Climate-invariant machine learning

Projecting climate change is a generalization problem: We extrapolate the recent past using physical models across past, present, and future climates. Current climate models require representations of processes that occur at scales smaller than model grid size, which have been the main source of model projection uncertainty. Recent machine learning (ML) algorithms hold promise to improve such process representations but tend to extrapolate poorly to climate regimes that they were not trained on. To get the best of the physical and statistical worlds, we propose a framework, termed “climate-invariant” ML, incorporating knowledge of climate processes into ML algorithms, and show that it can maintain high offline accuracy across a wide range of climate conditions and configurations in three distinct atmospheric models. Our results suggest that explicitly incorporating physical knowledge into data-driven models of Earth system processes can improve their consistency, data efficiency, and generalizability across climate regimes.

model processes that generalize across climates.
ML algorithms typically optimize an objective on a training dataset and make implicit assumptions when extrapolating. Here, extrapolation refers to predictions outside of the training data range, henceforth referred to as out-of-distribution predictions. As an example, multiple linear regressions (MLR) assume that the linear relationship that best describes the training set is valid outside of that training set. Alternatively, when confronted with out-of-distribution inputs, random forests (RF, (24)) find the closest inputs in their training sets and assign the corresponding outputs regardless of the out-of-distribution input values. Neural networks (NN), which are powerful nonlinear regression and classification tools, rely on nonlinear activation functions and fitted weights to extrapolate. Except in specific situations (e.g., samples in the close neighborhood of the training set or described by the same nonlinear mapping as the training set), there is no reason why NNs should generalize well far outside of their training sets. Differently trained NNs on the same training data can give drastically different out-of-distribution prediction, emphasizing the low confidence in the actual out-of-distribution prediction.
In climate applications, this extrapolation issue means that ML algorithms typically fail when exposed to dynamic, thermodynamic, or radiative conditions that differ substantially from the range of conditions they were trained on; an issue that remains under-reported in the research literature. Exceptions include (25), who showed that an RF-based moist convection scheme generalizes poorly in the tropics of a climate 6.5K warmer than the training climate; and (26), who showed that NNs and support vector machines downscaling surface air temperature made significant extrapolation errors when exposed to temperatures 2-3K warmer than in the training set. (27) showed that an NN-based thermodynamic subgrid-scale closure generalizes well to climates 1-2K warmer than the training one, but makes large errors as soon as the test climate is 4K warmer than the training one. (28) confirmed that these generalization errors remain even when the NN subgrid closure is modified to enforce conservation laws to within machine precision. This has led several studies to recommend training ML models in multiple climates if possible (29,30). Interestingly, both Guillaumin et al. (31) who trained an NN parameterization for subgrid oceanic momentum transport and Molina et al. (32) who trained convolutional NNs (33) to classify thunderstorms in high-resolution model outputs found that their ML models generalized well to a warmer climate. While this may be because both models relied heavily on velocity inputs and their gradients, whose distributions changed only slightly when the climate warmed, Molina et al. noted that using two types of ML layers, namely batch normalization (BN, (34)) followed by dropout (DP, (35)), was key to this successful generalization. DP and BN are two examples of a larger set of methods that help NNs generalize and avoid overfitting, broadly referred to as "regularizations" (36). Most empirical regularization methods (e.g., L1 regularization) rely on the parsimony principle, i.e. that simpler models, accurately describing the training set with fewer fitted parameters, are preferable to more complex models and generalize better to unseen conditions. More systematic approaches to regularization have been developed to use ML models in out-of-distribution situations that still require the same inputs/outputs, referred to as domain adaptation (e.g., (37)(38)(39)), a particular case of transfer learning (e.g., (40)). While not all domain adaptation approaches (sample-based, statisticsbased, ensemble-based, domain-invariant feature learning, domain mapping, etc.) need supervision (41), they usually require at least a few samples in the generalization domain.
Without dismissing existing domain adaptation methods, we here focus on physically-informed methods that do not require samples in the generalization domain for three reasons: (1) a longterm goal of the climate science community is to train ML models that rely on historical observations only as we cannot, by definition, observe the future climate; (2) as shown later, even if we have access to simulation data across climates, ML models that intrinsically generalize to climates they have not been trained on tend to be more data-efficient and robust to other changes (e.g., model geography changes); and (3) physically-informed methods can be readily combined with existing domain adaptation methods. Motivated by these challenges, we ask: How can we enhance ML algorithms with physical knowledge to make accurate predictions in climate conditions that -in standard variables -lie far outside of the training set?

Problem Definition
Our scientific contribution is to transform the emulated ML mapping that changes with climate, here referred to as "brute-force", to a mapping that approximately remains constant across climates, here referred to as "climate-invariant". Inspired by invariants in physics and self-similarity in fluid mechanics (42), we change the emulated mapping from brute-force to climate-invariant by transforming the input and output vectors so that their distributions shift minimally across climates (see Fig 1).
We demonstrate this framework's utility by adapting ML closures of subgrid atmospheric thermodynamics (i.e., coarse-scale thermodynamic tendencies resulting from subgrid convection, radiation, gravity waves, and turbulence) so that they generalize better across climates. The motivation for this application is two-fold. First, purely physically-based subgrid closures remain one of the largest sources of uncertainties in Earth system models (43)(44)(45). While ML-based closures have emerged as a promising alternative to traditional semi-empirical models (46), they lack robustness (47,48) and, as discussed earlier, usually fail to generalize across climates (25,27,28). Second, atmospheric thermodynamic processes are directly affected by global temperature changes, e.g., in response to anthropogenically-forced climate change (49). Therefore, predicting subgrid thermodynamics in a warm climate with an ML model trained in a cold climate leads to very apparent failure modes (28) that we can transparently tackle. In mathematical terms, our goal is to build a climate-invariant mapping between the input vector x representing the large-scale (≈100km) climate state and the output vector y grouping large-scale thermodynamic tendencies due to explicitly-resolved convection and parameterized radiative transfer and turbulent mixing at the ∼1km-scale (see SM B1 for details). We keep the overall structure of the mapping x → y fixed throughout the manuscript and aim to predict y as accurately as possible in training and generalization climates (out-of-distribution prediction). Note that this mapping makes some implicit assumptions based on successful past work (27,50), including locality in horizontal space and time (outputs only depend on inputs in the same atmospheric column at the same time step) and determinism (only one possible output vector for a given input vector). We include cloud radiative effects in all heating terms (total heatinġ T , longwave heating lw, and shortwave heating sw), but for simplicity do not predict changes in cloud liquid water and ice, and exclude cloud water and greenhouse gases other than water vaporq v from the input vector x.
After introducing the climate simulations and training/validation/test split (Section 2), we define the climate invariant mapping and feature transformations (Section 3), and demonstrate and explain their ability to generalize (Section 4) before concluding. We refer the reader to the Supplementary Materials (SM) for data availability (SM A), additional derivations and de-scriptions of the mapping and physical transformations (SM B), the implementation of our ML framework (SM C), and additional results (SM D).

Data
To test the robustness of our framework across model configurations and geographies, we use three distinct storm-resolving climate models and experimental set-ups: aquaplanet simulations using the Super-Parameterized Community Atmosphere Model version 3.0 (SPCAM3), Earthlike simulations (i.e., with continents) using the Super-Parameterized Community Earth System Model version 2 (SPCESM2), and quasi-global aquaplanet hypohydrostatic simulations using the System for Atmospheric Modeling version 6.3 (SAM). SPCAM3 and SPCESM2 assume a strict scale separation between the resolved coarse scales and subgrid processes, making them ideal testbeds to machine learn local subgrid closures (27,51). In contrast, SAM does not assume scale separation as a global storm-resolving model, which is more realistic but requires carefully coarse-graining its high-resolution simulation output (52,53). For each climate model, we run three simulations with three different prescribed surface temperature distributions: • (+0K) A reference simulation with a temperature range analogous to the present climate • (-4K) A cold simulation with surface temperatures 4K cooler than the (+0K) simulation • (+4K) A warm simulation with surface temperatures 4K warmer than the (+0K) simulation, with the exception of SAM for which only the (-4K) and (+0K) simulations are available. We summarize the simulations and indicate their spatiotemporal resolutions in Tab S1. Fig 2 gives a visualization of surface temperatures in each model (Fig 2a) and snapshots of mid-tropospheric subgrid heating (Fig 2b), which is one of our ML models' outputs.

Super-Parameterized Aquaplanet Simulations
We use data from two-year SPCAM3 (54) climate simulations in an aquaplanet configuration (55), with zonally-symmetric surface temperatures fixed to a realistic meridionally-asymmetric profile (56). The insolation is fixed to boreal summer conditions with a full diurnal cycle. A two-dimensional storm-resolving model is embedded in each grid cell of SPCAM3, namely 8 SAM atmospheric columns using a spatiotemporal resolution of 4km×30lev×25s, and the default one-moment microphysical scheme (57). SPCAM3 combines a spectral primitive equation solver with a semi-Lagrangian dynamical core for advection (55). The (+0K) SPCAM3 simulation was first presented in (51) and subsequently used to train ML subgrid closures in (27,48,58). Inspired by the generalization experiment of (25), the (+4K) simulation was introduced in (27), and we ran the (-4K) simulation for the work presented here to increase the surface temperature generalization gap from 4K to 8K.

Super-Parameterized Earth-like Simulations
We run three two-year SPCESM2 (59) climate simulations in an Earth-like configuration with realistic surface boundary conditions, including a land surface model, seasonality, aerosol conditions representative of the year 2000, and a zonally-asymmetric annual climatology of sea surface temperatures derived from the "HadOIB1" dataset (60). We use CESM v2.1.3 to couple CAM v4.0 with the Community Land Model, and similarly embed 32 SAM columns in each atmospheric grid cell to explicitly represent deep convection. Our (+0K) simulation is similar to (50)'s, which showed the potential of ML for subgrid closures in Earth-like conditions.

Quasi-Global Aquaplanet Hypohydrostatic Simulations
While super-parameterization is well-adapted to statistically learning subgrid closures thanks to its explicit scale separation, this scale separation comes at the cost of distorted mesoscale systems and momentum fluxes (61). Furthermore, most ML subgrid closures are based on coarse-graining high-resolution simulations (e.g., (62,63)). This motivates us to also test the climate-invariant framework in hypohydrostatic SAM simulations in which the dynamics are not affected by a prescribed scale separation. Computational expense is reduced through hypohydrostatic scaling, which multiplies the vertical acceleration in the equations of motion by a factor of 16 to increase the horizontal scale of convection without overly affecting the larger-scale flow (64,65). While these simulations use idealized settings, such as aquaplanet configurations, an anelastic dynamical core, a quasi-global equatorial beta plane domain, and perpetual equinox without a diurnal cycle, (66) showed that they produce tropical rainfall inten-sity and cluster-area distributions that are close to satellite observations. The prescribed surface temperature distribution in the control simulation of (66) is designed to be close to zonal-mean observations (67), and its maximum value is roughly 2K colder than that of the distribution used for the (+0K) SPCAM3 simulation. To better match the SPCAM3 maxima of distributions of upper-level temperatures and humidities, we choose to treat this SAM control simulation as the (-4K) SAM simulation, and the warm simulation of (66) as the (+0K) SAM simulation. We refer the reader interested in the details of the simulations and the coarse-graining (here by a factor of 8) to (52). We summarize key differences in the mappings that are learned for SAM as compared to SPCAM3/SPCESM2 below: • the input vector does not contain specific humidity, surface pressure, sensible heat fluxes, or latent heat fluxes, but instead contains the total non-precipitating water concentration and uses distance to the equator as a proxy for solar insolation, • the output vector includes the subgrid total non-precipitating water tendency instead of the subgrid specific humidity tendency, • the output vector does not contain subgrid longwave and shortwave heating, • SAM uses a height-based vertical coordinate rather than a pressure-based one, and • the generalization experiment is from (-4K) to (+0K) (unavailable (+4K) simulation).

Training, Validation, and Test Split
Both generalization experiments expose ML models to out-of-distribution inputs they have not been trained on. Following best ML practices (68), we use the training set to optimize the ML model's trainable parameters, save the trainable parameters that led to the best performance on the validation set to avoid overfitting the training set, and evaluate the final model on samples from a separate test set. We split each of the 9 simulations into training/validation/test sets by using non-contiguous 3-month periods (reported in Tab S1) to avoid high temporal correlations between training/validation/test set samples (69).
To understand which solutions are most promising for helping ML algorithms generalize to unseen conditions, we design two generalization experiments: (1) Training and validating ML models on cold simulations (-4K) and testing them on warm simulations (+4K for SP-CAM3/SPCESM2 and +0K for SAM); and (2) training and validating ML models on aquaplanet simulations (SPCAM3) and testing them on Earth-like simulations with continents (SPCESM2).

Theory
We formally define a climate-invariant mapping as a mapping that is unchanged across climates. In practice, it is difficult to find mappings that are exactly invariant and we will use the terminology climate-invariant for any mapping that remains approximately constant across climates.
To achieve climate invariance, we introduce physically-based feature transformations, defined as physically-informed functions that map the inputs/outputs to different inputs/outputs whose distributions vary little across climates. We deem the physical transformation to be climateinvariant if it is successful at limiting distributions variations of the inputs/outputs across climates. Note that climate-invariant transformations are distinct from nondimensionalization in dimensional analysis, as nondimensionalization does not necessarily alter distribution shape while climate-invariant transformations may yield variables that are not dimensionless.
Throughout the following section, we compare two transformation options for each input, whose univariate PDFs are depicted for all three atmospheric models in Fig 3: No transformation (top) and our most successful transformation (bottom) (see SM B2a for our second best transformations). Our comparison relies on the Hellinger and Jensen-Shannon PDF distance metrics defined and calculated in SM B5. To prevent information leaks from generalization test sets into the physically-informed ML framework, we take two precautions: (1) the physical transformations are fixed, meaning that their structure and parameters are non-trainable; and (2) transformations are ranked based on their generalization from (-4K) to (+0K) in SPCAM3. Our (+4K) results across models and geographies independently confirm this ranking.
Specific Humidity: Without any transformation, the PDF of specific humidity q (Fig 3a, top) extends through a considerably larger range as the climate warms. This is because, barring supersaturation, q has a theoretical upper bound in a given climate, namely the saturation specific humidity, which increases quasi-exponentially with temperature through the Clausius-Clapeyron relation (e.g., (70,71)). The relative humidity (RH) transformationq RH (Fig 3a, bottom) normalizes specific humidity by its saturation value. As a result, most of the RH PDF lies within [0, 1], except for a few atmospheric columns exhibiting super-saturation in SPCAM, and that PDF changes little as the climate warms (72). In addition to capturing grid-scale saturation,q RH helps predict the subgrid effects of dry-air entrainment, known to regulate tropical convection (73-75) (see SM B2b for details of RH calculations).
Temperature: The PDF of temperature T (Fig 3b, top) shifts quasi-linearly as the climate warms. To address this shift without compromising the approximate invariance of tropopause temperatures with warming (76-78), we derive a temperature transformation directly relevant for moist convection: the buoyancy of a non-entraining, moist static energy-conserving plumẽ T buoyancy (Fig 3b, bottom, see SM B2c for this buoyancy's derivation). This transformation is inspired by recently introduced lower-tropospheric buoyancy measures (79,80), but with an extension to the full troposphere (81). WhileT buoyancy does not explicitly include entrainment effects, the mapping ofT buoyancy (p) andq RH (p) to heating and moisture sink will implicitly include these. This transformation captures leading order effects needed to yield approximate climate invariance (Fig 3b).T buoyancy also increases physical interpretability by linking the vertical temperature structure and near-surface humidity changes to a metric that correlates well with deep convective activity (82). The presence of near-surface specific humidity in its definition captures the role of near-surface humidity relative to the temperature structure aloft in contributing to moist convective instability in the tropics.
Latent Heat Flux: The last input whose distribution significantly changes with warming is the latent heat flux LHF (Fig 3c, top; the remaining inputs -sensible heat flux and surface pressure -change less with warming and are discussed in SM B2d). Similar to specific humidity, the increase of latent heat fluxes with warming is directly linked to the Clausius-Clapeyron relationship (e.g., (83)). To address this shift, we leverage the bulk aerodynamic formula to represent surface fluxes and to provide a physics-motivated transformation of LHF using the near-surface saturation deficit (Fig 3c, bottom). This transforms LHF, a thermodynamic variable, intoLHF ∆q , approximately proportional to the magnitude of near-surface horizontal winds and density (e.g., (83)), whose distributions vary less with warming.
We now show that all three input transformations (q RH (p),T buoyancy (p), andLHF ∆q ) lead to statistically significant improvements in the ML models' ability to generalize.

Results
The results are organized as follows. After demonstrating the benefits of progressively transforming the ML models' inputs (Fig 4), we show how climate-invariant models learn subgrid closures across climates and geographies during training (Tab 1, Fig S7). We then discuss the global skill of different models after training (Fig 5, Fig S8). Finally, we investigate the structure of climate-invariant mappings to understand why they generalize better across climates (Fig 6), even when data from multiple climates are available (Fig 7).

Benefits of Incremental Input Transformations
In this section, we demonstrate that sequentially transforming the inputs of neural networks progressively improves their generalization abilities from the cold (-4K) aquaplanet (SPCAM3) simulation to the warm (+4K) aquaplanet simulation. The largest surface temperature jump tested in this study is between the cold aquaplanet simulation and the tropics of the warm aquaplanet simulation ("warm Tropics" for short), defined as the regions with out-of-distribution surface temperatures, whose latitudes are between -15°S and 23°N (approximately the red regions in Fig 2's top-left subplots). To expose the failure modes of the "brute force" model and the benefits of progressively transforming the inputs, we first trained several NNs on the cold aquaplanet until they reached high accuracy (Fig 4a) before testing their out-of-distribution generalizability in the warm Tropics (Fig 4b).
In the cold Tropics, the vertical profiles of the mean-squared error (MSE) are nearly indistinguishable for all types of NNs and roughly follow the vertical structure of subgrid variance, as discussed in (51,58). When evaluated in the warm Tropics, the MSE of the "brute force" . This generalization error decreases as inputs are sequentially transformed: First no transformation (blue), then specific humidity (orange), then temperature (green), then latent heat fluxes (red). For reference, the purple line depicts an NN trained in the warm climate. We depict the tendencies' mean-squared error versus pressure, horizontally-averaged over the Tropics of SPCAM3 aquaplanet simulations, for the four model outputs: total moistening (q), total heating (Ṫ ), longwave heating (lw) and shortwave heating (sw). Given that the brute-force NN's generalization error (blue line) greatly exceeds that of the transformed NNs, we zoom in on each panel to facilitate visualization.
NN (blue line) increases by a factor of ≈10 and peaks above 100 W 2 m −4 , underlining how standard NNs fail to generalize across climates. As discussed in Section 3, we progressively transformed the inputs starting with specific humidity, which is transformed to relative humidity (orange line). This first transformation decreases the MSE so much (by a factor of 5-10) that we need to zoom in on each panel to distinguish the generalization abilities of additional NNs. Adding the transformations of temperature to plume buoyancy (green line) and then of LHF toLHF ∆q (red line) incrementally decrease the MSE, which is particularly clear for the subgrid longwave heating. Impressively, transforming all three inputs decreases the MSE so much that the resulting climate-invariant NN's MSE (red line) is within ≈ 25% of the MSE of a "brute-force" NN that was directly trained in the warm climate (purple line).
Hereafter, we use "climate-invariant" to refer to models for which all three inputs (q, T , LHF) but no outputs were transformed, solely based on physical principles. After demonstrating their success in the aquaplanet case, we are now ready to investigate how these climate-invariant models learn in the other climates and simulations introduced in Section 2.

Learning across Climates and Geographies
In this section, we show that climate-invariant models learn mappings that are valid across climates and geographies, and that their efficacy improves when used in conjunction with ML regularization techniques like BN and DP layers.
Tab 1 shows the MSE of ML models trained in three different datasets, and evaluated over their training and validation sets, and over test sets of different temperature and geography. As discussed previously, climate-invariant NNs (NN CI) generalize better to warmer climates than brute-force NNs (NN BF). We go one step further by examining learning curves, defined as the MSE of an ML model at the end of each epoch during training 1 . Impressively, the learning curve of the climate-invariant NN trained in the cold aquaplanet but tested in the warm aquaplanet (starred blue line in Fig S7a) is mostly decreasing, supporting this manuscript's key result: climate-invariant NNs continuously learn about subgrid thermodynamics in the warm aquaplanet as they are trained in the cold aquaplanet. In contrast, the "brute-force" NN trained in the cold aquaplanet but tested in the warm aquaplanet makes extremely large generalization errors, which worsen as the model is trained in the cold aquaplanet (see SM D2 for details).
Climate-invariant NNs also facilitate learning across geographies, i.e., from the aquaplanet to the Earth-like simulations and vice-versa (see NN CI rows in Tab 1). Climate-invariant transformations additionally improve the MLR baseline's generalization ability, albeit less dramatically. This smaller improvement in MLR's generalization abilities is linked to its relatively small number of trainable parameters, resulting in (1) "brute-force" MLRs generalizing better than "brute-force" NNs; and (2) MLRs having lower descriptiveness and fitting their training sets less well, limiting the maximal accuracy of climate-invariant MLRs on test sets.
There are a few cases in which transforming inputs does not fully solve the generalization problem, e.g., when trying to generalize from the aquaplanet to the Earth-like simulation. In  that case, we leverage the fact that input transformations can easily be combined with standard techniques to improve generalization, such as DP layers before each activation function and a single BN layer before the first DP layer (34). As DP layers randomly drop a fixed proportion of the trainable parameters during training, NNs with DP fit their training set less well (see NN CI+DN row of Tab 1). However, they improve generalization in difficult cases (e.g., between cold aquaplanet and Earth-like simulations) and do not overly deteriorate generalization in cases where the input transformations work particularly well (e.g., from cold to warm aquaplanet). Our results suggest that combining physics-guided generalization methods (e.g., physical transformation of the inputs/outputs) with domain adaptation methods (e.g., DP) is advantageous and deserves further investigation. After analyzing the overall MSE during training, we now turn to the spatial characteristics of our ML models' skill after training.

Global Performance after Training
In this section, we first compare the spatial characteristics of the brute force and climateinvariant NNs' skill across climates of different temperatures to further establish the advantages of our climate-invariant transformation. These advantages are clearly visible in Fig 5, where the "brute-force" models struggle to generalize to the warm Tropics for all simulations despite fitting the cold training set well (Fig 5a). We can trace these generalization errors to warm surface temperature conditions the NNs were not exposed to during training, visible when comparing Fig 5a with Fig 2a. In contrast, the climate-invariant models fit the warm climate almost as well as the cold climate they were trained in (Fig 5b). Note that Fig 5 focuses   ) of six models trained in three simulations (first column) and evaluated over the training or validation set of the same and two other simulations (last four columns). The models (second column) are brute-force (BF) or climateinvariant (CI), multiple linear regressions (MLR) or neural nets (NN), and sometimes include dropout layers preceded by a batch normalization layer (DN). The models are trained for 20 epochs and we give the MSE corresponding to the epoch of minimal validation loss followed by the MSE averaged over the 5 epochs with lowest validation losses (in parentheses). Note that "Different Temperature" refers to (+4K) for (-4K) training sets and vice versa. In each application case, we highlight the best model's error using bold font.
map of a single output, i.e., the total subgrid heating at 500hPa, but that the horizontal map of other outputs, such as the near-surface subgrid heating (see Fig S8) all exhibit the same pattern of "brute-force" models failing in the warm Tropics and the climate-invariant models mostly correcting these generalization errors. Finally, the spatial distribution of the skill in the training set (e.g., middle column of Fig S7) is reassuringly consistent with the skill map of highly tuned NNs trained in similar conditions (e.g., (50)). This confirms that the "brute-force" models, representative of state-of-the-art ML subgrid closures, fail to generalize. This failure is confirmed in the latitude-pressure map of the subgrid heating at all vertical levels shown in Fig 5c. In contrast, Fig 5d underlines how climate-invariant NNs improve generalization throughout the atmosphere in the warm Tropics without deteriorating skill in the mid-latitudes and poles of the warm simulation. This consideration helped us choose our final input transformation, as theT from NS temperature transformation significantly deteriorated generalization in the mid-latitudes, while theT buoyancy transformation helps generalization in the Tropics without overly compromising skills at other latitudes. There is a slight skill compromise at high latitudes, as can be seen by comparing the second rows of Fig 5c and Fig 5d, which is especially apparent in the SAM case and can be partially traced back to challenges in generalizing subgrid ice sedimentation (not shown here, see (52) for details).
To fully compare ML models across climate and geographies, we evaluate their overall MSEs in the training, validation, and both generalization test sets in Tab 1. In addition to the MSE of minimal validation loss, we show the MSE averaged over the 5 epochs of minimal validation loss in parentheses to confirm that our models have converged. Consistent with Fig S7's learning curves, climate-invariant NNs with DP and BN layers often demonstrate the highest level of generalizability (two rightmost columns of each row's NN CI+DN models). While they fit their training sets less well, "brute-force" MLRs generalize better than "brute-force" NNs because they have fewer trainable parameters (see MLR BF and NN BF models). In Tab 1, we also show that while DP and BN layers generally increase the generalization performance of "brute-force" NNs 2 (NN BF+DN models), we can systematically improve these standard ML generalization methods by combining them with input transformations (NN CI+DN models).

Understanding Climate Invariance
To interpret our NN results, we use a game theory-based explainable ML approach, called SHAP (84,85), to dissect climate-invariant mappings and provide insight into why they generalize better across climates and geographies. Note that MLRs are interpretable by construction, and we can draw preliminary conclusions by visualizing MLR weights without the need for explainable ML libraries (see SM D4). Similarly, we can directly visualize the NNs' linear responses (78,(86)(87)(88) by calculating their Jacobians (gradients) via automatic differentiation (47). However, the difference between BF and CI MLRs is small and the Jacobians (86) cannot always be reliably used to explain nonlinear NN predictions as they are first-order derivatives  calculated over a sample (89).
Therefore, as climate-invariant NNs have shown superior generalizability from cold to warm climates (see NN CI errors in Tab 1), we employ SHAP's Kernel Explainer to elucidate the NNs' generalizability. We choose this attribution method for its versatility, as it can be used for any multi-input/output ML model. SHAP estimates the impact of a particular input value x i on each output y j of our model. It is designed with a local accuracy property, ensuring that the sum of the effects of individual inputs equals the difference between y j and its average value in the training set: where we have introduced the deviation y j , defined as the difference between y j and its training set average y i def = y i − y i E . We use these "Shapley values" to build a nonlinear feature matrix M capturing the influence of an input x i on an output y j : where we use the sign of the input deviation x i to make M ij positive if x i and y j have the same sign, e.g., if a positive input deviation leads to a positive output deviation 3 . In Fig 6, each panel depicts the SHAP feature matrix M for a given model: brute-force (a) and climate-invariant (b). Each model's inputs (e.g., specific humidity q and temperature T ) are organized on the x-axis from the surface to the top of the atmosphere. Each model's outputs (subgrid moisteningq and subgrid heatingṪ ; see Fig S13/14 for subgrid longwave heating lw and subgrid shortwave heating sw) are organized on the y-axis, from the surface to the top of the atmosphere. Following a horizontal line shows how different inputs contribute to a given output, while following a vertical line shows how a given input influences different outputs. Fig 6 contains a wealth of information about subgrid closures trained in aquaplanet simulations; we focus here on visualizing how the climate-invariant NNs (b) operate in ways that generalize better than their brute-force counterpart (a). Consider the row for subgrid heatingṪ . In the brute-force case (a), M has large coefficients in most of the troposphere (in the entire square below the dashed lines depicting the approximate tropopause level). This means that specific humidity and temperature deviations at all levels impact subgrid heating at a given level, i.e. there are large non-local relations in the vertical. Some non-local relations are physically plausible for convection, since buoyant plumes tend to rise from the surface, and indeed nearsurface T and q influenceṪ through the entire troposphere. However, in this model, moisture at higher altitudes appears to influenceq at lower altitudes, raising suspicions that some of the brute-force NN's non-localities are not causal, but rather due to high auto-correlations within the input's vertical profile, as (47) showed could happen. Temperature variations are observed climates outperform brute-force (BF) models offline in ≈95% of cases, with less sensitivity to the data partition used for training. Dots on the left represent the median BF error from a 10-fold cross-validation without replacement, with horizontal lines indicating the 1st and 9th deciles. Stars on the right correspond to the median climate-invariant CI error. Ticks denote the majority of cases, for which CI models outperform BF models, even when data from both climates are available; crosses indicate the rare exceptions.
to have strong vertical correlations (90) in part because of deep convective effects. Because temperature affects the saturation threshold for moisture, the BF NN will have to correctly capture the effects of both temperature and moisture wherever either has influence. In contrast, in the B plume -RH climate-invariant case (b), leading non-local effects between the boundary layer and the free troposphere have already been taken into account in the buoyancy formulation, and the temperature-dependent saturation threshold is built into RH. Thus, M forṪ tends to be concentrated near the red diagonal, meaning that positive deviations of plume buoyancy and relative humidity increase subgrid heating near the same vertical level. The use of domain knowledge has effectively reduced the effects that must be estimated by the NN for the climate-invariant models. This tends to yield differences between the models trained in the cold and the warm climates that are much smaller than for the brute-force models (see last column of Fig S14).

Advantages of Climate Invariance with Multi-Climate Data
It is natural to wonder if the benefits of climate invariance carry over to training scenarios that entail data from multiple simulations spanning diverse temperatures. In contrast, until now, we have only trained ML models on single climate simulations. To find out, we examine the benefits of our climate invariant transformation approach under the ideal scenario where we have access to data from multiple climates. We conduct experiments where we train NNs on both the cold (-4K) and warm (+4K) aquaplanet simulations. We progressively increase the amount of training data to assess data efficiency. Throughout the experiment, we use eight batches and systematically increase the batch size by powers of 2, starting with a batch size of 1. Note that we obtain similar results when increasing the number of batches while keeping the batch size fixed (not shown). To obtain well-defined uncertainty estimates, we employ a 10-fold cross-validation procedure via random sampling without replacement. Our findings, depicted in Fig 7, demonstrate that CI NNs (1) consistently outperform BF NNs, particularly in data-rich scenarios; and (2) exhibit lower sensitivity to the training data partition, resulting in more reliable performance with reduced variability in test errors. This confirms that our climate-invariant mapping enhances data efficiency, performance, and fit reproducibility across different climates, even when training data from multiple surface temperatures are available.

Discussion
In the context of climate change, we hypothesized that ML models emulating climate-invariant mappings (Fig 1), for which the inputs/outputs distributions change little across climates (Fig 3), generalize much better than ML models emulating brute-force mappings, for which the inputs/outputs distributions change significantly across climates. Tested on a suite of stormresolving atmospheric simulations with different surface temperatures in three atmospheric models with distinct geographies (Fig 2), physically-transformed NNs generalize better as their inputs are progressively transformed (Fig 4). Climate-invariant NNs whose inputs have all been transformed learn mappings that are robust to temperature and geography changes (Tab 1), and hence exhibit superior generalization skill almost everywhere on the globe (Fig 5), including when data from multiple climates are available (Fig 7). Finally, attribution maps reveal that climate-invariant NNs learn more spatially local mappings that facilitate generalization across climates and geographies (Fig 6).
From a computational perspective, incorporating physical knowledge, here of climate change, into an ML framework to improve its generalization skill is a successful example of using domain knowledge to extract more informative predictors, informally referred to as "feature engineering" (e.g., (91)). This also aids interpretability of the mapping. From a climate science perspective, requiring that a nonlinear statistical model of the atmosphere generalize across climate is a stringent test that helped us discover new mappings. This climate-invariant mapping is more robust to climate and geography changes, and is more advantageous than directly using model and observational outputs (e.g., specific humidity, temperature), even when data are available in various climate regimes. In the particular case of subgrid thermodynamics, our generalization results suggest the possibility of NN-powered closures that could work in Earth-like settings, even in vastly different climate conditions. Finally, the attribution maps suggest the possibility of new analytic representations of convection from data, facilitated by the more local climate-invariant representation of subgrid thermodynamics. Our strategy paves the way for the successful use of machine learning models for climate change studies.  This PDF file includes: Supplementary Text (SM A to SM D) Figs. S1 to S14 Tables S1 to S3 References (S1 to S39) The Supplementary Materials (SM) is organized as follows: First, we discuss code and data availability (SM A), including links to multiple repositories to reproduce the different ML-based closures and climate simulations discussed in the manuscript. Then, we derive the input transformations used in the manuscript (SM B1), the output transformations tested in this SM (SM B2), and discuss possible vertical coordinate transformations (SM B3). We provide a guide to find new climate-invariant transformations in SM B4. SM C details the practical implementation of our climate-invariant ML workflow. Finally, we present supplementary results in SM D, including the Jensen-Shannon distance between input distributions to complement the Hellinger distance presented in the main manuscript (SM D1), the learning curves of climate-invariant models across climates and geographies (SM D2), the generalization skill of climate-invariant NNs near the surface (SM D3), and three methods to visualize the "brute-force" and climateinvariant mappings and compare them in the cold (-4K) and warm (+4K) climates.
As described in Section 2, we use data from eight climate simulations using three climate models (SPCAM3, SPCESM2, and SAM) to form our training, validation, and test sets. We report the exact characteristics of the splits in Tab S2 and information to re-generate the full simulation output below.

SPCAM3
The codebase for running the "SPCAM3" simulation is the same employed by (4), which is archived at https://gitlab.com/mspritch/spcam3.0-neural-net/ -/tree/sp-diagnostic for the (+0K) simulation. The sea surface temperature is uniformly cooled by 4K to produce the (-4K) simulation and uniformly warmed by 4K to produce the (+4K) simulation. Raw output of the (+0K) simulation can be found at https: //zenodo.org/record/1402384#.YaUCsdDMI-w (3). The full simulations output, which is several TB, is archived on the GreenPlanet cluster at UC Irvine and available upon request.
SPCESM2 The codebase for running the "SPCESM2" simulations is the same employed by (5), which is archived at https://github.com/mspritch/UltraCAM-spcam2_ 0_cesm1_1_1; this code was in turn forked from a development version of the CESM1.1.1 located on the NCAR central subversion repository under tag spcam_cam5_2_00_forCESM1_ 1_1Rel_V09, which dates to February 25, 2013. The full simulations output, which is several TB, is also archived on the GreenPlanet cluster at UC Irvine and available upon request. We additionally archived the input data and run scripts necesary to re-run all three simulations as part of the manuscript's accompanying data using Zenodo https://zenodo.org/record/ 5775541#.YbeMHNDMKUl (6).
SAM The codebase for running the "SAM" simulations is the same employed by (7). The data from the (-4K) simulation is several TB and can be found at https://drive.google. com/drive/folders/1TRPDL6JkcLjgTHJL9Ib_Z4XuPyvNVIyY. The initial sounding, meridional surface temperature profile, and source code to re-run the simulation can be found at https://zenodo.org/record/4118346#.YaT_WtBKg-w (8). To produce the (+0K) simulation, the initial sounding and surface temperature profiles are both uniformly warmed by 4K.

B1. Mapping
In this subsection, we present the characteristics of the emulated mapping, highlighting the differences between the superparameterized models (SPCAM3, SPCESM2) and the stormresolving model (SAM). In both cases, the input vector x encodes the large-scale (≈100km) climate state: where q (p) is the vertical profile of specific humidity in units of kg/kg (written as a function of the background pressure coordinate p in units of Pa), T (p) is the temperature's vertical profile in units of K, p s is surface pressure in units Pa, S 0 is solar insolation in units of W/m 2 , SHF is surface sensible heat flux in units W/m 2 , LHF is surface latent heat flux in units of W/m 2 , q np (p) is the vertical profile of non-precipitating water concentration in units of kg/kg, and d Equator is the distance to the Equator, which is used as a proxy for solar insolation in the mapping learned for SAM. The output vector y groups subgrid-scale thermodynamic tendencies: where L v in units of J kg −1 is the latent heat of vaporization of water in standard conditions, q (p) is the subgrid moistening vertical profile, c p in units of J kg −1 K −1 is the specific heat of water at constant pressure in standard atmospheric conditions,Ṫ (p) is the total subgrid heating vertical profile (including subgrid radiation), lw (p) is the subgrid longwave radiative heating vertical profile, sw (p) is the subgrid shortwave radiative heating vertical profile, andq np (p) is the time-tendency of non-precipitating water condensation vertical profile. Following (9), all components of the output vector y are mass-weighted and vertically integrated within each vertical layer to yield energy flux units (W/m 2 ). Assuming vertical profiles have N p vertical levels, x is of length (2N p + 4) for SPCAM3 and SPCESM2, and of length (2N p + 1) for SAM. y is of length 4N p for SPCAM3 and SPCESM2, and of length 2N p for SAM.

B2. Physically-Based Input Transformations
In Fig S1, we compare three transformation options for each input, whose univariate PDFs are depicted in Fig S1: No transformation (top), our most successful transformation (bottom), and our second best transformation (middle). After defining our second best input transformations (B2a), we delve into the details of our relative humidity (B2b) and plume buoyancy (B2c) transformations before discussing the other inputs' distribution shift (B2d).

B2a. Second Best Input Transformations
Along the way to our optimal feature transformations, we explored candidate options that proved second best. For completeness these are reviewed first in the SM.
Saturation Deficit: We explored saturation deficit but found it did not lead to climate invariance. Similar to q, saturation deficit (Fig S1a, middle) still has a corresponding expansion of the PDF with warming as a result of the Clausius-Clapeyron relation. It is defined as the amount by which the water vapor concentration must be increased to achieve saturation without changing the environmental temperature and pressure: where q sat (T, p) is the saturation specific humidity. In contrast, the relative humidity transformationq RH (p) def = RH (q, T, p) (Fig S1a, bottom) results in a climate-invariant PDF, as evidenced by PDFs that mostly overlap across all three climates.
Temperature minus Near-Surface Temperature: Assuming that the temperature's PDF shift with warming is almost uniform with height, we can derive an approximate invariant by subtracting the temperature at all levels T (p) from the near-surface temperature T (p NS ): where p NS is the lowest atmospheric pressure level see (Fig S1b, middle). However, this linear transformation fails in the upper atmosphere, especially near the tropopause where temperatures are approximately invariant with warming (10-12) and therefore decoupled from surface temperature changes. This is why the buoyancy of a moist static energy-conserving plume: where q NS = q(p NS ) is the near-surface specific humidity, yields approximate climate invariance (Fig S1b, bottom), unlike the temperature minus near surface temperature transformation.
Scaling Latent Heat Fluxes by Near-Surface Specific Humidity: To address the increase of LHF with warming, we scale LHF by near-surface specific humidity q (p NS ) (Fig S1c, middle): where q is a user-chosen parameter that we set to 10 −4 to avoid division by zero. While better than directly using LHF, this transformation fails for very dry atmospheres when the latent heat flux is negative, e.g. in polar oceans where atmospheric water vapor may be condensing on the surface, or when the near-surface specific humidity is very small, e.g. in subtropical regions. This is why scaling LHF using the near-surface saturation deficit (Fig S1c, bottom): yields better generalizability. While both the Jensen-Shannon and Hellinger distances would suggest thatLHF ∆q is a slightly less good transformation thanLHF q , theLHF ∆q transformation leads to improved generalization performance compared toLHF q (not shown). This confirms that only considering the PDF distances is not always sufficient to find the optimal transformations (discussed in SM B4).

B2b. Relative Humidity
Relative humidity (RH) provides our optimal transformation for the specific humidity inputs. RH is defined as the ratio of the partial pressure of water vapor e(p, q) to its saturation value e sat (T ), and can be expressed analytically: where R v ≈ 461J kg −1 K −1 is the specific gas constant for water vapor, R d ≈ 287J kg −1 K −1 is the specific gas constant for dry air, p (in units Pa) is the total atmospheric pressure, q (in units kg/kg) is specific humidity, and e sat (T ) (in units Pa) is the saturation pressure of water vapor, whose analytic expression in our case is given below. Consistent with Eq 8, the saturation specific humidity q sat corresponding to RH = 1, is SAM's single-moment microphysics scheme (13), which is also used in the SPCAM3 and SPCESM2 simulations, partitions water between the liquid and ice phases using a weight ω that is a linear function of the absolute temperature: Under the assumptions of this microphysics scheme, the saturation pressure of water vapor can then be found by integrating the Clausius-Clapeyron equation with respect to temperature, expressed analytically as: where T 0 = 273.16K and T 00 = 253.16K. In Eq 11, as temperature increases, the saturation pressure of water vapor goes from the saturation vapor pressure with respect to liquid e liq , to the saturation vapor pressure with respect to ice e ice . These are given by the following polynomial approximations: where a liq is a vector of length 9 containing nonzero polynomial coefficients. The polynomial approximation for e ice , with the same temperature switches as Eq 11, is: where C (T ) is a ramp function of temperature given by: and (a ice , c ice ) are vectors of length 9 and 5 containing nonzero elements, respectively. Between temperatures of T 00 and T 0 , the saturation pressure of water vapor is a weighted mean of e liq and e ice . The reader interested in the numerical details of this transformation is referred to our implementation of relative humidity at https://colab.research.google.com/ github/tbeucler/CBRAIN-CAM/blob/master/Climate_Invariant_Guide.ipynb.

B2c. Plume Buoyancy
Plume Buoyancy (B plume ) is our most successful transformation for the temperature inputs. Buoyancy is defined as the upward acceleration exerted upon parcels by virtue of the density difference between the parcel and the surrounding air of the atmospheric column (e.g., (14)). Because our ML model's inputs represent the large-scale thermodynamic state, the ML model does not have information about the storm-scale buoyancy field, and we must rely on idealized approximations to estimate the buoyancy that a plume would have for given specific humidity and temperature profiles. To be consistent with the model's conserved quantities (13), we derive a simple buoyancy metric based on a moist static energy (h) conserving plume below following similar derivations in (15) and (16). We refer the reader interested in the numerical details of this transformation to https://colab.research.google.com/github/ tbeucler/CBRAIN-CAM/blob/master/Climate_Invariant_Guide.ipynb. For purposes of this transformation, we omit virtual temperature effects and condensate loading (effects of the environmental water vapor on heating/moisture sink are being estimated separately). Thus the parcel buoyancy is simply proportional to the relative difference between its temperature T par and the environmental temperature T : where g is the gravity constant. Further assuming that the plume is non-entraining, obeys hydrostatic balance, and lifts parcels from the near-surface, the lifted parcel's moist static energy is conserved and equal to its near-surface value (at pressure p NS : where we have used the environmental moist static energy's definition: where z is geopotential height, and we neglected z (p NS ) as the near-surface is close to the surface by definition. To express the parcel's buoyancy as a function of the environmental thermodynamic state, we finally assume that the parcel is saturated 1 , and that the thermodynamic differences between the parcel and the environment are small, which allows us to linearize the Clausius-Clapeyron equation about the environmental temperature: Using Eq 18 to write T par − T as a function of the environmental thermodynamic state and substituting the resulting expression into Eq 15 yields an estimation of plume buoyancy from 1 not necessarily true close to the surface (q, T, p): where the parcel's moist static energy is expressed as a function of near-surface (q, T ) in Eq 16, the environmental saturated moist static energy in pressure coordinates is defined as: and we have introduced the dimensionless factor: Note that in pressure coordinates, we calculate the geopotential height by vertically integrating the hydrostatic equation after using the ideal gas law:

B2d. Sensible Heat Fluxes and Surface Pressure
In this appendix, we discuss the univariate PDFs of the two inputs we did not transform in the main manuscript (see Section 3) in Fig S2 for both super-parameterized models and all three surface temperatures (-4K, +0K, and +4K). The PDF of sensible heat fluxes changes very little with warming. There is a slight expansion of the left tail of the surface pressure PDF with warming as the most extreme low-pressure systems become more intense, but we hypothesize that these changes are small enough not to require a dedicated input transformation.

B3a. Theory
In contrast to input transformation, transforming our ML models' outputs, namely subgrid thermodynamics, only marginally improves the models' ability to generalize. In the absence of physical theory on how the full vertical profile of subgrid thermodynamics changes with warming, we place ourselves in an idealized scenario: Assuming we know how the outputs' marginal PDF changes with warming, can we help our ML models generalize via output transformation?
We note that assuming knowledge of how the marginal (univariate) PDFs (or equivalently the CDF) of convective heating and moistening change with warming is more realistic than assuming full knowledge of how their joint PDFs change with warming. This knowledge could come from e.g. convection theory (e.g., (17)) or shorter simulations than those required to train a subgrid closure. Under this assumption, a natural transformation is the outputs' cumulative distribution function (CDF):ỹ = CDF (y).
In essence, we are assuming that the mapping is more likely to be invariant in quantile than in physical space, which is a common practice when debiasing the outputs of climate models referred to as quantile mapping (e.g., review by (18)). In practice, we test two distinct methods for transforming the outputs using their CDFs and report the results for SPCAM3 aquaplanet simulations in SM B2b.
Quantile mapping after training: The first method is to transform the ML model's input during training, and then transform the ML model's output after training. This is akin to standard, post-hoc, quantile mapping. In the particular case of trying to generalize from a (-4K) cold simulation to a (+4K) warm simulation, the entire transformation to yield outputs in physical units can be mathematically written as: where for simplicity but without loss of generality, we have considered a singular input y whose CDF is CDF −4K in the (-4K) cold simulation and CDF +4K in the (+4K) warm simulation.
Quantile mapping before training: The second method is to transform the ML model's output before training. In that case, we directly train the ML model to predictỹ = CDF (y) as accurately as possible. We then map the output back to physical units using CDF −1 after training.

B3b. Results
The two methods to transform outputs presented above are depicted in Fig S3 and the generalization results presented in Fig S4. Transforming outputs after training slightly improves generalization skill (from an overall R 2 of 0.58 to 0.62 for the generalization (+4K) set). In contrast, transforming outputs before training leads to equally bad results both on the training and generalization sets, which is a negative result underlining the challenges of designing the appropriate loss function in probability space. A possible solution would be to convert back the outputs to physical space before feeding them to the loss function during training, and further investigation is required to fully assess the potential and limitations of training these ML models in probability space.

B4. Coordinate Transformation
Another transformation to consider when input/output variables are functions of spatiotemporal coordinates is coordinate transformation, resulting in a coordinate change. In our specific example, it is possible to transform the vertical coordinate, i.e. the hybrid pressure coordinate p. Possible transformations include: 1. the temperature ( p = T ), e.g. for radiative heating, which tends to vary less in temperature coordinates (19), 2. the saturation specific humidity ( p = q sat ) which is consistent with a transformation of the primitive equations that captures an upward shift of the circulation as the climate warms (20), 3. the geopotential height or the altitude ( p = z), which could more consistently capture gravity wave propagation, 4. or a coordinate with fixed values for characteristic vertical levels in the atmosphere, such as the top of the planetary boundary layer or the tropopause (10).
While we were able to transform the vertical coordinate using interpolation functions (not shown), the benefits were not visible in our particular case. This could be because input transformation already addresses some of the upward shift of convective activity warming, as shown by the influence of humidity inputs in Fig S13.

B5. Finding Feature Transformations Yielding Climate Invariance
This section outlines how to find successful feature transformations. Fig S5 illustrates our proposed workflow for finding robust inputs/outputs transformations that transform the initial "brute-force" mapping into a climate-invariant mapping when combined. Note that this workflow assumes that we cannot or do not want to retrain ML algorithms in the target climate, which excludes automatically finding a transformation by training a model. This limitation could arise because the data in the target climate are insufficient or less reliable, or because we seek to uncover new physical relations that hold across an even wider range of climates.
The first step is to propose a physical transformation to implement. We can do this through knowledge of robust physical or statistical relations that link and/or preserve distributions (e.g., state equations, self-similarities, conservation laws, accurate empirical relations, etc) as has already been modeled in SM B2. These relations help derive invariants (e.g., (21)) under a change in thermodynamic conditions. Before taking the time to implement this transformation in the ML workflow, we can verify that the PDFs of the transformed inputs/outputs match in the training and target climates. Ideally, the joint PDFs of the transformed inputs/outputs would match. In practice, because it is easier to transform one variable at a time and the data are often insufficient in the target climate, we can fall back on the necessary (but not sufficient) condition that the univariate PDFs of the transformed inputs/outputs must match in the training and target climates. Mathematically, this match can be quantified using PDF distance metrics.
An additional challenge is that the original and transformed variables may have different units and range, meaning that any nonlinear distance metric will complicate the PDF comparison. To address this, we normalize the PDFs and their support variables X so that the PDFs' domains strictly lie within [0, 1]. For a given variable, we use the same normalization factors across climates: where PDF norm is the transformed PDF and X norm is its transformed support, and max cl and min cl respectively refer to the maximum and minimum operators over the variables' domains and across climates, i.e. over the (-4K), (+0K), and (+4K) simulations.
Once the PDFs of each variable are normalized, we may pick any informative PDF distance metric to quantify how PDFs match across climates. Here, we pick the commonly-used Hellinger distance between two PDFs p and q, formally defined (22) as: This distance is symmetric (i.e., the arguments' order does not affect the outcome) and easy to interpret: H (p, q) is bounded by 0 (when p = q) and 100% (when p is zero whenever q is positive and vice-versa). In SM D1, we show that our results using the Hellinger distance (see Tab S1) are consistent with those using the Jensen-Shannon distance (23) (see Tab S3), a PDF distance metric giving large weights to the PDFs' tails that tend to be particularly problematic for generalization purposes.
Once the univariate PDFs of the physically transformed variables match across climates, the second step is to train two inexpensive or "baseline" models on the reference climate to quickly check whether the transformation improves an ML model's generalization ability: (1) A "brute-force" model without the transformation; and (2) a "climate-invariant" model with the transformation. If the transformation does not improve the baseline model's generalization abilities (i.e., (2) performs worse than (1) in the target climate), then the transformation may not be appropriate. Note that because we here develop climate-invariant neural nets, we suggest training MLR baselines, but the ML model used to define the baseline should be tailored to the desired final ML model If the transformation improves the inexpensive baseline model's performances, the last step is to train the "brute-force" and "climate-invariant" versions of the desired ML model (usually nonlinear) on the reference climate. If the physical transformation improves the desired ML model's generalization abilities (i.e., the climate-invariant model beats the brute-force model in the target climate using the same performance metric calculated over a validation set), then we may keep the transformation. This workflow may be repeated for the ML model's additional input/output variables until the emulated mapping is as climate-invariant as possible.
Before applying this workflow to subgrid thermodynamics closures, we underline one of its key challenges: Because some transformations are much more impactful than others, it is often not possible to develop each physical transformation independently. In our case, the specific humidity inputs vary the most across climates, meaning that transforming specific humidity affects the model's generalization abilities the most. As a result, initial experiments that independently tested the effect of transforming temperature suggested a negative impact of temperature transformation on generalization ability (not shown). This initial result was later invalidated by experiments that jointly transformed specific humidity and temperature (see below). Following this, we adopt a progressive input transformation approach, where the most important inputs are transformed first: Specific humidity, then temperature, and finally surface energy fluxes.

C. Implementation
For reproducibility purposes (24), we now detail the practical implementation of a climateinvariant ML workflow (see Fig S6), from its overall structure (SM C1) to its benchmarking (SM C4) via the characteristics of the multiple linear regressions (MLR, SM C2) and neural networks (NN, SM C3) presented in this manuscript.

C1. Overall Workflow
We present three ways to implement physical transformations. The first way is to physically transform the inputs/outputs before training. While this option is easiest to implement and debug, it usually comes at the cost of disk space: Every time we try a new transformation, we need to duplicate our training/validation/test datasets for all the climates/geographies we are interested in, which can quickly be prohibitive when trying multiple transformation combinations.
Therefore, it can be advantageous to transform the input/output variables within the ML framework, so that the transformations occur during training. In essence, we are trading disk space for computational time. In that spirit, the second method is to transform the inputs/outputs via custom layers (e.g., Ch 12 of (25)) in the ML algorithm itself. Since this second method tends to substantially slow down training as it adds sequential operations on the GPU, we take advantage of the fact that the transformations occur before and after the emulated mapping, and propose a third method that can happel in parallel on the CPU: Transforming inputs/outputs by customizing the pipeline or "data generator", which is the algorithm responsible for feeding numbers to the ML model after reading the training data files. For each batch, the custom data generator then transforms inputs before feeding them to the ML algorithm. In our case, note that we transform outputs independently via quantile mapping (see SM B2).
For the rest of this manuscript, we will train our ML models using custom data generators: For "brute-force" models, the transformations are set to None (no transformation), while for "climate-invariant" models, the q transformation is set toq RH , the T transformation is set tõ T buoyancy , and the LHF transformation is set toLHF ∆q . For all models, we additionally subtract the mean from each input before dividing it by its range to feed the ML algorithm floating-point numbers between (-1) and 1. Note that for each transformation, numbers are "de-normalized" before the transformation and "re-normalized" after, so that all transformations are done in physical units while the ML algorithm is always fed single-precision floating-point numbers in For simplicity and building upon previous ML-powered subgrid closures (4,7,26), we use the mean-squared error (MSE) of the prediction in physical units (here W 2 m −4 ) as our loss function. Motivated by the framework presented in Fig S5, we first train MLRs (SM C2) before training NNs (SM C3) and benchmarking our ML models to quantify their accuracy and ability to generalize (SM C4).

C2. Multiple Linear Regressions
To use the same data generator for both MLRs and NNs, we implement our MLRs in Tensorflow 2.0 (27) and train them using the Adam optimizer (28), which builds on stochastic gradient descent (29). Training a climate-invariant MLR results in a weight matrix A of size 4N p × (2N p + 4) and a bias vector b of length 4N p such that: where stochastic optimization means that there is no unique optimal solution for A and b. We train MLRs for 20 epochs using the default Keras learning rate of 0.001 and save the weights and biases corresponding to the minimal loss over the validation set.

C3. Neural Network Design
To isolate the effects of physically transforming the NN's inputs, we fix the hyperparameters of all NNs trained in this study, and leave the joint investigation of hyperparameter tuning and physical transformations for future work. Informed by (30) and (31) the minimal and maximal learning rates by 10% for the next 6 epochs before further reducing them by a factor 10 for the last 3 epochs.
For SPCAM and following (35), we augment some of our NNs with BN and DP layers, more specifically one DP layer before each activation function and a single BN layer before the first DP layer. Following (36), we use the default DP rate of 30% and the default parameters of the Keras BN layer that normalize each feature using its mean and standard deviation in a given batch (37). Note that we do not adjust the default parameters of DP and BN to optimize generalization skill as this would require misusing the generalization test set as a validation test.

C4. Benchmarking
We benchmark our ML models using two different metrics: their MSEs and their coefficient of determination R 2 , defined for a singular output y k as: where · samp is the averaging operator over the samples of interest. For instance, if we want a horizontal map of R 2 , we average samples at a given location over time, while we average over time and horizontal space if we want a single R 2 value for y k . Similarly, if we want one value of MSE per output y k , we only average the MSE over time and horizontal space rather than over all outputs, as when calculating the loss function. While comparing MSE and R 2 in the reference and target generalization climates is enough to assess generalization skill after training, we are also interested in how a given ML model learns to generalize during training. To address that question, we augment our SPCAM ML models with a function (technically a "Keras callback" (38)) that calculates the MSE over two datasets that correspond to the two generalization experiments at the end of each epoch: (1) a dataset of different temperature (warm when training on cold, and vice-versa); and (2) a dataset of different geography (Earth-like when training on Aquaplanet, and vice-versa). At the end of training, we hence obtain three learning curves for each ML model: the validation loss, and the loss in the two generalization sets as a function of number of epochs. Note that these callbacks are computationally expensive as they require evaluating the ML model over ≈ 100M samples at the end of each epoch, which means they should be avoided when purely seeking performance, e.g. during hyperparameter tuning.

D1. Jensen-Shannon Distance between PDFs across Climates
As an alternative to the Hellinger PDF distance, we pick the Jensen-Shannon distance (23) because it is a symmetric distance (i.e., the arguments' order does not affect the outcome) that uses the logarithms of the PDFs, hence giving large weights to the PDFs' tails that tend to be particularly problematic for generalization purposes: where p and q are the normalized PDFs to compare and KL is the Kullback-Leibler divergence, defined for continuous PDFs as:

D2. Learning across Climates and Geographies
This section complements Sec 4.2 and confirms that climate-invariant models learn mappings that are valid across climates and geographies during training. For this purpose, we track the models' generalizability throughout the training process as explained below. Fig S7 shows learning curves; the color of each line indicates the dataset the model was trained in, while the color of the row indicates the dataset the model was tested in. To gain intuition, we can start by looking at lines that have the same color as their axes: These are the "standard" learning curve showing that each model's validation loss in the same climate/geography monotonically decreases as the model is trained, confirming that we are not overfitting the training set.
We are now ready to zoom in on a key result of this manuscript: The learning curve of the "climate-invariant" NN trained in the cold aquaplanet but tested in the warm aquaplanet (starred blue line in the red box (a)). Impressively, this learning curve is mostly decreasing, confirming that "climate-invariant" NNs are able to continuously learn about subgrid thermodynamics in the warm aquaplanet as they are trained in the cold aquaplanet. In contrast, the "brute-force" NN trained in the cold aquaplanet but tested in the warm aquaplanet (circled blue line in the red box (a)) makes extremely large generalization errors, which worsen as the model is trained in the cold aquaplanet.
"Climate-invariant" NNs also facilitate learning across geographies, i.e., from the aquaplanet to the Earth-like simulations (starred blue line in green box (b) is consistently below circled blue line) and vice-versa (starred green line in blue box (c) is consistently below circled green line). "Climate-invariant" transformations additionally improve the MLR baseline's generalization ability (see right column, e.g., starred blue line in red box (a) and starred green line in blue box (c)), albeit less dramatically. This smaller improvement in MLR's generalization abilities is linked to its relatively small number of free parameters, resulting in (1) "brute-force" MLRs generalizing better than "brute-force" NNs; and (2) MLRs having lower representation power and fitting their training sets less well, limiting the maximal accuracy of "climate-invariant" MLRs on the test set.
There are a few cases in which transforming inputs does not fully solve the generalization problem, e.g., when trying to generalize from the aquaplanet to the Earth-like simulation (starred blue line in green box (b)). NNs with DP fit their training set less well (squared lines that have the same color as their boxes are above corresponding circled/starred lines). However, they improve generalization in difficult cases (e.g., squared blue line in green box (b)) and do not overly deteriorate generalization in cases where the input transformations work particularly well (e.g., squared green line in blue box (a)). This confirms that combining physics-guided generalization methods (e.g., physical transformation of the inputs/outputs) with standard ML generalization methods (e.g., DP) is advantageous.

D3. Geographic Skill
To show that the improved generalization skill of climate-invariant NNs for subgrid heating is not unique to the mid-troposphere (Fig 5 in the main manuscript), in Fig-S8 we also show the generalization skill of climate-invariant NNs near the surface. Consistent with (31,39), the highest skill for the training climate is over land for all NNs as most of the variability comes from the diurnal cycle, which is easy to predict for NNs. Similarly to Fig 5, the generalization error is apparent for the brute-force NN (a) and mostly solved by making the NN climateinvariant (b).

D4. Visualizing Climate-Invariant Mappings
Before using SHAP in Section 4 to visualize the difference between brute-force and climateinvariant mappings, we test simple linear methods to analyze ML models. First, we directly plot the weights A (see Eq 28) of our multi-linear regressions in Fig S9/S10. Second, we plot the mean Jacobian of our NN calculated via automatic differentiation in Fig S11/S12. Unlike SHAP, the MLR weights and the Jacobian matrices both suggest that the climate-invariant mapping is non-local in the vertical. Fig S9/S10 is consistent with the climate-invariant MLR generalizing only slightly better than the brute-force MLR (see top-right panel of Fig S7). Meanwhile, comparing Fig S11/S12 to the full SHAP feature importance matrix (Fig S13/S14) suggests that while the linear sensitivity of subgrid heating/moistening with respect to lower-tropospheric plume buoyancy is high (top panels of Fig S11b), which is expected, subgrid heating/moistening can be well-predicted using mostly local plume buoyancy information (top panels of Fig S13b).   Table S3: Jensen-Shannon distance away from the (-4K) simulation for the PDFs of (q 600hPa , T 850hPa , T 150hPa , LHF) and their transformations: (+0K) distance in gray and (+4K) distance in red. . For a given variable and transformation, we use the same vertical logarithmic scale across models. Note that unlike for q, the best options for T and LHF do not decrease distribution distance more than the second best options, which is discussed in text.   Figure S3: Two types of bias-correction methods used to transform outputs in SM B2: (dark green, "post-processing" method) Quantile mapping is typically done after training the model to bias-correct the outputs, and (light green, "pre-processing" method) we additionally test directly making predictions in probability space by converting the outputs to their CDF values before training. Note that this usually changes the loss function. Coefficient of determination R 2 Figure S4: Coefficient of determination R 2 for 500-hPa subgrid heating of brute-force (a), climate-invariant (b), climate-invariant with outputs transformed after training (c), climateinvariant with outputs transformed before training (d) NNs trained using the cold (-4K) training set of SPCAM3 and calculated over the warm (+4K) training set of SPCAM3.        For the climate-invariant mapping (b), LHF is transformed to LHF ∆q as described in Sec 3, which in conjunction with the temperature and humidity transformations, changes the multilinear regression weights for all input variables.     For the climate-invariant mapping (b), LHF is transformed to LHF ∆q as described in the "Theory" section, which in conjunction with the temperature and humidity transformations, changes the SHAP feature importance matrix for all input variables.