Basin-scale biogeography of marine phytoplankton reflects cellular-scale optimization of metabolism and physiology

Extensive microdiversity within Prochlorococcus, the most abundant marine cyanobacterium, occurs at scales from a single droplet of seawater to ocean basins. To interpret the structuring role of variations in genetic potential, as well as metabolic and physiological acclimation, we developed a mechanistic constraint-based modeling framework that incorporates the full suite of genes, proteins, metabolic reactions, pigments, and biochemical compositions of 69 sequenced isolates spanning the Prochlorococcus pangenome. Optimizing each strain to the local, observed physical and chemical environment along an Atlantic Ocean transect, we predicted variations in strain-specific patterns of growth rate, metabolic configuration, and physiological state, defining subtle niche subspaces directly attributable to differences in their encoded metabolic potential. Predicted growth rates covaried with observed ecotype abundances, affirming their significance as a measure of fitness and inferring a nonlinear density dependence of mortality. Our study demonstrates the potential to interpret global-scale ecosystem organization in terms of cellular-scale processes.


Environmental data
Data were compiled from the Atlantic Meridional Transect (AMT-13) cruise of the RSS James Clark Ross in the fall of 2003 from the English Channel to the Falkland Islands (Islas Malvinas; fig. 2a), provided by request from the British Oceanographic Data Centre (BODC). 66 measured variables were compiled (data S2), including hydrographic data, spectral light data, nutrient concentrations, pigment concentrations, particulate and dissolved elemental compositions, microbial community compositions, and biological rate measurements. Most measurements were taken from bottle samples collected in depth profiles at 10 m resolution in the upper 100 m, and 20 m to 50 m resolution to 300 m, and additional samples were sometimes collected at the deep chlorophyll maximum depth. Prochlorococcus ecotype abundances, determined by qPCR, were included for 32 stations as reported in [9]. Simulations conducted in this study were restricted to this subset of 32 stations (from a total of 78 casts at 54 stations) for which synoptic ecotype abundance data were available, although we provide the complete transect dataset with permission from the BODC (data S2).
CTD Vertical profiles of pressure, temperature, salinity, dissolved oxygen, and chlorophyll a were collected using a Sea-Bird SBE 911plus CTD. Data were processed at 1 m resolution and calibrated with in situ samples. Potential density was calculated using the UNESCO 1983 algorithm. The deep chlorophyll maximum depth interval was defined as ±10m from the maximal CTD fluorescence depth.
Spectral irradiance Spectral downwelling irradiance profiles were conducted at 22 stations along the transect, using a free-falling optical profiler equipped with 8 wavebands in the photosynthetically active radiation range (412 nm to 685 nm). Di↵use spectral attenuation coecients were measured at 1 m intervals at each station. To capture light dynamics on the photoacclimation timescale, an 8-day average cloud optical thickness ⌧ c was applied as a mask to a model of direct and di↵use clear-sky spectral radiation at each station. Level 2 Aqua cloud products were retrieved from NASA's Atmosphere Archive and Distribution System (https://earthdata.nasa.gov/eosdis/daacs/laads) at 0.1 degree resolution for an area of 3 x 3 pixels centered at each station at the time of sampling. A spatial average was calculated for each station, omitting pixels with missing data. Clear-sky direct and di↵use spectral radiation was calculated at 2 nm resolution using the SMARTS algorithm (Simple Model of the Atmospheric Radiatived Transfer of Sunshine, National Renewable Energy Laboratory; https://www.nrel.gov/grid/solar-resource/smarts.html) at local noon for the spectral attenuation measurement band. Spectral downwelling irradiance (E d ;W m -2 nm -1 ) was then calculated for each station at discrete depths as where is the wavelength (nm), and k d is the attenuation coe cient (m -1 ) measured at each depth z (m). To match the resolution of both the surface irradiance and the pigment spectral absorption measurements (described in sections 1.5 and 1.6), a spline interpolation was applied in the wavelength domain of k d at 2 nm resolution.
Nutrients Nutrients used in simulations included ammonium, nitrite, nitrate, and phosphate. High-and low-sensitivity methods were implemented for their appropriate concentration intervals on each cast, and where overlapping data were available the high-sensitivity measurement was chosen. Nitrate concentrations were calculated as the di↵erence between spectrophotometrically determined nitrite and nitrate+nitrite. Silicate and the reduced sulfur compounds dimethylsulfoxide (DMS) and dimethylsulfionopropionate (DMSP) were quantified on a subset of stations, and are included in the dataset but were not used in simulations. Values below the detection limit were encountered in 4 stations (data S2), and in those cases concentrations were assigned to the detection limit.
Particulate carbon, nitrogen, and chlorophyll Particulate organic carbon and nitrogen were determined by filtration onto glass fiber filters (Whatman GF/F), acidification to aspirate inorganic carbon, and elemental analysis.
Size fractionated chlorophyll a (including divinyl chlorphyll a) concentrations were measured fluorometrically in the intervals 0.2-5 µm, 5-10 µm, and >10 µm using polycarbonate membrane filters, as well as on glass fiber filters (Whatman GF/F). An additional 14 pigments were quantified using HPLC methods, also extracted from GF/F's.

Flux balance analysis
A chemical flux balance can be derived from the Law of Mass Action. A system of n chemical reactions involving m chemical species at concentrations x i can be expressed in terms of the stoichiometry of their reactants a ij and products b ij , The j th reaction proceeds in the forward and reverse directions according to the rate constants k + j and k j , respectively. Thus, the net rate of change in the concentration of the i th chemical species is given by the sum: The system of reactions can be represented by an m ⇥ n matrix of the directed stoichiometric coe cients S ij = b ij a ij , so that reactants and products are assigned negative and positive values, respectively. The forward and backward rates of each reaction, described by the second term of the sum, can also be represented by a vector of fluxes v j , such that It is often the case that both x and k are unknown, so the system of reactions is stoichiometrically balanced but the rates are e↵ectively unbounded. The steady-state scenario (Sv = 0) is often imposed in flux balance analysis (FBA) as a simplifying assumption which implies that, over some time interval t, there is no net change in the concentration of any internal chemical species. A relaxation of this assumption in non-steady-state situations can be found elsewhere [61], but ideally one should limit their interpretation of FBA predictions to the chemostat.
Wherever possible, knowledge of the reaction kinetics and thermodynamic parameters are leveraged to establish directionality, and in some cases, capacity constraints (i.e., v max ; discussed below). Reactions which are reversible remain unbounded, while reactions which are expected to only proceed in the forward or backward direction are constrained by lower bounds v lb j or upper bounds v ub j of zero, respectively. It is critical that all reactions included in the network be balanced for mass and charge, and that each chemical species be stoichiometrically constistent. In practice, however, a number of extracellular chemical species are unconstrained by mass balance and are allowed to freely exchange with the system boundary, enabling flux through the network and the net synthesis of biomass.
Assuming a steady-state, a set of optimal flux distributions can be predicted that maximizes "fitness" by solving the linear program where c is a predefined coe cient vector whose entries define the weighted set of reactions that determine fitness. Here, fitness is defined as growth rate, the rate of synthesis of one gram of ash-free dry biomass of specified composition. Although FBA will find a global maximum function value, the stoichiometric matrix S is typically underdetermined. The flux vector v is not unique, and S often contains thermodynamically infeasible loops. One strategy to retrieve a more sparse solution and to eliminate loops is to minimize the l 1norm of the flux vector [62]. This is achieved with a second linear program. Assume that the previous program results in fitness f = c T v. To retrieve a sparse and unique solution, we minimize the l 1 -norm of the flux vector v subject to the additional linear constraint f = c T v.
When the linear constraint prohibits a sparse solution of v, one can relax the constraint within a predefined tolerance 2 [0, 1] and solve the above problem using c T v f instead. To further explore alternative optima of the FBA solutions, flux variability analysis (FVA) is implemented to identify the maximal and minimal fluxes of each reaction while sustaining the optimal function value within a predefined tolerance. In FVA, the extrema are determined in two sequential optimizations (maximizing and minimizing) for each reaction v j , holding the objective value c T v = f from Eq. (7) within tolerance . For brevity, we show only the minimization step: A measure of the sensitivity of the objective function value to each metabolite is given by the dual solution to the FBA linear program (Eq. 5). Dual variables (of length n) 1 and 2 are assigned to the lower and upper bounds on each reaction. In our application, the biomass formation reaction v B is maximized, so the sensitivity of growth rate to the supply of each metabolite ( i = dv B dx i ) is given by the minimization Critically, the above optimizations require the assignment of appropriate metabolic rate constraints and biochemical compositions for each strain in each environment. In the following sections, we describe the process of reconstructing the metabolic networks (sections 1.3 and 1.4), specifying strainspecific constraints on biochemical compositions and fluxes (section 1.5), incorporating physiological acclimation processes (section 1.6), and accounting for temperature e↵ects (section 1.7).

Pangenome-scale metabolic network assembly
An aggregate GEM was reconstructed to represent the full set of metabolic functions encoded in the pangenome of Prochlorococcus (PanGEM). The current pangenome consists of 77 sequenced isolates, 7 metagenome assembled genomes, and 564 single cell amplified genomes, comprising 866,894 unique ORFs. Gene assemblies were retrieved from the JGI Integrated Microbial Genomes and Microbiomes portal (IMG; http://img.jgi.doe.gov). Translated coding sequences were mapped to KEGG orthologs (KOs) using a bi-directional blast with KEGG's kofamKOALA, paring the number of unique KOs to 2084. A database containing nucleotide sequences, translated amino acid sequences, header strings, database identifiers (NCBI, IMG, and KEGG), metadata and statistics for each genome was compiled; this database was the basis for draft reconstruction of the PanGEM network.
Gene-protein-reaction (GPR) associations were established for all functionally annotated metabolic KO's using the KEGG API. Reactions linked to each KO were parsed into their participating metabolites and stoichiometric coe cients, tabulated and annotated for the draft reconstruction. Wherever possible, each reaction and metabolite was associated with unique identifiers from KEGG, MetaNetX, NCBI PubChem, EMBL-EBI ChEBI, EMBL-EBI RHEA, Enzyme Commission numbers, and InChI. In rare cases, multiple identifiers were linked to a single entry.
Each metabolite linked to a GPR was assigned a chemical formula, charge, and molecular weight using KEGG and PubChem databases. In entries where the molecular formula was generic (i.e., wherever an R-group was present), molecular weights and charges were recorded only for the parent moiety. For polysaccharides, the number of monomers was assigned based on glycogen containing 10-mers, solely for convenience. Each lipid head group and their associated acyl chain lengths and saturation states were assigned based on data available for the HLI strain MED4 [ ]. Standard Gibbs formation energies ( f G 0 o ) for each metabolite were predicted using the eQuilibrator tool [64] at pH = 7.4, ionic strength of 0.7 M (appropriate for marine microbes), and an intracellular concentration of 1 mM, however intracellular concentrations were unknown and may introduce error to in vivo f G 0 o predictions. Entries with e↵ectively zero-mass (e.g., photons, excitons, heat) were assigned a charge and their own 'element' e.
In addition to internal metabolic reactions with KO associations, several other sets of reactions were added to the model including: (1.) transport, exchange, and spontaneous reactions, (2.) light-harvesting and photosynthetic reactions, (3.) biomass formation reactions, and (4.) 'gap-filled' reactions with no known GPR, described in more detail below.
1. KOs were manually assigned to transport reactions using the Trans-portDB database http://www.membranetransport.org/, but the direction of transport was not always known. For example, in the case of ATP binding cassette transporters, whether the membrane receptor domain is inward facing (exporter) or outward facing (importer) was not assumed and was instead parsed into two irreversible reactions. In the case of symporters and antiporters, transporter flux bounds were assumed to be reversible. Exchange reactions were added for all external metabolites, including 'biomass', and were allowed unbounded flux with unconstrained 'boundary' metabolites.
2. Photons are absorbed by each pigment separately, according to the in situ downwelling irradiance spectrum and the pigment-specific absorption spectrum (Section 1.6). Non-photochemical quenching was also explicitly represented for the relaxation of each pigment in the form of heat, which was included as a 'metabolite.' Reactions involving the photoprotective pigment zeaxanthin were included separately for its co-localized quenching of singlet, doublet, and triplet excited divinyl chlorophylls a and b. The role of the secondary light-harvesting pigment ↵-Carotene in Prochlorococcus photosystems remains unclear; however, we currently consider it to be a light-harvesting pigment and allow exciton transfer by the relaxation of a singly-excited electronic intermediate state of ↵-Carotene to the ground state, transferring an exciton to either photosystem I or II: In the cases of divinylchlorophyll's a and b, the series of singly-, doubly-, and triply-excited electronic states are explicitly represented both for their excitation and relaxation intermediate reactions: Although the mechanism of the photoprotective pigment zeaxanthin is not known in Prochlorococcus, there is evidence [65] in Synechococcus that it is bound to the light harvesting complex as accessory antennae, suggesting that it may directly strip photons from its neighboring chlorophylls through resonance energy transfer, and further provides a mechanism to quench singlet oxygen in the thylakoid lumen (th). The relaxation of singly-excited zeaxanthin dissipates to heat (nonphotochemical quenching), thus we chose to represent its function as: (f) Cell Wall -containing crosslinked peptidoglycans, UDP-diphosphate, and lipid A disaccharide.
(g) Ions and Metals -containing the major ions of potassium, calcium, sodium, chloride, and magnesium, as well as the trace elements cadmium, cobalt, copper, iron (II), a guanylyl cofactor bound molybdenum, strontium and zinc.
(i) Nucleosides and nucleobases -containing a set of 19 nucleosides, deoxynucleosides and their bases.
(k) "BioPool" -containing a subset of metabolites which are known to be present at su ciently high concentrations in cells but do not belong to any of the other macromolecular pools. These include chorismate, glutathione, heme, S-adenosyl-L-methionine, and spermidine. Metabolites identified in our metabolomics data are appended to this subset.
(l) Biomass -this is the lumped reaction on a weight basis (g of macromolecule g biomass -1 and includes each of the previously described macromolecular pools in their respective proportions and a growth associated maintenance ATP requirement which is cell size dependent (section 1.5).
4. Reactions with no gene or protein association, but which were required for biomass synthesis on minimal medium were added. This 'gap-fill' subset has no candidate GPR association and therefore requires considerable manual inspection and literature review to rationalize the choice of including. In most cases, the choice was simple, where a single missing reaction in a linear pathway could be added; however, in some rare cases the decision was more nuanced. We therefore have low confidence in their presence and have flagged these reactions to encourage future functional genomics insights and improvement to the quality of the model. The number of gap filled reactions was 36 (2.4% of all reactions).
The process from draft to final reconstruction followed procedures detailed previously [54,66] and fall outside the scope of this article. Ensuring stoichiometric consistency, identifying unbalanced reactions and metabolites, detecting dead-ends, leaks and orphaned reactions, as well as a multitude of other tasks were conducted using the Matlab toolboxes COBRA (version 2.21.1; [5 ]) and RAVEN (version 2.4.0; [6 ]), both of which are extensively documented.

Semi-automatic strain GEM reconstruction
A novel approach to automatically extract functional, stoichiometrically consistent strain GEMs from the PanGEM was developed. By definition, any strain is a subset of the PanGEM, however simply excluding the remaining reactions often results in lethal deletions. In lieu of manual curation for each strain, we employ a strategy to identify an essential subset of reactions which are required to synthesize all biomass components de novo and satisfy the growth objective.
While a minimal essential reaction subset is possible, flexibility and redundancy in metabolic networks introduces some ambiguity in how we might define 'essential.' The removal of a key reaction may divert resources through a less e cient pathway, or one which cannot stoichiometrically carry the flux required to achieve maximal growth rates. Instead of searching for a minimal essential reaction subset, we identify an essential reaction subset which still allows for growth at some proportion ✏ 2 [0, 1] of the maximal growth rate f . To do this, we implement a compressed sensing linear program which iteratively solves a constrained l 1 -norm minimization of fluxes over a range of function values ✏f . The sparsest set of fluxes which satisfies the threshold c T v ✏f is taken to be the essential reaction set E. The choice of a suitable ✏ value depends on the application, however in our case the most conservative approach was taken (✏ = 1 tol, where tol is the interior point solver step tolerance). One can interpret this choice as taking the minimal set of reactions required to satisfy maximal biomass synthesis rates. To populate the complete metabolic network for each particular strain S, we append GPR associations unique to a strain's genome V, such that S = E [ V.
A possible shortcoming of our approach is that the appended set V is assumed to be complete but may contain gaps. This decision reflects our limited knowledge of strain-specific growth capabilities. Coupling our approach to an automated gap filling method (e.g., ReFill; [68]) may provide additional metabolic capabilities, albeit at the risk of spurious additions. We note that, even without this step, predicted metabolic capabilities (fig S16) for the subset of strains that we do have experimental evidence for were comprehensively captured by our algorithm. The table below summarizes a pairwise comparison of experimental and model predicted growth capabilities of nine strains on six sole nitrogen and phosphorus sources: The PanGEM reconstruction steps and the CS algorithm are provided as a Matlab toolbox which can be accessed in a GitHub repository https:// www.github.com/jrcasey/PanGEM. Strain model reconstruction was applied only to isolates with complete genomes. Metagenome assembled genomes and single cell assembled genomes were typically incomplete, resulting in large gaps in metabolic networks which would require other approaches.

Biochemical and physiological customization of strain GEMs
Biochemical compositions, physiological traits, and constraints on their ranges were assigned to each strain. Initial compositions represent the nutrientreplete batch acclimated state of isolates, as a starting point for acclimation. Data assigned to each strain were derived both from literature sources and from measurements taken for this study.
GAM and NGAM Growth associated maintenance (GAM; fmol ATP cell -1 h -1 ) and non-growth associated maintenance (NGAM; fmol ATP cell -1 ) ATP requirements were calculated based on allometric relationships [69]: where V is the cell volume (µm 3 ). NGAM costs were included as fixed upper and lower bounds on an ATP hydrolysis drain reaction. To di↵erentiate other GAM ATP costs from structural biosynthesis ATP costs of macromolecular pools, the polymerization costs of amino acids, nucleic acids, lipids, and lipopolysaccharides were separately assigned to each respective macromolecule synthesis equation from the total predicted GAM value. Residual GAM costs were included in the bulk biomass equation.  [72]. Data are reported as peak area for each feature, and features were assigned chemical names when identification was possible upon comparison to an MS library. For conversion to dry cell weight compositions, peak areas were normalized to protein concentrations measured by the BCA assay.
Additional biomass composition data were gathered from previous studies, although taxonomic coverage was less extensive for certain classes. Lipid head groups and acyl chain lengths and saturation states were only available for strain MED4 [63,73]; amino acid profiles for strain MED4 [73], mineral content and major ions for strains NATL1A, GP2, SB, EQPAC1, SS120, and PCC 9511 [74]; iron and cobalt for strain MIT 9215 [75]. The concentrations of other vitamins, free nucleotides and nitrogenous bases, and a subset of metabolites and osmolytes not detected or not annotated in our metabolomics results have not be determined for any strain of Prochlorococcus; in these cases, compositions from the model cyanobacterium Synechocystis PCC6803 were used for all strains [45].
Pigments Initial values and boundary constraints on pigment compositions were derived from previously published datasets [4,17,76] [77] to discriminate MED4 cells from potential contaminating heterotrophic bacteria. Nutrient-replete growth media was based on the AMP1 recipe [78] modified to have lower concentrations of ammonium (110 µM), phosphate (10 µM), and trace metals (10% of AMP1 concentrations). Nutrient-replete cultures were maintained in optically-thin semi-continuous growth and sampled for macromolecular and elemental composition when fully acclimated to these conditions (>10 generations with <15% variation in growth rate). These same parameters were assessed in separate nitrogen and phosphorus starvation experiments in which nutrient-replete cultures were diluted with N-free or P-free media (modified AMP1 media as described above without ammonium or phosphate added) and sampled four times over 7-8 days from late exponential phase to late stationary phase (4 days after cessation of growth). Samples for total particulate carbohydrate, lipid, protein, DNA, RNA, and divinyl-chlorophyll-a as well as elemental composition (particulate C, N, and P) were collected and analyzed as in [53] with the following exceptions: a) smaller pore size polycarbonate filters (0.2µm) were used to collect protein, RNA, and DNA samples; b) samples for all other macromolecules were collected on two pre-combusted glass fibre filters filters (Whatman GF/F) stacked on top of each other during filtration; and c) extracted carbohydrates were quantified using the phenol-sulfuric acid method [79]. At each sampling point, cell abundances and particulate dry weight were also determined, allowing each macromolecular pool to be expressed as a cell quota or proportion of cell mass. Cell abundances were determined as described above while samples for particulate dry weight were collected on GF/F filters as described above for macromolecular samples and weighed after being rinsed with 0.5M ammonium formate and dried at 60 o C to remove salts. The upper and lower bounds of compositions of each macromolecular pool in both nutrient starvation experiments were applied to all strains in the acclimation pipeline (described below).

Acclimation pipeline for strain GEMs
Two sequential bilevel optimizations are implemented to allow each in silico strain to acclimate to each environment, subject to physiological constraints ( fig. S17). The first, a linear program called OptTrans, searches for the optimal distribution of membrane transporters and cell size, while the second, a non-linear program called PhysOpt, searches for the optimal macromolecular and pigment composition. Both optimizations repeatedly solve the FBA problem with varying compositions, using an interior point solver (Mosek 9, Mosek, API).
OptTrans It was recently shown that substrate-limited growth rates can be predicted for an arbitrary substrate using a combination of molecular modeling, quantitative proteomics, and FBA [32]. The model accounts for the dependence of both the maximum uptake rate v max and the half-saturation concentration k S on transporter abundance, cell size, and temperature. To distinguish from other reactions in the network S, we refer to a subset Q (given as subscripts q for each column) to refer to all transporter reactions, their substrates, and associated parameters.
Briefly, under nutrient replete growth conditions, the substrate uptake rate (v q = v max,q ; predicted by FBA) and transporter abundance (a q ; measured by quantitative proteomics; [80]) is used to determine the in vivo catalytic rate of the transporter (k cat,q ). Two additional parameters are required to compute the uptake rate dependence on transporter abundance -the e↵ective capture area of the transporter A and the hydrated molecular di↵usivity of the substrate molecule D. A q can be quantified by measuring the dimensions (PyMol; The Py-MOL Molecular Graphics System, Version 2.3 Schrodinger, LLC) of the transmembrane domain of each transporter using 3-dimensional renderings of the predicted protein structure (RaptorX, [81]). D q is quantified as an empirical function of hydrated molecular volumes ( [82]; using the LeBas incremental method; [83]) and the dynamic viscosity, determined as a function of in situ temperature, pressure, and salinity. Given a known temperature and bulk concentration of substrate S 1 q , these three parameters are su cient to calculate the e↵ective half saturation concentrations at two limits -the di↵usive limit k D s,q as S 1 q ! 0, which is independent of the number of transporters, and the so-called porter limit k P s,q as S 1 q ! 1), which is dependent on the number of transporters. The uptake rate v q is then This transport model was applied in an optimization framework, termed OptTrans, to account for multiple substrates and to allow for flexibility in cell size. Cell size and transporter density (transporters m -2 ) for each substrate (e.g., nitrate, nitrite, ammonium, phosphate) are optimized, subject to membrane surface area constraints, transporter catchment areas, and in vivo catalytic rates. Cells are assumed to be spherical and we do not currently optimize the shape coe cient . The Sherwood number is also not explicitly considered, and we assume transport due to advective shear to be equal to that of molecular di↵usion within the boundary layer of the cell. With these two simplifications, the bi-level optimization for OptTrans includes an outer program which varies the number of transporters a and the cell size r and assigns flux bounds to corresponding transport reactions in the inner program. Let a 2 R Q be a vector with elements a q that comprise the number of transporters for substrate q, and let r be the radius of the cell. Then OptTrans maximizes the objective µ(a, r) as follows: For the AMT simulations, di↵erent strains are capable of transporting di↵erent nutrients (ammonia, nitrite, nitrate, and phosphate). For strains lacking genomic or experimental evidence of a particular transporter, the upper bound on its abundance is set to zero.
The optimal transporter abundance a ⇤ and a critical substrate concentration S ⇤ were calculated [32] as a reference for comparison against simulated optima. S ⇤ represents the delineation between two rate limiting regimes -di↵usion limitation (S  S ⇤ ), wherein a ⇤ is the minimal number of transporters required to maintain uptake rates at the di↵usive flux, and growth limitation (S > S ⇤ ), wherein a ⇤ is the minimal number of transporters required to maintain maximal growth, set by some other limiting factor (e.g., growth limitation by another resource, ribosome limitation of translation rates, RuBP carboxylase capacity, molecular crowding, etc.). When used in conjunction with other metrics of nutrient stress (e.g., metabolite sensitivity analysis), comparisons of ambient nutrient concentrations with S ⇤ and comparisons of optimal a with a ⇤ can parse otherwise indistinguishable nutrient limitation states.
PhysOpt Uptake rates from OptTrans are set as upper bounds on transporter fluxes for the second bilevel optimization step, PhysOpt, which searches for the optimal macromolecular composition and pigment composition for each strain in each grid cell, wherein the spectral light field and the nutrient supply are known. Additionally, the optimal cell radius r from OptTrans is implemented in PhysOpt as cell ash-free dry weights (M ; g DW cell -1 ), converted from the radius as described previously.
Similar to OptTrans, the outer program of PhysOpt solves for elements of the constraints on the vector of fluxes v corresponding to the absorption reactions in the inner program. However, in contrast to OptTrans, PhysOpt also simultaneously adjusts coe cients of the S matrix corresponding to the biomass reaction (denoted by the column superscript B). These compositions are subject to experimentally determined upper and lower bounds on each macromolecular pool i and subject to the simplex constraint ( P m i=1 S B i = 1). Since each pigment and macromolecular pool incurs a di↵erent elemental and energetic cost, tradeo↵s in resource allocation are implicitly considered as part of the holistic fitness maximization. For example, de novo protein synthesis is energetically costly relative to total biomass, so it might be expected that compositions would be depleted at low light levels, however since more than 90% of total cellular nitrogen is bound up in protein, the fitness cost of high protein content in the nitrogen-deplete surface layer is considerably higher. Note that, in principle, the PhysOpt formulation need not be restricted to such a coarse resolution (e.g., protein) and could accommodate metabolomic datasets (i.e., individual amino acids).
Several reactions corresponding to the absorption of light by each pigment, as well as reactions representing the transitions through excitation and relaxation of electronic states of each pigment were added. To distinguish between these and other reactions, and for clarity, we refer to this subset using the subscript p. Thus, in PhysOpt, the rate of each pigment absorption reaction v p corresponds to the wavelengthintegrated product of the pigment's specific absorption spectrum a ⇤ p [84], the downwelling irradiance spectrum E d , and the amount of pigment g p . Irradiance and absorption data typically spanned the range ( min : max ) from 350 nm to 800 nm. In addition to the experimentally determined upper and lower bounds on g p (g ub p and g lb p , respectively), we also include the upper and lower bounds on the divinylchlorophyll b/a ratio g dvChlb g dvChla (g ub b/a and g lb b/a , respectively) as a further stoichiometric constraint on pigment compositions. The packaging e↵ect is not currently included, as it should not appreciably influence absorption for cells as small as Prochlorococcus. The full PhysOpt formulation is:

Temperature dependence of metabolic rates
Growth rates and metabolic fluxes were adjusted for temperature e↵ects by comparing in situ growth temperatures to a model of temperature dependent growth, parameterized for each strain. Our approach provides a strainspecific parameterization while avoiding the e↵ects of adaptive laboratory evolution, which can be substantial [55]. First, the Arrhenius activation energy E a was calculated based on growth rates determined in batch cultures maintained over a range of temperatures T ( K; fig S15) [9,56], where A is a dimensionless constant and R is the ideal gas constant (8.314 J mol -1 K -1 ). Calculated activation energies were statistically identical for the 12 strains (p=0.415) with a mean value of 52.7±5.1 KJ mol -1 (R 2 = 0.89; p = 5.1 ⇥ 10 -20 ). The Arrhenius function captures the temperature dependence of growth rate within a range where enzyme stability is preserved. At temperatures exceeding that range, growth rates of the 12 analyzed strains rapidly approached zero, requiring two additional parameters (optimal growth temperature T ⇤ and maximum growth temperature T + ) to capture these dynamics. We could detect no consistent pattern in the width of the interval between T ⇤ and T + , so a single parameterization (2±0.8 o C above T ⇤ ) was assumed for all strains. A modification of a previously proposed model [57], but which preserves the Arrhenius parameterization, was used to capture the dynamics of a dimensionless growth rate constant k in both regimes ( fig. S14c), Optimal growth temperatures (OGT; T ⇤ ) were predicted for each strain from proteome sequences using the machine learning algorithm TOME-cool. The OGT dataset from TOME 1.0 [26] was expanded to include additional psychrophilic and psychrotolerant taxa [58,59], resulting in a training dataset of 6020 microorganisms (https://github.com/EngqvistLab/tome_cool). The distribution of OGT values is shown in fig. S18. The same machine learning approach as described in [26] which uses 5-fold cross-validation with 5-fold inner cross-validation for hyperparameter optimization was used to select the best regression model. Among linear, elastic net, Bayesian Ridge, decision tree, and random forest regression models, the support vector machine regression achieves the best performance (coe cient of determination score, R 2 = 0.88 obtained from the cross-validation, fig. S19). The residual plots show that the model tends to overestimate the OGT values for psychrophiles ( fig. S19b). As the OGT values shows a skewed distribution with most samples with an OGT between 20 o C and 40 o C, we next binned OGT values with an interval size of 5 o C. Equal weights were assigned to each bin. Organism-specific weights were calculated by normalizing each bin weight by the number of organism in each bin. The weighted regression was used, and di↵erent regression models were tested with the same cross-validation approach. Among those, the Bayesian Ridge regression showed the best performance (R 2 = 0.65, fig. S19c). The residual plot shows that the model improved performance for psychrophiles, however at the cost of precision for other organisms (fig. S19d). The fitting results with the best models (SVR and Bayesian Ridge) from those two di↵erent settings was compared (fig. S19e-g).

Weighted population rates and stocks
Acclimated single cell biomass-specific fluxes and compositions were scaled up to whole population, volumetric rates and concentrations (stocks) using qPCR based ecotype abundance data [9]. Since the primers captured a variable proportion (75 ± 21%) of the total Prochlorococcus FCM counts, ecotype relative abundances were scaled to flow cytometric cell counts, thereby treating the unknown remaining population as an average of those quantified. Rather than assuming an even composition of strains within each ecotype, the most fit strain in each grid cell was chosen as a representative. Accordingly, population variables X pop (e.g., metabolic fluxes, macromolecular compositions, elemental stoichiometries) were calculated as an abundance weighted sum of a variable associated with the fittest strains from each ecotype x e , where B F CM is the flow cytometry derived total Prochlorococcus cell count, b e is the relative abundance of the e th ecotype in the set of all ecotypes E, and M e is the ash-free dry weight of cellular biomass for the most fit strain within each ecotype, approximated using its calculated optimal cell size and a fixed dry weight content of 21% and ash content of the dry weight of 16% [ ]. 73

Datasets
Data S1 -Strains (data S1.csv) Table of Prochlorococcus strains included in this manuscript and a variety of metadata associated with each strain. Columns include link identifiers to NCBI and IMG databases, genome metrics and statistics, sequencing information, taxonomic classifications, cell size and maximum photosynthetic rates.
Data S2 -AMT-13 cruise data (data S2.csv)  Data S6 -Literature sources (data s7.xlsx)        rates of primary production. 14 C-PP data were acquired during roughly 12 hour incubations and most closely approximate net primary production (NPP). By accounting for respiratory oxygen consumption during 24 hour incubations in paired dark bottles, Winkler-GPP most closely approximates gross primary production. Both incubations integrate the entire microbial community. Pro-NPP is defined by the rate of inorganic carbon exchange (uptake minus loss) and is comparable with        Left panel -growth rates of 12 strains of Prochlorococcus, each grown over a range of cultivation temperatures. Supra-optimal growth temperatures are excluded from this plot. Least-squares regression line is shown, corresponding to an activation energy of 52.7±5.1 KJ mol -1 . Middle panel -violin plots of TOME-cool predicted optimal growth temperatures for each of the five ecotypes analyzed. Right panel -comparison of the observed and predicted temperature dependence of growth rates for two strains (MED4 and MIT 9312). Predictions are based on TOME-cool OGT parameterizations of our modified Arrhenius function for each strain.