Active learning of the thermodynamics-dynamics trade-off in protein condensates

Phase-separated biomolecular condensates exhibit a wide range of dynamic properties, which depend on the sequences of the constituent proteins and RNAs. However, it is unclear to what extent condensate dynamics can be tuned without also changing the thermodynamic properties that govern phase separation. Using coarse-grained simulations of intrinsically disordered proteins, we show that the dynamics and thermodynamics of homopolymer condensates are strongly correlated, with increased condensate stability being coincident with low mobilities and high viscosities. We then apply an “active learning” strategy to identify heteropolymer sequences that break this correlation. This data-driven approach and accompanying analysis reveal how heterogeneous amino acid compositions and nonuniform sequence patterning map to a range of independently tunable dynamic and thermodynamic properties of biomolecular condensates. Our results highlight key molecular determinants governing the physical properties of biomolecular condensates and establish design rules for the development of stimuli-responsive biomaterials.


INTRODUCTION
Biomolecular condensates-also known as "membraneless organelles"-are self-organized structures within the cytoplasm and nucleoplasm of living cells [1][2][3].Condensates play diverse roles in a wide variety of biological processes [4][5][6], in large part because of their ability to concentrate proteins, RNAs, and small molecules in chemically specific environments [5][6][7][8].Nonetheless, the ability of condensates to tune the rates of biochemical reactions is not only determined by their biochemical compositions.Dynamical properties, such as viscosities and molecular mobilities, also affect how condensates exchange biomolecules with the rest of the cell [9][10][11] and interact with other intracellular structures, such as chromatin [12] and cytoskeletal components [13].Dynamical properties are also directly relevant to condensate assembly and disassembly kinetics [14] and irreversible aging processes [15].Recent experiments have demonstrated that dynamical properties vary widely across different condensates, both in vitro and in vivo [11,16].
Many condensates are believed to form spontaneously as a result of phase separation taking place close to a local thermodynamic equilibrium [1,17].The thermodynamic stability of a condensate, which can be quantified by the critical temperature of a biopolymer solution [18] or by the saturation concentration required for phase separation [9,19], is an equilibrium property related to the multiplicity and strengths of the "multivalent" interactions among the biomolecules comprising a condensate [9,20,21].Intuitively, one might expect that the thermodynamic stability of a condensate is correlated with its dynamical properties, which also emerge from the collective effects of biomolecular interactions in the condensed phase.This anticipated relationship represents a "tradeoff" between stability and dynamics, since stronger thermodynamic driving forces are expected to correlate with lower molecular mobilities [22].Experimental evidence suggests that such a correlation exists for certain ribonucleic condensates, such as Arg/Gly-rich repeat protein/RNA condensates [11] and LAF-1/RNA condensates [16].However, understanding the general relationship between thermodynamic stability and the internal dynamics of biomolecular condensates requires additional and systematic exploration [23,24].
Determining whether a stability-dynamics tradeoff extends to phase-separating biopolymers is challenging due to the enormous diversity of biopolymers that can form condensates.Here, we focus on condensates composed of intrinsically disordered proteins (IDPs), a class of proteins that tend not to adopt stable secondary and tertiary structures and are known to be essential components of many naturally occurring condensates [25].To study the properties of IDP sequences, we employ a coarse-grained (CG) model that represents IDPs at residue-level resolution [26].Models of this type have been shown to reproduce experimental measurements of both single-molecule and condensed-phase properties [18,[27][28][29], suggesting that CG IDP models capture essential physics of disordered polypeptides and retain sufficient chemical specificity to predict differences among condensates composed of various IDP sequences.Thus, CG IDP models seem well-suited to systematically explore the relationship between IDP stability and condensate dynamics.
In this article, we combine molecular dynamics simulations and machine learning to navigate the sequence space [30,31] of IDPs and elucidate the relationship between the phase behavior and internal dynamics of singlecomponent IDP condensates.We first establish a strong correlation between proxies for condensate stability and dynamics in homomeric polypeptides, which is understood using simple physical arguments.We then deploy a computationally efficient machine-learning strategy based on "active learning" [32][33][34] to identify IDP sequences that break this correlation, exhibiting faster internal dynamics than homomeric polypeptides that assemble into equally stable condensates.In this way, we identify "Pareto-optimal" IDP sequences, meaning that sequence perturbations cannot enhance the dynamics further without reducing the stability of the condensate.Finally, we examine sequence features of Pareto-optimal sequences and perform a counterfactual analysis [35,36] to identify the sequence determinants of the limiting thermodynamics-dynamics tradeoff.Taken together, our results demonstrate how sequence design can be used to tune thermodynamic and dynamical properties independently in the context of biomolecular condensates.

Thermodynamic and dynamic properties of homomeric polypeptides are strongly correlated
To establish a baseline expectation for the thermodynamics-dynamics tradeoff in IDP condensates, we begin by studying a collection of homopolymeric sequences at 300 K. Specifically, we consider homopolymers consisting of each amino-acid type with chain lengths of N = 20, 30, 40, or 50 monomers.Molecular dynamics simulations are performed using the CG model developed by Mittal and coworkers [26,28], which treats IDPs at the single-residue level in implicit aqueous solvent.Amino-acid residues interact via a combination of bonded and nonbonded pair potentials, the latter of which account for both electrostatic interactions and short-ranged hydrophobic forces.We use a Debye screening length of 1 nm, corresponding to an ionic strength of 0.1 M. See "Model of intrinsically disordered proteins" in Materials and Methods for further details.
Throughout this work, we utilize the second-virial coefficient, B 2 , as a proxy for the thermodynamic stability of the condensed phase for IDPs that undergo phase separation.B 2 quantifies the net attractive or net repulsive intermolecular interactions of a pair of molecules in dilute solution [37].It is well established that B 2 can also be used to predict the critical point of simple atomic and molecular fluids [38], colloidal suspensions [39,40], and homopolymer solutions [41].A strong correlation between B 2 and the critical temperature for IDP phase separation has also been reported, albeit for a limited set of IDP sequences [18].In all these systems, a negative B 2 value-implying net attractive interactions-is necessary for phase separation to occur.Although this condition turns out to be insufficient to guarantee phase separation in heteropolymer solutions [42,43], we find that B 2 nonetheless correlates strongly with the difference be-tween the coexisting phase densities for IDP sequences that do phase separate, as we demonstrate using molecular dynamics simulations below.In practice, we compute B 2 by calculating the potential of mean force, u(r), between the centers of mass of two polymer chains at 300 K using adaptive biasing force simulations [44] (Fig. 1a; see also "Physical property calculations" in Materials and Methods).B 2 is then obtained from the equation where r represents the distance between the center of mass of each chain and β ≡ 1/k B T .For convenience, we report B 2 relative to a reference volume, V 0 = 5529 Å3 , equal to the pervaded volume of an ideal polymer chain, V 0 = (4π/3)b 3 0 (N/6) 3/2 [41], with the CG equilibrium bond length, b 0 = 3.8 Å, and a chain length of N = 50.
Next, we develop a method to confirm whether an IDP sequence phase separates and, if so, to determine dynamical properties in the condensed phase.We perform an equation-of-state (EOS) analysis to approximate the condensed-phase density, ρ c , for polymers with negative second-virial coefficients (Fig. 1b; see also "Physical property calculations" in Materials and Methods).The EOS is a relationship between the osmotic pressure and volume of the polymer solution at constant temperature.In macroscopic systems, the pressure is a monotonically increasing function of the density at constant temperature.However, in finite-size simulations, the EOS exhibits a non-monotonic "van der Waals loop" [38,45] at temperatures for which phase separation occurs; fundamentally, this non-monotonicity arises due to the inability of a small system to form a stable interface between coexisting bulk phases.We therefore use short simulations of 100 polymer chains at 300 K to compute the pressure as a function of density, and we identify nonmonotonic behavior as evidence of a phase-separated region.Making the approximation that the coexistence pressure is near zero, corresponding to a dilute lowdensity phase, we approximate the condensed-phase density, ρ c , as the highest-density root of a non-monotonic EOS.By contrast, we consider a strictly non-negative EOS as evidence that phase separation does not occur at 300 K. Following this approach, we find that 29 out of the 80 homomeric polypeptides phase separate at 300 K. We later show that this method, which is amenable to high-throughput simulations, is accurate to within a few percent of ρ c values determined via direct-coexistence slab simulations [28].
Finally, we use the self-diffusion coefficient, D, of a tagged chain within the bulk condensed phase as a simple measure of the internal condensate dynamics.We calculate D from the mean-squared displacement of a tagged chain in a canonical-ensemble simulation conducted at the condensed-phase density, ρ c (see "Physical property calculations" in Materials and Methods).This quantity is negatively correlated with the viscosity of the condensed phase in phase-separated homopolymer solutions and ρ Y 20 c , respectively.(c) The dimensionless self-diffusion coefficients, D/D0, of homopolymers in the condensed phase are anticorrelated with their scaled dimensionless second-virial coefficients, −B2N 1/2 /V0.Points are shown on a log-log plot for the 29 homopolymers (out of 80 simulated) that phase separate at 300 K. Self-diffusion coefficients are obtained from canonicalensemble simulations conducted at the condensed-phase density, ρc, determined from the equation-of-state analysis.Statistical errors reflect the standard error of the mean and are comparable to the symbol size.
(Fig. S1 in the Supplementary Materials).However, the greater statistical uncertainties in the viscosity calculations make D the more practical choice for quantifying internal condensate dynamics in high-throughput simulations.We report self-diffusion coefficients relative to a reference value, D 0 = 0.42 × 10 −9 m 2 /s, which corresponds to an ideal Rouse chain with chain length N = 50, D 0 = τ k B T /N M [41], where τ = 1 ps is the damping time in our Langevin simulations and M = 118 g/mol is the average molecular weight of the 20 amino acid types.
Plotting D/D 0 versus B 2 /V 0 , we observe a strong negative correlation between these quantities across all homopolymeric sequences that are determined to phase separate (Fig. 1c).Surprisingly, this correlation is stronger than the direct correlation between the condensed-phase density, ρ c , and either of these quantities alone (Fig. S2 in the Supplementary Materials).This indicates that even though the condensed-phase density at 300 K cannot be predicted solely on the basis of the second-virial coefficient, this quantity nonetheless captures the key physics necessary to predict the internal dynamics of a homopolymer condensate at its equilibrium density.
To rationalize the empirical relationship between D/D 0 and B 2 /V 0 , we consider a Rouse model [41] of an unentangled polymer melt.This model can describe the condensed phases in these simulations, which have densities on the order of 1 g/ml.In this model, the selfdiffusion coefficient is inversely proportional to the total frictional force experienced by the chain, which typically follows an Arrhenius law for a thermally activated process [24,41].It is thus reasonable to relate the logarithm of the total friction to the reversible work required to separate a pair of chains, which is proportional to log(−B 2 ) in the limit of strong attractive interactions.Accounting for the number of interacting chains within the pervaded volume of a tagged polymer [41], we therefore propose that the condensed-phase self-diffusion coefficient for phase-separating homopolymers should scale as D ∼ −N 1/2 B 2 .Empirically, we find that including the prefactor N 1/2 in the scaling relationship indeed improves the correlation across the 29 phase-separating homopolymers that we simulated (Fig. 1c and Fig. S3 of the Supplementary Materials).We thus conclude that this simple model captures the essential relationship between the second-virial coefficients and condensed-phase self-diffusion coefficients of homomeric polypeptide sequences.

Short polypeptides from known intrinsically disordered regions do not form condensed phases
Having established a correlation between D/D 0 and B 2 /V 0 for homomeric polypeptides, we next examine the behavior of heteromeric polypeptides based on extant IDP sequences.To do so, we perform CG simulations of 1,266 short (20 ≤ N ≤ 50) polypeptide sequences from DisProt [46], a manually curated database of disordered regions of proteins.This range of sequence lengths is used throughout this study to limit the total computational expense of these simulations.Although these polypeptides feature diverse sequence characteristics (Fig. S4 in the Supplementary Materials), none of these polypeptide sequences form condensed phases according to the criteria set forth in the prior section.This observation could be attributed to potential bias within the DisProt database towards soluble sequences [47], the imposition of an upper limit on sequence lengths probed in this study, which excludes many longer extant IDP sequences that are known to play important roles in intracellular phase separation [46,48], and our focus on single-component systems (e.g., exclusion of RNAs that may be required to observe phase-separation [6]).
Of all analyzed DisProt sequences with 20 ≤ N ≤ 50, we find that the vast majority (c.a.80%) have positive B 2 (Fig. S5 in the Supplementary Materials), while the remaining 255 sequences exhibit monotonically increasing EOS curves, indicating that phase separation does not occur at 300 K when simulated using this CG IDP model.Nonetheless, we note that some of these B 2 values are more negative than some of the phase-separating homomeric polypeptides considered in Fig. 1.This observation highlights the importance of performing the EOS analysis to confirm whether a polypeptide undergoes phase separation at 300 K.
Active learning identifies limits of tunability for condensate thermodynamic-dynamic properties Since the lack of phase separation precludes analysis of condensed-phase dynamical properties using short Dis-Prot sequences, we set out to generate novel heteromeric sequences and explore the thermodynamics-dynamics tradeoff beyond homopolymers.To this end, we adopt an active learning scheme [32] that iteratively selects and simulates new polypeptides to probe this relationship methodically and efficiently (Fig. 2a).The selection of new polypeptides is guided by Bayesian optimization, which has been usefully deployed in materials design, including the design of heteropolymer sequences, by informing the next "best" sequence(s) to characterize [30,[49][50][51][52][53].Within the context of the thermodynamics-dynamics tradeoff, the "best" sequences are those considered as Pareto optimal [54,55], meaning that no other polypeptide sequence can be found that simultaneously exhibits an increased D and a decreased B 2 .
Our active learning scheme integrates Bayesian optimization with supervised machine-learning models and a genetic algorithm for designing polypeptides.Polypeptides are chosen for characterization based on a policy of expected hypervolume improvement (EHVI) [56], which has previously been demonstrated to converge towards a true set of Pareto-optimal sequences [57].These Paretooptimal sequences are referred to as the Pareto front.Optimizing EHVI over sequence space is facilitated by a genetic algorithm that leverages machine-learning (ML) models, trained using the results of prior CG simulations, to predict physical properties (i.e., phase behavior, B 2 , and D) from sequence features of prospective polypeptides.In particular, the ML models employ a 30dimensional feature vector that incorporates the aminoacid composition [58,59] and ten additional sequencelevel descriptors, including the sequence length, N ; the average hydrophobicity per residue, λ; the sequence hydropathy decoration parameter, SHD [60]; the fraction of positively and negatively charged residues, q+ and q− ; the net charge per residue, |q|; and the sequence charge decoration, SCD; the mean-field second virial coefficient [60], B MF 2 ; the Shannon entropy, S; and the average molecular weight, M .All the ten sequence-level feature values are normalized and standardized as described in "Design of novel intrinsically disordered proteins" in Materials and Methods.Each iteration of active learning involves training ML models with the current simulation data, generating 96 new polypeptide sequences via the genetic algorithm, and subsequently simulating these sequences according to the protocols introduced previously.Additional technical details are reported in Materials and Methods, and complete descriptions of ML model generation and validation are provided in the Supplementary Materials (see Fig. S6).
We find that this active-learning approach rapidly identifies a representative set of Pareto-optimal polypeptides (Fig. 2b and Fig. S7).First, using a ML model for B 2 trained using DisProt sequence data, we perform an initial round of simulations involving IDPs chosen to minimize B 2 only (Iteration 0).These simulations establish relevant data for training a ML model for D, enabling subsequent multiobjective optimization via EHVI (see "Design of novel intrinsically disordered proteins" in Materials and Methods).We quantitatively assess convergence of the active-learning approach by monitoring both the hypervolume (HV) under the current approximation of the Pareto front and the fraction of newly added Pareto-optimal sequences relative to the current number of Pareto-optimal points (Fig. 2c).Active learning initially leads to a rapid expansion of the Pareto front (Iterations 1-3), whereas subsequent iterations (Iterations 4-7) result in substantially diminished expansion and mostly inconsequential additions to the Pareto front, especially along the weakly-interaction portion (−B 2 /V 0 ≲ 50) of the front (dashed lines in Fig. 2c).In these latter rounds, the expansion of the front occurs primarily via the addition of sequences with extremely negative B 2 and low D. We therefore halt optimization based on EHVI at this point and perform one last iteration of sequence design based on pure exploitation of the ML models, which confirms convergence of the Pareto front (Fig. 2c).The final data set includes 2,114 polypeptide sequences, including 35 Pareto-optimal sequences (denoted P1-P35, see Table S1) that provide a good representation of the true Pareto front of heteromeric polypeptides.
By contrast with homomeric polypeptides, active learning identifies "designed" heteropolymer sequences spanning a range of D values at fixed B 2 and vice versa (Fig. 2b).Although heteromeric polypeptides follow the same trend of decreasing D with decreasing B 2 as found for as homomeric polypeptides, the data cannot be neatly collapsed by any obvious scaling.For strongly attracting polypeptides (−B 2 /V 0 ≳ 50), all se- quences possess limited condensed-phase mobility.Most of these, including the homomeric sequences Y 40 , W 40 , and W 50 , are within the statistical error region spanned by Pareto-optimal heteromeric sequences (Fig. 2b).However, more weakly interacting polypeptides at the Pareto front (−B 2 /V 0 ≲ 50) exhibit D values that are roughly twice those of homopolymer sequences with the same B 2 .For example, at a fixed B 2 /V 0 ≈ 10 in Fig. 2b, the reduced diffusion coefficient D/D 0 increases from ∼ 0.4 to ∼ 0.9.(We note that, because the Bayesian optimization only targets expansion of the Pareto front, it is likely that the maximum dynamic range of D values at fixed B 2 can exceed a factor of two.)Thus, Fig. 2b illustrates the overall prospects and limitations of tuning the thermodynamic-dynamic tradeoff via sequence de-sign.Specifically, when IDPs interact relatively weakly, sequence design might be used to fine-tune the dynamical properties of single-component IDP condensates that possess similar thermodynamic stabilities (here assessed by −B 2 ) or, alternatively, to design IDP condensates that likely undergo phase separation below disparate critical temperatures while exhibiting similar dynamical properties.

Coexistence simulations validate Pareto-optimal sequences discovered via active learning
We validate the IDP sequences designed via active learning by performing large-scale molecular dynamics simulations of phase coexistence.We carry out direct coexistence simulations of five representative sequences from the thermodynamics-dynamics Pareto front (P1, P8, P14, P19, and P35) using 1000 chains in a slab geometry [18,27] at 300 K. Simulation durations of at least 2 µs are required to reach equilibrium, resulting in the formation of an interface between equilibrium condensed and dilute phases (Fig. 3a).The interfaces between the coexisting dilute phases also become sharper as we move along the Pareto front from P1 to P35 (Fig. 3b), which is consistent with a trend towards increasing thermodynamic stability.Importantly, we also find that the differences between the condensed and dilute-phase densities are anticorrelated with B 2 , which decreases from P1 to P35 (Fig. 3c).Both observations are consistent with the previously reported correlation between critical temperatures and second-virial coefficients in single-component IDP solutions [18] and support our use of B 2 as a proxy for the condensed-phase thermodynamic stability.
Importantly, these simulations recapitulate the condensed-phase dynamical properties of Pareto-optimal sequences predicted by our active-learning approach.From equilibrated simulations of phase coexistence, we determine the mean condensed-phase density and compute the self-diffusion coefficient of chains within the bulk of the condensed phase (i.e, away from the interface with the dilute phase).The physical properties determined in this way compare favorably with the values determined from the EOS analysis (Fig. 3c and Fig. S8 in the Supplementary Materials) and singlephase simulations of condensed-phase dynamical properties (Fig. 3d).In all cases, quantities computed via these different methods differ by less than 10% and typically agree within the statistical error.This agreement demonstrates that the high-throughput simulations utilized by our active-learning approach are sufficiently accurate compared to more computationally expensive direct-coexistence simulations.Furthermore, these results suggest that our active-learning protocol (Fig. 2c) indeed converges to a representative sequences on the thermodynamics-dynamics Pareto front for this CG IDP model.

Sequence features change drastically along the thermodynamics-dynamics Pareto front
To better understand what governs the thermodynamics-dynamics tradeoff for heteromeric polypeptides (Fig. 2b), we analyze the sequence characteristics of IDPs on the Pareto front (Fig. 4a, stars).An aggregate comparison among the Pareto-optimal polypeptides (P1-P35) is based on the pairwise cosine similarity of their feature vectors, where higher scores indicate greater similarity.For clarity, we decompose sequence similarity into the contributions from (i) the amino-acid composition (Fig. 4b, lower triangle) and (ii) the various sequence-derived descriptors (Fig. 4b, upper triangle).The average cosine similarities among pairs of Pareto-optimal sequences are (i) 0.60 and (ii) 0.56, respectively.For reference, the average cosine similarities among pairs of DisProt sequences (20 ≤ N ≤ 50) are (i) 0.57 and (ii) 0.22; these reference values are indicated by the vanishing points of the color bars in Fig. 4b.
Based on relative similarities, we propose that the Pareto-optimal sequences roughly divide into three groups: P1-P8, P9-P20, and P21-P35, outlining a high-B 2 /high-D regime, a transition regime, and a low-B 2 /low-D regime, respectively (top-left, middle, and bottom-right in Fig. 4a,b).While P1-P8 generally exhibit stronger composition and sequence-descriptor similarity than randomly selected pairs of disordered regions from DisProt, they significantly differ from the other groups.Collectively, P9 to P35 tend to show strong composition and sequence-descriptor similarities with one another.However, the transition group (P9 to P20) has lower intra-group similarity compared to the low-B 2 /low-D group (P21 to P35) (e.g., see also dashed lines in Fig. S9 of the Supplementary Materials), which motivates their distinction.
However, we find that neither composition nor descriptor-based similarity alone guarantees similar macroscopic properties among Pareto-optimal sequences.On the one hand, Pareto-optimal sequences can exhibit high variability in sequence characteristics despite occasionally possessing similar physical properties.This is particularly true in the transition regime, which is typified by high intra-group dissimilarity; for instance, P13 and P17 differ significantly in terms of sequence descriptors relative to sequences that are directly adjacent on the Pareto front.This variability in sequence characteristics also suggests a higher degeneracy of near-Paretooptimal polypeptides in the transition regime compared to the high-B 2 /high-D and low-B 2 /low-D regimes, where the characteristics of Pareto-optimal sequences tend to be more consistent.On the other hand, sequences with high overall composition and sequence-descriptor similarity can exhibit vastly different macroscopic properties.For example, extremal sequences P1 and P35 have greater sequence-descriptor similarity than P1 does with several other sequences in the high-B 2 /high-D regime.Even within the higher-degeneracy transition regime, P9 and P20 maintain overall high composition and sequencedescriptor similarity despite a 2.4-fold change in B 2 and a 1.9-fold change in D between these points on the Pareto front.Taken together, these observations suggest that there is substantial flexibility in choosing combinations of sequence features to achieve near-Pareto-optimal condensate properties.
We next consider whether specific sequence characteristics, rather than overall sequence similarities, might control property changes across the Pareto front.In this way, we find that a subset of sequence descriptors are most relevant and that the relative importance of each descriptor depends on the regime.In Fig. 4c, we highlight the trends along the Pareto front of key sequence descriptors including sequence length, N ; average sequence hydropathy, λ; sequence hydropathy decoration, SHD; the absolute net charge per residue, |q|; the fraction of charged residues (or the total charge), q+ + q− ; and sequence charge decoration, SCD.Data for other features that have previously been cited as relevant to IDP condensates [9,[60][61][62] are shown in Fig. S9 in the Supplementary Materials.The highlighted descriptors display an array of behaviors when moving across the Pareto front.N is roughly sigmoidal with −B 2 , starting at small N in the high-B 2 /high-D regime (P1-P8) before gradually but noisily increasing to plateau at large N in the low-B 2 /low-D regime (P27-P35).By contrast, λ and SHD exhibit U-shaped behavior, first generally decreasing in the high-B 2 /high-D regime and then increasing once entering the transition region.Interestingly, the net charge, |q|, of Pareto-optimal sequences is always close to zero, but there are qualitative differences in total charge, q+ + q− , and SCD in different regimes.In particular, q+ + q− starts at finite fractional charge at high-B 2 /high-D and initially increases before diminishing to zero in the transition and low-B 2 /low-D regimes; SCD displays similar behavior, but with the opposite sign.
The collective observations from Fig. 4c suggest two important guiding principles regarding the thermodynamics-dynamics tradeoff of protein condensates.First, the trends in charge-related features indicate that Pareto-optimal polypeptides are polyampholytic only in the high-B 2 /high-D regime.Polypeptides in this regime also tend to be shorter and less hydrophobic than other phase-separating IDPs generated during active learning.Therefore, polyampholicity in combination with nonuniform charge distribution and weak hydrophobicity is key to achieving phase separation with short (and therefore fast-diffusing) chains.Second, and by contrast, Pareto-optimal polypeptides are largely composed of uncharged, hydrophobic amino acids in the low-B 2 /low-D regime.In particular, sequences P27-P35 are predominantly composed of tyrosine, tryptophan, and phenylalanine, which have the highest hydrophobic character in the CG model and are also among the more massive amino acids.Greater hydrophobicity and longer chain lengths therefore result in both stronger inter-chain attraction and more sluggish diffusion in the low-B 2 /low-D regime.It is notable that these insights are consistent with prior computational and experimental studies of protein-condensate rheology [9,63,64].Therefore, these principles may provide insights into general property trends resulting from sequence modifications, even though they stem from analyzing a specific set of Pareto-optimal polypeptides.

Pareto-optimality is characterized by regime-dependent sequence determinants
We perform a counterfactual analysis [36,65,66] of the Pareto-optimal IDPs to gain further insight into the sequence determinants of the heteropolymer thermodynamics-dynamics tradeoff.In this context, a "counterfactual" is an IDP sequence that bears strong sequence similarity to a Pareto-optimal sequence but exhibits distinctly different physical properties (Fig. 4a).Such counterfactuals can therefore be used to attribute Pareto-optimality to small variations in IDP sequences.Here, we consider a polypeptide to be a counterfactual of a specific sequence on the Pareto front if (i) the cosine similarity of their feature vectors is greater than 0.9 and (ii) the counterfactual candidate is within a dimensionless distance of 0.15-0.3away from the Pareto front; for this calculation, we use a Euclidean distance in the in the standard-normalized B 2 -D plane (see Supplementary Materials for details).Requiring that the dimensionless distance is in the range 0.15-0.3ensures that the properties of counterfactual sequences are statistically resolvable from those on the Pareto front while remaining in a similar regime of physical property behavior.Because each Pareto-optimal polypeptide may have several counterfactuals satisfying the aforementioned criteria, we examine the average feature differences, where k denotes a particular feature of a feature vector ⃗ x, P i is the ith Pareto-optimal sequence with its value of x k indicated as x , C ij is the jth counterfactual of the ith Pareto-optimal sequence with its value of x k indicated as x , and n Cij is the total number of counterfactuals of the ith Pareto-optimal sequence.Plotting ⟨∆x k ⟩ for all features versus −B 2 /V 0 (Fig. 4d and Fig. S10 in the Supplementary Materials) allows us to ascertain whether specific sequence descriptors distinguish Pareto-optimal versus non-Pareto-optimal sequences.
In the low-B 2 /low-D regime, we find that Paretooptimal polypeptides possess several distinctive features relative to their counterfactual sequences.In particular, Pareto-optimal sequences within this regime are generally longer, feature more hydrophobic residues, and have larger values of the sequence hydropathy decoration parameter, SHD, than their counterfactuals.Paretooptimal sequences in this regime also tend to have fewer charged residues and a lower net charge than their counterfactuals.Since the Pareto-optimal sequences in this regime are also neutral (Fig. 4c), this implies that the counterfactual sequences possess a net charge despite having near zero values of the sequence charge decoration parameter, SCD (Fig. 4c,d).Our counterfactual analysis therefore indicates that hydrophobic interactions are the dominant determinants of Pareto-optimal sequences in the low-B 2 /low-D regime.
However, outside the low-B 2 /low-D regime, differences between counterfactuals and Pareto-optimal sequences are not as easily resolved.In the high-B 2 /high-D regime, no single sequence descriptor differs consistently between Pareto-optimal and counterfactual sequences.Instead, variations in B 2 and D manifest via subtle manipulations, such as exchanging residues with comparable hydrophobicity but different masses (Fig. S10 in the Supple-mentary Materials).Within the transition regime, there are statistically significant differences between features of Pareto-optimal polypeptides and counterfactuals, but the directions of such differences are not necessarily preserved across the Pareto front.In particular, the hydrophobicity differences, ⟨∆ λ⟩, and total charge differences, ⟨∆(q + + q− )⟩, fluctuate between positive and negative values within this regime (P9-P20 in Fig. 4d).This consistently enhanced variability across multiple features (see also Fig. S10 in the Supplementary Materials) can be attributed to the greater degeneracy of sequence space near the Pareto front in the transition regime, as discussed in the previous section.Overall, this counterfactual analysis indicates that the relationship between sequence features and Pareto optimality is nuanced, since relatively subtle changes from counterfactuals dictate behavior in the high-B 2 /high-D regime and myriad feature combinations exist for achieving near-optimality in the transition regime.

DISCUSSION
Using an active-learning approach, we have investigated the extent to which thermodynamic and dynamical properties are coupled in IDP condensates.Our approach uniquely combines high-throughput simulations of a coarse-grained model of IDPs, Bayesian active learning strategies for predicting physical properties from IDP sequences, and numerical optimization techniques to efficiently explore sequence space.Overall, this approach confirms that a tradeoff between thermodynamic stability and internal condensate dynamics is a general, albeit tunable, feature of protein-based condensates.Importantly, this work provide a robust assessment of condensate dynamics via a global analysis of IDP sequence space, rather than relying on systematic sequence mutations alone.
The essential insight of our active-learning study is that heteromeric polypeptides, by virtue of their diverse physicochemical properties, tend to decouple the thermodynamic and dynamical properties of IDP condensates.In our study, the correlation between proxies for thermodynamic stability and condensed-phase dynamics is strongest in the case of homomeric polypeptides.In particular, both the thermodynamic stability and the condensed-phase dynamics of homomeric polypeptide condensates can be predicted with high accuracy on the basis of the second-virial coefficient and the chain length alone.However, this correlation between thermodynamic and dynamic properties weakens when we consider heteromeric IDP sequences, indicating that we can design sequences to tune the thermodynamic and dynamic properties of IDP condensates independently.Consistent with this notion, we observe that the second-virial coefficient is not always sufficient for predicting phase separation of heteromeric IDP sequences.Ultimately, by identifying a representative set of Pareto-optimal polypeptides-heteromeric sequences that outline a limiting tradeoff boundary with respect to the second-virial and diffusion coefficients-we find that the condensed-phase selfdiffusion coefficient can be substantially increased relative to homomeric polypeptides with the same secondvirial coefficient.
Our systematic approach also provides insight into specific sequence descriptors that govern Pareto-optimal sequences.However, we find that these sequence determinants are surprisingly nuanced and differ depending on the position of a sequence along the thermodynamicsdynamics Pareto front.Whereas the extremal regions (high-B 2 /high-D and low-B 2 /low-D) display consistent trends with respect to key sequence features, the transition regime is characterized by high sequence variability, which likely reflects greater degeneracy with respect to sequence space within this regime.Chain length, hydrophobicity, and hydrophobic patterning are consistently important determinants of Pareto-optimality.By contrast, total charge and charge patterning descriptors are primarily relevant in the high-B 2 /high-D regime that is dominated by polyampholytes, and enhancements to charge patterning tend to suppress polypeptide diffusion.Finally, our counterfactual analysis indicates that relatively minor sequence perturbations can result in large changes to both B 2 and D, particularly in the transition regime.These findings suggest that different design principles are relevant in different regimes of physicochemical properties, and they highlight the importance of considering multiple sequence features when optimizing polypeptide sequences to achieve specific physical properties.
We emphasize that our numerical results apply to single-component protein condensates and rely on the accuracy of the underlying coarse-grained IDP model, which lacks physicochemical features such as directional sidechain interactions, hydrogen bonds, and secondary structure.These limitations of the underlying coarsegrained model influence the Pareto-optimal sequences, many of which look dissimilar from known naturally occurring IDPs, despite having overall low sequence complexity.In this sense, our designed sequences should be interpreted as polypeptide-like heteropolymers with hydrophobic, electrostatic, and excluded-volume interactions as opposed to literal amino-acid sequences.Coarsegrained models are also known to predict dynamical properties with reduced fidelity compared to all-atom models with many more degrees of freedom [67][68][69].It is likely that a greater range of variability of dynamical properties would be predicted by a model that incorporates such features, especially if such modifications were to allow for more accurate descriptions of low-density condensed phases [16].It is also likely that the range of variability of dynamic properties would be increased by examining IDP sequences with lengths greater than the upper limit of 50 used in this study.Nonetheless, despite these considerations, we expect that our findings regarding the existence of a thermodynamics-dynamics trade-off, as well as the relative importance of chain length, charge patterning, and hydrophobicity along the Pareto front, are robust with respect to alternative IDP models [29,[70][71][72][73][74] and actual polypeptide-like polymers.Experiments and higher-resolution simulations [67,[75][76][77] will be needed to test such predictions.
A major strength of our approach is its ability to explore IDP sequence space efficiently and characterize relationships among disparate biophysical properties.Crucially, the strategy of combining high-throughput simulations and Bayesian-guided optimization is independent of any limitations of the underlying coarse-grained model.Moreover, we have demonstrated that this approach can efficiently converge the Pareto front for a pair of biophysical properties that depend on a complex sequence of computations.We anticipate that this framework could be applied to other combinations of biophysical traits and may also be useful for studying multicomponent and multiphasic systems [78][79][80].Our work therefore establishes a promising approach for engineering custom biomolecular condensates with tunable thermodynamic and dynamic properties.

Model of intrinsically disordered proteins
The coarse-grained (CG) model developed by Regy et al. [26] represents IDP sequences, comprising 20 aminoacid types, in implicit solvent.The force field consists of a harmonic bonded potential as well as short-range van der Waals (vdW) and long-range electrostatic (el) nonbonded interactions, ) where r i,i+1 is the distance between bonded residues i and i + 1, the force constant is k b = 10 kcal/(mol • Å2 ), and the equilibrium bond length is b 0 = 3.82 Å.The vdW interaction takes the Ashbaugh-Hatch functional form, ) where r ij is the distance between nonbonded residues i and j, and is the Lennard-Jones potential with ϵ = 0.2 kcal/mol.The interaction strength, λ ij = (λ i + λ j )/2, and distance, σ ij = (σ i + σ j )/2, parameters are determined from the hydropathy scaling factors, λ i , and vdW diameters, σ i , of the residues, respectively [26].Lastly, screened electrostatic interactions take the form where q i is the charge of residue i, D = 80 is the dielectric constant of the solvent, and κ = 10 Å is the Debye screening length.

Physical property calculations
Second-virial coefficient.Second-virial coefficients are determined by calculating the potential of mean force, u(r), as a function of the center-of-mass (COM) distance, r, between an isolated pair of chains.These calculations are carried out using the adaptive biasing force (ABF) method [44] implemented in the COLVARS package in LAMMPS [81].A Langevin thermostat is applied to maintain a constant temperature of 300 K. Two molecules are placed in a cubic simulation box with dimensions 300 × 300 × 300 Å. ABF simulations are performed as a function of r, which varies from 1 Å to 100 Å. Force statistics are stored in bins of width 0.2 Å.The biasing force is applied after 1000 samples are collected in each bin, after which a final production run is performed for 5 µs with a timestep of 10 fs.The second-virial coefficient is then obtained from Eq. ( 1).Thirty independent simulations are performed for each free energy calculation to determine the statistical error.
Condensed-phase density.Considering the high computational costs of slab simulations, we developed an alternative method to determine whether a CG sequence undergoes phase separation at 300 K. To this end, we look for evidence of a van der Waal's loop in the equation of state (EOS) of a small system.Specifically, we compute the pressure using canonical-ensemble simulations of 100 chains with periodic boundary conditions at densities ranging from 0.2 g/ml-1.2g/ml.A Langevin thermostat is applied to maintain a constant temperature of 300 K. Simulations are carried out for 100 ns with a timestep of 10 fs.We assume a near-zero coexistence pressure to minimize statistical and fitting errors, so that phase separation can be identified by the existence of a negative pressure at a finite density.This approximation is justified by the fact that the coexisting dilute phase is characterized by a low polymer density and is nearly ideal, implying that the coexistence pressure is also close to zero.In cases where negative pressures are observed, we fit a cubic spline to the EOS to determine the condensed-phase density, ρ c , defined as the highest density at which the pressure crosses zero.The pressure values are bootstrapped with replacement 50 times to determine the average condensate densities and their standard errors.
Self-diffusion coefficient.To determine the self-diffusion coefficient of a chain within the condensed phase, canonical-ensemble simulations of 100 chains are performed at the condensed-phase density, ρ c , with periodic boundary conditions.The long-time behavior of the mean-squared displacement is then used to compute the self-diffusion coefficient, where ⃗ r is the position of the COM of a tagged chain.The temperature is maintained at 300 K using a Langevin thermostat with a damping frequency of 1 ps −1 .Simulations are carried out for 100 ns with a timestep of 10 fs.
For each diffusion-coefficient calculation, the statistical error is quantified using 30 independent simulations.

Design of novel intrinsically disordered proteins
Overview of framework.We deploy an active learning [32,55] framework based on Bayesian optimization [34] to identify IDP sequences that define a Pareto front of the properties B 2 and D. The active learning framework utilizes coarse-grained simulations to generate data for B 2 as well as pressure-density data and D, as appropriate.These data are used to train three different machine-learning (ML) models: two Gaussian process regressors [82] (GPR) and one random forest (RF) binary classifier [83].These ML models are used to make surrogate predictions during the optimization step of active learning, during which new IDP sequences are selected for simulation.Sequence optimization is carried out using a genetic algorithm, as described below.Following optimization, simulations are performed to generate new data and begin the next iteration of active learning.
Machine-learning models.GPR models are trained to predict B 2 and D, while a RF model is trained to predict whether a given IDP sequence will undergo phase separation.For a given sequence i, the input for all ML models is ⃗ x (i) , size-explicit augmented fingerprint as a feature vector [58].In particular, the feature vector is 30-dimensional, consisting of 20 features reflecting the composition of the amino acids and ten sequence-level chain descriptors.The elements of the feature vector are all normalized via linear transformation methods to be unit-scale.Additional details regarding the featurization, training, and hyperparameter optimization for all ML models are provided in the Supplementary Materials.Although model accuracy is not a primary objective, the three trained ML models show high accuracy in predicting the aforementioned properties (see Fig. S6 in the Supplementary Materials) at the conclusion of active learning.The coefficients of determination R 2 of the two GPR models are as high as 0.975±0.015and 0.970±0.015,and the accuracy of the RF classifier is 0.96.
Sequence optimization.Different fitness functions were used for sequence design by a genetic algorithm depending on the active-learning iteration.Denoting the itera-tion number by the index k, the fitness function where EHVI(⃗ x (i) ) is the expected hypervolume improvement acquisition function, RF(⃗ x (i) ) is the output label of the random forest classifier (evaluating to 1 if phase separation is predicted and to 0 otherwise), α(⃗ x (i) ) is a scaling function that biases against generation of sequences that have high similarity to previously simulated or proposed sequences, and Ã indicates a standard-normalized transformation of the property A based on the data acquired up to the given iteration.For each iteration, 96 unique sequences are generated.In Eq. ( 7), we compute the EHVI [56] for a given sequence ⃗ x * , where B * 2 and D * are predicted values of the secondvirial and diffusion coefficients, respectively, obtained from the GPR models; σ * B2 and σ * D are corresponding uncertainty estimates from the GPR models; P ′ = {⃗ x (0) , ⃗ x (1) , . . ., ⃗ x (n) , ⃗ x (n+1) } represents the current npoint approximation of the Pareto front, augmented with additional reference points ⃗ x (0) = (X, −∞) ⊤ and ⃗ x (0) = (−∞, X) ⊤ to facilitate the hypervolume calculation; Φ(s) denotes the cumulative probability distribution function for a standard normal distribution of s; and Ψ is a function defined as The similarity penalty α( ⃗ x (j) ) is defined to be where j = 2, 3, . . ., 96 are indices of candidate sequences generated in a given iteration.When a sequence is highly similar to one or more previously generated sequences, α tends to zero.For j = 1, we take α = 1 (i.e., no penalty is applied).
The active learning process is seeded with data obtained from simulations of 1,266 disordered sequences reported in DisProt [84].We select and simulate all sequences with sequence lengths between 20 and 50 residues (inclusive), taking care to remove duplicated samples.Since none of these sequences are determined to phase separate, we generate new sequences by maximizing a fitness function that involves only B 2 in iteration 0. The resulting data enable the training of ML models for B 2 , D, and phase separation.Then in subsequent iterations, we maximize a fitness function based principally on the expected hypervolume improvement (EHVI) to identify polypeptides that outline a Pareto front of −B 2 versus D. To confirm the convergence of the Pareto front, a final exploitation-only round of optimization is performed by maximizing a simple linear function of the standard-normalized properties − B2 and D.
For each iteration, 96 "child" sequences are successively generated that aim to maximize the relevant fitness function in Eq. (7).To facilitate convergence towards selecting each child sequence, 96 independent trials of sequence optimization are executed in parallel, and only the sequence with the best fitness score is chosen for explicit simulation; overall, this means that 96×96 sequence optimizations are performed in each iteration.
In iteration 0, the initial "parent" sequences are Dis-Prot sequences with negative B 2 .In all subsequent iterations, the parent sequences are Pareto-optimal sequences identified from the previous iteration.Candidate child sequences are derived from the parent sequences via a combination of crossover and mutation moves.In addition, "deletion" and "growth" moves are performed, which either remove a portion of the current sequence or add a portion of the current sequence within itself.These moves are executed independently in the order of crossover, mutation, deletion and growth with probabilities of 0.5, 0.8, 0.2, and 0.5, respectively.These moves are performed for 100 steps, which is sufficient for the genetic algorithm to converge to a local maximum of the fitness function (Fig. S11 in the Supplementary Materials).Additional details can be found in the Supplementary Materials.

C. Random forest classifier construction
A random forest (RF) classifier is used to predict whether phase separation is expected to occur as a function of the feature vector, ⃗ x.Here, the RF classifier consists of 100 decision trees.The best split in decision trees is determined by the Gini impurity.The minimum number of samples required to split an internal node is set at two, and the minimum number of samples required to be a leaf node is one.Five-fold cross-validation is employed on the training set to mitigate overfitting and to obtain optimal hyperparameters in the same fashion as described in SI Sec.I B. The performance of the RF classifier at the conclusion of active learning for an 80/20 train/test split is shown as a confusion matrix in Fig. S6c.

D. Sequence optimization and genetic algorithm
Sequence optimization to identify candidate proteins with high EHVI is facilitated using a genetic algorithm.In addition to conventional mutation and crossover moves, we also employ "deletion" and "growth" moves to improve diversity of proposed sequences.All moves are constrained as necessary to produce sequences with 20 ≤ N ≤ 50.In all discussion, sequences of N residues are index from 1 to N .In each iteration, batches of 96 candidate sequences are produced as described below.
• Step 1 : A set of n possible parent sequences sorted by their fitness from high to low.In Iteration 0, the parent sequences are sequences from DisProt that exhibit negative B 2 .At the beginning of every other iteration, the parent sequences correspond to those sequences that define the current approximation to the Pareto front.
• Step 2 : To generate "child" sequences, a pair of sequences are randomly selected from the top 30% of parent sequences.These sequences then undergo a series of crossover, mutation, growth and deletion moves: -Crossover : Suppose the pair of selected sequences have a length of N and M .First, two indices are selected s 1 and s 2 both from the set ∈ [1, min({N, M }) with uniform probability.Then, another index is selected -Growth: For a given sequence, the length of a sub-sequence for addition l gro ∈ [0, 50 − N ] is selected with uniform random probability.Next, a random position s in the sequence is selected with uniform probability.Then, the sub-sequence with indices ∈ [1, l gro ] is replicated and inserted at s.As necessary, the bond between amino acids at s and s + 1 is eliminated, the amino acid at s is bonded to the first amino acid in the replicated sequence, and the amino acid formerly at s + 1 is bonded the last amino acid in the replicated sequence.
-Mutation: For a given sequence of length N , an index s ∈ [1, N ] is selected with uniform random probability.Then, the amino acid at position s is exchanged for another amino acid, which is selected from the pool of twenty amino acids with uniform random probability.
The probabilities for crossover, deletion, growth, and mutation are 0.5, 0.2, 0.5, and 0.5, respectively.The moves are executed independently in the order of crossover, deletion, growth, and mutation.This procedure generates two new "child" sequences.
• Step 3 : Repeat Step 2 until ⌊0.7n⌉ new sequences have been produced.These newly generated child sequences are combined with the top 30% of the parent sequences from the prior generation to comprise a new set of n parent sequences, which represent a new generation.
• Step 4 : The fitness function is evaluated over the new generation of prospective parent sequences.The sequences are ranked by fitness.
• Step 5 : Repeat starting from Step 2 until 100 generations have been produced. • Step 6: The sequence with the best fitness at the conclusion of all generations is identified and proposed as a candidate for simulation as part of the active learning iteration.
Steps 1-6 above results in one candidate sequence.To facilitate convergence towards an optimal sequence, Steps 1-6 are executed 96 times in parallel (see Fig. S11) without communication.The resulting 96 sequences, which differ by the stochastic nature of the genetic algorithm, are ranked by their fitness, and the one with the best overall fitness is marked for characterization by simulation.This overall workflow is then executed again (going back to Step 1 ) 95 additional times to obtain a total of 96 distinct sequences for characterization by simulation.Note that the fitness function changes based on the active learning iteration and the prior history of proposed sequences (see Eqn. 7 of the main text).Although a genetic algorithm was used for this task, any reasonable optimization algorithm should be able to achieve similar results.In addition, different move probabilities or move sets may facilitate better convergence but were not explored here.
To characterize the stochasticity of the sequences produced by the genetic algorithm (GA), we perform five replicate rounds of GA to generate alternative sequences for the final "exploitation" step of our active learning approach (see Figure 2c in the main text).The predicted B 2 and D values of these polypeptides are shown in Fig. S12, and representative sequences are listed in Tables S2-S5.In Fig. S12, we consider four distinct regions along the Pareto front as examples for detailed comparisons.Within each region, the sequences generated by different GA replicates are predicted to have similar B 2 and D values and thus are predicted to lie in close proximity to the Pareto front.However, as shown in Tables S2-S5, the sequences generated by different GA replicates differ substantially from one another due to the stochastic nature of the GA optimization approach.Nevertheless, in this exercise, we do not identify any sequences that notably improve upon our previously identified Pareto front.This suggests that our stochastic optimization has converged to a representative boundary in terms of the thermodynamics-dynamics tradeoff, although precise sequences would likely differ were the study to be conducted again.

E. Determination of counterfactuals
The task of determining a counterfactual for a sequence requires identifying another sequence that possesses overall similar characteristics but exhibits a different classification.In the context of this study, the classifications correspond to Pareto-optimal versus near-Pareto-optimal.To treat our two objective properties (D and B 2 ) on equal footing, we consider the classification of near-Pareto-optimal to be based on a distance determined according to standardnormalized variables D and B2 .Each sequence i can then be represented as a coordinate in the standard-normalized B 2 -D plane as ⃗ z (i) = ( B(i) 2 , D(i) ) ⊺ .To account for statistical uncertainties associated with the quantities obtained from simulations, we associated near-Pareto-optimal with distances d ∈ [0.15, 0.3], which are measured from a given coordinate to a piecewise-linear approximation to the Pareto front.In particular, we first calculate three sets of 35 distances ({d v,i }, {d 1,i } and {d 2,i }) associated with our 35 Pareto-optimal sequences with coordinates ⃗ z (P1) , . . ., ⃗ z (P35) as below: and where d v,i is the distance between the point ⃗ z (k) to the piecewise lines determined by two adjacent Pareto-optimal points ⃗ z (Pi) and ⃗ z (Pi+1) , d 1,i is the distance between ⃗ z (k) and ⃗ z (Pi) , and d 2,i is the distance between ⃗ z (k) and ⃗ z (Pi+1) ; for the above i ∈ [1,34].

FIG. 1 .
FIG.1.Coarse-grained simulations of homomeric polypeptides predict a strong relationship between condensate stability and internal dynamics.(a) The potential of mean force, u(r), between the centers of mass of two polymer chains at 300 K. Three example curves are shown for net repulsive, weakly attractive, and strongly attractive 20-mers, A20, F20, and Y20, respectively.(b) Empirical equation-of-state curves at 300 K for the three homomeric polypeptides in panel a.Only two of these 20-mers, F20 and Y20, are predicted to phase separate to form condensed phases with the indicated densities ρ F 20 c

FIG. 2 .
FIG. 2. Active-learning approach to heteromeric sequence design.(a) The workflow for designing new IDP sequences by integrating coarse-grained modeling, machine learning (ML), and genetic-algorithm-based optimization.Active learning is an iterative process consisting of (i) performing molecular dynamics simulations to compute three physical properties of IDPs (B2, phase behavior, and D); (ii) learning the predictive features of IDP sequences by training ML models to estimate these three physical properties; (iii) designing new sequences by maximizing a fitness function; and (iv) returning to step (i) to test the physical properties of the proposed IDP sequences.The data set and ML models are subsequently updated via successive iterations.(b) The relationship between the scaled condensed-phase self-diffusion coefficients, D/D0, and scaled second-virial coefficients, −B2/V0, of all the designed IDP sequences that undergo phase separation.The color of each grid element indicates the number of distinct sequences within that region.The red highlighted area represents a one-standarddeviation region spanned by the Pareto-optimal heteromeric sequences (red stars).Homopolymer sequences (white circles) are shown for comparison.(c) Convergence analysis of the active-learning approach.The hypervolume under the Pareto front (blue symbols, left axis) increases with active-learning iterations, while the fraction of newly updated data points at the Pareto front (black symbols, right axis) tends to zero as the Pareto front converges.Filled symbols/solid lines show the convergence of the entire Pareto front, while open symbols/dashed lines describe the convergence of the Pareto front in the regime where −B2/V0 ≲ 50.A final "exploitation" round (shaded region) of sequence design confirms convergence of the hypervolume, introducing only a few new sequences into the Pareto front representation.

FIG. 3 .
FIG. 3. Validation of Pareto-optimal IDP sequences using direct-coexistence simulations.(a) Snapshots of slabgeometry direct-coexistence simulations at 300 K for five representative IDP sequences from the thermodynamics-dynamics Pareto front (see Fig.2b): P1 (blue), P8 (orange), P14 (red), P19 (cyan), and P35 (purple).Slab simulations of 1000 chains are performed in the canonical ensemble for 4.5 µs (P1) and 2 µs (P8, P14, P19, and P35).(b) Average density profiles, measured orthogonally to the interface between the dilute and condensed phases, obtained from the simulations in panel a.The final 3 µs of the P1 trajectory and the final 0.5 µs of the P8, P14, P19, and P35 trajectories are analyzed to compute these density profiles.Shaded areas indicate each bulk condensed phase (except for P14 and P19 for clarity), defined as the region where variations in the average density are less than 0.02 g/ml.(c) The correlation of the dimensionless second virial coefficient −B2/V0 and the density difference between the condensed and dilute phases, ρC − ρD.(d) Comparison of the dimensionless condensed-phase self-diffusion coefficients, D/D0, computed via the equation-of-state method within the active-learning framework (AL) or via the direct-coexistence (DC) simulations shown in panel a.The dashed line shows a linear fit to the data with a coefficient of determination of R 2 = 0.996 and a slope of 0.94.

FIG. 4 .
FIG. 4. Physicochemical features of sequences at the thermodynamics-dynamics Pareto front.(a) Representative IDP sequences along the Pareto front (stars) divide into three regimes.Also shown are counterfactual sequences (circles) of three representative Pareto-optimal sequences, P1 (blue), P8 (orange), and P35 (purple); see text for the criteria used to select counterfactual sequences.(b) Cosine similarities between the amino-acid composition component of the feature vectors (lower triangle, left color bar) and between the sequence-descriptor component of the feature vectors (upper triangle, right color bar) of the 35 representative Pareto-optimal sequences.(c) Variations of selected sequence descriptors (degree of polymerization, N ; average hydrophobicity, λ; hydropathy distribution, SHD; absolute net charge per residue, |q|; the total charge, q+ + q−; and charge distribution, SCD) along the Pareto front.Dashed lines indicate the mean value of each feature across all phaseseparating sequences studied via active learning.(d) Average feature differences between each Pareto-optimal sequence and its counterfactuals, Eq. (2).Error bars represent standard errors of the mean, and dashed lines indicate zero feature difference.
Finally, the sub-sequence with the indices ∈ [s 1 , s 2 ] from the smaller sequence is exchanged with the sub-sequence with indices ∈ [s 3 , s 3 + (s 2 − s)1] from the larger sequence, accounting for changes to bond connectivity as needed.-Deletion: For a given sequence of length N , the length of a sub-sequence for deletion l del ∈ [0, N − 20] is selected with uniform random probability.Next, an index s is chosen from the set [0, N − l del ].Then, the amino acids with indices in the sequence ∈ [s, s + l del − 1] are removed from the chain.If necessary, bonds are established between the amino acids with indices of s − 1 and s + l del .

)
FIG. S1.Relationship of viscosity and self-diffusivity in the condensed-phase of homomeric polypeptides.The viscosity, η, is anticorrelated with self-diffusivity, D, although the former possesses larger statistical uncertainties.The Pearson correlation coefficient between η and log(D) is −0.91.For all systems, quantities are measured in the canonical ensemble at the estimated coexistence density of the condensed phase as determined by the approximate EOS method (see main manuscript).The viscosity is determined using the Green-Kubo formalism(85).Error bars represent the standard error of the mean.
FIG. S11.Illustration of convergence toward an optimal sequence using parallel execution of the genetic algorithm.Each solid line indicates the current highest fitness of a sequence produced over the course of 100 generations for a single execution of the genetic algorithm (Steps 1-6 in Sec.I D).There are 96 lines corresponding to 96 independent executions with the same set of possible parent sequences.The different executions lead to a range of sequences with different fitness scores.The single best sequence (red star) is proposed for characterization by molecular dynamics simulation.A given active learning iteration produces 96 sequences in this manner.

5 FIG
FIG. S12.Comparison of optimized sequences produced during replicate runs of the genetic algorithm.The machine-learning predicted values of B2 and D for sequences generated by five replicate runs of the genetic algorithm (GA) in "exploitation" mode (see Figure 2c in the main text).The final Pareto front (red stars; cross referenced in Figure 2b in the main text) is shown for comparison.We compare representative sequences (opaque solid circles), each generated by a different GA replicate, from four distinct regions, 1-4, along the Pareto front to understand the stochasticity of the genetic algorithm.These replicate sequences are listed in Tables S2-S5.Sequences not listed in the tables are shown as partially transparent circles for clarity.

TABLE S2 .
Representative sequences in the region 1 of Fig. S12 generated by five GA replicates.The Pareto-optimal sequence P1 study is shown for comparison.

TABLE S3 .
Representative sequences in the region 2 of Fig. S12 generated by five GA replicates.The Pareto-optimal sequence P8 is shown for comparison.

TABLE S4 .
Representative sequences in the region 3 of Fig. S12 generated by five GA replicates.The Pareto-optimal sequences P14 and P15 are shown for comparison.

TABLE S5 .
Representative sequences in the region 4 of Fig. S12 generated by five GA replicates.The Pareto-optimal sequences P21-P24 are shown for comparison.