Physically informed data-driven modeling of active nematics

A continuum description is essential for understanding a variety of collective phenomena in active matter. However, building quantitative continuum models of active matter from first principles can be extremely challenging due to both the gaps in our knowledge and the complicated structure of nonlinear interactions. Here, we use a physically informed data-driven approach to construct a complete mathematical model of an active nematic from experimental data describing kinesin-driven microtubule bundles confined to an oil-water interface. We find that the structure of the model is similar to the Leslie-Ericksen and Beris-Edwards models, but there are appreciable and important differences. Rather unexpectedly, elastic effects are found to play no role in the experiments considered, with the dynamics controlled entirely by the balance between active stresses and friction stresses.


Introduction
Active matter and the associated emergent phenomena such as spontaneous organized motion have attracted a lot of attention recently [1][2][3][4][5].Many different types of active matter exist at different length scales.In this paper we focus on a particular class of such systems, known as active nematics [6], which feature highly elongated apolar interacting units.Some notable examples include systems comprised of vibrated monolayers of cylindrical rods [7], microtubules [8], actin filaments [9], and certain types of bacteria [10] suspended in a layer of fluid.
Active nematics exhibit a range of complex defect-mediated flows [8,[11][12][13] and a number of hydrodynamic models have been proposed to understand and quantify the observed flow patterns and transitions between dynamical regimes, mostly in two spatial dimensions, [6,[14][15][16][17][18][19][20][21][22][23].Most of these models are variants of the Leslie-Ericksen model [24,25] or the explicitly nematic Beris-Edwards model [26,27], which provide a coarsegrained description of microscopic nematic molecules in three spatial dimensions.Their applicability to macroscopic filaments such as microtubules (MTs) or actin bundles is questionable, especially when those filaments are confined to an interface between a pair of immiscible fluids [8,12,13].Indeed, while existing hydrodynamic models capture some aspects of the observed phenomena [28,29], they also fail to describe a number of experimental observations for MT suspensions [12,13].
Furthermore, hydrodynamic models of active nematics have a dozen or so parameters, few of which can be directly measured.Some progress has been made in indirectly identifying parameters in models of known functional form from experimental data for both filamentous bacteria [10] and MT suspensions [30].However, a more fundamental question of what the correct form of the hydrodynamic model of a particular active nematic system is remains unresolved.The present study goes a step further than these studies and shows that both the functional form of the model and the values of the coefficients can be identified from experimental data using an algorithm known as sparse physics-informed discovery of empiric relations (SPIDER) [31].
Building on a body of machine learning literature devoted to spatially extended nonequilibrium systems [32][33][34][35][36][37][38], SPIDER combines the relevant domain knowledge (e.g., the symmetries of the physical system) and experimental measurements (e.g., orientation and velocity fields) to identify a complete set of parsimonious physical relations necessary to describe the observed dynamics quantitatively.Unlike neural networks that can be trained to reproduce the observed dynamics, including those of active nematics [30,39] but yield little physical insight, SPIDER yields a set of physical relations in the familiar form of partial differential equations (PDEs) which can be directly compared against existing models.Moreover, these physical relations are easily interpretable and provide substantial new physical insight, especially when discrepancies with existing models are found.

Problem Description
We consider here an active nematic where the flow is driven by a suspension of MTs confined at a water-oil interface [8].This system has traditionally been described [14,17,20,30,40] using variations of either the Leslie-Ericksen or the Beris-Edwards model.Both models were originally derived to provide a continuum descriptions of molecular nematic liquid crystals in three spatial dimensions.Neither model has a firstprinciples justification for suspensions of colloidal mesogens confined to a two-dimensional interface between a pair of fluid layers which are themselves in contact with one or more rigid boundaries.For instance, while both fluids are incompressible in three dimensions, the interfacial flow generally will not satisfy the divergence-free conditions in two spatial dimensions as is assumed by both models.Similarly, the viscous flows in the two fluid layers can be very different from those in a three-dimensional volume.Some attempts have been made to address vertical confinement by, e.g., introducing Rayleigh friction [40,41].However, this does not address, but rather exacerbates, the key problem: both models contain a multitude of material parameters which cannot all be computed or measured, making direct quantitative comparison between experiment and theory a challenging task.
First-principles analyses uniformly assume, but never prove, that the system can be described by a two-dimensional model, whatever its functional form.While confinement indeed effectively constrains the motion of MTs to two spatial dimensions, the flow in both fluid layers driven by that motion generally remains fully three-dimensional [42].One of the key objectives of this study is therefore to address whether three-dimensional effects are weak and the system affords an effective quantitative description in two spatial dimensions, analogous to the treatment of other weakly turbulent flows in thin fluid layers [43].We will assume that such a model can be synthesized using measurements of three physical observables at the twodimensional oil-water interface.These observables -the director field, the flow velocity, and a local order parameter -appear in most first-principles models of active nematics.
In particular, the Leslie-Ericksen model is formulated in terms of the director field n.Nematic symmetry n → −n combined the presence of topological defects, which is a generic feature of this system, implies that the director field cannot be defined globally as a continuous field.To avoid this complication, most theoretical models instead use the globally continuous nematic tensor Q ij = Sn i n j , where 0 ≤ S ≤ 1 is a scalar order parameter, or its traceless counterpart Qij .The scalar order parameter S measures the degree of local alignment of nematic molecules and mainly serves to describe disorder arising from thermal motion of microscopic nematic molecules.In contrast, away from defects, MTs tend to be well-aligned due to a combination of their large aspect ratio and their relatively strong interaction; hence, we set S = 1 in our analysis.
Unlike molecular nematics, which tend to have a uniform density, MT-based systems are nonuniform, with their density or, more accurately, packing fraction φ varying between zero in the neighborhood of topological defects and unity far from the defects.The density field φ plays a role analogous -but note equivalent -to the (dis)order parameter S. Note that the values of φ ≈ 1 can only be achieved due the near-perfect alignment of the MTs.While some phenomenological models that include the density field have been proposed [15,44], they lack proper justification or validation.Most commonly, φ is simply assumed to be a constant in the first-principles models of MT suspensions.

Data-driven model discovery
The difficulties facing first-principles analyses call for an alternative approach where a quantitative model of a particular experimental system is constructed directly from experimental data.Here we will rely on a recently introduced technique named sparse physically-inspired discovery of empiric relations (SPIDER) which has already been validated using both numerical data describing a highly turbulent flow driven by pressure gradients [31] and experimental data describing a weakly turbulent flow driven by Lorentz force [38].
The latter case in particular has numerous similarities with the problem considered here.Most notably, it also involves two thin fluid layers supported by a rigid bottom boundary and strong vertical confinement is used to synthesize a quantitative two-dimensional model of the nominally three-dimensional fluid flow.Before discussing the application of SPIDER to the experimental system considered here, let us make several observations regarding the data, which are represented by sequences of snapshots of dense flourescently labeled MT bundles, such as the one shown in Figure 1(a).From these one can identify the director field n, the flow velocity u, and the MT density φ, all at the interface on a rectangular grid spanned by the two spatial coordinates and time.Examples of reconstructed fields are shown in Figure 1(b-d).The spatial resolution of the images however is insufficient to resolve the fast variation of φ and n near the topological defects.Moreover, no reliable information about either n or u can be obtained in the regions where φ ≈ 0 (dark areas in the image) which typically surround topological defects, as illustrated in Figure 1(b).As a result, we exclude these regions from our analysis.Outside of those regions, φ ≈ 1 is effectively constant.Therefore, only the fields n and u serve as useful data.
These velocity and director fields are, in fact, sufficient to synthesize a hydrodynamic model of this experimental system with the help of SPIDER, as described in detail in the Methods section.Figure 2 summarizes the key ingredients and steps of the algorithm.First, a sufficiently large set of tensor products F r (n, u), up to rank 2, is constructed in symbolic form from the vector fields n and u as well as their spatial and temporal derivatives.These are split into irreducible components according to the symmetries of the system, yielding libraries of terms with similar transformation laws.For instance, terms that are invariant with respect to the nematic symmetry and transform as vectors under rotation form one library, while terms that are invariant under rotations but change sign when n is replaced with −n form another.Each library represents a PDE of the form r c r F r = 0 (1) with coefficients c r that are assumed to be constant, reflecting the symmetry of the problem with respect to translations in space and time.Note that the vertical confinement implies that any model that might be discovered, would be two-dimensional, so ∇ = x∇ x + ŷ∇ y , u = xu x + ŷu y , and so on.1) evaluated using appropriately sampled data yields a coefficient equation (2).Finally, a sparse regression algorithm is applied to each coefficient equation to identify one or more empirical relations.
To combat measurement noise [36], PDE (1) is converted to its weak form.Specifically, it is multiplied by one of several smooth weight functions and integrated over one of many rectangular spatiotemporal domains to generate an overdetermined linear system of equations for the coefficients c r .The matrix G contains integrals of individual terms which are evaluated using properly nondimensionalized data.This process is repeated for each library, yielding a set of coefficient equations (2).To make sure that regions with unreliable director and velocity field data are excluded, the weights are constructed as a product with a mask ψ that vanishes in the regions to be excluded.In practice, the mask is constructed automatically based on the values of φ and ∇ i n j .An example of a mask overlaid on the director field is shown in Figure 1(c) and corresponding Supplementary Movie S4, which shows how the mask evolves in time.
Finally, a sparse regression algorithm illustrated by Figure 3 is applied to each coefficient equation ( 2).This iterative algorithm finds one or more approximate sparse solutions c, which correspond to parsimonious yet quantitatively accurate relations between the fields u and n.In this manner, nine equations were found across six different libraries (see Supplementary Material).It is straightforward to show that all of them can be derived from the following set of three fundamental relations: an incompressibility condition an evolution equation for the director field and a stress balance

Sparse model
Full library

Remove largest term
Figure 3: An iterative regression algorithm for finding a set of sparse solutions for the coefficient equation ( 2).An empirical relation balancing sparsity and accuracy is identified using sequential regression of the full library.The largest term in that relation is then removed from the full library and the procedure is repeated to find additional empirical relations contained in the library.G n represents the n-th column of the matrix G.
Here P⊥ is the projection operator onto the direction normal to n, A ij and Ω ij represent, respectively, the symmetric and antisymmetric components of the velocity gradient tensor ∇ i u j , and the bar denotes the trace-free component of a symmetric tensor.The two terms in equation ( 5) could be interpreted as an anisotropic viscous stress σ v ij = µ Qij Ākl Qkl and an active stress σ a ij = α Qij , with the coefficient c 5 ∝ −α/µ relating the strengths of activity and viscosity.The activity coefficient α is positive (negative) for an extensile (contractile) nematic [20,45].The coefficient c 5 is indeed negative, suggesting that α > 0, as it should be for an extensile nematic considered here.
Before discussing how these PDEs are related to the existing models of active nematics, let us emphasize that SPIDER ensures that these relations are satisfied in weak form.To check that the strong form of these PDEs is also satisfied, we evaluated and compared various terms at every location in space and time.In particular, the divergence ∇ • u for a typical snapshot is shown in Figure 4(b) and Supplementary Movie S1.We find that its magnitude is small almost everywhere, consistent with what has been previously reported for the active nematic under consideration [29].Regions where the divergence takes large, positive or negative, values are collocated with the regions where the MT density is low (φ ≈ 0).These regions are excluded from our analysis, so the velocity field can be considered essentially divergence-free where the MTs are dense (φ ≈ 1).
Figures 4(c,d) and Supplementary Movie S2 compare the field ∂ t n computed directly from the data with that given by equation ( 4).In this case, we find good agreement in the entire domain, not just in the regions with φ ≈ 1.Finally, Figure 4(e-h) and Supplementary Movie S3 compare the active stress σ a ij and the viscous stress σ v ij that appear in the stress balance relation (5).The two independent components of the (symmetric) stress tensor are shown in Figure 4(e-h) and Supplementary Movie S3.Again, we find good agreement in the entire domain.The minor discrepancies that can be seen are a result of insufficient accuracy in the numerical evaluation of the derivatives of the velocity field.

Discussion
Let us now turn to the discussion of the physical insight that the identified relations suggest.First of all, it should be emphasized that we obtained a complete mathematical description of the problem in the regions with nearly uniform density φ of MTs.This description has the same number of equations (three) as both the Leslie-Ericksen and the Beris-Edwards models, describing the fluid flow and the orientation of the MTs.
Two of these relations, the incompressibility condition (3) and the evolution equation for the director field (4), are the same as in the Leslie-Ericksen model since the coefficients c 1 and c 2 are both very close to the expected values of ±1.Furthermore, λ = −c 3 is found to be very close to unity, as expected for thin filaments [46].It is notable that no contributions from the free energy F [n], including those due to elasticity, are identified in the evolution equation ( 4).The remaining equation ( 5) however is rather unexpected.This is a tensor relation representing local balance between active and viscous stresses, not a vector relation representing momentum balance, as is the case in the Leslie-Ericksen and the Beris-Edwards model.Given that we are dealing with a creeping flow, it is hardly surprising that the terms ∂ t u and u • ∇u representing inertia can be ignored, so that equation (6c) should reduce to the force balance This equation is consistent with the discovered relation ( 5), provided the stress tensor can be decomposed as σ = σ v + σ a , where the viscous stress and active stress were defined previously.Note that, again, no elastic contribution is found, which is consistent with the absence of elastic effects in the evolution equation ( 4).In our experimental setup, which is characterized by a relatively low density of topological defects, it is the balance of active and viscous stresses that controls the flow [6,18,19] rather than the balance between active and elastic stresses, as is more commonly assumed [12,[14][15][16][17][20][21][22][23].In addition, we find that the pressure is essentially constant and thus, we can neglect it as well.Finally, the stress tensor σ v representing viscous effects is highly anisotropic, in contrast to what is commonly assumed.It represents a special case of the more general phenomenological expression proposed by Leslie [25].Here β 1 = ν 1 − λ(ν 2 + ν 3 ), β 2 = ν 5 + λν 2 , and ν 1 through ν 5 are the "Leslie viscosities."Only one term is present (β 1 = 0), the other two are either absent or too small to be detected While we refer to the two-dimensional tensors σ a and σ v as "stress tensors" in keeping with the convention, it is important to understand that their components do not represent the stress according to its standard definition in three dimensions.For instance, the forces at the interface generated by kinesin (active stress) are balanced by the xz and yz components of the viscous stresses in the fluid layers above and below the interface which do not appear in our effectively two-dimensional description.Rather, it is more appropriate to think of the relation ( 5) as a two-dimensional "projection" of the proper stress balance relation in three dimensions.In particular, although σ v involves spatial derivatives of the interfacial flow velocity, rather than the flow velocity itself, it does represent friction stresses [40,41], as explained in the Supplementary Material.One can also find there a discussion of the correct physical interpretation of the parameter c 5 .
Although the evolution equation for the Q-tensor contained in the Beris-Edwards model is not listed among the three fundamental relations (3)-( 5), it follows immediately from the evolution equation ( 4) if the latter is multiplied by n j .It is also identified via symbolic regression with coefficients very close to those in equation ( 4) (see Supplementary Material).In this case, as before, no elastic (or, more generally, free energy) contributions are found.Symbolic regression also identifies several other relations that follow from one of the fundamental relations.For instance, we find a simple relation Qij Āij + c 5 = 0 (9) with c 5 = (−0.55± 0.3%) ≈ c 5 that is equivalent to the stress balance ( 5), although it does not allow an equally intuitive physical interpretation.Note that two of the three physical relations (3)-(5) describing this system involve no time derivatives and cannot be identified using methods such as SINDy [47] which assume their presence.Physical constraints play a crucial role in constructing the libraries, and one should be careful to include as much physics as possible and, at the same time, avoid using assumptions that only appear logical but are not physically grounded.In addition, when properly constrained, symbolic regression becomes an extremely powerful and general tool for synthesizing new scientific knowledge, as the results presented here vividly illustrate.

Limitations and future work
It is worth reiterating that the model we have identified does not provide a full description of the dense MT suspension at a flat interface.This model only describes regions where the curvature of the MTs is low and their density φ is high and nearly uniform.The dynamics in this system are controlled by the topological defects, the neighborhoods of which have been excluded in our analysis.To properly account for these dynamics, our model has to be generalized to describe the regions around the defects where the curvature is high and the density φ varies in both space and time.That generalization has to include an evolution equation for φ and incorporate the dependence on φ into the remaining governing equations.The former is undoubtedly the continuity equation reflecting mass conservation Diffusion of MTs is negligible due to both confinement to the interface and their large size, hence the flux is likely dominated by advection [44], j = φu, although curvature corrections [15] are possible.
The dimensional version of the model ( 3)-( 5) contains only one parameter c 5 which defines a time scale.It does not contain any parameters that can be used to define a length scale.The absence of elastic stresses in our model suggests that interaction between topological defects is mediated by the flow in the two fluid layers when the mean defect separation L is large compared with the layer thickness h.Indeed, in our experiment, L ≈ 240 µm is larger than h = 50 µm.Our results, however, do not exclude the possibility that elastic effects might play a role in the high-curvature regions, requiring generalization of the model.Equation (4) does not appear to need any modification, as it describes the evolution of the director field quite accurately in regions even with low φ, as illustrated by Figure 4(c,d) and the corresponding Supplementary Movie S2.On the other hand, both relations describing the fluid flow have to be generalized.
In particular, there is no physical reason for the interfacial fluid flow to be divergence-free everywhere in two dimensions, even though the data suggests that the incompressibility equation ( 3) is satisfied away from the defects.As illustrated by Figure 4(a,b) and the corresponding Supplementary Movie S2, ∇ • u is large and positive in the neighborhood of topological defects with +1/2 charge where φ changes between low and high values.The regions with ∇ • u large and negative are also collocated with the regions where φ varies between low and high values, but represents past locations of these moving topological defects.In contrast, the flow is found to be essentially divergence-free in the neighborhood of topological defects with -1/2 charge.
Positive divergence is only found in high-curvature regions, suggesting that elastic effects play a key role in creating the defects and pushing the MTs apart, lowering the density φ.Indeed, there is experimental evidence supporting this role [13].We find the divergence to be negative in regions where the curvature is low and the density φ is below unity suggesting that depletion interaction plays an important role.Therefore, one might expect to see two additional dimensional parameters characterizing, respectively, the stiffness of the MT bundles and the depletion interaction in the generalized model.In particular, the stiffness parameter can be used to define a characteristic length scale, as discussed in the Supplementary Material.Indeed, exponential distributions of vortex areas at high densities of topological defects suggest the presence of a length scale which also depends on activity [12] and viscosity [42].
The stress balance equation ( 5) should also be generalized.In regions without MTs, where φ = 0, the active stresses are expected to vanish, while the viscous stresses are expected to become isotropic, with the latter balanced by the pressure gradient.Regions where the pressure gradients are nonnegligible likely correspond to locations where there is a strong flow towards or away from the interface associated with the large (positive or negative) values of ∇ • u.Indeed, the largest discrepancy between the active and viscous stresses is localized to those regions as well, as illustrated by Figure 4(e-h) and movie S2.
Finally, let us revisit the assumption of the local order in the average MT orientation that underlies the validity of the hydrodynamic description of this system.While the vast majority of MTs are oriented in the same direction (S = 1), there are rare exceptions.An example is shown in Figure 5(a) which features several MT bundles that are misaligned with the rest.In regions where MTs cross, their orientation cannot be described by a continuous field and the hydrodynamic description breaks down.This is illustrated in Figure 5(b,c) and Supplementary Movies S1 and S3, which show that the error in both the incompressibility condition (3) and the stress balance ( 5) is the largest in the region where misaligned MTs are found.This ultimately reflects that there is a 3D aspect of these MT suspensions that is often neglected.
It is straightforward to extend our analysis to include additional physical fields such as the MT density and kinesin or ATP concentration.The main challenge is the capability to measure and/or reconstruct the corresponding quantities in experiment in parallel with the director and flow fields.Additional fields need to be similarly well-resolved in space and time, so that the corresponding derivatives and integrals can be computed with reasonable accuracy.

Experimental setup and data acquisition
We conduct our experiments on the microtubule-kinesin active nematic system pioneered in [8].The long rod-like MTs are bundled together via depletion interactions and are driven out of equilibrium by the action of kinesin-streptavidin motor protein complexes, which are units that induce relative motion utilizing ATP as the energy source.Depletion forces also aid in driving the MT bundles to form bundles to the oil-water fluid interface, where they execute self-sustained bending and buckling instabilities.The system is extensile, which means that active stresses cause the MT bundles to extend in length and contract in width.
To investigate the dynamics of defects in 2D flat space, we prepare the active nematic in a flow-cell setup where the entire pool of ingredients is confined in a 2D sealed cell roughly 10 cm 2 in area and 100 µm in thickness.The lower surface of the cell is subjected to hydrophobic treatment (using Aquapel) and the upper surface to hydrophilic treatment (using polyacrylamide coating) to enhance wetting by the respective fluid phases.A fluorinated oil (HFE-7500 with surfactant E2K0660) forms the oil-phase, and the active MT suspension forms the water-phase.We obtained purified tubulin monomers and kinesin-streptavidin motor protein complexes from the Dogic Group at Brandeis University [8,48].The polymerization of tubulin to MTs is performed in our lab before mixing with other biomaterials as per the protocols described in previous works [8,29,48,49].The final active mix has 20% MTs by volume aided with 144µM ATP.The entire flow cell is sealed by epoxy resin and centrifuged at 1000 RPM to accelerate the depletion mechanism to the interface.
We use confocal fluorescence microscopy for visualization.The MTs are labeled with AlexaFlour 647 dye and illuminated at 633 nm; the excitation and emission peaks are at 651 nm and 667 nm, respectively.After sample preparation and centrifugation, we wait for 15-20 minutes to allow for uniform depletion, and then image at a constant framerate till the activity ceases.Typically, the MTs stay active for 6+ hours.Imaging is done using 10× and 20× objectives to focus on regions with area on the order of mm 2 , away from the edges of the flow cell.The imaging process results in a time series of 8-bit grayscale images, which are stored as the raw data [29].

Data processing
Two fields are extracted directly from experimental data: the flow velocity u and the nematic director field n.The video analyzed has O(1000) frames with 512x512 resolution and side length 484.35 µm, collected at a frame rate of 0.88 s −1 .The flow field u is extracted with Particle Image Velocimetry (PIV) performed by LaVision DaVis 10.1.The resulting flow field is only resolved on a 128x128 mesh, and all other fields are restricted to this resolution.
The director field n is extracted using Coherence Enhanced Diffusion Filtering (CEDF) [50].CEDF determines the direction along which the spatial intensity variation is the smallest, which corresponds to local average alignment of the MTs.This allows constructing the nematic tensor order parameter Qij , which upon diagonalization yields the nematic director (n).A detailed description of the image analysis technique can be found in [51].While this method reliably finds n relatively far from defects where MTs are dense, the low MT density near the defects prevents resolving the director field in these regions, even when carefully using the blurring schemes inherent in our CEDF methodologies.This can be seen in Figure 1c.
The MT density φ can also be extracted from experimental images as a function of blurred intensity.Far from defects, the MTs can be reasonably assumed to be of uniform density.In the neighborhood of defects, the density is discontinuous; it vanishes close to defects where the director field becomes undefined.We restrict our study to the former regions, characterized by negligible density fluctuations.We also ignore the scalar order parameter S which describes the local alignment of the MTs.Far from defects, S = 1 as the MTs are all well-aligned with rare exceptions.
Both u and n fields are smoothed with a moving least squares multivariate fit, and any necessary derivatives are computed with second order centered differences.A good choice of units is helpful for model discovery, and we choose a characteristic length and time scale so that mean velocity and vorticity are both unity, |u| = |ω| = 1.In these units, the experimental image sequence has dimensions of L x ×L y ×L t = 5×5×17.Libraries even powers of n odd powers of n rank-0 tensor (scalar) F r F r rank-1 tensor (vector) Table 1: The summary of the model libraries and their symmetry properties.

Model libraries
Locality and smoothness imply that a physical relation between the two fields, u and n, can be written as a superposition (1) of a number of terms F r , each constructed from u, n and/or their spatial and temporal derivatives, i.e., that relation has the form of a PDE.For instance, every relation in the Leslie-Ericksen model ( 6) has just such a form.For relations involving more than one term, all terms should transform in the same way under every operation in the symmetry group describing the problem [31], which includes rotations around the vertical axis and nematic symmetry n → −n.Hence, terms with different transformation properties are grouped into separate libraries {F r }.
The rotation symmetry implies that every term in a library has to transform as a tensor of a specific rank (we only consider tensors of rank 0 through 2).The nematic symmetry implies that every term in a library must involve either even or odd powers of n.An arbitrary rank-2 tensor can be decomposed into a symmetric and antisymmetric part.The symmetric part can be further decomposed into a traceless component and the trace, with the latter transforming as a scalar.Therefore, without loss of generality, the rank-2 tensor library can be split into two parts: symmetric traceless (denoted with a bar) and antisymmetric (denoted with a tilde).
All the libraries considered in this study are summarized in Table 1 with the terms contained in various libraries listed in the Supplementary Material.Subscripts denote the tensor indices and superscripts denote the hyperparameters associated with regression, e.g., the index of the term in the library.

Weak formulation
Once a library has been constructed, the corresponding PDE ( 1) is converted to an over-determined system of linear algebraic equations Gc = 0 for the unknown coefficients c = (c 1 , c 2 , • • • ) following Ref.[37].This is accomplished by multiplying every term by weight functions w k and integrating the result over a rectangular spatiotemporal subdomain V l , with each combination of k and l defining one or more rows of the feature matrix G. Weak form of the PDEs allows SPIDER to deal with noise levels as high as 100% [31].
The weight functions w k are constructed as a product of three components: (i) an envelope (1 − x 2 ) τ (1 − y 2 ) τ (1 − t 2 ) τ which vanishes, along with τ − 1 derivatives, on the boundary of subdomain V l to eliminate the boundary terms left after integration by parts; (ii) a modulation term which is taken to be one of {1, cos(πx − θ k x ), cos(πy − θ k y ), cos(πt − θ k t )} with arbitrary phases θ k m , and (i) a mask ψ which excludes unreliable data.Note that, in the above expressions, x, y, and t have been shifted and scaled such that The choice of τ is determined by the highest order derivative that appears in the PDE (1); here we take τ = 4 [36].The smooth mask ψ vanishes in regions where the density of MTs is low and/or their curvature is high and approaches unity far from those regions.It is constructed by successively smoothing the Boolean field ψ 0 , which vanishes when the image intensity is below some threshold or derivatives of n are above some threshold and is unity otherwise.
After performing the integration by parts to move as many derivatives as possible from the library term containing noisy data onto a smooth weight function (see Supplementary Material for details), each integral is evaluated numerically using the trapezoidal rule [36].The subdomains V l are chosen to include sufficiently many grid points in every direction to ensure reasonable accuracy of numerical quadrature (54x54x65 grid points) and are centered at uniformly distributed random points of the spatiotemporal domain describing the data set in order to ensure the data is well sampled and to avoid linear dependence.

Sparse regression
Once the feature matrix G has been constructed, we look for a sparse solution c that corresponds to a parsimonious physical relation that balances simplicity and accuracy.Simplicity can be measured in a number of ways; here we take it to be determined by the number of nonzero coefficients c r .The accuracy can be quantified by a properly normalized residual; we choose to use the 2-norm of the residual for the weak form of the relation, η = Ξ −1 Gc 2 .For relations involving multiple terms, we normalize by the 2-norm of the largest term, Ξ = max r c r G r 2 , where G r is the r-th column of G.For single-term relations, we instead use the 2-norm of the corresponding tensor before contraction, i.e., Ξ = ∇u 2 for the incompressibility condition (3).
Parsimonious relations are identified using sequentially thresholded regression (STR).At each step of this iterative algorithm, we compute c as the right singular vector of G corresponding to the smallest singular value.This corresponds to the solution of a constrained least squares problem G T Gc = 0 with the normalization c 2 = 1 [31].We start with the full library and, at each step, discard the term F r with the smallest magnitude, r = arg min c r G r 2 .Note that quality (completeness) of a library can be quantified by computing the residual η before any terms are discarded.For data not corrupted by noise, a good library should have η 1.After discarding a term from the library, the procedure is repeated, with the residual η k increasing with every iteration k as the number of terms in the model decreases.The iteration is terminated when either there is only one term left or the residual increases by a factor exceeding some threshold, i.e., η k+1 > γη k , where we typically take γ = 1.15.This choice is somewhat arbitrary, as most of our results are robust to a choice 1.1 ≤ γ ≤ 1.3.In the latter case, we find a multi-term relation, if the final residual is sufficiently small.In the former case, we have to recompute the residual using the normalization appropriate for single-term relations to decide whether that relation is suitably accurate.
Note that STR is not guaranteed to yield the most accurate sparse relation (of a given complexity) contained in the original library.For relatively simple relations (such as the ones identified in this study) and relatively small libraries, we can verify the results of STR by performing a combinatorial search computing the norm of all the relations containing a given number of terms.To validate a relation with K terms contained in a library with N terms through combinatorial search requires an order of N (N − 1) • • • (N − K + 1) operations, which is a tractable problem for the values of K and N considered here.For relations with larger K, the results of STR can be validated by adding one or more terms from the library that decrease the residual the most.If none of these lower the residual significantly, the result of STR is considered validated.
Multiple relations, including identities, can coexist in the same library.STR is therefore performed iteratively; the library is pruned by throwing out the most complex term of the previously identified relation.STR generally finds identities with machine precision residuals before physical relations with higher residuals.Relations can be more robustly labeled identities by testing them on random smooth synthetic data.Identities will have low residual independent of whether synthetic of experimental data is used, while physical relations are only found for experimental data.Low-dimensional combinatorial searches can more thoroughly explore the relation space to find both identities and physical relations.We stop searching for new relations once the residual for the full pruned library increases above some threshold e.g.0.4.
Finally, the coefficients c r of sparse physical relations are found to vary slightly depending on the number (chosen to be an order of magnitude larger than the size of the library) and location of the integration domains.To quantify the uncertainty in the coefficients, c r is first identified using the entire feature matrix.The coefficients are then recomputed for the same sparse relation using 100 different samples containing half of the rows in the feature matrix and only the columns corresponding to c r .The mean and standard deviations of the distribution define the value and uncertainty, respectively, of the corresponding coefficient.

Construction of libraries
Construction of the libraries in an ad-hoc manner is prone to errors, so we developed a systematic procedure.The first step is to construct tensors, up to a certain rank, from the vectors such as u, n, and ∇ and scalars such as 1 and ∂ t .The number of different such tensors quickly grows with the rank, so we use known physics to further constrain what terms can appear.In particular, the flow is slow, so inertia is negligible.Hence we allow u and ∂ t to appear at most once in any tensor.Elastic and viscous effects are described by terms with two spatial derivatives, so we allow ∇ to appear at most twice in any tensor.Lastly, we do not allow mixed spatiotemporal derivatives to appear.There are no physical constraints on n, so this field can appear an arbitrary number of times.Let us define the fundamental tensors of rank k as T (k) : The normalization n 2 = 1 constrains the derivatives of n: n i ∇ j n i = 0 and n i ∂ t n i = 0.A stronger consequence is that the gradient tensor ∇n is reducible [52]: It is the contractions of these reduced tensors which make up our libraries.This reduction does not remove all identities from the library, but it eliminates many.

Nematic-invariant scalar library
There are nine nematic-invariant scalars that can be obtained from even-rank reduced fundamantal tensors with the same nematic symmetry listed in (12): This library contains a single identity which can be used to prune its last term.These library terms are well-behaved everywhere, and so they can be integrated without any complications yielding the following entries in the feature matrix Two parsimonious physical relations are identified via symbolic regression, the incompressibility condition (3) and a relation (9) between the director and flow fields with the relative residuals of η = 0.03 and η = 0.08, respectively.Figures S1(a) and S1(b) illustrate how the residual varies with the number K of terms retained in the relation.Note that the coefficient c 1 has units of inverse time.Its magnitude is c 1 = O(1), which is consistent with our choice of units.Figure S1(a) shows that there is a version of the incompressibility condition involving 5 terms that has an even lower residual (η = 0.02) than the one-term relation.However, given that our library is missing terms which incorporate the crucial dependence on the microtubule (MT) density φ (as discussed in the main text), it is rather pointless to look for a physical interpretation of this more general relation.

Nematic-covariant scalar library
There are 18 nematic-covariant scalars that can be constructed from even-rank reduced fundamental tensors with the same nematic symmetry: where the last term can be eliminated using the identity (14).The nematic-covariant scalar library { F r } involves discontinuous fields.Director field changes sign at "branch cuts" connecting the topological singularities, causing problems with evaluating derivatives.To restore continuity, all terms are multiplied by n, making every term a vector.The two components of this vector are treated separately, effectively doubling the number of rows of G (one for i = 1 and another for i = 2): The most accurate parsimonious physical relation identified via symbolic regression contains two terms: where c (2) 5 = −0.57± 1%.This relation follows from (9) as long as c (2) 5 = c 5 and, indeed, the values of c  (9), which highlights the importance of the quality of the data-processing algorithm (here the one that extracts n from the images), especially for relations containing derivatives.

Nematic-covariant vector library
The nematic-covariant vector library is expected to include an angular momentum balance relation describing the evolution of the director field such as (6a) and hence contains the term ∂ t n.Since |n| = 1, this time derivative should be orthogonal to n, and without loss of generality, we can restrict our attention to vectors orthogonal to n that can be constructed from the reduced fundamental odd-rank tensors with the same nematic symmetry: where the term n j u j b i has been replaced by its more familiar form u j ∇ j n i .We can exclude the component of every term Fr along n and eliminate the discontinuities in the director field by considering the z component of the vector product Fr × n or, in the index notation: No identities are found in this library.Symbolic regression identifies one parsimonious physical relation (4).This relation is formally equivalent to the evolution equation (6a) of the Leslie-Ericksen model (sans the elastic contribution Γh) with the coefficients c r that are very close to ±1.The relative residual η = 0.08 is quite low and comparable to that of equation (9). Figure S1(d) shows how it varies during the regression.
Note that symmetric trace-free tensors have two independent components ij = 11 and 12, doubling the number of rows in G. Three identities appear in this library.
We use these to discard the last two library terms.Two physical relations are found in this library, the stress balance relation ( 5) and an evolution equation for the orientation tensor where c 1 = 1 ± 0.1%, c 2 = −0.96± 0.1%, c 3 = −1.02± 0.1%, and c 4 = 2.05 ± 0.1%.These relations have low residuals η = 0.1 and η = 0.09, respectively.For comparison, the tensor balance between Āij and Qij proposed in Ref. [18] has a much higher residual η = 0.67.The variation of the residuals during regression is shown Figure S1(g-h).Note that relation (30) corresponds to the equation (??) of the Beris-Edwards model and can be derived by multiplying the evolution equation ( 4) by n j .This gives the following correspondence between the coefficients: Using a lower STR threshold γ = 1.1, the residual of the relation ( 5) can be decreased slightly (about 10%) by including a term A ij representing isotropic viscous contribution.The corresponding coefficient is quite small (−0.02), which suggests that the isotropic contribution to viscosity is negligible in the regions of high density of MTs.On the other hand, this term is expected to be dominant in the regions with the low density of MTs.Hence, in general, one should expect the viscous stresses to include both contributions ( Qkl A kl Qij and A ij ) with the coefficients dependent on the MT density φ.The coefficients are also expected to depend on the shape of the nematic units, similar to the tumbling parameter λ.It should be possible to derive these coefficients from first principles by solving for the flow above and below the interface with appropriate boundary conditions.
The physical nature of the stress balance relation Equations ( 5) and ( 9) describing the fluid flow can be understood from fist principles in the regions where the curvature of the director field n is low.Indeed, MT bundles experience extension in the direction of n due to the action of kinesin motors and contraction in the transverse direction due to depletion interaction.The (ATP-concentration-dependent) extension rate E is the same as the contraction rate to preserve the mean density of MTs at the interface.Let us orient the x and y axes, locally, such that n = x, so that Qxx = − Qyy = 1/2 and Qxy = Qyx = 0.The kinematic condition at the interface containing the MTs requires a divergence-free interfacial flow to have a velocity where ψ(x, y) = Exy + f (x) + g(y), is a two-dimensional stream function and f (x) and g(y) are arbitrary functions that represent the mean flow.
Hence the MT bundles are characterized by a constant extension rate E rather than a constant stress magnitude α.The corresponding flow above and below the interface can be easily found by assuming its vertical component to vanish everywhere (this assumption is clearly invalid in the regions of nonzero ∇ • u near topological defects).Let z = 0 denote the interface between the two fluid layers.The corresponding flow field in the bottom fluid layer satisfying the kinematic boundary condition at the interface and the no-slip boundary condition at the bottom z = −h b of the cell is then u b (x, y, z) = (1 + z/h b )u i (x, y).Similarly, the flow in the top layer u t (x, y, z) = (1 − z/h t )u i (x, y) satisfies the kinematic boundary condition at the interface and the no-slip boundary condition at the top z = h t of the cell.The corresponding viscous stresses at the interface are linear in u i and represent Rayleigh friction [40,41].The friction coefficient depends on the thicknesses of the two layers and their dynamic viscosities µ b and µ t .We can now obtain a characteristic length scale that balances elastic stresses with viscous stresses caused by the extension of the MT bundles.The initial instability leading to an eventual formation of topological defects and controlling their spacing involves buckling of initially straight filaments [16,22].Let us denote the radius and length of our MT bundles as r and L, respectively.The force exerted by the viscous stresses is given by where h and µ describe the layer with the higher ratio µ/h, according to the expression (38), and the contribution from the mean flow can be ignored.Elastic force at the threshold of the buckling instability is given by where E is the Young's modulus.Balancing these forces, we find a characteristic length scale

Figure 1 :
Figure 1: Raw experimental images and the extracted fields.(a) A snapshot of the MTs.The complete image is shown, with the red box highlighting a small region centered on a −1/2 topological defect.Panel (b) shows the zoomed-in view of the red box.The extracted director field n (red arrows) does not line up with the orientation of the MTs in the center of the image, which indicates that director field data is unreliable near topological defects.(c) Director field (black arrows) and the mask ψ (color), (d) The flow field u (black arrows) and the corresponding vorticity ω = −2Ω xy (color).Panels (c) and (d) show the vector fields corresponding to panel (a) on a much coarser grid than that on which the data are available.

Figure 2 :
Figure2: A schematic representation of the SPIDER algorithm.Tensors F r are constructed in symbolic form from fields and their derivatives and projected into irreducible representations of the underlying symmetry group, yielding a set of libraries.Weak form of the corresponding equation (1) evaluated using appropriately sampled data yields a coefficient equation(2).Finally, a sparse regression algorithm is applied to each coefficient equation to identify one or more empirical relations.

Figure 4 :
Figure 4: The strong form of the identified relations.Symbolic regression identifies physical relations in weak form.To check the validity of the corresponding PDEs in strong form, we computed each term on the entire spatial domain using finite differences.Panel (a) shows a cropped snapshot of the MTs, and all other panels correspond to this snapshot.Panel (b) shows the divergence ∇ i u i of the interfacial flow.Panel (c) shows the observed angular velocity ∂ t θ = ε ij n i ∂ t n j of the MTs and panel (d) its value reconstructed using the vector relation (4).The remaining panels compare the two components of the active and viscous stresses in arbitrary units: the diagonal component σ a 11 (e) and −σ v 11 (f) and the off-diagonal component σ a 12 (g) and −σ v 12 (h).The viscous stress shown in panels (f) and (g) involves spatial derivatives and is therefore much noisier than the active stress shown in panels (e) and (g).Solid black curves in panels (b-h) correspond to a level set of the number density field φ and describe the edges of the regions devoid of MTs.

Figure 5 :
Figure 5: MT alignment and the accuracy of the identified relations.(a) Three noticeable out-ofplane MT bundles are misaligned with the rest of the MTs at the bottom left of the image.(b) Divergence of the flow is the largest in the regions of misalignment.(c) The error (residual) of the scalar form (9) of the stress balance is also the largest there.

and c 5
are found to be very close.Figure ()(c) shows how the residual varies during regression.The inclusions of the additional factor s = ∇ • n increases the residual to η = 0.23, almost three-fold compared with

) provided c 1 =
−c 2 = c 1 , 2c 3 + c 4 = c 2 , and c 4 = −c 3 .The relative residual for relation (34) is η = 0.05, making it the most accurate representation of ∂ t n found.The variation of the residual during regression is shown in Figure S1(i).