Geospatial distributions reflect temperatures of linguistic features

The geographical distributions of features of language contain a footprint of the cultural evolutionary dynamics of language.


S1 Introduction
This document provides further technical details of the model, its analytical solution, numerical simulations, and the resultant procedure for estimating temperatures of linguistic features from empirical data. We begin by addressing a limited special case of the model, the case where = = 0, i.e. when the horizontal process is completely faithful. The stationary distribution of this model can be analytically calculated for an isotropic lattice. Details of the calculation are given in Section S2. We then generalize the model first by assuming ≠ 0 and ≠ 0 but retaining the requirement that = . This generalization also turns out to yield to an analytical treatment, as detailed in Section S3. In Section S4 we drop the requirement that = . The full analytical solubility is lost for this more general model, but we show, using numerical simulations, that equations (1)-(4) of the main paper, which supply feature frequency, isogloss density and temperature at the stationary distribution, are valid for this general formulation also. Section S5 reports the results of a number of sanity checks in order to show that our method for evaluating empirical temperatures is robust to (a) choice of geographical area, (b) choice of spatial scale, (c) the spatial structure of the underlying substrate, and (d) potential idiosyncracies arising from the empirical spatial distributions of language families. Finally, Section S6 supplies detailed metadata about the features consulted in our study.

S2 Model: faithful horizontal transmission
To facilitate the analytical exploration of the model's behaviour, we first study the special case = = 0, i.e. we assume that all horizontal events result in the faithful transmission of a feature from one lattice node to another.
For the purposes of the analytical calculation we will treat the model as a spin system on a two-dimensional regular square lattice with = × sites and periodic boundary conditions; comparable approaches are available in the literature on voter models (26,27). We write (x) ∈ {−1, 1} for the spin at lattice site x = ( 1 , 2 ); (x) = (x) for the average spin of x (over realizations of the stochastic process); and = x (x)/ for the mean magnetization over the entire lattice. The feature frequency , or fraction of up-spins in the system, is related to by the identity = ( + 1)/2. We further write (x, y) = (x) (y) for the pair correlation of (x) and (y). In summations, x is understood to index the set of von Neumann neighbours of site x, i.e. the set {( 1 − 1, 2 ), ( 1 + 1, 2 ), ( 1 , 2 − 1), ( 1 , 2 + 1)}. (S1) In Sections S2.1-S2.4 below we calculate analytically the stationary-state feature frequency and isogloss density. While we are not aware of existing work in the exact same model, the calculation follows well established principles from the analytical treatments of voter models in statistical mechanics (see (21,27) and references therein). In particular a very similar calculation is carried out in (26). The novelty of our work is therefore not in the calculation itself, but in the application of the analytical calculation to define temperature for linguistic features.

S2.1 Spin flip probability
Central to our analytical derivations is the spin flip probability, i.e. the probability with which the spin at site x changes its state from −1 to +1 or vice versa, if it is selected for potential update. In our model this is of the form where (x) is the contribution of the ingress-egress process and (x) the contribution of the spatial (voter) process. These are and where it is important to remember that the summation over x runs over the four nearest neighbours of x. Hence we have

S2.2 Stationary-state feature frequency
The spin at x changes by the amount −2 (x) with probability 1 (x), the prefactor 1/ representing the probability of x being picked for update. Consequently, the mean spin (x) = (x) evolves as where Δ is the time step associated with each attempted spin flip. Bearing in mind that (x) (x) = 1 and plugging Eq. (S5) in, this implies Taking the sum over all sites x, one has Now x x (x ) = 4 x (x), since the LHS is the sum of the four von Neumann neighbours of all lattice sites, so that each site, having four neighbours, gets counted four times. Using = x (x)/ , we then have With the standard choice Δ = 1/ , and taking the limit → ∞ (i.e. the continuous-time limit Δ → 0), we thus find Hence the mean magnetization in the stationary state ( / = 0) is (S11) From this, using = ( + 1)/2, we find for the fraction of up-spins in the stationary state.

S2.3 Pair correlation function
To compute the pair correlation (x, y) = (x) (y) , we note that (x) (y) changes by the amount −2 (x) (y) if either x or y flips spin. Assuming x ≠ y, and working directly in the continuous-time limit, we have After some algebra we find where the summation over y is over the four nearest neighbours of y.
We now assume translation invariance and write (r) = (x − y) = (x, y). Then, for r ≠ 0, (S15) Due to translation invariance the two summations on the RHS coincide, and we have (always restricting r ≠ 0) where Δ is the lattice Laplace operator We note that we always have the boundary condition (0, ) = 1 for all (self-correlation is 1 at all times, as (x) 2 = 1).

S2.4 Stationary-state isogloss density
If (e 1 , ) were known, where e 1 is the unit vector (1, 0), the isogloss density could be obtained via the identity (S18) Thus, knowing the limiting ( → ∞) value of (e 1 ) would imply the stationary-state isogloss density . At the stationary state, (x) + (y) = 2 and (r) = 0. Assuming r ≠ 0, Eq. (S16) then implies in other words, (S20) In the special case = 0 (i.e., no spatial interaction between neighbouring sites), this implies (r) = 2 for all r ≠ 0. Thus for any two sites x ≠ y, (x) (y) = 2 , indicating that spins at different sites are fully uncorrelated. This is what one would expect, as all sites operate independently when = 0.
In the special case = 1 (no ingress-egress dynamics within the sites), on the other hand, we obtain Δ (r) = 0 for r ≠ 0. We also have (0) = 0. This implies (r) = 1 everywhere, so either all spins are up or all are down. This is the only possible stationary state when the only dynamics is through nearest-neighbour interactions. Now suppose 0 < < 1. Dividing Eq. (S20) by we obtain where we write We note that Δ (r) = Δ (r). To solve Eq. (S21), it then suffices to solve (for r ≠ 0) We now first focus on the equation for all r (including r = 0), and where x,y is the Kronecker delta. Let (r) be a solution of Eq. (S26). Then is a solution of Eqs. (S24) and (S25). This can be seen as follows: first, from Eq. (S27), so the condition in Eq. (S25) is met. Second, we need to show that Eqs. (S26) and (S27) imply Δ (r) + (r) = 0 for r ≠ 0. For r ≠ 0 we have from Eq. (S26). The quantity (r) in Eq. (S27) is proportional to (r) with a proportionality constant (1 − 2 )/ (0) which is independent of r. Given that (r) fulfills Eq. (S29) for r ≠ 0 it is then clear that (r) fulfills Δ (r) + (r) = 0 for r ≠ 0, i.e. Eq. (S24).
So we are left with the task of finding a solution of Let us write (r) in Fourier representation: where k = ( 1 , 2 ) and (k) is the Fourier transform of (r), i.e.

(S39)
This expression is related to the complete elliptic integral of the first kind (48). We use the following notation: . (S40) Hence we have where we abbreviate for convenience. Next, we write the Laplacian Δ (0) in full: Assuming isotropy, each term in the square brackets is equal to (e 1 ) = (1, 0). Hence Eq. (S30), evaluated at r = 0, takes the form From this we find Now using Eqs. (S23) and (S27), The stationary-state isogloss density is then where we have used Eq. (S41). On the other hand, so that Recalling that see Eq. (S22), and defining = − , i.e. = − , yields our final equation for the stationarystate isogloss density: where The expression in Eq. (S52) is a downward-opening parabola in whose height is fixed by

S2.5 Sanity check: the limits → 0 and → ∞
The complete elliptic integral ( ) has the following known properties (49): Now consider the limit → 0. This limit is relevant when either → 1, or both → 0 and → 0, see Eq. (S54). These are situations in which the spatial (voter) process dominates the ingress-egress dynamics. In this limit 1/(1 + ) → 1, so that − = (1/(1 + )) → ∞. Noting that Eq. (S53) can be written in the form we then find that ( ) → 0. Thus, in the limit where the spatial (voter) process completely overtakes the ingress-egress process, the stationary-state isogloss density is zero, indicating that all sites agree in their spin. Next consider → ∞. This limit occurs when + > 0 and → 0, i.e. the ingress-egress process dominates. Then 1/(1 + ) → 0, so that − → /2. From Eq. (S56), we then find that ( ) → 1 in this case. Thus, in the limit where the ingress-egress process completely overtakes the spatial process, the stationary-state isogloss density is given by the parabola = 2 (1 − ), indicating complete independence of the individual spins.

S3 Model: undirected horizontal error
We now drop the requirement that and be zero, but retain the assumption that = , i.e. that errors in the horizontal process are equally likely to innovate a feature as they are to lose it.

S3.1 Spin flip probability
The spin flip rate associated with the (vertical) dynamics within a site is not affected by the generalization, and remains as before The rate with which the spin at site x flips due to interactions with one of the neighbours changes though due to the modification of the model. We now have The first term only contributes if (x) = (x ) = −1. Without copying errors, no change of (x) would occur. However, with probability there is an error the attempted copying of (x ) = −1, and (x) flips from (x) = −1 to (x) = 1. The second term contributes only when (x) = 1 and (x ) = −1. This only results in a flip of (x) (from 1 to −1) if no error is made in the copying process, i.e. if there is no ingress event. Hence the prefactor 1 − . With probability the copying is associated with an ingress-type error, and (x) remains at = −1, and hence no flip occurs. The third and fourth terms have a similar interpretation.
We now re-organise terms in Eq. (S58): We therefore have as in the original model, and where it is important to remember that the summation over x runs over the four nearest neighbours of x. The total rate with which the spin at site x flips is therefore

S3.2 Stationary-state feature frequency
As in the more basic model, we proceed to compute the magnetization in the stationary state, or equivalently the number of spins in state +1 (the fraction of sites where the feature is present). Similar to Eq. (S6) we have It is now difficult to proceed for the case of general and . The difficulty arises from the last term in Eq. (S63), which contains a correlation function. This means that the equation for the mean magnetisation is not closed. Similarly, if we proceeded to calculate the correlation function, we would meet averages of objects involving three different sites.
However, we can continue if we set = . The problematic term then vanishes. Assuming = means that errors in the copying process from neighbours occur independent of what spin state is copied. We write = = and always assume 0 ≤ < 1/2. The case = 0 describes error-free horizontal transmission. For = 1/2 all information is lost in the transmission, the received state is entirely random (feature present or absent, each with probability 1/2), and independent of the transmitted state. The restriction < 1/2 therefore encapsulates the assumption that some information is transmitted on average. A choice of > 1/2 would reflect situations in which there is a tendency for bits to become actively inverted in transmission ('worse than random'). We do not study these here.
Taking the sum over all sites x in Eq. (S63), one now finds From this in turn we obtain Hence in the stationary state Comparing this to Eq. (S11) we note the additional term proportional to in the denominator. This term drives to zero. This makes sense, as errors in the copying process from neighbouring sites 'equalize' the frequencies of the two spin states present in the system. Suppose for example there is a majority of +1 spins. Then an attempted copying event is more likely to be one in which a +1 spin is copied. Errors in this process reduce the fraction of +1 spins, and increase the fraction of −1 spins. Using = ( + 1)/2, we find from Eq. (S66): This quantity is manifestly between 0 and 1 (numerator and denominator are both non-negative, and the denominator is always greater than or equal to the numerator). For = 0 this reduces to Eq. (S12), the formula for the simpler model. We also notice that = 1/2 when = 1. This makes sense, there is then only the horizontal process, and copying errors happen with the same rate , independently of the state that is copied (absence or presence of the feature). As mentioned above, this drives the system to equal fractions of −1 and +1 spins.

S3.3 Correlation function and isogloss density
We start from [see Eq. (S13)] where now From this we find x (x , y) where the summation over y is over the four nearest neighbours of y.
As in the analysis of the simpler model we now assume translation invariance and write (r) = (x − y) = (x, y). Then, for r ≠ 0, Adding and subtracting a term (the aim is to isolate a lattice Laplacian) we obtain Due to translation invariance the two summations on the RHS coincide, and we have (always restricting r ≠ 0) where Δ is the lattice Laplacian.
In the stationary state, (x) + (y) = 2 and (r) = 0. Assuming r ≠ 0, we then obtain the following from Eq. (S73): In other words, In the special case = 0 (i.e., no spatial interaction between neighbouring sites), this implies (r) = 2 for all r ≠ 0, as in the simpler model (but note that the stationary value of is different in the two models).
We note that the term containing the lattice Laplacian does not contribute when either = 0 or = 1/2. If = 0, then there is no copying process from neighbouring sites in the model from the beginning, so there is no interaction at all between the different sites. As mentioned above, if = 1/2 no information is transferred in copying processes from neighbours, as the state to be copied is transferred faithfully with probability or inverted, each with probability 1/2. This does not constitute any meaningful interaction, and neither does it create correlations between sites. The absence of the Laplacian reflects this.
In the following we will always exclude such pathological cases, and assume 0 < < 1 and ≠ 1 2 . Dividing Eq. (S75) by (1 − 2 ) we obtain This is of the same form as Eq. (S21), except that the expression for has changed [cf. Eq. (S22)]. We can therefore use all results obtained for the simpler model, provided we use the expression for in Eq. (S77). In particular we have where The height of the downward-opening parabola in in Eq. (S79) is fixed by ( ), where i.e., This expression for temperature clearly generalizes our earlier result (S54) for the special case = = 0. We recall that the meaningful range of is 0 ≤ < 1/2, so the temperature in Eq. (S82) is non-negative.

S4 Model: no restrictions on error rates
As discussed in Section S3.2 above, we have been unable to find a closed-form solution for the stationary-state feature frequency , isogloss density and temperature for the fully general model, i.e. when all of the parameters , , , and are in free variation. The root of the problem lies in the fact that the equations for the first and second moments of the distribution of features do not close, when ≠ . In such cases there is either a tendency towards introducing a feature in the horizontal process, or towards losing it, amounting effectively to selection for or against the feature. Truncation schemes would be required to obtain approximations for the stationary values for and . Interestingly very similar issues are known in the population genetic literature, in the presence of selection (28). We have not attempted to explore approximations for and based on truncations of the resulting hierarchy of equations. However, the following ansatz is reasonable, and, as we show below, sufficiently accurate for our purposes.
In the model with = , the stationary-state feature frequency reads [see Eq. (S67)] with = = , so 2 = + . Since is by definition the frequency of up-spins, the following generalization of (S83) suggests itself: Secondly, when = , the stationary-state temperature reads [see Eq. (S82)] Substituting again + for 2 yields the following ansatz for the general case ≠ : Finally, from the arguments in Section S3.3, we hypothesize the following form for the stationarystate isogloss density for the fully general model: where and is taken from Eq. (S86).
To corroborate these hypotheses, we ran a simulation of the model on a lattice of 50 × 50 sites and 100 features, with randomly drawn values for the parameters , , , and for each feature, measured the empirical values of , and at the end of each simulation, and correlated these against the above theoretical predictions. Each simulation was run for 25,000,000 iterations, so each feature underwent either a vertical or a horizontal event in each lattice site approximately 100 times during the simulation. The correlations are visualized in Fig. S1. The fact that all points lie close to the diagonal suggests that Equations (S84), (S85), (S86), even though not rigorously derived from theory, do faithfully describe these quantities at the stationary distribution. The Spearman correlation coefficients are = 0.994 ( < 0.001) for , = 0.985 ( < 0.001) for , and = 0.931 ( < 0.001) for .

S5 Robustness of method
To establish that our method for estimating empirical temperatures is robust against minor variations in the data it is fed, we have repeated the analysis by applying it to disjoint subsets of the WALS (32) dataset and by applying different spatial scales in the computation of isogloss densities, on whose values our empirical temperature estimates are based. We have also run the model on a neighbourhood graph inferred from the WALS atlas (instead of a regular lattice, as in the analytical solution), to assess to what extent the less regular real-life distribution of languages over physical space might invalidate the analytical regular-lattice idealization. Finally, in order to assess whether the irregular spatial distributions of genealogical units (language families and genera, rather than individual languages) might skew isogloss densities in unpredictable and non-universal ways, we have repeated the analysis with the restriction that nearest neighbour languages are always drawn from within the language family the focal language belongs to. The results of these extensions, which support our original idealized lattice model and the choice to consider only a small number of nearest neighbours for each language in the calculation of isogloss densities, are reported in the following subsections.

S5.1 The two hemispheres test
Our main analysis relies on a bootstrap scheme to arrive at estimates of temperature . To make sure that is actually catching a signal rather than the particularities of an arbitrary sample of the world's languages, the procedure is repeated 1,000 times and the median temperatures and bootstrap distributions reported in the main paper. The question remains, however, to what extent temperature estimates from one sample of languages correlate with estimates from another such sample-a robust method should arrive at a similar ranking of features with respect to temperature regardless of which set of languages is considered, provided that that set is large enough. To answer this, we follow (9) by first partitioning the world into two subsets which are both genealogically and areally independent to a great degree: Old World (OW), consisting of longitudes between 0°E and 180°E, and New World (NW), comprising latitudes between 0°W and 180°W. The analysis is now conducted for both subsets (OW and NW) separately, and the median estimates from the OW bootstrap correlated against the median estimates from the NW bootstrap. The Spearman correlation coefficient was found to be = 0.52 ( < 0.01). This should be compared against the value = 0.51, which is what has previously been reported for a similar two hemispheres comparison based on a phylogenetic-areal technique (9). This reasonably high (and statistically significant) positive correlation suggests that temperature ( ) is a measurable quantity whose value is to some extent independent of the specific composition of the language sample used in its calculation.

S5.2 Varying the spatial scale
In the main paper, as well as in the above two extensions, we have computed isogloss densities on the basis of considering the 10 nearest geographical neighbours of each language-that is, assuming a neighbourhood size of = 10. The question arises whether considering either smaller or larger neighbourhoods would noticeably affect our temperature estimates. To assess this, we repeated the analysis for neighbourhood sizes varying from = 1 to = 50 and correlated the temperature estimates against those found with = 10 in the original analysis. The values of the Spearman correlation coefficients are plotted in Fig. S2 (left); all correlations are statistically significant at < 0.001. The high correlations indicate that our method is robust over a range of spatial scales. Fig. S2 (right) shows that as neighbourhood size increases, the method becomes increasingly non-local. There is thus little incentive to consider larger neighbourhoods. This, moreover, does not exhaust the set of possible operationalizations of spatial neighbourhood. For example, the definition of isogloss density could incorporate geographical distance between languages such that the weight of any pair of languages decreases with increasing distance.
Known physical barriers could also be taken into account.

S5.3 Spatial adjacency graph vs. regular lattice
The above analytical solution of the model assumes a regular isotropic infinite lattice. Empirically, languages are not distributed over physical space in this manner, so the question arises to what extent the spatial configuration of the mathematical model actually reflects reality. To assess this, we simulated the model on an adjacency graph inferred from the geographical coordinates of real languages extracted from WALS (altogether 2639 languages, disregarding the sign languages recorded in the atlas): each language was connected to its 10 nearest geographical neighbours. To simulate a variety of features of varying temperatures, 100 independent features were modelled, drawing the parameters , , , and for each feature at random. Empirical values of feature frequency , isogloss density and temperature were computed at the end of each simulation (after 25,000,000 iterations, so that each feature got updated in each language roughly 100 times). The empirical quantities were then correlated against what the theory predicts on an ideal lattice. The Spearman correlation coefficients are = 0.995 ( < 0.001) for , = 0.988 ( < 0.001) for , and = 0.915 ( < 0.001) for (Fig. S3).

S5.4 Potential baseline isogloss density bias resulting from the geographical distributions of languages belonging to a genealogical unit
It is possible to imagine a feature which is stable in the sense of having very low ingress and egress rates but being confined to a single genealogical unit (a language family or genus) whose languages happen to be scattered among languages from other genealogical units in which this feature is not present (cf. Fig. 2C in the main paper, imagining that the colours signify both the presence of a feature as well as the genealogical unit, i.e. presence/absence of the feature precisely carves the genealogical boundary). In such a case, our method, which does not rely on genealogical information but only looks at the empirical spatial scattering, would conceivably report an intermediate (rather than low) temperature simply because of the areal scattering of the genealogical units, whereas a genealogical method would report maximal stability. Certain features of our method render it relatively robust to such a scenario. Our main procedure estimates features' temperatures across languages belonging to several genealogical units; moreover, in the bootstrap set-up described in the main paper, the specific geographical composition of the language sample varies from one bootstrap run to the other. However, to make the point more quantitatively, we have implemented a variation of the original analysis, in which the nearest neighbour languages are always drawn from the same language family as the focal language, as specified in the genealogical information contained in WALS. On the assumption that the proposed scattered nature of the geospatial distributions of genealogical units causes a significant problem for our method, the temperature estimates from this genetic-areal variation of the method ought not to correlate, or at least not to correlate well, with those from our main analysis. This is, however, not the case. The Spearman correlation is 0.814, statistically significant at the < 0.001 level.

S6 WALS feature levels
The following list gives the values of the WALS (32) features considered in this study; the italicized part after each value gives its value in our binarization (present for 'feature present', absent for 'feature absent' and N/A if the WALS value was excluded from the binarization as irrelevant). Languages attesting irrelevant values were excluded from the corresponding feature language sample for the purposes of calculating our statistics.