Detecting and quantifying causal associations in large nonlinear time series datasets

A novel causal discovery method for estimating nonlinear interdependency networks from large time series datasets.


This PDF file includes:
Section S1. Time series graphs Section S2. Alternative methods Section S3. Further PCMCI variants Section S4. Conditional independence tests Section S5. Theoretical properties of PCMCI Section S6. Numerical experiments Algorithm S1. Pseudo-code for condition selection algorithm. Algorithm S2. Pseudo-code for MCI causal discovery stage. Algorithm S3. Pseudo-code for adaptive Lasso regression. Table S1. Overview of conditional independence tests.     Fig. S10. Experiments for nonlinear models (part 1). Fig. S11. Experiments for nonlinear models with different sample sizes (part 1). Fig. S12. Experiments for nonlinear models (part 2). Fig. S13. Experiments for nonlinear models with different sample sizes (part 2). Fig. S14. Experiments for observational noise models. Fig. S15. Experiments for nonstationary models. Fig. S16. Runtimes for numerical experiments. Fig. S17. ANCOVA interaction plots. Fig. S18. Comparison of PCMCI and CCM on logistic maps.  Definition 1 (Definition of time series graph). Let X be a multivariate discrete-time stochastic process and G = (X × Z, E) the associated time series graph. The nodes in that graph are the individual timedependent variables X = (X 1 , . . . , X N ) at each time t ∈ Z. Let X t denote the process at a particular time step that we call the present, and X − t = (X t−1 , X t−2 , . . .) the past process. Consider an underlying time-dependent data-generating process where f j is some potentially nonlinear functional dependency and η j t represents mutually independent dynamical noise, i.e., η j t ⊥ ⊥ η k t ′ for k ̸ = j or t ̸ = t ′ . P(X j t ) ⊂ X − t denotes the causal parents of variable X j t . Variables X i t−τ and X j t for τ > 0 are connected by a lag-specific directed link "X i t−τ → X j t " in G pointing forward in time if X i t−τ ∈ P(X j t ). Another way to define the edges E of the graph is for τ > 0 and where ' & & ⊥ ⊥' denotes the absence of a conditional independence and X − t \ {X i t−τ } the past of the multivariate process excluding X i t−τ . Both definitions are equivalent for processes sufficing the assumptions used here (section "Assumptions of causal discovery from observational data"), in particular the Markov property which holds since the noise terms η in Eq. (S1) are assumed independent. The graph is actually infinite in time, but in practice only considered up to some maximum time lag τ max . Throughout this work we assume that we have only one time series realization available and assume stationarity: The above definition for links at time t holds for links at every t ′ ∈ Z. Then the parents P(X j t ) for all variables X j t ∈ X t represent the graph G. Contemporaneous links for τ = 0 (P(X j t ) ⊂ X − t+1 in Eq. (S1)) can be defined in different ways (34,39). Here they are left undirected, but other techniques (59, 60) could be applied to determine causal directionality for contemporaneous links.
In the following we define alternative causal and non-causal methods considered here, Supplementary Tab. S3 gives an overview of the methods compared in the numerical experiments.

S2.1 FullCI
The most straightforward way to test the existence of causal links is to directly test As before, the past X − t is truncated at a maximum time lag τ max . Also this test can be implemented with any of the conditional independence tests defined in Supplementary Sect. S4 below. In its original formulation, Granger causality between time series variables X and Y is based on fitting a linear or nonlinear model, including other covariates, to Y and a causal link X → Y is assessed by quantifying whether the inclusion of the past of variable X in the model significantly reduces the prediction error about Y (12). Our formulation can be interpreted as a general, lag-specific version. For linear models we implement FullCI by fitting a vector-autoregressive (VAR) model using the statsmodels package, which also allows to obtain the corresponding p-values. Section S1. Time series graphs Section S2. Alternative methods

S2.2 Lasso
In the linear case, FullCI can be phrased as testing for nonzero coefficients in a VAR model. If implemented using standard ordinary least-square regression, the problem becomes ill-defined if the number of coefficients exceeds the number of samples. One way to address this problem are regularized highdimensional regression techniques. In particular, Lasso (17) is a common regression method that can also be used for variable selection. Lasso regression is known to be inconsistent in some scenarios, which is overcome by the Adaptive Lasso (18) that yields a consistent estimator by utilizing an adaptively weighted penalty. Here, we implemented an adaptive Lasso that consists of computing several Lasso regressions with iterative feature re-weighting, see Supplementary Algorithm S3. After several iterations, the active set of variables is determined as the non-zero coefficients. Then all zero-coefficients are assigned a p-value of one, while the p-values for the active set of variables is determined by an OLS regression including only the active variables. To select the optimal regularization parameter, we used a time-series based cross-validation scheme. Some tests based on AIC-hyperparameter-selection did not yield good results. We found that the p-values tend to be over-conservative and false positive levels cannot be reliably controlled just below a desired threshold.

S2.3 PC algorithm
The original PC algorithm was formulated for general random variables without assuming a time order. It consists of several phases where first, in the skeleton-discovery phase, an undirected graphical model (62) is estimated whose links are then oriented using a set of rules (5,25) in subsequent phases. Since here time order yields an orientation of time-lagged links, we implement only the skeleton-discovery phase of the PC algorithm, as given in Supplementary Algorithm S1, but in contrast to our fast variant, the original version does not restrict the number of condition combinations q max to test. In our numerical experiments, we use a large choice of q max = 10 and α PC = 0.2. In contrast to FullCI or PCMCI, it is not straightforward to assess the confidence of causal links with the PC algorithm, because links are iteratively removed. Here we use a p-value assessment for causal links according to ref. (66) that is, the maximum of all p-values from the conditional independence tests for different condition sets S in Eq. (4) defines the aggregated p-value of a causal link. This causal discovery method we term PC in the numerical experiments. Also here, we found that the p-values tend to be over-conservative for the most part with outliers for strong autocorrelations, and that false positive levels cannot be reliably controlled just below a desired threshold.

S2.4 Convergent cross-mapping
Convergent cross-mapping (CCM) (14) assumes an underlying dynamical system and is based on statespace reconstruction (see also refs. (15, 53 )). In this framework a causal relationship between two dynamical variables X and Y can be established if they belong to a common dynamical system. If variable X can be predicted using the reconstructed system based on the time-delay embedding of variable Y , then X has a causal effect on Y . CCM is here estimated with embedding dimension E = 2 and alternatively by optimizing the embedding dimension using the simplex algorithm (14). To test significance we used the surrogate test ebisuzaki with 500 surrogates in the R-package rEDM. CCM requires two criteria (14): (1) a significant CCM value at library size n and (2) an increasing CCM ,71 value over increasing library length. As a p-value of CCM we, hence, take max(p n , p conv ), where p n is the p-value of CCM at library size n and p conv is the p-value for the hypothesis of an increasing linear trend of the CCM value with increasing library length, here from n ′ = 10 to n ′ = n. Note that to test X → Y , CCM and related works (53) only use the time series of X and Y with the underlying assumption that the dynamics of any common driver can be reconstructed using delay embedding. In Supplementary Fig. S18 we depict a comparison with PCMCI on coupled logistic maps described in Supplementary Sect. S6 below.

S2.5 Unconditional pairwise tests
We also investigate unconditional pairwise test measures I(X t−τ ; Y t ). In the ParCorr implementation this is simply the Pearson correlation coefficient (Corr), in the GPDC implementation the distance correlation (dCor), and for CMI the mutual information (MI), see definitions in Supplementary Sect. S4.

S3.1 PCMCI 0
In the MCI stage of the PCMCI method, one can also test for all links, that is, the MCI test without the condition on the parents P(X i t−τ ), or, equivalently, p X = 0, denoted PC 1 +MCI 0 . Note that for X i t−τ ∈ P(X j t ) this is the test in the last iteration of the PC algorithm, but here also links that were removed in PC 1 are tested again. In our numerical experiments (Supplementary Figs. S4,S5,S6,S7) we found that PC 1 +MCI 0 has inflated false positives. We also test a variant, called PC 1 +MCI 0 pw, where all time series are pre-whitened prior to running PC 1 +MCI 0 , that is, we preprocessed all N time series by estimating the univariate lag-1 autocorrelation coefficientsâ i = ρ(X i t−1 ; X i t ) and regressing out the AR(1) autocorrelation part of the signals Then the PC 1 +MCI 0 test is applied to these residualsX. Our numerical results in Supplementary Figs. S4,S5,S6,S7 show that this approach also fails to control false positives.

S3.2 BivCI
In the numerical experiments, we also evaluate a bivariate conditional independence test (BivCI) This corresponds to a pairwise lagspecific version of Granger causality or Transfer Entropy (67). Thus, this test removes autocorrelation to some extent, but does not exclude common drivers or indirect dependencies. Also auto-dependencies induced by common drivers are not removed. As analyzed in Supplementary Figs. S4,S5,S6,S7, BivCI reduces autocorrelation effects to some extent, but false positives are not correctly controlled, as expected.

Section S3. Further PCMCI variants
Similarly to the PC algorithm, the PCMCI framework (Supplementary Algorithms S1 and S2) that we propose can be used in conjunction with any conditional independence test -these will typically be based on estimating different dependence measures with associated test statistics. Here we implement three tests: partial correlation (ParCorr), a nonlinear two-step conditional independence test we term GPDC, and a fully non-parametric test based on conditional mutual information (44). Supplementary Tab. S1 gives an overview of the tests. Note that the conditional independence tests in the present context are conducted with a sample size of n = T − 2τ max , where T is the time series length and the first 2τ max samples are cut off since this is the maximum time lag of a lagged parent in P(X i t−τ ) relative to X j t . Further, we might need to account for missing and masked samples.

S4.1 ParCorr
Partial correlation for testing X ⊥ ⊥ Y | Z here is estimated (Supplementary Tab. S1) in a two-stage procedure with a multivariate regression of X and Y , separately, on Z followed by a correlation test on the residuals. Its advantages are fast computation and that the distribution under the null hypothesis of conditional independence is known analytically, but it is applicable only to the multivariate Gaussian case which can only capture linear dependencies.

S4.2 GPDC
GPDC also belongs to the class of residual-based conditional independence tests. Instead of a linear regression, here the first step is a Gaussian process (GP) regression (42) and the second step a test for the (unconditional) independence of the uniformized residuals with the distance correlation coefficient (DC) (43). GP regression is a widely used Bayesian nonparametric regression approach. Distance correlation (43) is a measure of dependence between two random variables and is zero if and only if the variables are independent. Thus, distance correlation measures both linear and nonlinear association. See Supplementary Tab. S1 for details. Note that the underlying assumption of GPDC is that of an additive functional dependency, see ref.
(72) for a more general, but still residual-based, test. The estimator for distance correlation is defined as where the distance covariance dCov (and variance dVar) is computed by where A j,k B j,k are the doubly-centered distance matrices of X and Y (see ref. (43)), respectively, and n is the sample size (in the present context the length of the time series minus cutoffs due to τ max ). Distance correlation is implemented in Tigramite based on the code in the original dcov.test function in the energy package for the R-language. The prior transformation of X and Y to uniform marginals allows to pre-compute the distribution for each sample size n (implemented in Tigramite) and critical values under the independence null hypothesis -thereby avoiding computationally expensive permutation tests for each test of X ⊥ ⊥ Y | Z. Note that since we are not aware of a way to account for the GP-step Section S4. Conditional independence tests in assessing the degrees of freedom for the subsequent distance correlation test, we simply used the sample size n, which did not seem to inflate false positives in our experiments, at least for not too high dimensions. Figure 3D illustrates a quadratic relationship X = Z 2 + η X and Y = −Z 2 + η Y for Z ∼ N (0, 1) where X ⊥ ⊥ Y | Z can be detected with GPDC, but not with ParCorr.

S4.3 CMI
ParCorr and GPDC, like any two-step procedure for conditional independence testing, which includes a regression followed by an unconditional test on the regression residuals, has an underlying assumption of additive noise and a functional dependence on the conditioning variables. In the presence of dependencies which cannot be represented even in a nonlinear functional form, a regression will not be able to remove these dependencies on the conditioning variables. Figure 3D for CMI illustrates a multiplicative case with X = Zη X and Y = −Zη Y for Z ∼ N (0, 1), where both ParCorr and GPDC fail to establish X ⊥ ⊥ Y | Z. Then the two-step procedure should be replaced with fully non-parametric techniques to measuring and testing conditional dependence. Here, a fully non-parametric test (44) for continuous data based on conditional mutual information, combined with a local permutation scheme is implemented. The conditional mutual information for continuous and possibly multivariate random variables X, Y, Z is defined as with the Digamma function as the logarithmic derivative of the Gamma function ψ(x) = d dx ln Γ(x) and sample length n. The only free parameter k is the number of nearest neighbors in the joint space of X ⊗ Y ⊗ Z which defines the local length scale (in maximum norm) ϵ i around each sample point i. Then k xz i , k yz i and k z i are computed by counting the number of points with distance strictly smaller than ϵ i (including the reference point i) in the subspace X ⊗ Z to get k xz i , in the subspace Y ⊗ Z to get k yz i , and in the subspace Z to get k z i . The decisive advantage of this estimator compared to fixed global bandwidth approaches is its local data-adaptiveness (Fig. 3D): The hypercubes around each sample point are smaller where more samples are available. As opposed to GP regression, this feature allows to detect also highly non-smooth dependencies (44). Unfortunately, few theoretical results are available for the complex mutual information estimator. While the Kozachenko-Leonenko estimator is asymptotically unbiased and consistent (73, 77), the variance and finite sample convergence rates are unknown. Hence, the null distribution in the CMI test (44) relies on a local permutation test that is also based on nearest neighbors and data-adaptive. Here we set k CMI = 50, and as the nearest neighbors in the local permutation scheme k perm = 5, as well as B = 500 for the number of surrogates to approximate the null distribution. These choices are based on the findings in ref. (44). CMI is implemented in Tigramite, where k CMI = ⌊0.1n⌋ with sample size n, B = 500, and k perm = 5 are the default parameters. Alternative conditional independence tests are, e.g., kernel conditional independence tests (78-80). PCMCI-CMI can be considered a doubly adaptive causal discovery method: PCMCI adapts the condition-selection locally to the causal network and CMI adapts the conditional independence estimation locally to the sample density.

S4.4 Discrete data
The previous tests were developed for continuously-valued data. For discrete or symbolic data, the software package Tigramite implements a CMI test based on discrete entropy estimation, called CMIsymb. This test directly estimates CMI based on where the discrete densities are estimated from symbol frequencies. For this test no analytical results for the null distribution are available and a permutation test is recommended.

S5.1 Computational complexity
PCMCI has polynomial complexity in the number of variables N and τ max . The computational complexity of the PC 1 stage of PCMCI (Supplementary Algorithm S1) strongly depends on the network structure and the parameter α PC . The sparser the causal dependencies, the faster the convergence. In the worst case, where the network is completely connected (which is rather pathological), the computational complexity of the PC 1 condition-selection stage for N variables amounts to conditional independence tests with iteratively increasing cardinality. The MCI stage (Supplementary Algorithm S2) further involves N 2 τ max tests (for τ > 0). Hence, the worst case total computational complexity in the number of variables is polynomial and given by N 3 τ 2 max + N 2 τ max -if α PC -optimization is not taken into account, which yields a factor of the number of α PC -values tested in the first term. The maximal dimensionality of the estimations in the MCI stage is 2 + | P(X j t )| + | P(X i t−τ )|. The computational runtime will then depend on how the conditional independence test scales with this dimensionality and the time series length T . In the numerical experiments, we analyze runtimes of the different compared methods for different network sizes N and time series lengths T ( Supplementary  Fig. S16). In Tigramite a recycle_residuals-option is implemented that can be used for ParCorr and GPDC to save already computed residuals in memory to be re-used in later tests of PC 1 or MCI.

S5.2 Consistency
We here give a proof of the consistency for the population version of PCMCI, that is, PCMCI estimates the true graph if there are no errors in the conditional independence tests. The proof relies on the three standard assumptions in causal discovery (5) Causal Markov Condition, implying that X j t is independent of X − t \ P(X j t ) given its parents P(X j t ), and (3) Faithfulness which guarantees that the graph entails all conditional independence relations that are implied by the Markov condition. Faithfulness says that if two variables are independent conditionally on a set S, then they are also separated by S in the graph. See ref. (34) for formal definitions of separation and the assumptions. As part of the Causal Markov Condition in the time series graph context, we also assume no contemporaneous causal effects, which excludes causal effects at τ = 0. Only then the lagged parents are sufficient for the Causal Markov condition. Last, we also assume stationarity here so that dependencies (or lack of them) remain unchanged across time. With these assumptions we can state the consistency of PCMCI. Proposition 1. (Consistency) Let X be a stochastic process with true time series graph G as defined in Def. 1 (equivalent to definition in Eq. S2) and let G be the estimated graph with PCMCI (Algorithms S1,S2) implemented with a consistent conditional independence test. Assuming Causal Sufficiency, Faithfulness and the Causal Markov Condition, we have that A proof is given below. Note that the consistency of the population-version of PCMCI is a weaker statement than, for example, uniform consistency which bounds the error probability as a function of the sample size giving a rate of convergence. Robins et al. (45) showed that no uniformly consistent causal discovery technique from the class of independence-based methods (5) exists, since the convergence can always be made arbitrarily slow by a distribution that is almost unfaithful with some dependencies made arbitrarily small. Uniform consistency can only be achieved under further assumptions that exclude these almost unfaithful dependencies (46). To prove consistency, we need the following two lemmas.
Lemma 1. Let P(X j t ) denote the estimated condition set of Algorithm S1 for X j ∈ X and let P(X j t ) denote the true parents. Assuming Causal Sufficiency, Faithfulness and the Causal Markov Condition together with a consistent conditional independence test, we have that that is, the estimated parents are a superset of the true parents. .
Proof. Suppose X i t−τ / ∈ P(X j t ). Causal Sufficiency implies that X i t−τ was observed and independence can be tested. For p max = ∞ (default parameter value in PC 1 ) and with a consistent conditional independence test, PC 1 removes X i t−τ from P(X j t ) if and only if X i t−τ ⊥ ⊥ X j t | P(X j t )\{X i t−τ } in the last iteration of PC 1 . Now Faithfulness implies that then X i t−τ→ X j t and hence X i t−τ / ∈ P(X j t ). In Lemma 2 we prove that PC 1 estimates exactly the true parents.
Lemma 2. Let P(X j t ) denote the estimated condition set of Algorithm S1 for X j ∈ X. Assuming Faithfulness, the Causal Markov Condition, and Causal Sufficiency we have that that is, the estimated parents are the true parents.
Proof. From Lemma 1 we know that P(X j t ) is a superset of P(X j t ), so we only need to check whether all parents in P(X j t ) are also in P(X j t ). Assume the contrary that . From the weak union property of conditional independence it follows that contrary to the assumption. Hence P(X j t ) = P(X j t ). With these two Lemmas we can prove Proposition 1.
Proof. (Proposition 1) From Lemma 2 under the assumptions of Causal Sufficiency, Faithfulness, Causal Markov Condition, and with a consistent conditional independence test the first stage of PCMCI estimates the true set of parents, that is P(X j t ) = P(X j t ). The MCI test (Eq. (3), Algorithm S2) establishes the absence of a link, that is, We need to prov Supplementary Fig. S1). In addition to the standard assumptions of causal discovery, we will make use of the basic properties of conditional independence: Decomposition, weak union, and contraction, as well as their contrapositions (68). Ad 1) From Lemma 2 it now follows that which proves the first part.
Ad 2) Now the contraposition of contraction implies that But the latter of these independence relations cannot hold since we assume the Causal Markov Condition which implies which proves the second part.

S5.3 False positive control
For finite samples, PCMCI will correctly control false positives at the specified significance level provided that (1) the condition-selection stage PC 1 detects in particular those parents that are necessary to identify conditional independencies, and (2) the MCI test is correctly calibrated. In the following two sections we discuss the effect of condition-selection and autocorrelation as a main factor for calibration.

S5.3.1 Condition-selection stage
As mentioned regarding uniform consistency above, it is difficult to derive theoretical guarantees for the first point, but empirically we found that PC 1 detects a superset of the parents with high probability. The detection performance of the PC 1 stage is shown for different time series lengths, networks sizes, and network densities in Supplementary Figs. S4,S5,S6,S7. Depending on the time series length, even for N = 100 still between 80% and 90% of the true parents are detected (black lines). Note that true positives refer to the detection rate for all parents here, cross-links as well as auto-links. In all these PC 1 implementations, the hyperparameter α PC was chosen among a range of values based on AIC (see discussion in Materials and Methods section "Choice of α PC "), which empirically yielded a high detection rate. Clearly, a too small α PC will increase the risk of missing a parent. A missing condition in P(X j t ) can potentially lead to a false positive in the MCI stage, but it seems that, empirically, those conditions that were missed are weak drivers and not very likely to induce spurious links. As mentioned, PC 1 is tuned to high power and for N = 100 also more than 80% of the identified conditions are false (red line), but still the number of conditions is only a small fraction (grey line) of the conditions used for FullCI.
As for the second point, the calibration of the MCI test will depend on the correct specification of the degrees of freedom, next to the parametric assumptions of the test. The degrees of freedom are reduced mainly by autocorrelation effects (discussed in next section), but may also be reduced due to the condition-selection stage. In the present implementation we do not take the latter factor into account. We have not found any sign of inflated false positives in our numerical experiments. Note that the PC 1 condition-selection does not decide which links are tested in the MCI stage (all are tested), only on the conditioning sets used. However, in order to ensure a more conservative false positive control regarding this aspect, one could split the dataset and conduct the PC 1 condition-selection on a separate part of the dataset than that used for the MCI tests. This would, however, clearly lead to a decrease in power and carry the problem of choosing the split in time series.

S5.3.2 Autocorrelation effects
The consistency proof does not require that the MCI test also conditions on the lagged parents P(X i t−τ ), conditioning on P(X j t ) suffices. We condition on the parents of the lagged variable since, for finite sample sizes, this approach helps to account for autocorrelation. Another reason, discussed in Supplementary Sect. S5.5 below, is that then the MCI test statistic value can be interpreted as a notion of causal strength, which allows to rank causal links in large-scale studies in a meaningful way. In the following, we provide a mathematical intuition. Conditional independence testing requires access to the null distribution of the test statistic under the null hypothesis of conditional independence. As described in Supplementary Tab. S1, for the conditional independence tests considered in this paper, the null distribution is either analytically given (ParCorr), pre-computed in advance (GPDC), or generated via a local permutation test (CMI). All three methods assume that the data for a particular test X ⊥ ⊥ Y | Z is independent and identically distributed (iid). Consider the simple two-variable model where η X,Y are iid. For c = 0 we have (unconditional) independence between X and Y . But the Pearson correlation test statistic ρ(X, Y ) for this case is not distributed according to a t-distribution with n − 2 degrees of freedom (Supplementary Tab. S1). In fact, due to the autocorrelation between samples for a, b > 0, the unknown true distribution has fewer degrees of freedom and will be typically wider than the assumed null distribution, leading to more false positives. The same holds for the pre-computed distribution for the distance correlation or the permutation-based distribution for mutual information. Now consider the idea to exclude autocorrelation in Y , which is the idea behind bivariate Granger causality and the information-theoretic transfer entropy (TE) (67), using the measure ρ(X t−τ ; Y t |Y t−1 ). For the above model for c = 0, this partial correlation can be simplified to ρ( is not for a > 0 and a test would still lead to false positives as analyzed further in ref. (34). The conditioning of the standalone PC algorithm also is based only on the parents of Y and, hence, does not control false positives correctly for large autocorrelation in X (see Fig. 5C). Typical remedies to account for autocorrelation are to adjust the degrees of freedom in some way, using pre-whitening, or by block-shuffling. While these approaches help to some extent for the simple bivariate case, they fail in the multivariate case that is relevant for causal discovery (34). Now consider the MCI test for this example, which, in the ParCorr implementation, can be simplified Thus, the final Pearson correlation test on the residuals, after regressing out the conditions, only depends on the noise terms and these are iid. Therefore, the analytical null distribution for n − 2 − 2 degrees of freedom is appropriate here and yields expected false positive rates. A similar reasoning holds for GPDC where also nonlinear auto-dependencies are regressed out. This case can be generalized to nonlinear additive models as discussed in Refs. (39, 55), here we briefly summarize this result in an information-theoretic framework based on the conditional mutual information [Eq. (S10)]

Proposition 2. (MCI iid-ness) Assume a model with no link between variables
where g X and g Y are arbitrarily linear or nonlinear deterministic functions of the parents Proof.
Importantly, the innovation terms η X t−τ , η Y t are iid. Then the dependence of MCI only on these innovation terms implies that statistical tests on I MCI X→Y (τ ) = 0 can be conducted under the iid-assumption and the null distribution assumptions discussed above are appropriate, yielding well-calibrated tests. Assumption (S34) is further discussed in ref. (39). Note, however, that this result is derived here for the population version of MCI and its application to empirical estimators should be considered with some caution and would rely on consistency and unbiasedness of these estimators, e.g., linear regression in ParCorr and GP in GPDC. The consistency properties of GP regression for specific classes of functions have been studied in ref. (69). A full analysis of GPDC would require considering those learning theoretic guarantees on regression functions and how they impact the properties of the subsequent distance correlation independence test, which is beyond the scope of this work. For CMI no finite sample consistency results are available (44). Our numerical experiments show that the MCI test largely has the expected rate of false positives even for strongly autocorrelated and nonlinear dependencies. This approach to avoiding timedependence in the sample we found to outperform other remedies such as pre-whitening or blockshuffling (34). Also the FullCI test is essentially performed on independent samples since the condition on X − t \ {X t−τ } removes any dependence with the past. However, in the GPDC implementation, we found inflated false positives (Fig. 5D), which is likely due to high dimensionality, where the autocorrelations are not properly regressed out.

S5.4 Effect size and FullCI
Now we turn to the dependent case where there is a causal link and the detection power depends on dimensionality and effect size, assuming sample size and significance level are fixed. Next to the lower dimensionality of the MCI test compared to FullCI, one can prove that the MCI test statistic generally has a larger or equal effect size compared to FullCI. Let I denote conditional mutual information as a general measure of dependence.

Proposition 3. (MCI is larger or equal than FullCI) With FullCI defined as a conditional mutual information corresponding to Eq. (S3) it holds that
Proof. To simplify notation (see Supplementary Fig. S1) Thus, R denotes the additional conditions of FullCI compared to MCI. Note that these are independent of Y given (P contains all of Y 's parents and by the Markov assumption I (R; Y |P * Y , P X , X) = 0. Now consider the following two possibilities for decomposing a multivariate mutual information using the chain rule FullCI and MCI are equal if the additional conditioning variables R are independent of Y given {P * Y , P X }. If they are not, for example, if variables in R are causally affected by X (i.e., they are causal descendents of X), then conditioning on them will generally lead to a lower effect size. Hence, the variables R are irrelevant in that they are not required as a separating set to establish that X and Y are independent (if this is the case) and, further, increase the risk to reduce effect size. Both the lower dimensionality and higher effect size are responsible for the empirically found higher power of the MCI test compared to FullCI (or Granger causality).

S5.5 Causal strength
MCI's effect size is not only always larger (or equal) than FullCI, but also can be interpreted as a measure of causal strength. Here, we employ an information-theoretic concept, based on a number of desirable dependence measure properties (39). Consider model (S33) with an added dependency term of Y on X We now investigate an information-theoretic definition of causal strength based on conditional mutual information: If we had experimental access for intervening in η X t−τ at a particular time t − τ , then this notion of causal strength information-theoretically quantifies how much of this momentary perturbation can be detected in Y t , excluding information contained in the past. This measure directly corresponds to "momentary" dependence in Y t on X t−τ that does not come through the parents of X t−τ . There are several information-theoretic proposals for measures of causal strength, see, for example, ref. (57). Our definition of causal strength is based on the fundamental concept of source entropy as further discussed in ref. (39). MCI for this model is an estimator of causal strength since, similar to the above proof For a linear dependence f (X t−τ ) = cX t−τ , MCI can be further simplified which for partial correlation in the Gaussian case becomes where σ 2 · now denotes the variances of the noise terms η. Thus, for a linear additive dependency, where causal strength can be attributed to a single coefficient c, MCI depends only on this coefficient and on the noise terms, but not on g Y , g X . MCI is then independent of dependencies due to the parents P (X t−τ ) and P (Y t ), which could include auto-dependencies. A causal signal can, thus, be better detected against noise coming from confounding drivers or autocorrelation. This theoretical result is confirmed in the numerical experiments in Fig. 6. PCMCI results in a confined scaling of power with link strength. In sum, larger effect size and lower dimensionality lead to higher detection power of PCMCI compared to FullCI.
Correlation can be very different from the causal effect, that is, from the link coefficient in a linear model. Take the following example Here the correlation for the link where Γ · denotes the variances. The correlation, thus, depends not only on c, and may even become zero depending on a and b. The MCI partial correlation, on the other hand, estimates the causal strength given by On the other hand, for nonlinear cases, there can still be various dependencies, because the function f mixes η X t−τ with P (X t−τ ) (39). As for the consistency proof given above, the results here are only derived for the population version of MCI.

S6.1 Model setup
To evaluate and compare different causal discovery methods, we use a model that mimics the properties of real world data. Here we model six of the major challenges of time series from complex systems: Highdimensionality, nonlinearity, strong autocorrelation, time lagged causal dependencies, observational noise, and non-stationarity. Consider the following model from which we generate 20 ensemble members per number of variables N , number of links L, and coupling strength c: For i, j ∈ {1, ..., N } we randomly choose L links "i → j" with i ̸ = j and generate time series according to for j ∈ {1, . . . , N }, and where (1) a j are uniformly randomly drawn from {0, 0.2, 0.4, 0.6, 0.8, 0.9} for one half of the ensemble and from {0.6, 0.8, 0.9, 0.95} for another, more autocorrelated, half of the 20 network ensemble members, except for the high-density experiments where a j are only drawn from (4) τ i is uniformly randomly drawn from {1, 2}. (5) c is constant for all links in a model and its absolute value differs among the experiments (see descriptions in Supplementary Tab. S2); the sign of c is positive or negative with equal probability. To guarantee stationarity, the functions f i (x) are all linear in the limit of large x and we dismiss all models for which the corresponding vector autoregressive model with nonlinear functions f i replaced by linear ones is nonstationary according to a unit root test (70). Figure 5B gives an example realization. With L = N links in each model, we have an average cross-indegree of 1 for all network sizes (plus an auto-dependency). The cross-link density, on the other hand, decays with N as N N (N −1)τmax = 1 (N −1)τmax . We also tested networks with L = 2N links, but we note Section S6. Numerical experiments that it is difficult to generate stationary time series for such densely connected processes. The effect of observational noise is evaluated by adding Gaussian noise N (0, σ 2 ) with different standard deviations σ. Non-stationarity is modeled by adding a sinusoidal dependence a sin(2πt/25) with different amplitudes a. The causality benchmark platform www.causeme.net contains these and further datasets to facilitate method evaluation.

S6.2 Performance evaluation
To  Fig. 5A, the left and right boxplots in the figures depict the distributions for all weakly autocorrelated pairs with mean autocorrelation (ρ(X t−1 , X t ) + ρ(Y t−1 , Y t ))/2 < 0.7 among the two variables X and Y of a link, and for strongly autocorrelated pairs ((ρ(X t−1 , X t ) + ρ(Y t−1 , Y t ))/2 ≥ 0.7), respectively. The boxes show the 25-75% and whiskers the 1-99% percentile range, the median is marked by a bar and the mean with 'x'. Note the logarithmic y-axis in the bottom panel for false positive rates > 0.1. The tick labels on the top of the figures note the average runtime and its standard deviation across the different model setups. The runtime estimates were evaluated on Intel Xeon E5-2667 v3 8C processors with 3.2GHz. These runtimes will depend on implementation. An overview of runtimes for the experiments is shown in Supplementary Fig. S16. In Supplementary Tab. S2 we list the model setups for the numerical experiments. Supplementary Tab. S3 gives details on the compared methods. The experiments were evaluated on a high-performance cluster.

S6.3 AN(C)OVA analysis
In Supplementary Fig. S17 and Tabs. S4-S12 we show results of an analysis of variance (ANOVA) and covariance (ANCOVA) of the numerical experiments. Supplementary Tab. S2 lists the ANOVA results tables corresponding the the different experiments. We use the statsmodels package and estimate the ANOVA model R ∼ C(W ) where R is the true positive rate, the false positive rate, or the runtime, and W = N, T, σ, a in the different experiment analyses and C indicates that we treat this as a categorical variable. Every ANOVA analysis yields coefficients for the change of the mean of the response variable, e.g., detection power, for every level of the independent variable, e.g., N . With N = 2 as a reference level (intercept), this results in M coefficients ∆ i for the change of detection power for N i = 5, 10, 20, 40, 60, 80, 100 with respect to N = 2. To summarize these results, in Supplementary Tabs. S4-S8 we depict R ±∆ where R is the reference level and∆ = 1 gives the average relative change of R per one unit increase in W (or rescaled for better readability). We mark∆ in bold if at least one coefficient ∆ i is significantly different from zero at a 1% significance level. Furthermore, we study the ANCOVA model R ∼ C(N ) + C(T ) + C(N ) * C(T ) on the experiment named ANCOVA in Supplementary Tab. S2 to also study interaction effects between N and T , that is, how the dependence of R on N changes for different values of T , and vice versa. In Supplementary Tabs. S9-S12 we show all corresponding coefficients and further statistics.

S6.4 Chaotic coupled logistic maps
In Supplementary Fig. S18 we depict a comparison of PCMCI in the CMI implementation with the nonlinear state-space method convergent cross-mapping (CCM) (14), see Supplementary Sect. S2.4. As in ref. (34), we consider three coupled logistic maps with optionally added dynamical noise and here additionally investigate the effect of strong common driver coupling with uniformly distributed independent noise η and r = 4 leading to chaotic dynamics. Here the coupling coefficient c controls the degree of common driving and σ the amount of dynamical noise in the system. To evaluate true and false positive rates, 200 realizations with time series length T = 150, 300 were generated. All methods were evaluated at a significance level of 5%. PCMCI was estimated with α PC = 0.1, τ min = 0, τ max = 2 and CMIknn parameters k CM I = 0.1n, k perm = 5 and B = 500 permutation surrogates (44), where n is the sample size.

S7 Supplementary Algorithms
Algorithm S1 qmax to estimate parents of X j t . We use this algorithm as a pre-selection stage in PCMCI with p max = N τ max (i.e., no restriction on the maximum number of parents) and q max = 1. For the standalone PC-stable algorithm, we set q max to a large value of 10.
Require: Time series dataset X = (X 1 , X 2 , . . . , X N ), selected variable X j , maximum time lag τ max , significance threshold α PC , maximum condition dimension p max (default p max = N τ max ), maximum number of combinations q max (default q max = 1), conditional independence test function 1: function CI(X, Y, Z)

2:
Test X ⊥ ⊥ Y | Z using test statistic measure I 3: return p-value, test statistic value I 4: Initialize preliminary set of parents P( Break for-loop 9: for all X i t−τ in P(X j t ) do 10: Mark X i t−τ for removal from P(X j t )

20:
Break from inner for-loop 21: Remove non-significant parents from P(X j t )

22:
Sort parents in P(X j t ) by I min (X i t−τ → X j t ) from largest to smallest 23: return P(X j t ) . Pseudo-code for condition-selection algorithm PC

Algorithm S2
τ ≥ 0, then causal links for τ = 0 correspond to contemporaneous links, which are left undirected here.
Require: Time series dataset X = (X 1 , X 2 , . . . , X N ), sorted parents P(X j t ) for all variables X j estimated with Algorithm S1, maximum time lag τ max , maximum number p X of parents of variable X i , and conditional independence test function CI 1: for all ( Define P p X (X i t−τ ) as the first p X parents from P(X i t ), shifted by τ

4:
Run MCI test to obtain (p-value, for inference of non-zero coefficients and p-values for linear model Y = Xβ. Lasso implemented in python's sklearn package LassoCV with default parameters (except fit_intercept=False) and with λ n chosen by cross-validation using TimeSeriesSplit(n_splits=5). In the numerical experiments we used k max = 5. The first part gist.github.com/agramfort/1610922 .

5:
Solve Lasso problem with λ n chosen by time series based cross-validation Re-weight coefficients β * * j = β * j /w j for j = 1, . . . , d 7: Compute new weights w j = 1/(2|β * * j |  Tables   Table S1 Overview of conditional independence tests to test whether X and Y are independent conditional on Z. The tests are discussed in Supplementary Sect. S4. All tests assume continuously-valued data. The Gaussian process (GP) was fitted with sklearn's GaussianProcessRegressor with kernel=RBF()+WhiteKernel() and alpha=0. The bandwidth of the Kernel in sklearn is estimated by maximizing marginal likelihood (ML-II). D Z is the cardinality of Z.
Note that the conditional independence tests in the present context are conducted with a sample size of n = T − 2τ max , where T is the time series length and the first 2τ max samples are cut off since this is the maximum time lag of a lagged parent in P(X i t−τ ). Further, we might need to account for missing and masked samples.

Estimation
Get residuals from OLS fit Pre-computed for each sample size n from dCor(u, v) with u, v ∼ U (0, 1) (valid due to copula transform) Local permutation test (44) .
x. An overview of runtimes for the experiments is shown in Fig. S16.

500, 1000
0.324 20  . Summarized ANOVA results for high-dimensionality ParCorr experiments             We investigate the relationship between the monthly climate index Nino and land temperature anomalies over Northwestern Canada, mostly British Columbia (BCT, hatched region). Nino is defined as the average sea-surface temperature anomaly (HadISST dataset (27)) over the red Nino3.4 region (5 • North-5 • South and 170-120 • West). BCT is defined as the land surface temperature (CRUTEM4 dataset (28)) over British Columbia and parts of Yukon and the Northwestern Territories, Canada (50-65 • North and 120-140 • West). The grid location 62.5 • North, 132.5 • West was excluded since more than 1% of the samples were missing. Anomalized time series have the seasonal cycle removed. We constrain our analysis to the period with reliable satellite data (1979-2017) with a length of T = 468 months. To remove any long-term temperature trend, a Gaussian kernel smoothing mean with a bandwidth of σ = 120 months was removed from the raw time series. (B) Time series of Nino and BCT. (C) Matrix of lag functions between Nino and BCT for Correlation (Corr), PCMCI, and FullCI (conditional on the whole past of both time series up to τ max = 6). Note that for autocorrelations (on the diagonal) the zero-lag is not drawn. (D) Matrix of p-values. The black line denotes the 5% significance level. Note that the sample size here is n = T − 2τ max = 456. Note that in Fig. 2 we only depict links detected in at least 20% of the realizations.
Runtimes for numerical experiments. In the previous figures we show not only averages, but boxplots, which are omitted here to better depict the dependencies for different time series lengths T. An interaction between N and T is present if the difference between, for example, the detection power for T = 300 and T = 150 changes for different N , here present for FullCI and PCMCI, but not for Lasso and PC. .