Sweepstake reproductive success and collective dispersal produce chaotic genetic patchiness in a broadcast spawner

Description


Bioinformatic analysis
Already demultiplexed and cleaned sequence reads were obtained from BGI and further quality checked using FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). All of the samples were retained for subsequent analysis as they contained similar numbers of reads (mean = 4,769,382, sd = 3,149,063, range: 720,804-19,432,533). The sequence reads were de novo assembled into RAD loci using Stacks 2.52 (68). Values for the two main parameters -M and -n were chosen following the optimization procedure described by Rochette & Catchen (69). Briefly, as many of the populations in our dataset are likely to be closely related, we set -M = -n (84) and evaluated the performance of values ranging from one to nine in a randomly selected subset of 12 samples. The combination of these parameters for which the number of polymorphic loci present in at least 80% of individuals reached a plateau was defined as optimal. The optimal combination (-M = 6 and -n = 6) was then selected for analyzing the entire dataset. The resulting catalog of de novo assembled RAD loci was filtered to retain only RAD loci that were found in at least 80% of the samples, in order to avoid the risk of incorporating RAD loci erroneously assembled in a subset of samples. This threshold was based on visual inspection of the distribution of loci coverage across samples. PCR duplicates were then identified and removed by specifying the --rm-pcr-duplicates option within the gstacks module. The resulting raw SNP genotypes were then filtered to retain only biallelic SNPs with genotype quality and depth of coverage greater than five using VCFtools (70) as well as to retain only SNPs genotyped in at least 90% of individuals. Subsequently, we discarded all SNPs with a depth of coverage greater than twice the mean depth of the raw dataset (> 27) in order to filter out potentially paralogous loci. Next, all individuals with more than 30% missing data were removed and only variants with MAF greater than 0.01 were retained. Finally, the software PLINK version 1.9 (71) was used to filter out SNPs showing significant departures from Hardy-Weinberg equilibrium (HWE) with an alpha level of 0.05 after having implemented mid-p adjustment (85) and pruned out putatively linked loci using an r 2 threshold of 0.5. The depth of coverage of the SNPs retained in the final dataset was quantified using VCFtools in order to ensure that no significant differences were present among the three sequencing libraries.

Outlier analysis
Temporally replicated genome scans for outlier loci were performed on the cohorts sampled within Ryder Bay in 1999 and 2015 using the software BayPass (47). This program implements a Bayesian hierarchical model approach to identify loci that are potentially under divergent selection while accounting for evolutionary relationships among the samples. To do this, BayPass estimates a covariance matrix of allele frequencies across populations, which is informative about population history and provides a comprehensive description of population structure. We ran BayPass under the core model with default parameters and specifying five different populations, corresponding to the five different sampling locations. To evaluate genome-wide statistical significance, we used a Bonferroni adjusted -log10 (p-value) threshold of 6.34. To check whether we might have inadvertently removed loci under selection from our dataset by filtering out SNPs showing significant deviations from HWE, we repeated this analysis using a dataset that was not filtered for HWE. This dataset consisted of 112,246 SNPs and the corresponding Bonferroni adjusted -log10 (p-value) threshold to determine significance was then set to 6.35.

Simulations
Forward genetic simulations were implemented in SLiM (31) to investigate the specific conditions under which our empirical results for the Rose Garden 1999 cohort could be replicated. In each simulation, individuals were modelled as diploid organisms with a 100 kb genome size, with neutral mutations occurring at a rate of 1 x 10 -5 and the recombination rate set to the default value of 1 x 10 -8 . These parameter settings were chosen to allow the simulations to be completed within a reasonable amount of time, while also allowing for a common baseline level of nucleotide diversity to establish in the simulated populations. All of the simulations were initially run for 200 generations during which all populations were connected by random migration.
Our simulations were built using a non Wright-Fisher modeling approach, which allowed us to replicate the reproductive behavior of N. concinna. Specifically, this species aggregates in "stacks" of up to 20 individuals prior to spawning, which is believed to maximize egg fertilization rates by enhancing spawning synchrony and the proximity of gametes (51,86,87). Consequently, at the start of each generation, we divided individuals into equal sized groups representing stacks, and only individuals within the same stack were allowed to reproduce. The sex ratio within each stack was set to 1:1 to reflect empirical estimates of the adult sex ratio for N. concinna at King George Island on the Antarctic Peninsula (88). Opposite sex individuals within a stack were mated at random and could generate offspring with multiple partners within the same stack. Reproduction continued until the carrying capacity (K) of the population was reached. We modeled high variance in reproductive success according to a Weibull distribution, with scale and shape parameters of λ = 0.4 and k = 0.55 for non-sweepstake reproduction and λ = 0.4 and k = 0.22 for sweepstake reproduction. To reflect the sampling design of our empirical study, overlapping generations were modeled.
The first set of simulations attempted to evaluate the conditions that emulate the empirical kinship structure of the Rose Garden sampling cohort in 1999. We simulated a single population in which we forced a strong sweepstake event to occur by allowing only a limited number of females (F) located within a single stack to reproduce. We evaluated the effects of varying the number of reproducing females within the stack (F), the size of the stack (St), and K on the kinship structure of the resulting offspring cohort, which was subsampled to ten individuals to reflect our empirical sample size. Specifically, we tested values of F between 1 and 5, values of St equal to 5, 10 or 20, and values of K equal to 500, 1000 or 2000, and ran 100 simulation for each possible combination of parameters values. We then treated a simulation as having successfully replicated our empirical results when the offspring sample contained only full and halfsiblings. For comparison, simulations involving multiple reproducing females (i.e. 2 ≤ F ≤ 5) were re-run while allowing reproducing females to be located on different stacks.
The second set of simulations aimed at exploring the importance of collective dispersal for the generation of CGP and for the detection of sweepstake signatures. We simulated a total of five populations which included a 'source' and a 'sink' population. Reproduction at the source followed the sweepstake model described above, while using the parameter settings F = 1, K = 1000 and St = 10, which were found to be optimal in the first set of simulations. In the other four populations, we implemented non-sweepstake reproduction, with multiple stacks producing offspring and multiple females breeding within stacks. We then allowed varying proportions of the offspring generated at the source population to collectively disperse (CD) to the sink population, while all other migration among populations occurred following a unidimensional stepping stone model until carrying capacities were reached. We evaluated 12 values of CD (0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 95 and 100) and ran 100 simulations for each value. Again, the resulting cohorts at the sink population were subsampled to ten individuals and we regarded a simulation as having successfully replicated our empirical results when the sampled offspring contained only full and half-siblings.
Finally, we selected the combination of parameters that provided the best fit to the empirical kinship structure of Rose Garden in 1999 (F = 1, K = 1000, St = 10 and CD = 100) and ran the simulations for three further generations without enforcing any additional sweepstake events. The aim of this analysis was to understand how the occurrence of a sweepstake event followed by collective dispersal might affect the kinship structure and genetic diversity of adjacent populations in subsequent generations. At each timepoint, the simulated cohorts were again subsampled to ten individuals, which were used to calculate nucleotide diversity and kinship. A total of 100 simulations were run.  Results of exploratory simulations of sweepstake events in N. concinna (see Supplementary methods for details). At the start of each generation, we randomly built stacks of up to 20 individuals with a 1:1 sex ratio. We then selected a focal stack at random and allowed individuals within that stack to randomly mate until the carrying capacity (K) of the population was reached. We explored the effects of varying the number of reproducing females within the stack (F), the size of the stack (St), and K on the kinship structure of the resulting offspring cohort, which was subsampled to ten individuals to reflect our empirical sample size. A simulation was regarded as having successfully reproduced our empirical results for Rose Garden 1999 when the resulting offspring sample contained only full and half-siblings.  Results of exploratory simulations in which females from more than one stack were allowed to reproduce (see Supplementary methods for details). At the start of each generation, we randomly built stacks of up to 20 individuals with a 1:1 sex ratio. We then allowed between two and five females, each from a different stack, to mate at random within their stacks until the carrying capacity (K) of the population was reached. We explored the effects of varying the number of reproducing females (F), the size of the stack (St), and K on the kinship structure of the resulting offspring cohort, which was subsampled to ten individuals to reflect our empirical sample size. A simulation was regarded as having successfully reproduced our empirical results for Rose Garden 1999 when the resulting offspring sample contained only full and half-siblings.   S4. Simulated nucleotide diversity and kinship structure in the three generations following the sweepstake event (see Supplementary methods for details). We simulated a total of five populations including a 'source' (So) and a 'sink' (Si) population and allowed varying proportions of the offspring generated at the source population to collectively disperse to the sink (CD), while migration among the other populations (labelled 1, 2 and 3) followed a stepping stone model (see Supplementary methods for details). Simulations then proceeded for three further generations where no further sweepstake events were enforced and migration among all five populations followed a stepping stone model. The boxplot shows variation in nucleotide diversity across cohorts sampled from the five simulated populations at a) one generation after the original sweepstake event; b) two generations afterwards; and c) three generations afterwards. Nucleotide diversity is expressed as the proportion of the maximum value observed within each simulation, with the sink population highlighted in purple. The barplots show proportions of unrelated individuals (orange), half-siblings (green) and full-siblings (purple) in cohorts sampled from the five simulated populations. Table S1.
Pooling strategy during the library preparation and related summary statistics. The sampling cohorts were assigned at random to one of three sequencing libraries as shown in the table. To provide an indication of the quality of the sequencing data obtained from each library, we calculated the mean (± standard deviation) per-base phred-scaled quality scores and the mean (± standard deviation) depth of coverage of the SNPs retained for analysis.  Table S5.

Library ID
Cohort-specific effective population size. Effective population size estimates were calculated for each cohort using the linkage disequilibrium (NeLD) and the molecular coancestry (NeC) methods implemented in Neestimator (50). 95% confidence intervals are reported in parentheses.  Table S6.
Goodness of fit between observed and expected allele frequency spectra for each cohort using the l2 metric. l2 (Xi-Beta) refers to distances to allele frequency spectra expected under the Xi-Beta (2-α, α) coalescent model, while l2 (Kingman) refers to distances to allele frequency spectra expected under the Kingman coalescent model.