Evidence for a single, ancient origin of a genus-wide alternative life history strategy

Understanding the evolutionary origins and factors maintaining alternative life history strategies (ALHS) within species is a major goal of evolutionary research. While alternative alleles causing discrete ALHS are expected to purge or fix over time, one-third of the ~90 species of Colias butterflies are polymorphic for a female-limited ALHS called Alba. Whether Alba arose once, evolved in parallel, or has been exchanged among taxa is currently unknown. Using comparative genome-wide association study (GWAS) and population genomic analyses, we placed the genetic basis of Alba in time-calibrated phylogenomic framework, revealing that Alba evolved once near the base of the genus and has been subsequently maintained via introgression and balancing selection. CRISPR-Cas9 mutagenesis was then used to verify a putative cis-regulatory region of Alba, which we identified using phylogenetic foot printing. We hypothesize that this cis-regulatory region acts as a modular enhancer for the induction of the Alba ALHS, which has likely facilitated its long evolutionary persistence.

After these splits and joins, the JoinSingles2All and OrderMarkers2 were rerun to obtain final (de novo) linkage maps with 31 linkage groups. The splits and joins were necessary due to the complex family structure (multiple families) and low marker density.
With the help of the linkage map, the scaffolds (all except Sc0000116) were manually put together into these 31 linkage groups. Linkage groups were named based on chromosome numbers in Melitea cinxia (92). The linkage map was re-evaluated in this physical order with OrderMarkers2 (recombination2=0 chromosome=1..31 evaluateOrder=phys_order improveOrder=0 hyperPhaser=1 phasingIterations=3), put into grandparental phase (phasematch.awk) and this map was used for QTL mapping.

C. eurytheme genome polishing, quality control, and annotation
Polishing of the genome was performed using Pilon v1.2.2 (57), using data from a single orange female, with an Illumina TruSeq Nano library prep and sequenced (150 bp PE reads with a 350bp insert size, Illumina HiSeqX) to provide ~30X genome coverage, aligned using NextGenMap v0.5.2. The assembly quality was assessed using basic length metrics and BUSCO v1.1b1 before and after polishing to evaluate the difference, using the insecta_odb10 dataset. The genome was softmasked for repetitive regions using RED v.05/22/2015 (61).
Genome annotation was generated using BRAKER2 (v2.1) trained on data from C. eurytheme transcriptome and proteins. The transcriptome was assembled using RNA-seq data from several developmental life stages that was generated in a previous study (63). We used a reference protein dataset from the Arthropoda section of OrthoDB (v10). Transcriptome reads were aligned using HISAT2 v.2.1.0 (64), against the unmasked genome, and the alignment was then filtered, sorted, and indexed using SAMTOOLS v.1.7. Braker2 was run using the ETP mode and set to take softmasking into account. Annotation of the resulting chromosome level resulted in 18077 transcripts from 16842 genes.
Beta statistics were calculated using Betascan2 (43), the previous VCF file, and default settings used for filtering the VCF file. We only analyzed scaffold 2, focusing the analysis to the region surrounding the Alba locus. Since scaffold 2 was 11 Mbp in length, with Alba located at 6.5Mb, this provided sufficient flanking region for a representative global analysis.
We called structural variants in all C. eurytheme samples used for the GWAS using the default pipeline in DELLY (84). Variants were first called in each sample individually using Delly call. Variants from each sample were then merged and genotyped jointly using delly merge, delly call and bcftools merge. The final output was first filtered using the germline filter provided by DELLY, and secondly by including only sites in which no samples had a lowqual call (not enough read support to genotype).

Phylogenetic analyses
The longest exon per gene dataset was generated by running BUSCO as above, but upon the protein dataset from our annotation of the C. eurytheme genome, with the protein dataset generated using our GFF annotation and the genome as inputs for the gffread script from cufflinks v.2.2.1 (93). Of the total lepidopteran BUSCO genes searched (n= 5286), 4476 were complete and single copy in our protein dataset for C. eurytheme. Among the BUSCO outputs is a table where for each annotated protein identified, its BUSCO status is indicated (e.g., as single, complete, and duplicated, etc.). Using this table, the exons of the complete and single-copy proteins in our annotation were extracted and converted to a bed file. Then the length of each exon was calculated, allowing for the longest exon per protein ID to be selected using custom scripts, and the resulting bed file of these longest exons was then used as input for the bam2fasta script from the package bambam v1.4 tool-kit (71), along with all the bam alignment files and a minimum depth requirement of 5 reads per base pair. The resulting set of exons had a wide range of sizes (min,max exon lengths: 94,11667 bp). The resulting set of fasta-files (busco_exons) was then used to generate gene trees using iQtree, with each gene tree using extended model selection, a random starting tree, 1000 ultrafast bootstraps and optimization, and Z. cesonia set as an outgroup (-m MFP -t RANDOM -bb 1000 -alrt 1000 -bnni -o Z_cesonia). A total of 4244 gene trees were generated. These were then used as input for species tree estimates by Astral using default settings. Gene trees were also grouped by chromosome using the genome annotation, which allowed for species tree for each chromosome to be estimated.
We also generated a filtered busco_exon dataset, using AMAS V.1.0 (94) to remove Z. cesonia from all fasta-files, and then generate a summary table of all files, which was then parsed to produce a set of files filtered to remove those with missing content > 1 %, < 5 % variable sites), and length < 300 bp and > 2000 bp. With this filtered set of fasta file IDs (n=1400), these iQtree gene trees were then selected for species tree estimation to assess whether dataset quality affected species tree topology. The resulting species tree from these filtered fasta files was identical to the full busco_exon analysis. Additional analyses using complete CDS for ~9000 genes, or their second exon, produced essentially identical species tree results (Fig. S4).
Gene tree concordance with the species tree was assessed using Phyplots (95) to calculate the number of gene trees concordant and discordant with the species tree topology per node. Phyplots output was further parsed to distinguish among gene trees discordant with the species tree, into those supporting a main alternative tree vs. many alternative trees and those gene trees having less than 50% bootstrap support at the node in question. We used pieplots to represent these proportions ( https://github.com/mossmatters/MJPythonNotebooks/blob/master/PhyParts_PieCharts.ipy nb, accessed on 28/04/2021). Importantly, these pieplots results were concordant with estimated gene concordance factors via iQtree (Fig. S4).
We note here that the placement of the C. philodice Alba specimen from British Columbia (Canada) is grouping with C. canadensis instead of C. philodice from Connecticut (USA). Whether this represents a hybrid individual, paraphyly of C. philodice, the need to reassess the status of C. canadensis, or misidentification of the collected specimen from Canada, will be addressed in subsequent work.
SNAPP analyses (74), which are run as an add-on to the BEAST2 software program, were used to generate multi-species-coalescent analyses using SNPs from the busco_exon dataset from above. SNAPP estimates the species tree probability by integrating over all the possible gene trees observed among the SNPs. Despite dramatically decreasing parameter space for analyses, this approach remains computationally demanding. Recent work has investigated the number of SNPs for optimal inference while minimizing computation demands and determined accurate SNAPP settings for time calibration, which uses a strict-clock model with fossil calibrations and the linking of all population sizes during analysis (33). Thus, in order to stay within a multispecies-coalescent framework for divergence time and phylogenetic relationship estimation, we followed these recommendations (33) and downsampled our taxa to remove redundancy among closely related species while retaining regional diversity. For species where Alba was polymorphic and we sequenced both morphs, we only included the colored morph. This took our entire dataset of 29 samples down to 21.
Using AMAS, exon fasta files were subsampled to these 21 species, concatenated, and then converted to phylip file format. A ruby script was used to generate an input XML file for SNAPP via snapp_prep.rb (https://github.com/mmatschiner/snapp_prep, accessed on 22/04/2021), which takes a phylip formatted sequence dataset, allows for specifications of run iterations, inter-SNP intervals, total SNPs, a starting tree, and various constraints to be incorporated. Constraints included a temporal calibration for the timing of the split between Zerene and Colias using a secondary calibration of 10.9 million years ago with sigma=1, along with two monophyletic constraints well supported by previous Astral analyses (one for South American taxa, one for the remaining Colias species). The SNP dataset was constructed drawing random SNPs from among the concatenated BUSCO exons with at least 300 bp between each SNP, monomorphic sites removed, as were non-bi-allelic sites, resulting in 131 bi-allelic sites.
In order to generate a starting tree, the exons in the busco_exon set were concatenated this full dataset run using iQtree (extended model selection, a random starting tree, 1000 ultrafast bootstraps and optimization, and Z. cesonia set as an outgroup (-m MFP -t RANDOM -bb 1000 -alrt 1000 -bnni -o Z_cesonia)). This concatenated exon tree shows nearly the identical topology to that of the ASTRAL gene tree (Fig. S19). From this, an ultrametric tree with branch lengths, and a cladogram tree without branch lengths, were used as the two starting trees for SNAPP runs. For each starting tree, four independent runs were initiated, each with 4 million iterations. All 8 resulting runs were assessed for convergence, which was determined by effective sample sizes (ESS) > 200 and convergent likelihood and posterior distributions, which was assessed using Tracer in the BEAST2 package. Tree files were combined using Treeannotator in the BEAST2 package, with posterior estimates using median tree credibility after 10% of data was discarded as burn-in. The resulting phylogenetic relationships and divergence estimates for Colias and non-South American Colias crown groups were nearly identical in their results, which were visualized using Figtree v.1.4.4 (96).

Introgression analysis
In order to assess the level of gene flow and introgression among Colias species, we calculated the D-statistic between all possible trios of species for which we had sequence data. Using the software Dsuite (76) we were able to analyze this jointly and infer between which, past and present taxa, have likely had introgression. By calculating a f-branch statistic and analyzing it together with the phylogeny, we can infer between which taxa and which nodes in the phylogeny gene flow is most likely.
For this analysis, we used the sequence data for different Colias species mentioned previously and aligned them to the manually curated C. eurytheme Alba-reference genome using NGM (same settings as previously). No filtering was done before using Freebayes to call variants. The resulting vcf-file was filtered for strand-bias, a quality score of > 30, 90% of species sharing the site, minimum sample depth of 5, and minor allele frequency of 5%. In the analysis, only biallelic SNPs are included. D statistic between all our trios was calculated with the phylogeny as a reference tree to guide the analysis using the Dtrios tool in the Dsuite-toolkit (v.0.4). Dtrios compares all possible trios of species and calculates a genomewide minimum D and a significance value of each trial, and it will also calculate F4 statistics for each trio that we used to calculate the f-branch statistic we used to infer signals of past introgression. To test if the difference in observed levels of introgression we observed between regions was driven by an incorrect placement of C. phicomone or C. palaeno, the two species whose placement differed between the SNAPP tree and the Astral tree, we removed these taxa from the analysis, however, while we saw a decrease in the strength of introgression by doing this, the general patterns persisted (Fig. S24 & S25).  (77), and the resulting VCF-file was filtered using a mix of VCFTOOLS v0.1.13 (78) and custom awk scripts. The final GWAS was run using PLINK v1.9 (82) to identify associated loci. We filtered the GWAS based on two separate levels of stringency. First, we filtered the VCF file to remove low quality SNPs or sites that did not match our depth or frequency criteria (minimum depth 3, minQ 20, max-missing 0.95, and minor allele cutoff of 0.05). Second, we added an additional filter to this set and added the prior criteria to include only sites unique to the Alba females. We also filtered the output using the information gained from the QTL analysis and only kept SNPs on the scaffolds that made up Chromosome 3.

GWAS of Alba in
The same filtering approach was also applied to the GWAS done using the Alba reference genome.

Generation of draft genomes of Alba individuals of C. eurytheme, C. nastes, and C. crocea
DNA from wild-caught females was extracted from the thorax of samples stored frozen in 95% ethanol using the same protocol as in the resequencing done for the GWAS. Prior to library preparation, the molecular weight of the DNA was estimated using gel electrophoresis (0.5% agarose LE). DNA was extracted from 2 individuals of each species, and the ones with the highest quality DNA (as determined by gel electrophoresis, the 260/280 ratio measured via a spectrophotometer (Nanodrop 8000; Thermo Scientific, Waltham, MA, USA), and DNA concentration quantified on a Qubit 2.0 Fluorometer (dsDNA BR; Invitrogen, Carlsbad, CA, USA)), were selected for library prep. Sequencing and assembly via Supernova v2.1.1 were performed at SciLifeLab (Stockholm, Sweden).

PCR-based validation of insertion
The presence and uniqueness of the insertion to Alba individuals was further validated using PCR-based markers. The primers were designed to bind within the insertion region, which was unique to Alba. Optimal primer binding sites were identified using the primer3 software v2.5.0 (97). PCR reactions were run on DNA extracted from 8 orange and 8 Alba females that had not yet been sequenced or used in the GWAS. Positive controls were run using previously validated primers binding to mitochondrial cytochrome oxidase I gene (98). The reactions were run using Invitrogen Platinum Taq in a Veriti 96-well thermocycler (Applied Biosystems, Foster City, CA, USA) using the recommended settings for the polymerase (72C x 2min followed 35 cycles of 94C x 30sec + 54C x 30sec + 72C x 15 sec followed by 72C x 5min). The PCR product was visualized by agarose gel electrophoresis in a 1% gel (Fig. S9).

Generation of the C. eurytheme Alba reference genome
We identified the scaffold containing BarH1 in the supernova assembly of an Alba C. eurytheme using tBLASTn. We then aligned all the resequencing data from the GWAS to this contig alone and visually evaluated the contig in IGV. Regions where no orange reads aligned, but Alba did, were extracted and blasted back against the C. crocea reference genome (28) to assess whether this was the previously identified Alba insertion region. The one contig that both contained the BarH1 gene and only had reads from Alba individuals mapping, was identified as orthologous to the previously identified C. crocea Alba insertion using BlastN. To identify the orthologous region of this contig in the orange C. eurytheme reference, we used BlastN. Once the regions in common between the Alba insertion contig and the orange reference were established, we manually inserted the entire Alba insertion contig into the orange reference genome. This resulted in what we refer to as the Alba reference C. eurytheme genome.
We decided to generate the additional C. crocea assembly to make sure that we would be able to assemble and subsequently detect the Alba insertion sequence. We used the same method here as we described in the main paper for the detection of the Alba allele in the C. nastes and C. eurytheme assemblies, where we blasted the previously known Alba allele detected in the C. crocea assembly (28), and if this insertion was co-located with the BarH1 gene. We also assessed sequence similarity and synteny using Blastn, which we visualized using Kablammo (Fig. S10 & S11).

Alignment and assessment of the Alba insertion across species
The sequence data generated for the phylogeny was aligned to the Alba reference-genome using NextGenMap, filtered for mapQ 20, and being in proper pairs. Quantification of the proportion of mapped reads, and confirmation of sex in cases where it was uncertain, used a read coverage analysis via goleft indexcov (99) (see ST_CovStats). This confirmed that seven samples in our analysis were indeed male (Colias erate poligraphus, Colias tamerlana mongola, Colias phicomone, Colias tyche, Colias behrii (B033), Colias tamerlana (B037), and Colias wiscotti) based on the increased coverage on the scaffolds that make up the sexchromosome (9, 19 & 52). It also emphasizes the divergence between the South American species C. lesbia and C. euxanthe and the other species as they consistently showed highly reduced coverage on the terminal ends of the shorter chromosomes in the assembly (ST_CovStats).
Coverage across the insertion region was then visually inspected in IGV. We primarily assessed the regions where we had observed a difference in coverage between orange and Alba C. eurytheme individuals and whether this extended to other species where we had sampled either orange or Alba individuals. The sequence that was unique to only whitecolored species (putative Alba species) was the conserved Alba region and likely causal for the phenotype across Colias.
To establish a null expectation for how often a region of the same size as the conserved Alba region would segregate in this manner between white and colored (e. g orange, yellow. red) species, we performed a read depth analysis of all the sequenced Colias species. Reads were aligned using NextGenMap and filtered for proper pairs. We then used goleft v.0.2.1 (https://github.com/brentp/goleft) to evaluate the average read depth of 600 bp windows across all scaffolds. We selected 600 bps rather than the full 1200 bp of the candidate Alba locus to ensure that at least one window ended up inside the insertion region; this also made the analysis more sensitive compared to using a larger window size. The read depth coverage in the windows was then classified in a binary fashion: having read coverage (1) or not (0), with the latter classification assigned if the window had less than 25% of the read depth compared to the scaffold average. Thus, for each window, each species had a value of 0 or 1. Then, using these values of either having reads covering the window or not, we calculated the mean value per window for the white and colored groups of species. Finally, we subtracted the white value per window from the colored value per window. For the Alba identified region in white-colored butterflies, this value was 1, which was then compared to the rest of the windows across the genome, which served as a genome-scale control.

Phylogenetic analysis of the Alba insertion.
We extracted the consensus sequence of the region from all 43 Alba samples in our analysis (14 C. eurytheme, ten C. crocea, seven C. philodice (Maryland), one C. philodice (from British Columbia), and one from each remaining Alba species in the phylogeny) using the bam2fasta 1.4 tool-kit. We kept all sites with at least a read depth of one and selected the most common allele at polymorphic sites. We then ran IQtree to generate a phylogenetic tree of the sequences. IQtree was run with model finder plus enabled to allow it to find the most parsimonious model.

CRISPR/Cas9 targeted mutagenesis of the Alba insertion.
PROMO v 3.0.2 (86, 87) was used to predict transcription factor binding sites within the Alba candidate locus. Putative binding sites for proteins that have previously been shown to interact with BarH1 or are known to be involved in sexually dimorphic phenotypes were identified. Of the candidates, a putative binding site of doublesex was particularly intriguing due to its known role in the development of a variety of sexually dimorphic phenotypes (100), including butterfly wing color (101,102). Thus, 4gRNAs that targeted the putative doublesex binding site were designed. Two gRNAs were designed to cut within the putative doublesex binding site, while the other two targeted upstream and downstream from the site (Fig. S21). Four types of cocktails of gRNA/Cas9 mixture were injected: 1. gRNA_P (within the doublesex binding site) 2. gRNA_P and gRNA_U (3bp upstream of the doublesex binding site) 3. gRNA_2 (~250bp upstream of gRNA_P) and gRNA_5 (~150 bp downstream of gRNA_P). 4. All four together.
Unfortunately, due to poor egg-laying of the Alba females, we could not inject an equal number of eggs with each combination, instead primarily injecting eggs with either gRNA_P alone or gRNA_U and gRNA_P. The only combination we saw any phenotype from was when we injected all four together. Unfortunately, we were only able to inject 40 eggs with this combination before the females stopped laying eggs. The primary cause of death among our injections was accidental, due to larvae sticking to the double-sided tape that we used to attach the eggs to the glass slide during the injection. Out of 200 injected eggs, we could only transfer 39 larvae to a fresh hostplant (5 of the 40 eggs we injected with all four gRNAs: 2 males, 1 orange and 2 Alba females); the remaining died from the trauma induced as we attempted to release them from the glue.

Validation of CRISPR/Cas9 targeted mutagenesis.
Using Primer3 (97) we designed PCR primer pairs that bind upstream and downstream of gRNA_2 and gRNA_5. Due to a large amount of repetitive DNA in the region, we were forced to have the forward primer bind to a non-unique location, leading to a risk of having some off-target binding sites. The reverse primer was unique, and there were no alternative bindsites for the forward primer within 200Kb of the intended location. DNA was extracted using the Monarch HMW DNA extraction kit for Tissue (Cat# T3060S) using the front half of the thorax as well as from the eyes of both individuals with visible KO-phenotype. The PCRreactions were run using Invitrogen Platinum Taq in a Veriti 96-well thermocycler (Applied Biosystems, Foster City, CA, USA) using the recommended settings for the polymerase (72C x 2min followed 35 cycles of 94C x 30sec + 54C x 30sec + 72C x 15 sec followed by 72C x 5min). The PCR products of the thorax samples were visualized on an agarose gel electrophoresis in a 1% gel (Fig. S15). We also used the primer pair developed by (28) to validate the Alba status of the sample, and as a positive control for the reaction as the two primer pairs do not bind near each other. We additionally sequenced the PCR products from the eyes and thorax of the two KO-individuals using Nanopore Minion. The PCR-product of each sample was individually cleaned using a 1.8x Ampure XP cleanup (Beckman Coulter, Brea, CA, USA) and prepared for sequencing according to the Nanopore LSK109 ligation library protocol. Each library was sequenced until approximately 100K reads had been generated. The reads were basecalled with guppy v6.0.1 using the super accuracy configuration and filtered for reads with a Q-score of >10. The resulting fastq reads were mapped against the C. crocea reference genome (28) and the location and size of deletions were visually identified and correlated to the target sites of the individual gRNAs used (Fig.  S16).

Supplementary figures
Fig. S1 Flowchart of the assembly pipeline. Flowchart illustrating the genome assembly, annotation, and evaluation process for the C. eurytheme genome construction. Generation of raw contigs was done using 150x PacBio sequence data coming from a single orange female pupa. Linkage mapping of the contig data was done using Rad sequencing from multiple individuals of C. eurytheme X C. philodice hybrid crosses was used to create a linkage map of the contigs. The draft assembly was then polished using Pilon to reduce indel errors introduced by the PacBio sequencing, and finally annotated using BRAKER2. Each step of the assembly process was analyzed using N50 and BUSCO. Blue boxes represent the generation of data, Yellow boxed filtering of data, and green boxes visualization. In total, we ran three levels of filtering of the raw VCF-file with different levels of priors. Given the high synteny between Z. cesonia and Heliconius erato (65), and Heliconius to other butterflies and moths (92), we infer that Colias chromosomal structure adheres to the standard Lepidoptera chromosome structure (103). Fig. S4. Pie charts of gene tree concordance, conflict, and lack of signal compared to the Astral species tree. While traditional node support values, whether bootstraps or posterior probability, generally overstate gene tree support, here we present direct quantification per node of gene tree topology with the species tree topology. The number of gene trees concordant (top number) and in conflict (bottom number) is shown at each node. Pie charts at each node give a further breakdown of gene tree proportions by those concordant with species tree (blue), those that support a common alternative topology (green), those that support the remaining low-frequency alternatives (red), and those lacking robust information as they have less than 50% bootstrap support (gray). Here, the South American taxa are very well supported, as are the clades containing: C. crocea, C. erate, and C. electo; C. tyche, C. tamerlana mongola, C. nastes; C. pelidne, and C. behrii; C. alexandra, C. canadensis, C. philodice, C. eurytheme, and C. eriphyle. Note how the vast majority of nodes are red, indicating extensive gene tree conflict rather than a lack of phylogenetic signal (gray).   The contig containing BarH1 and the insertion was identified using blast and read depth analysis. Overlapping sequences between the contig and the orange reference genome were used to define the edges and to insert the sequence. The sequence was then inserted and overlapping sequences removed, favoring the Alba contig sequence leading to the generation of the Alba reference genome. Orthology assessment of the C. crocea Alba scaffold identified in the 10X assembly (query), identified by using BlastN of the C. eurytheme BarH1 gene and insertion sequence against the assembly, compared to the one found in the C. crocea reference genome(subject). The darker the blue hue is, the higher identity is. Fig. S11. Orthology assessment of the Alba insertion in C. nastes. The Alba scaffold, as identified using the BarH-1 gene and insertion identified in C. eurytheme, from the C. nastes 10X assembly (query), aligned against the Colias crocea reference genome(subject). The darker the blue hue is, the higher identity is.    Black lines indicate a deletion in the amplicon, and a gray means that it is matching the reference. The deletions overlap with locations of the gRNA target sites (with the exception of the rightmost gRNA that is missed by our primer pair. Deletions vary in size. Small deletions likely result from non-homologous end joining due to a single cut, while a larger deletion likely arises when 2 or more double stranded breaks occur due to multiple cut sites on the same DNA molecule. The gRNA that is causing a consistent deletion across both samples and tissues is located on top of the putative doublesex binding site.  Fig. S19. Busco exon tree. Phylogenetic tree of the 21 taxa subset generated using concatenated busco_exons. This set of genes was used as input for iQtree. Tree is rooted with Z. cesonia, and for each node, the branch support is shown with values and color. Only one node has < 100 % bootstrap support.   Table S9 for more details.  Distribution of species used in our phylogenetic analysis. This information was not used in our analyses and are only included to facilitate general insights into approximate ranges. The locations and maps are all taken from the funet website. These were automatically generated from regional information drawn from primary literature using a text mining approach. As such, these maps are approximate and included here for a general overview. However, should readers wish to have more accurate, more confident range information, we direct them to the funet website and the primary literature listed therein.