Within-host genomics of SARS-CoV-2

Extensive global sampling and whole genome sequencing of the pandemic virus SARS-CoV-2 have enabled researchers to characterise its spread, and to identify mutations that may increase transmission or enable the virus to escape therapies or vaccines. Two important components of viral spread are how frequently variants arise within individuals, and how likely they are to be transmitted. Here, we characterise the within-host diversity of SARS-CoV-2, and the extent to which genetic diversity is transmitted, by quantifying variant frequencies in 1390 clinical samples from the UK, many from individuals in known epidemiological clusters. We show that SARS-CoV-2 infections are characterised by low levels of within-host diversity across the entire viral genome, with evidence of strong evolutionary constraint in Spike, a key target of vaccines and antibody-based therapies. Although within-host variants can be observed in multiple individuals in the same phylogenetic or epidemiological cluster, highly infectious individuals with high viral load carry only a limited repertoire of viral diversity. Most viral variants are either lost, or occasionally fixed, at the point of transmission, consistent with a narrow transmission bottleneck. These results suggest potential vaccine-escape mutations are likely to be rare in infectious individuals. Nonetheless, we identified Spike variants present in multiple individuals that may affect receptor binding or neutralisation by antibodies. Since the fitness advantage of escape mutations in highly-vaccinated populations is likely to be substantial, resulting in rapid spread if and when they do emerge, these findings underline the need for continued vigilance and monitoring. Introduction The ongoing evolution of SARS-CoV-2 has been the topic of considerable interest as the pandemic has unfolded. Clear lineage-defining single nucleotide polymorphisms (SNPs) have emerged ​(​1​ )​, enabling tracking of viral spread ​(​2​, ​3 ​)​, but also raising concerns that new mutations may confer selective advantages on the virus, hampering efforts at control. Most prominently, there is increasing evidence that the D614G mutation (genome position 23403) in the Spike protein (S) increases viral transmissibility ​(​4​, ​5 ​)​ and N439K (genome position 22879) evades antibodies without loss of fitness ​(​6​)​. Most analyses have been focused on mutations observed in viral consensus genomes, which represent the dominant variants within infected individuals. Ultimately though, new mutations emerge within individuals, and hence knowledge of the full underlying within-host diversity of the virus at the population level, and how frequently this is transmitted, is important for understanding adaptation and patterns of spread. The United Kingdom (UK) experienced one of the most severe first waves of infection, with over a thousand independent importation events contributing to substantial viral diversity during this period ​(​7​)​. In this study, we collected and analysed 1390 samples predominantly from symptomatic individuals (1173 unique individuals plus 93 anonymous samples) who tested positive for COVID-19 during the first wave of infection (March June 2020; ​Table S1). The samples were collected by​ two geographically separate hospital trusts: Oxford University Hospitals and Basingstoke and North Hampshire Hospital, located 60 km apart. Using veSEQ, an RNA-Seq protocol based on a quantitative targeted enrichment strategy ​(​8​ )​, which we previously validated for other viruses ​(​8​ – ​11​)​, we . CC-BY-NC-ND 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted December 10, 2020. ; https://doi.org/10.1101/2020.05.28.118992 doi: bioRxiv preprint


Introduction
The ongoing evolution of SARS-CoV-2 has been the topic of considerable interest as the pandemic has unfolded. Clear lineage-defining single nucleotide polymorphisms (SNPs) have emerged ( 1 ) , enabling tracking of viral spread ( 2 , 3 ) , but also raising concerns that new mutations may confer selective advantages on the virus, hampering efforts at control. Most prominently, there is increasing evidence that the D614G mutation (genome position 23403) in the Spike protein (S) increases viral transmissibility ( 4 , 5 ) and N439K (genome position 22879) evades antibodies without loss of fitness ( 6 ) . Most analyses have been focused on mutations observed in viral consensus genomes, which represent the dominant variants within infected individuals. Ultimately though, new mutations emerge within individuals, and hence knowledge of the full underlying within-host diversity of the virus at the population level, and how frequently this is transmitted, is important for understanding adaptation and patterns of spread.
The United Kingdom (UK) experienced one of the most severe first waves of infection, with over a thousand independent importation events contributing to substantial viral diversity during this period ( 7 ) . In this study, we collected and analysed 1390 samples predominantly from symptomatic individuals (1173 unique individuals plus 93 anonymous samples) who tested positive for COVID-19 during the first wave of infection (March -June 2020; Table S1). The samples were collected by two geographically separate hospital trusts: Oxford University Hospitals and Basingstoke and North Hampshire Hospital, located 60 km apart. Using veSEQ, an RNA-Seq protocol based on a quantitative targeted enrichment strategy ( 8 ) , which we previously validated for other viruses ( 8 -11 ) , we characterised the full spectrum of within-host diversity in SARS-CoV-2 and analysed it in the context of the consensus phylogeny.
We observed low levels of viral diversity within individuals, with evidence of strong within-host evolutionary constraint in Spike and other regions of the genome. Although within-host variants can be observed in multiple individuals in the same phylogenetic or epidemiological cluster, most viral variants are either lost, or occasionally fixed, at the point of transmission, with a narrow transmission bottleneck. These results suggest potential vaccine-or therapy-escape mutations are likely to rarely emerge or be transmitted from infectious individuals. Nonetheless, we identified Spike variants present in multiple individuals that may affect receptor binding or neutralisation by antibodies. Since the fitness advantage of escape mutations in highly-vaccinated populations is likely to be substantial, resulting in rapid spread if and when they do emerge, these findings underline the need for continued vigilance and monitoring.

Within-host variant frequencies are reproducible
Establishing reliable variant calling thresholds for clinical samples, where true variant frequencies are unknown, ideally requires re-sequencing of multiple samples from RNA to test for concordance. Working within the constraints of small volumes of remnant RNA from laboratory testing, we re-sequenced 65 samples, of which 27 replicate pairs generated sufficient read numbers (>50,000 unique mapped reads) for reliable minor variant detection. Intrahost single-nucleotide variants (iSNVs) with <2% MAF were generally indistinguishable from noise, whereas those >=3% MAF were highly concordant between replicates (Fig. 1D,  Fig. S2).
Based on the above considerations, we identified a set of 583 iSNV sites that were observed (i) in high-VL samples with at least 50,000 unique mapped reads, (ii) at depth of at least 100 reads, (iii) with a MAF of at least 3%, and (iv) not observed to vary in synthetic RNA controls (Table S2; see Methods). Of these, we excluded the 18 sites which were variant in over 20 samples. Variants at these sites occurred at low frequency in many samples (Table S2), with some showing evidence of strand bias and/or low reproducibility between technical replicates (Fig. S2). Among the excluded sites was 11083, which was observed in 46 samples and is globally ubiquitous in GISAID data. From manual examination of mapped reads in our dataset, this appears to be due to a common mis-calling of a within-host polymorphic deletion upstream at site 11082, occurring in a poly-T homopolymeric stretch. If genuine, this homopolymer stutter may have a structural or regulatory role; however, methodological issues in resolving this difficult-to-map region cannot be ruled out. The remaining 565 sites were taken forward for variant analysis. The black line is the estimated mean value by linear regression, with the green ribbon the 95% confidence interval. C: Distribution of number of identified iSNV sites at 3% from the real data (green) and from a simulation (orange). For the simulation 'true' MAFs were beta-distributed along the genome, and the estimated minor allele count at each site was drawn from a binomial distribution with number of trials equal to the read depth at that site, and probability equal to the "true" MAF at that site. D : Comparison of MAFs from 27 replicate pairs resequenced from RNA. The plot represents all MAF frequency comparisons for the 27 samples where both replicates had >50,000 unique mapped reads, limited to genomic sites where the MAF > 0.02 in at least one of the 54 replicates, and excluding sites observed to be variant in more than 20 samples from our whole dataset at MAF > 0.03. The green lines are the threshold value of 0.03.
. CC-BY-NC-ND 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted December 10, 2020. ; https://doi.org/10.1101/2020.05.28.118992 doi: bioRxiv preprint

Fig. 2. Intra-host variable (iSNV) sites are present in most samples and often shared. A:
Distribution of the number of samples with an intra-host variant at a site. B: Distribution of the number of sites with iSNVs for all samples with more than 50,000 mapped reads (dark) and samples with fewer than 50,000 mapped reads (light). Only identified sites were included (see main text) with sites variable in 20+ individuals excluded.

SARS-CoV-2 is evolutionary constrained at the within-host level
The distribution of iSNV sites varies across the genome (Table 1). Even excluding the UTR regions, which have a highly elevated density of iSNV sites, there is considerable variability across the genome, with open-reading frames (ORFs) 3a, 7a, and 8, and nucleocapsid (N) showing the highest densities. Most areas of the genome appear to be under strong purifying selection, with dn/ds values less than 1, including S. However, a few regions seem to be prone to directional selection within individuals, notably ORFs 3a, 7a, 7b, and 10. These patterns are broadly consistent with d n /d s values calculated for SNPs among consensus genomes ( 17 ) , suggesting evolutionary forces at the within-host level are reflected at the between-host level, at least for within-host variant sites in high VL samples. The exception is ORF7a which appears to be under purifying selection at the between-host level, but positive selection at the within-host level. Table 1. iSNVs and dn/ds by gene and over the whole genome.
All genome positions are relative to the Wuhan-Hu-1 reference sequence. iSNVs at the 18 "highly shared" sites and those identified from the synthetic controls are excluded, as are those in the poly-A tail (positions 29865-29903). The "mean iSNVs per 100 sites" column is the mean number in each gene over all 1390 samples. Note that due to gene overlap and non-coding intergenic regions, the total number of iSNVs (781) cannot be obtained as the sum of any column in this table, even if the rows for nonstructural proteins in ORF1ab are excluded. * nsp12 overlaps the boundary between ORF1a and ORF1b. ** Intergenic regions are excluded from this row, for which the start and end points do not represent a continuous range.

Within-host variant sites are phylogenetically associated
Consensus viral sequences that cluster closely on a phylogenetic tree have been used successfully in SARS-CoV-2 to identify epidemiological links ( 18 -20 ) . Due to the recent emergence and low evolutionary rate of SARS-CoV-2, its global phylogeny has only limited genetic diversity, and hence limited resolution to identify clusters. We sought to gain a better understanding of SARS-CoV-2 evolution and determine whether iSNVs could be used to help resolve phylogenies and transmission clusters. For the 1390 samples in our study, we . CC-BY-NC-ND 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted December 10, 2020. ; https://doi.org/10.1101/2020.05.28.118992 doi: bioRxiv preprint constructed a phylogeny using the robust procedure outlined by ( 21 ) (Fig. 3A). Using this tree, we determined whether iSNVs, and SNPs (indicating a difference in the most common variant among samples) at the same locus, are phylogenetically associated.
For the 153 iSNVs that are also consensus SNPs in at least one sample, termed iSNV-SNPs, we examined the proximity of tips with the iSNV to the position of consensus changes (between the two most common bases at the site of the iSNV) on the phylogeny (see Methods). A highly significant negative association (one-sided Mann-Whitney U-test, p <3x10 -16 ; Fig. S5A) was found between the presence of an iSNV at a given site in a sample and the patristic distance to the nearest example of a consensus change at the same site. When we tested sites individually, six showed a significant association after Benjamini-Hochberg correction ( p <0.05), reducing to five if only one sample from each individual was included.
In Fig. 3B we show the example of site 28580, with the red clade representing change from the global consensus G to A (a nonsynonymous change D103N in N), and nearby iSNVs occurring, both as minor As in the nodes ancestral to the change branch, and as minor Gs in the branch's immediate descendants. Based on corresponding epidemiological data, this represents a likely healthcare-associated cluster with onward transmission to close contacts. In Fig. 3C we give the further example of site 20796, a synonymous substitution L6843 in ORF1a. Trees for the other significant sites after Benjamini-Hochberg correction appear in Fig. S6. In addition, we examined 16 epidemiologically identified household clusters, in 5 of which we observed an iSNV in one individual that was fixed in the other (for the household analyses we did not constrain on sites only present in high VL samples; Table 2).
For the 261 iSNVs that are present in at least two individuals but never reach consensus, we analysed the association with the phylogeny of each iSNV variant as a discrete trait, using two statistics: the association index ( 22 ) and the mean patristic distance between iSNV tips. After adjustment for multiple testing, no sites showed a p-value less than 0.05 for a phylogeny-iSNV association for either statistic. Similarly, if we simply compare the distance to the nearest iSNV tip amongst iSNV and non-iSNV tips across all 261 iSNV sites, there is also no evidence for an association (one-sided Mann-Whitney U-test p~=1 , Fig.  S5B). Nevertheless, some individual sites do show patterns suggestive of iSNV transmission, with diversity maintained after transmission (22 with p <0.05 before adjustment for multiple testing for at least one of the two statistics; those 9 with p <0.025 are shown in Fig. S6) suggesting we may lack the power to statistically detect some associations. Among the 16 known household clusters, we observed only one iSNV shared in two individuals within the same household. This iSNV was unique to these two individuals, demonstrating a likely example of transmitted viral diversity ( Table 2).
. CC-BY-NC-ND 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted December 10, 2020. ; https://doi.org/10.1101/2020.05.28.118992 doi: bioRxiv preprint

The transmission bottleneck size within households is small
Estimating bottleneck size is difficult for SARS-CoV-2, since it requires sufficient genetic diversity to differentiate distinct viruses that may be transmitted in known source-recipient pairs, and confidence that transmission is the cause of variants observed in both source and recipients ( 23 -25 ) . Using the exact beta-binomial method ( 23 ) we estimated bottleneck sizes between 1 and 8 among 14 household transmission pairs ( Table 2). These observations are consistent with the small bottleneck sizes observed for influenza ( 25 ) .
We speculate that situations where multiple phylogenetically linked cases share sub-consensus variants could be a consequence of superspreader events, or other high-exposure situations, where many individuals are exposed to high viral doses. An association between the route of exposure and the transmission bottleneck has been demonstrated experimentally for influenza ( 26 ) . Here, we sequenced clinical samples, which likely include infections from some high-exposure events. For example, the clearest example . CC-BY-NC-ND 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted December 10, 2020. ; https://doi.org/10.1101/2020.05.28.118992 doi: bioRxiv preprint of shared diversity is at site 28580, with three individuals attending the same hospital department on the same day, and estimated bottleneck sizes of 4 between the assumed source (determined by date of positive test) and each of the two recipients.
Taken together, our observations suggest the transmission bottleneck can be wide enough to permit co-transmission of multiple genotypes in some instances, but small enough that multiple variants do not persist after a small number of subsequent transmissions. In the cases where this transmission culminates in a consensus change on the phylogeny these patterns are readily observable, but in most cases patterns of co-transmission are drowned out by the high proportion of iSNVs that fail to transmit, or are transmitted but then lost. Table 2. Household analysis of variants and transmission bottleneck size. 1 All households consisted of two individuals with sequence data. 2 The number of genome positions where a minor variant in one of the individuals (defined as >3% MAF and more than 100 reads) is the consensus variant in the other individual in the household (in all cases the consensus variant was >99.5%). All (non-masked) genomic sites were considered. 3 The number of genome positions where a minor variant is >3% MAF in both individuals in the household. 4 Bottleneck size was calculated using the exact beta-binomial method described in ( 23 ) . All sites >3% MAF and more than 100 reads in the assumed source individual were used in the analysis. In the recipient all reads at these sites were considered, with an error threshold of 0.5% MAF. Where the first samples for each individual in the household were more than one week apart, we assumed the earlier sampled individual was the donor. Where an individual had more than one sample, and/or the first individuals were positively sampled within a week, we calculated all possible combinations of donor-recipient samples. The maximum and minimum maximum likelihood estimates are recorded if different. No estimate is recorded if there are no identified iSNVs >3% in the donor (household 8), or the two individuals in the household had more than two consensus differences (household 15). . CC-BY-NC-ND 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted December 10, 2020. ; https://doi.org/10.1101/2020.05.28.118992 doi: bioRxiv preprint

Some within-host variants show signatures of selection
Variants occurring repeatedly, but without phylogenetic association, could indicate sites under selection in distinct individuals ( 27 ) . Of particular note are variants we observed at three sites in S: 21575 (V5F), 22899 (G446V) and 24198 (A879V), with G446V lying within the receptor binding domain (RBD). The minor variant F5 was observed in 14 samples, and represented SNPs in 8 samples, but did not have phylogenetic association in our iSNV-SNP analysis ( p = 0.771 before multiple testing adjustment, Fig. 3D). This V5F mutation has been shown to increase infectivity in vitro (28) , and has previously been identified as a potential site subject to selection ( 29 ) . This variant has repeatedly been observed in global samples, including as minority variant, but appears to be increasing in frequency slowly if at all, suggesting it is only advantageous within a small subset of individuals, with the variant either 'reverting' in subsequent infections (as seen in HIV ( 30 ) ), or failing to transmit at all. Similarly, we observed the minor variants V446 and V879 in 4 and 6 individuals respectively. Both variants have previously been shown to reduce sensitivity to convalescent sera in vitro ( 28 ) , and V446 strongly reduces binding of one of the antibodies (REGN10987) in the REGN-Cov2 antibody cocktail ( 31 ) , suggesting these may represent antibody escape mutations.

Implications of intra-host variation on consensus phylogenies
The presence of minority variants could explain some phylogenetic inconsistencies observed in SARS-CoV-2. The global phylogeny is reconstructed from an alignment with relatively few SNPs, and therefore the particular base identified as consensus at iSNV sites can affect the overall phylogeny ( 32 ) . Minority variants can result in changes to branch lengths, either by shortening branches due to lack of resolution at an informative site, or by extending branches if a minor variant -real or artifactual -is miscalled as consensus, as may occur in low VL samples. Miscalling a minor variant as consensus can also generate homoplasies (sites that are repeatedly mutated on the SARS-CoV-2 phylogeny), particularly where the minor variant was the result of contamination or co-infection and represents a lineage-defining SNP in another part of the phylogeny. The same effect would be expected for sites that are prone to host RNA editing or RNA degradation, resulting in the same minority variants arising in different parts of the phylogeny ( 33 ) .
For the iSNV-SNP associated sites that we detected, representing the emergence and/or transmission of genuine variants, our phylogeny appears to be robust to the presence of iSNVs. We observed relatively few homoplasies on our tree (97 out of 1254 SNPs; Table  S3), which suggests that at least in our dataset, the presence of minority variants did not strongly impact the phylogenetic signal. However, some of the longest terminal branch lengths in our phylogeny were indeed associated with low VL samples and high MAFs (Fig.  S7), which suggests that in some cases, minor variants could be responsible for branch length extension in low VL samples. While the presence of high-MAF, consensus-impacting minor variants in such samples could be due to the effect of proportional sampling from a smaller viral population (Fig. 1A,C), genuine biological explanations are also plausible, including late infection being associated with both low VLs and higher diversity ( 34 ) , or of an association of low VL with RNA degradation, host editing, or deleterious mutations which in turn reduce the likelihood of onwards transmission.
We emphasise however, that the presence of iSNVs at common SNP and/or homoplasic sites is not necessarily indicative of co-infection or contamination, or the . CC-BY-NC-ND 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted December 10, 2020. ; https://doi.org/10.1101/2020.05.28.118992 doi: bioRxiv preprint generation of methodological variants. The generation and transmission of iSNVs is a prerequisite for the generation of SNPs on the phylogeny, and homoplasic sites may represent sites under diversifying selection (positively selected in some individuals but negatively selected in others), or sites prone to generation of within-host variants. Nonetheless, as is increasingly being recognised, care is needed when both calling iSNVs and SNPs ( 32 ) . By sequencing synthetic RNA controls, resequencing samples, and only identifying sites if variable in at least one high VL sample, we retained only high confidence variants for our analyses.

Concluding remarks
We uncovered a consistent and reproducible pattern of within-host SARS-CoV-2 diversity in a large dataset of over 1000 individuals, with iSNV sites showing strong phylogenetic clustering patterns if they are also associated with a change in the consensus variant at the same site. However, most samples harboured few variant sites, with a pattern of strong within-host evolutionary constraint in most regions of the genome, including Spike. This indicates that the within-host emergence of vaccine-and therapeutic-escape mutations is likely to be relatively rare. Moreover, the transmission bottleneck size was very small (between 1 and 8) in most instances where we had epidemiological data, suggesting that even if escape-mutations do arise they will be prone to loss at the point of transmission.
Although this bodes well for the longevity of vaccines and antibody-based treatments, we observed two mutations in Spike ( G446V and A879V) that have previously been shown to escape antibody binding ( 28 , 31 ) , and a third that has been shown to increase viral infectivity (V5F, ( 31 ) ), emphasising the need for continuing vigilance. We identified 30 nonysynonymous iSNVs in Spike that are present in multiple individuals (Table S2), and we suggest these and other commonly occurring iSNVs in other regions of the genome should be investigated and monitored, particularly as vaccines and therapeutics are rolled out more widely.
Throughout, we aimed to minimise sequencing artefacts and sample contamination where possible. The dense sampling and deep sequencing of SARS-CoV-2 has enabled us to witness 'evolution-in-action', with diversity generated in one individual leading to a change in consensus and fixation in subsequently infected individuals. The observation of shared diversity among phylogenetically and epidemiologically linked individuals suggests within-host variants could be used, at least in some instances, to help better resolve patterns of transmission in a background of low consensus diversity.
Our work demonstrates that an essential requirement for incorporating intrahost variants in any analysis is an understanding of the population prevalence of intrahost diversity, conditional on the methods used to produce the deep sequencing data. Moreover, our results emphasise the power of open data, large and rigorously controlled datasets, and the importance of integrating genomic, clinical, and epidemiological information, to gain in depth understanding of SARS-CoV-2 as the pandemic unfolds.
. CC-BY-NC-ND 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in

Material and methods
Figures S1-S8 Tables S1-S3 OVSG Analysis group Membership COG-UK full list of consortium names and affiliations . CC-BY-NC-ND 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted December 10, 2020. ; https://doi.org/10.1101/2020.05.28.118992 doi: bioRxiv preprint

RNA extraction. Residual RNA from COVID-19 RT-qPCR-based testing was obtained from Oxford University Hospitals ('Oxford'), extracted on the QIASymphony platform with QIAsymphony DSP Virus/Pathogen Kit (QIAGEN), and from Basingstoke and North
Hampshire Hospital ('Basingstoke'), extracted with one of: Maxwell RSC Viral total nucleic acid kit (Promega); Reliaprep blood gDNA miniprep system (Promega); or Prepito NA body fluid kit (PerkinElmer). An internal extraction control was added to the lysis buffer prior to extraction to act as a control for extraction efficiency (genesig qRT-PCR kit, #Z-Path-2019-nCoV in Basingstoke, MS2 bacteriophage ( 37 ) in Oxford). The #Z-Path-2019-nCoV control is a linear, synthetic RNA target based on sequence from the rat ptprn2 gene, which has no sequence similarity with SARS-CoV-2 (GENESIG PrimerDesign pers. comm, 6 April 2020). The MS2 RNA likewise has no SARS-CoV-2 similarity ( 37 ) . Neither control RNA interfered with sequencing.
Targeted metagenomic sequencing. Samples with suspected epidemiological linkage, where this information was available prior to sequencing, were processed in different batches. Sequencing libraries were constructed from remnant volume of nucleic acid after clinical testing, ranging from 5 to 45 μl (median 30μl) for each sample depending on the available amount of eluate. These volumes represented 1-15% of the original specimen (swab). Libraries were generated following the veSEQ protocol ( 8 ) with some modifications. Briefly, unique dual indexed (UDI) libraries for Illumina sequencing were constructed using the SMARTer Stranded Total RNA-Seq Kit v2-Pico Input Mammalian (Takara Bio USA, California, US) with no fragmentation of the RNA. An equal volume of library from each sample was pooled for capture. Size selection was performed on the captured pool to eliminate fragments shorter than 400nt, which otherwise may be preferentially amplified and sequenced. Target enrichment of SARS-CoV-2 libraries in the pool was obtained through a custom xGen Lockdown Probes panel (IDT, Coralville, USA), using the SeqCap EZ Accessory Kits v2 and SeqCap Hybridization and Wash Kit (Roche, Madison, US) for hybridization of the probes and removal of unbound DNA. Following 12 cycles of PCR for post-capture amplification, the final product was purified using Agencourt AMPure XP (Beckman Coulter, California, US). Sequencing was performed on the Illumina MiSeq (batches 1-2) or NovaSeq 6000 (batches 3-27) platform (Illumina, California, US) at the Oxford Genomics Centre (OGC), generating 150bp or 250bp paired-end reads.

Quantification controls. A dilution series of in vitro
transcribed SARS-CoV-2 RNA (Twist Synthetic SARS-CoV-2 RNA Control 1 (MT007544.1),Twist Bioscience) was included in every capture pool of 90 samples starting from batch 3, and sequenced alongside the clinical samples. Control RNA was serially diluted into Universal Human Reference RNA (UHRR) to a final concentration of SARS-CoV-2 RNA of 500,000, 50,000, 5,000, 500, 100 and 0 copies/reaction. From this we produced a standard curve demonstrating linear association between viral load (VL) and read depth (Fig. S1). For an experiment comparing iSNV presence with and without probe capture, we additionally sequenced two replicates of the Twist RNA control without capture, diluted into UHRR to give an expected concentration of 50,000 copies per reaction.
As an additional validation step, we compared intrahost single-nucleotide variants (iSNVs) in re-sequenced controls with data for the stock RNA sequenced and provided by the manufacturer (Twist Bioscience). Six well-defined iSNVs, which were present in the manufacturer's data and presumably arose during in vitro transcription, were also recovered by our protocol (Fig. S8). In addition, we identified 112 sites that appeared vulnerable to . CC-BY-NC-ND 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted December 10, 2020. ; https://doi.org/10.1101/2020.05.28.118992 doi: bioRxiv preprint low-frequency intrahost variation in vitro (Table S3), possibly as a result of structural variation along the genome or interaction with the sequencing protocol. We blacklisted vulnerable sites from further analysis.
In-run controls. In addition to the synthetic RNA standards described above, each batch included a non-SARS-CoV-2 in-run control consisting of purified in vitro transcribed HIV RNA from clone p92BR025.8, obtained from the National Institute for Biological Standards and Control (NIBSC) ( 38 ) . For batches 1 and 2, which were sequenced prior to synthetic RNA becoming available, we included negative buffer controls. As additional negative controls, we sequenced 6 matched clinical samples from non-COVID-19 patients, distributed across different sequencing runs; none contained any SARS-CoV-2 reads.

Minimising risk of index misassignment
All samples had unique dual indexing (UDI) to prevent cross-detection of reads in the same pool. We used the in-run HIV RNA controls to estimate index misassignment, as this provided a sequence-distinct source of RNA: <3 SARS-CoV-2 reads were detected in any HIV control (median 0) and <10 HIV reads were detected in any SARS-CoV-2 control (median 0), suggesting that index misassignment, if present, occurred at extremely low levels.
Bioinformatics processing. De-multiplexed sequence read pairs were classified by Kraken v2 ( 39 ) using a custom database containing the human genome (GRCh38 build) and the full RefSeq set of bacterial and viral genomes (pulled May 2020). Sequences identified as either human or bacterial were removed using filter_keep_reads.py from the Castanet ( 9 ) workflow ( https://github.com/tgolubch/castanet ). Remaining reads, comprised of viral and unclassified reads, were trimmed in two stages: first to remove the random hexamer primers from the forward read and SMARTer TSO from the reverse read, and then to remove Illumina adapter sequences using Trimmomatic v0.36 ( 40 ) , with the ILLUMINACLIP options set to "2:10:7:1:true MINLEN:80". Trimmed reads were mapped to the SARS-CoV-2 RefSeq genome of isolate Wuhan-Hu-1 (NC_045512.2), using shiver ( 41 ) v1.5.7, with either smalt ( 42 ) or bowtie2 ( 43 ) as the mapper. Both mappers generated comparable results; smalt was used for the final analysis. Only properly paired reads with insert size under 2000 and with at least 70% sequence identity to the reference were retained. For analysis of consensus genomes, consensus calls required a minimum of 2 uniquely mapped (deduplicated) reads per position, equivalent to >15 raw reads per position. Analysis of within-host diversity was restricted only to positions with minimum raw depth of 100, except when examining diversity within presumed recipients of transmissions in the bottleneck analysis. Minor allele frequencies were computed at every position using shiver ( 41 ) (tools/AnalysePileup.py), with the default settings of no BAQ and maximum pileup depth of 1000000. Lineages were assigned by the Pangolin web server (https://pangolin.cog-uk.io) using the determined consensus genome for each sequenced sample.

Alignment.
Oxford and Basingstoke samples were selected if the consensus sequence (inferred from unique mapped reads) consisted of no more than 25% N characters. As an alignment to the reference sequence was already performed in shiver , no further alignment was necessary. To place these data into the global phylogenetic context and help resolve ancestry, a collection of non-UK consensus sequences from the GISAID database ( 44 ) were included in the set of sequences to be aligned. All GISAID ( 36 ) sequences were downloaded from the database on the 26th April 2020 and filtered to remove sequences that were less than 29800 base pairs in length, were more than 1% Ns, or were from the United Kingdom. The remaining sequences were clustered using CD-HIT-EST ( 45 ) using a similarity threshold of 0.995, and then one sequence per cluster picked. The resulting set, along with the reference genome Wuhan-Hu-1 (RefSeq ID NC_045512), were aligned using MAFFT ( 46 ) , with some manual improvement of the algorithmic alignment and removal of problematic sequences performed as a post-processing step. Indels with respect to Wuhan-Hu-1 in both the Oxford/Basingstoke and GISAID alignments were deleted, resulting in two alignments of 29903 nucleotides that could be readily combined.
Simulation of expected number of iSNVs for a given VL sample. To demonstrate the effect of read depth on estimated iSNV counts, we first assumed that within-host MAFs at each site s (here regarded as simply a proportion of reads that do not share the majority nucleotide at s ) for each isolate i were drawn from a Beta(1, 331.41) distribution. Under this distribution, whose mode is 0, the expected number of sites across the whole genome with a MAF greater than 0.03 is 1.22, which is the mean number of iSNVs per sample in the real data. Let p si represent this "true" simulated MAF, and d si the empirical read depth, of sample i at site s .
Then the estimated number of MAFs for i was calculated by summing draws from a Binomial( d si , p si ) distribution for each site s .

Phylogenetics.
Phylogenetic reconstruction was performed on the alignment consisting of the 1390 consensus sequences, along with the GISAID set and the Wuhan-Hu-1 reference sequence. We followed the recommendations of ( 21 ) whereby 100 separate maximum likelihood phylogenies were generated using RAxML-NG ( 47 ) and the GTR+G substitution model, such that each reconstruction used a different random starting parsimony tree. The final phylogeny was then obtained from this set using majority rule. This final tree was rooted with respect to the reference sequence, and then that and all GISAID isolates were pruned.
To identify homoplasic sites, we selected sites that changed state more than once along the tree, after inferring the states at internal nodes using ancestral state reconstruction as implemented in ClonalFrameML ( 48 ) and rooting the tree using the reference genome NC_045512.
Calculation of d n /d s . The total number of synonymous and nonsynonymous substitutions in the SARS-CoV-2 genome was estimated using the first method of ( 49 ) applied to the coding regions of the Wuhan-Hu-1 reference sequence. Overlapping reading frames were accounted for such that a substitution was considered nonsynonymous overall if it was nonsynonymous in either frame. The d n /d s ratio for iSNVs over a genomic region G was then calculated as: where is the fraction of iSNVs at p that are nonsynonymous, or 0 if there are no iSNVs at i p N p , the total number of potential nonsynonymous substitutions in G , and the denominator T G N replaces N with S to represent synonymous substitutions.
. CC-BY-NC-ND 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted December 10, 2020. ; https://doi.org/10.1101/2020.05.28.118992 doi: bioRxiv preprint Phylogenetic association of iSNVs and SNPs. Where an iSNV corresponded to a consensus SNP (by the base pair involved, not simply the site), we performed ancestral state reconstruction on the consensus trees using ClonalFrameML ( 48 ) to identify all branches upon which that substitution was involved. Tips derived from the same clinical sample were then pruned until only one (the one with the highest overall depth) remained. We then, for each tip in the tree, calculated the patristic distance from that tip to the midpoint of the closest one of these branches, and used a one-tailed Mann-Whitney U-test to test for association between the iSNV existing in a sample and this distance. Multiple testing was controlled for using the Benjamini-Hochberg adjustment. As a sensitivity analysis, this was repeated such that all but one tip per infected individual, rather than per clinical sample, were pruned. These analyses were done both on an individual site level and across all sites of interest.

Phylogenetic association of iSNVs at consensus invariant positions.
For the remaining iSNVs, we calculated the extent of association with the consensus phylogeny by treating the presence of an iSNV as a discrete character and calculating the association index, and the mean patristic distance between iSNV tips. Once again the consensus tree was pruned such that tips corresponding to samples with read depth <100 at the position and all but one tip coming from the same individual were removed. A null distribution was generated by permuting the tip labels of this tree 10,000 times, and a one-sided permutation test p -value calculated. Multiple testing was adjusted for as above. In addition, for each tip in the phylogeny at each site of interest, we calculated the minimum patristic distance to a different tip corresponding to an iSNV, and used the Mann-Whitney U-test again to compare the distribution of these distances between iSNV and non-iSNV tips.
. CC-BY-NC-ND 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted December 10, 2020. ; https://doi.org/10.1101/2020.05.28.118992 doi: bioRxiv preprint  . CC-BY-NC-ND 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted December 10, 2020. ; https://doi.org/10.1101/2020.05.28.118992 doi: bioRxiv preprint Supplementary Figure 5 -A Across all iSNVs that reach consensus, kernel density plot of the patristic distances from iSNV tips (orange) and other tips (green) to the nearest consensus branch change of the nucleotides involved. B Across all iSNVs that do not reach consensus and occur at least twice, kernel density plot of the patristic distances from iSNV tips and other tips to the nearest iSNV tip (other than the tip itself).
. CC-BY-NC-ND 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted December 10, 2020. CC-BY-NC-ND 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted December 10, 2020. ; https://doi.org/10.1101/2020.05.28.118992 doi: bioRxiv preprint reduced sensitivity to convalescent sera (Li et al). The remaining 9 subfigures are for iSNVs which never reach consensus but show a p <0.025 for phylogenetic association of iSNV tips using the association index or the mean patristic distance between iSNV tips. While we lack the power to identify these once the Benjamini-Hochberg adjustment is applied, the patterns remain suggestive of transmission of iSNVs by eye. Figure 7. Relationship between terminal branch length and number of unique mapped reads (viral load indicator). Terminal branch lengths on the phylogeny of 1390 samples were plotted against the viral load as estimated from the number of uniquely mapped reads for each sample.

Supplementary
. CC-BY-NC-ND 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted December 10, 2020. ; https://doi.org/10.1101/2020.05.28.118992 doi: bioRxiv preprint Supplementary Figure 8. Within-sample diversity assessed in control RNA (Twist Bioscience). Within-sample diversity was assessed in RNA controls sequenced with each sequencing batch (0.5 mln copies per reaction). At all sites where at least 2 replicates had a minor variant with minimum 3% MAF (boxplot), diversity was compared against a set of NGS reads obtained from Twist Bioscience for the ancestral stock of the in vitro transcribed RNA used in this study (red circles). Six variants were consistently recovered from both the manufacturer data and the in-batch controls, at positions 3350, 6669, 10001, 26791, 26793, 26796. To check whether the remaining within-host variants arose during the SMARTer library prep or during probe capture, we additionally resequenced two replicates of the Twist RNA without capture (blue crosses), by diluting neat RNA 50:50 v/v in Universal Human Reference RNA (UHRR) and taking a proportion for sequencing, to yield approximately 50,000 copies of the Twist control RNA per sample. We generated SMARTer libraries from these replicates, and sequenced these alongside other samples in separate batches. The two capture-free replicates had the same range of intra-sample variants as were observed in our routinely sequenced controls, implying that any differences from the manufacturer data cannot be explained by probe capture and must be the result of the SMARTer library protocol and/or stochastic variation between our laboratory aliquot and the ancestral RNA stock sequenced by Twist.
. CC-BY-NC-ND 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted December 10, 2020. ; https://doi.org/10.1101/2020.05.28.118992 doi: bioRxiv preprint Tables S1-S3 Table S1. Baseline characteristics of SARS-CoV-2 samples in our dataset collected by participating hospitals in Oxford and Basingstoke, UK, between 8 March and 10 June 2020. Lineages are given for the first sample per participant, excluding anonymous samples. . CC-BY-NC-ND 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted December 10, 2020. ; https://doi.org/10.1101/2020.05.28.118992 doi: bioRxiv preprint Table S2. Identified within-host variable sites. Sites with at least one minor allele at frequency >= 3% at depth of at least 100 reads, in a sample depth >=50,000 unique mapped reads. Throughout, "samples" refers to all sequencing runs, and therefore includes replicates in the totals. n_notPopConsensus refers to the number of samples in which the minor variant is not the population-level consensus (most common consensus allele); n_SNPs gives the number of SNPs on the tree; homoplasy is "TRUE" if a homoplasy exists on the tree; maf_median is the median minor allele frequency (MAF) for all samples with >2% MAF; maf_IQR is the inter-quartile range of minor allele frequencies for all samples with >2% MAF. https://github.com/katrinalythgoe/COVIDdiversity

COG-UK Full list of consortium names and affiliations
Funding acquisition, leadership, supervision, metadata curation, project administration, samples, logistics, Sequencing, analysis, and Software and analysis tools: Thomas R Connor 33,34 , and Nicholas J Loman 15 .
Project administration, metadata curation, samples, logistics, sequencing, analysis, and software and analysis tools: William L Hamilton 8,10 .
. CC-BY-NC-ND 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted December 10, 2020. ; https://doi.org/10.1101/2020.05.28.118992 doi: bioRxiv preprint