Establishment and lineage dynamics of the SARS-CoV-2 epidemic in the UK

The UK’s COVID-19 epidemic during early 2020 was one of world’s largest and unusually well represented by virus genomic sampling. Here we reveal the fine-scale genetic lineage structure of this epidemic through analysis of 50,887 SARS-CoV-2 genomes, including 26,181 from the UK sampled throughout the country’s first wave of infection. Using large-scale phylogenetic analyses, combined with epidemiological and travel data, we quantify the size, spatio-temporal origins and persistence of genetically-distinct UK transmission lineages. Rapid fluctuations in virus importation rates resulted in >1000 lineages; those introduced prior to national lockdown tended to be larger and more dispersed. Lineage importation and regional lineage diversity declined after lockdown, while lineage elimination was size-dependent. We discuss the implications of our genetic perspective on transmission dynamics for COVID-19 epidemiology and control.

I nfectious disease epidemics are composed of chains of transmission, yet surprisingly little is known about how co-circulating transmission lineages vary in size, spatial distribution, and persistence, or how key properties such as epidemic size and duration arise from their combined action. Although individual-level contact-tracing investigations can reconstruct the structure of small-scale transmission clusters [e.g., (1)(2)(3)], they cannot be extended practically to large national epidemics. However, recent studies of Ebola, Zika, influenza, and other viruses have demonstrated that virus emergence and spread can instead be tracked using large-scale pathogen genome sequencing [e.g., (4)(5)(6)(7)]. Such studies show that regional epidemics can be highly dynamic at the genetic level, with recurrent importation and extinction of transmission chains within a given location. In addition to measuring genetic diversity, understanding pathogen lineage dynamics can help researchers to target interventions effectively [e.g., (8,9)], track variants with potentially different phenotypes [e.g., (10,11)], and improve the interpretation of incidence data [e.g., (12,13)].
The rate and scale of virus genome sequencing worldwide during the COVID-19 pandemic has been unprecedented, with >100,000 severe acute respiratory syndrome corona-virus 2 (SARS-CoV-2) genomes shared online by 1 October 2020 (14). About half of these represent infections in the United Kingdom and were generated by the national COVID- 19 Genomics UK (COG-UK) consortium (15). The UK experienced one of the largest epidemics worldwide during the first half of 2020. Numbers of positive SARS-CoV-2 tests rose in March and peaked in April; by 26 June, there had been 40,453 nationally notified COVID-19 deaths in the UK [deaths occurring ≤28 days after first positive test (16)]. Here, we combine this large genomic dataset with epidemiolog-ical and travel data to provide a full characterization of the genetic structure and lineage dynamics of the UK epidemic.
Our study encompasses the initial epidemic wave of COVID-19 in the UK and comprises all SARS-CoV-2 genomes available before 26 June 2020 (50,887 genomes, of which 26,181 were from the UK; Fig. 1A) (17). The data represent genomes from 9.29% of confirmed UK COVID-19 cases by 26 June (16). Further, using an estimate of the actual size of the UK epidemic (18), we infer that virus genomes were generated for 0.66% [95% confidence interval (CI), 0.46 to 0.95%] of all UK infections by 5 May (Fig. 1B).
Genetic structure and lineage dynamics of the UK epidemic from January to June We first sought to identify and enumerate all independently introduced, genetically distinct chains of infection within the UK. We developed a large-scale molecular clock phylogenetic pipeline to identify "UK transmission lineages" that (i) contain two or more UK genomes and (ii) descend from an ancestral lineage inferred to exist outside of the UK (Fig. 2, A and B). Sources of statistical uncertainty in lineage assignation were taken into account (17). We identified a total of 1179 [95% highest posterior density (HPD), 1143 to 1286] UK transmission lineages. Although each is intended to capture a chain of local transmission arising from a single importation event, some UK transmission lineages will be unobserved or aggregated as a result of limited SARS-CoV-2 genetic diversity (19) or incomplete or uneven genome sampling (20,21). Therefore we expect this number to be an 1 of 5 underestimate (17). In our phylogenetic analysis, 1650 (95% HPD, 1611 to 1783) UK genomes could not be allocated to a UK transmission lineage (singletons). Had more genomes been sequenced, it is likely that many of these singletons would have been assigned to a UK transmission lineage. Further, many singleton importations are likely to be unobserved.
Most transmission lineages are small, and 72.4% (95% HPD, 69.3 to 72.9%) contain <10 genomes (Fig. 2C). However, the lineage size distribution is strongly skewed and follows a power-law distribution (Fig. 2C, inset), such that the eight largest UK transmission lineages contain >25% of all sampled UK genomes ( Fig. 2D; figs. S2 to S5 show further visualizations). Although the two largest transmission lineages are estimated to comprise >1500 UK genomes each, there is phylogenetic uncertainty in their sizes (95% HPDs, 1280 to 2133 and 1342 to 2011 genomes, respectively). Because our dataset constitutes only a small fraction of all UK infections, these observed lineage sizes will underestimate true lineage size. However, the true distribution of relative lineage sizes will closely match our observation, and its power-law shape indicates that almost all unobserved lineages will be small. All eight largest lineages were first detected before the UK national lockdown was announced on 23 March and, as expected, larger lineages were observed for longer (Pearson's r = 0.82; 95% CI, 0.8 to 0.83; fig. S7). The sampling frequency of lineages of varying sizes differed over time ( Fig. 3A and figs. S8 and S9); whereas UK transmission lineages containing >100 genomes consistently accounted for >40% of weekly sampled genomes, the proportion of small transmission lineages (≤10 genomes) and singletons decreased over the course of the epidemic (Fig. 3A).
The detection of UK transmission lineages in our data changed markedly through time. In early March, the epidemic was characterized by lineages first observed within the previous week (Fig. 3B). The per-genome rate of appearance of new lineages was initially high, then declined throughout March and April (Fig. 3C), such that by 1 May, 96.2% of sampled genomes belonged to transmission lineages that were first observed >7 days previously. By 1 June, a growing number of lineages (>73%) had not been detected by genomic sampling for >4 weeks, which suggests that they were rare or had gone extinct; this result is robust to the sampling rate ( Fig. 1, A and B, and Fig.  3C). Together, these results indicate that the UK's first epidemic wave resulted from the concurrent growth of many hundreds of independently introduced transmission lineages, and that the introduction of nonpharmaceutical interventions (NPIs) was followed by the apparent extinction of lineages in a sizedependent manner.

Transmission lineage diversity and geographic range
We also characterized the spatial distribution of UK transmission lineages using available data on 107 virus genome sampling locations, which correspond broadly to UK counties or metropolitan regions (data S1). Although genomes were not collected randomly [some lineages and regions will be overrepresented because of targeted investigation of local outbreaks; e.g., (22)], the number of UK lineages detected in each region correlates with the number of genomes sequenced (  transmission lineages, and larger lineages were more geographically widespread. These observations indicate substantial dissemination of a subset of lineages across the UK and suggest that many regions experienced a series of introductions of new lineages from elsewhere, potentially hindering the impact of local interventions. We quantified the substantial variation among regions in the diversity of transmission lineages present using Shannon's index (SI; this value increases as both the number of lineages and the evenness of their frequencies increase; Fig. 4C and data S3). We observed the highest SIs in Hertfordshire (4.77), Greater London (4.62), and Essex (4.49); these locations are characterized by frequent commuter travel to or within London and proximity to major international airports (23). Locations with the three lowest nonzero SIs were in Scotland (Stirling = 0.96, Aberdeenshire = 1.04, Inverclyde = 1.32; Fig. 4C). We speculate that regional differences in transmission lineage diversity may be related to the level of connectedness to other regions.
To illustrate temporal trends in transmission lineage diversity, we plotted SI through time for each of the UK's national capital cities (Fig. 4D). Lineage diversities in each peaked in late March and declined after the UK national lockdown, congruent with Fig. 3, C and D. Greater London's epidemic was the most diverse and was characterized by an early, rapid rise in SI (Fig. 4D), consistent with epidemiological trends there (16,24). Belfast's lineage diversity was notably lower (data S4 shows other locations).
We observe variation in the spatial range of individual UK transmission lineages. Although some lineages are widespread, most are more localized and the range size distribution is right-skewed ( fig. S11), congruent with an observed abundance of small lineages (Figs. 2C and 4B) and biogeographic theory [e.g., (25)]. For example, lineage DTA_13 is geographically dispersed (>50% of sequence pairs sampled >234 km apart), whereas DTA_290 is strongly local (95% of sequence pairs sampled <100 km apart) and DTA_62 has multiple foci of sampled genomes (Fig. 4E and fig. S12). The national distribution of cases therefore arose from the aggregation of multiple heterogeneous lineage-specific patterns.

Dynamics of international introduction of transmission lineages
The process by which transmission lineages are introduced to an area is an important aspect of early epidemic growth [e.g., (26)]. To investigate this at a national scale, we estimated the rate and source of SARS-CoV-2 importations into the UK. Because standard phylogeographic approaches were precluded by strong biases in genome sampling among countries (20), we developed a new approach that combines virus phylogenetics with epidemiological and travel data. First, we estimated the TMRCA (time of the most recent common ancestor) of each UK transmission lineage (17). The TMRCAs of most UK lineages are dated to March and early April [median = 21 March; interquartile range (IQR) = 14 to 29 March]. UK lineages with earlier TMRCAs tend to be larger and longerlived than those whose TMRCAs postdate the national lockdown (Fig. 5A and fig. S15).
Because of incomplete sampling, TMRCAs best represent the date of the first inferred transmission event in a lineage, not its importation date (Fig. 2B). To infer the latter and to quantify the delay between importation and onward within-UK transmission, we generated daily estimates of the number of travelers arriving in the UK and of global SARS-CoV-2 infections (17) worldwide. Before March, the UK received~1.75 million inbound travelers per week (school holidays explain the end-February~10% increase; Fig. 5B). International arrivals fell by~95% during March, and this reduction was maintained through April. Elsewhere, estimated numbers of infectious cases peaked in late March (Fig. 5B). We combined these two trends to generate an estimated importation intensity (EII), a daily empirical measure of the intensity of SARS-CoV-2 importation into the UK (17). Because both travel volumes and epidemic incidence fluctuate rapidly over orders of magnitude, the EII is robust to other sources of variation in the relative importation risk among countries (17). The EII peaked in mid-March, when high UK inbound travel volumes coincided with growing numbers of infectious cases elsewhere (Fig. 5, B and C).
Crucially, the EII's temporal profile closely matches, but precedes, that of the TMRCAs of UK transmission lineages (Fig. 5, A and C). The difference between the two represents the "importation lag," the time elapsed between lineage importation and the first detected local transmission event (Fig. 2B). Using a statistical model (17) S4 and figs. S19 and S20). This assignment is statistical (i.e., we cannot ascribe a specific source location to any given lineage).  (table S2). This size dependency likely arises because the earliest transmission event in a lineage is more likely to be captured if it contains many genomes (Fig. 2B) (17). We use this model to impute an importation date for each UK transmission lineage (Fig. 5D) . S17). Using these values, we estimated the numbers of inferred importations each day attributable to inbound travel from each source location. This assignment is statistical and does not take the effects of superspreading events into account. As with the rate of importation (Fig. 5A), the relative contributions of arrivals from different countries were dynamic (Fig. 5D). Dominant source locations shifted rapidly in February and March, and the diversity of source locations increased in mid-March ( fig. S17). The earliest importations were most likely from China or elsewhere in Asia but were rare relative to those from Europe. Over our study period, we infer that~33% of UK transmission lineages stemmed from arrivals from Spain, 29% from France, 12% from Italy, and 26% from elsewhere ( fig.  S20 and table S4). These large-scale trends were not apparent from individual-level travel histories; routine collection of such data ceased on 12 March (27).

Conclusions
The exceptional size of our genomic survey provides insight into the micro-epidemiological patterns that underlie the features of a large, national COVID-19 epidemic, allowing us to quantify the abundance, size distribution, and spatial range of transmission lineages. Before the lockdown, high travel volumes and few restrictions on international arrivals ( Fig. 5B and table S5) led to the establishment and cocirculation of >1000 identifiable UK transmission lineages (Fig. 5A), jointly contributing to accelerated epidemic growth that quickly exceeded national contact-tracing capacity (27). The relative contributions of importation and local transmission to initial epidemic dynamics under such circumstances warrant further investigation. We expect that similar trends occurred in other countries with comparably large epidemics and high international travel volumes; virus genomic studies from regions with smaller or controlled COVID-19 epidemics have reported high importation rates followed by more transient lineage persistence [e.g., (28)(29)(30)].
Earlier lineages were larger, more dispersed, and harder to eliminate, highlighting the importance of rapid or preemptive interventions in reducing transmission [e.g., (31)(32)(33)]. The high heterogeneity in SARS-CoV-2 transmission at the individual level (34)(35)(36) appears to extend to whole transmission lineages, such that >75% of sampled viruses belong to the top 20% of lineages ranked by size. Although the national lockdown coincided with limited importation and reduced regional lineage diversity, its impact on lineage extinction was size-dependent (Fig. 3, B and C). The overdispersed nature of SARS-CoV-2 transmission likely exacerbated this effect (37), thereby favoring, as the epidemic reproduction number (R t ) declined, greater survival of larger widespread lineages and faster local elimination of lineages in low-prevalence regions. The degree to which the surviving lineages contributed to the UK's ongoing second epidemic, including the effect of specific mutations on lineage growth rates [e.g., (11)], is currently under investigation. The transmission structure and dynamics measured here provide a new context in which future public health actions at regional, national, and international scales should be planned and evaluated.