Recurrent deletions in the SARS-CoV-2 spike glycoprotein drive antibody escape

Eluding detection Influenza viruses evade immunity initiated by previous infection, which explains recurrent influenza pandemics. Unlike the error-prone RNA-dependent RNA polymerase of influenza, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and related viruses contain polymerases with proofreading activity. However, proofreading cannot correct deletions, which during a long-term persistent infection could result in the generation of viruses showing alteration of entire stretches of amino acids and the structures they form. McCarthy et al. identified an evolutionary signature defined by prevalent and recurrent deletions in the spike protein of SARS-CoV-2 at four antigenic sites. Deletion variants show human-to-human transmission of viruses with altered antigenicity. Science, this issue p. 1139

S evere acute respiratory syndrome coronavirus 2 (SARS-CoV-2) emerged from a yet-to-be-defined animal reservoir and initiated a pandemic in 2020 (1)(2)(3)(4)(5). It has acquired limited adaptions, most notably the Asp 614 → Gly (D614G) substitution in the spike (S) glycoprotein (6)(7)(8). Humoral immunity to S glycoprotein appears to be the strongest correlate of protection (9), and recently approved vaccines deliver this antigen by immunization. Coronaviruses such as SARS-CoV-2 acquire substitutions slowly as the result of a proofreading RNA-dependent RNA polymerase (RdRp) (10,11). Other emerging respiratory viruses have produced pandemics followed by endemic human-to-human spread. The latter is often contingent upon the introduction of antigenic novelty that enables reinfection of previously immune individuals. Whether SARS-CoV-2 S glycoprotein will evolve altered antigenicity, or specifically how it may change in response to immune pressure, remains unknown. We and others have reported the acquisition of deletions in the N-terminal domain (NTD) of the S glyco-protein during long-term infections of immunocompromised patients (12)(13)(14)(15). We have identified this as an evolutionary pattern defined by recurrent deletions that alter defined antibody epitopes. Unlike substitutions, deletions cannot be corrected by proofreading activity, and this may accelerate adaptive evolution in SARS-CoV-2.
An immunocompromised cancer patient infected with SARS-CoV-2 was unable to clear the virus and succumbed to the infection 74 days after COVID-19 diagnosis (15). Treatment included remdesivir, dexamethasone, and two infusions of convalescent serum. We designate this individual as Pittsburgh long-term infection 1 (PLTI1). We consensus-sequenced and cloned S genes directly from clinical material obtained 72 days after COVID-19 diagnosis and identified two variants with deletions in the NTD (Fig. 1A).
These data from PLTI1 and a similar report (12) prompted us to interrogate patient metadata sequences deposited in GISAID (16). In searching for similar viruses, we identified eight patients with deletions in the S glycoproteins of viruses sampled longitudinally over a period of weeks to months ( Fig. 1A and fig.  S1A). For each, early time points had intact S sequences and later time points had deletions within the S gene. Six had deletions that were identical to, overlapping with, or adjacent to those in PLTI1. Deletions at a second site were present in viruses isolated from two other patients (Fig. 1B); reports on these patients have since been published (13,14). Viruses from all but one patient could be distinguished from one another by nucleotide differences present at both early and late time points (fig. S1B). On a tree of representative contemporaneously circulating isolates, they form monophyletic clades, making either a second communityacquired or nosocomially acquired infection unlikely ( fig. S1C). The most parsimonious explanation is that these deletions arose independently as the result of a common selective pressure to produce strikingly convergent outcomes.
We searched the GISAID sequence database (16) for additional instances of deletions within S glycoproteins. From a dataset of 146,795 sequences (deposited from 1 December 2019 to 24 October 2020) we identified 1108 viruses with deletions in the S gene. When mapped to the S gene, 90% of these deletions occupied four discrete sites within the NTD ( Fig. 2A). We term these important sites recurrent deletion regions (RDRs), numbering them 1 to 4 from the 5′ to the 3′ end of the S gene. Deletions identified in patient samples correspond to RDR2 (Fig. 1A) and RDR4 (Fig. 1B). Most deletions appear to have arisen and been retained in replication-competent viruses. Without selective pressure, in-frame deletions should occur one-third of the time. However, we observed a preponderance of in-frame deletions with lengths of 3, 6, 9, and 12 ( Fig.  2B). Among all deletions, 93% are in frame and do not produce a stop codon (Fig. 2C). In the NTD, >97% of deletions maintain the open reading frame. Other S glycoprotein domains do not follow this trend; for example, deletions in the receptor binding domain (RBD) and S2 preserve the reading frame 30% and 37% of the time, respectively.
To trace the origins of RDR variants, we produced phylogenies for each with 101 additional genomes that sample much of the genetic diversity within the pandemic (Fig. 2D).
The RDR variants interleave with nondeletion sequences and occupy distinct branches, indicating their recurrent generation. This is most pronounced for RDRs 1, 2, and 4 but is also true of RDR3, with conservatively four independent instances. RDR variants form distinct lineages/branches, most prominently in RDR1 (lineage B.1.258), and suggest humanto-human transmission events. Using sequences with sufficient metadata to explicitly differentiate individuals, we verified the transmission of a variant within each RDR between people ( fig. S2).
We defined the RDRs on the basis of peaks in the spectrum of S glycoprotein deletions. Deletion lengths and positions vary within RDRs 1, 2, and 4 ( Fig. 2E). Variation is greatest in RDRs 2 and 4, with the loss of S glycoprotein residues 144/145 (adjacent tyrosine codons) in RDR2 and residues 243 and 244 in RDR4 appearing to be favored. In contrast, the loss of residues 69 and 70 accounts for the vast majority of RDR1 deletions. On the basis of our phylogenetic analysis and accompanying lineage classifications, this two-amino acid deletion has arisen independently at least 13 times. RDR3 largely consists of three nucleotide deletions in codon 220.
We evaluated the genetic, geographic, and temporal sampling of RDR variants (Fig. 3, A  and B). This analysis was limited to sequences deposited in GISAID (16) where sequences from specific nations and regions are overrepresented (e.g., United Kingdom and other European countries). We show the distribution of all sequences within the database for reference. For RDR2 and RDR4, the genetic and geographic distributions largely mirror those of reported sequences. Variants of RDR1 and RDR3 are strongly polarized to specific clades and geographies. This is likely the result of successful lineages circulating in regions with strong sequencing initiatives. Our temporal analysis indicates that RDR variants have been present throughout the pandemic (Fig. 3C). Specific variant lineages such as B.1.258 (Fig. 2D) harboring D69-70 in RDR1 have rapidly risen to notable abundance (Fig. 3D). Circulation of B.1.36 with RDR3 D210 accounts for most of the RDR3 examples ( Fig. 2D and Fig. 3, C and D). The abundance of RDR2 D144/145 is explained by independent deletion events followed by transmission ( Fig. 2D and Fig. 3, C and D).
The recurrence and convergence of RDR deletions, particularly during long-term infections, is indicative of adaptation in response to a common selective pressure. RDRs 2 and 4 and RDRs 1 and 3 occupy two distinct surfaces on the S glycoprotein NTD (Fig. 4A). Both sites contain antibody epitopes (17)(18)(19). The epitope for neutralizing antibody 4A8 is formed entirely by the b sheets and extended connecting loops that harbor RDRs 2 and 4 (17). We generated a panel of S glycoprotein mutants representing the four RDRs to assess the impact of deletions on expression and antibody binding; we included an additional double mutant containing the deletions present in the B.1.1.7 variant of concern flagged initially in the United Kingdom. Cells were transfected with plasmids expressing these mutant glycoproteins, and indirect immunofluorescence was used to determine whether RDR deletions modulated 4A8 binding (Fig. 4B). Deletions at RDRs 1 and 3 had no impact on the binding of the monoclonal antibody, confirming that they alter independent sites. The three RDR2 deletions, the one RDR4 deletion, and the double RDR1/2 deletions completely abolished binding of 4A8 while still allowing recognition by a monoclonal antibody   targeting the RBD (Fig. 4B). Thus, convergent evolution operates in individual RDRs and between RDRs, as exemplified by the same phenotype produced by deletions in RDR2 or RDR4. We used the non-plaque-purified viral population from PLTI1 to determine whether RDR variants escape the activity of a neutralizing antibody. This viral stock was completely resistant to neutralization by 4A8, whereas an isolate with authentic RDRs (20) was neutralized (Fig. 4C). We used a high-titer neutralizing human convalescent polyclonal antiserum to demonstrate that both viral stocks could be neutralized efficiently. These data demonstrate that naturally arising and circulating variants of SARS-CoV-2 have altered antigenicity. We used a range of high-, medium-, and low-titer neutralizing human convalescent polyclonal antisera to assess whether there was an appreciable difference in neutralization between the S glycoprotein-deleted and undeleted viruses. No major difference was observed, which suggests that many more changes would be required to generate serologically distinct SARS-CoV-2 variants (table S1).

N V T W F H A I H V S G T N G T D P F L G V Y Y H I N L L A L H R S Y
Coronaviruses, including SARS-CoV-2, have lower substitution rates than other RNA viruses because of an RdRp with proofreading activity (10,11). However, proofreading cannot correct deletions. We find that adaptive evolution of S glycoprotein is augmented by a tolerance for deletions, particularly within RDRs. The RDRs occupy defined antibody epitopes within the NTD (17)(18)(19), and deletions at multiple sites confer resistance to a neutralizing antibody. Deletions represent a generalizable mechanism through which S glycoprotein rapidly acquires genetic and antigenic novelty of SARS-CoV-2.
The fitness of RDR variants is evident by their representation in the consensus genomes from patients, transmission between individuals, and presence in emergent lineages. Initially documented in the context of long-term infections of immunosuppressed patients, specific variants transmit efficiently between immunocompetent individuals. Characterization of these cases led to the very early identification of RDR variants that are escape mutants. Because deletions are a product of replication, they will occur at a certain rate and variants are likely to emerge in otherwise healthy populations. Indeed, influenza explores variation that approximates future antigenic drift in immunosuppressed patients (21).
The RDRs occupy defined antibody epitopes within the S glycoprotein NTD. Selected in vivo, these deletion variants resist neutralization by monoclonal antibodies. Viruses cultured in vitro in the presence of immune serum have also acquired substitutions in RDR2 that confer neutralization resistance (22). Potent neutralizing responses and an array of monoclonal antibodies are directed to the RBD (18,19,23). A growing number of NTD-directed antibodies have been identified (24,25). Why antibody escape in nature is most evident in the NTD highlights a discrepancy, and this requires further study.
Defining recurrent, convergent patterns of adaptation can provide predictive potential. From viral sequences, we have identified a pattern of deletions, contextualized their outcomes in protein structure and antibody epitope(s), and characterized their functional impact on antigenicity. During evaluation of this manuscript, multiple lineages with altered antigenicity and perhaps increased transmissibility have emerged and spread. These variants of global concern are RDR variants and include Mink Cluster 5 D69-70 (26), B.1.1.7 D69-70, and D144/145 (27), as well as B.1.351 D242-244 (28). Our analysis preceded the description of these lineages. We had demonstrated that identical or similar recurrent deletions that alter positions 144/145 and 243-244 in the S glycoprotein disrupt binding of antibody 4A8, which defines an immunodominant epitope within the NTD. Our survey for deletion variants captured the first representative of what would become the B.1.1.7 lineage. These real-world outcomes demonstrate the predictive potential of this and like approaches and show the need to monitor viral evolution carefully and continually.
Additional circulating RDR variants have gone virtually unnoticed. Are they intermediates on a pathway of immune evasion? That remains to be determined. However, deletions and substitutions within major NTD and RBD epitopes will likely continue to contribute to that process, as they have already in current variants of concern. The progression of adaptations in both immunocompromised patients and SARS-CoV-2 variants of concern remains to be resolved. Their evolution has thus far converged. The recurrence of adaptations in single patients and on global scales underscores the need to track and monitor deletion variants.