Undiagnosed SARS-CoV-2 seropositivity during the first 6 months of the COVID-19 pandemic in the United States

16.8 million SARS-CoV-2 infections in the US went undiagnosed in the first 6 months of the pandemic compared to 3.5 million diagnosed infections.


INTRODUCTION
Coronavirus disease 2019 , the disease caused by severe acute respiratory syndrome coronavirus type 2 (SARS-CoV-2) infection, presents with a spectrum of illness ranging from asymptomatic to severe disease. As with most respiratory viral diseases, it is difficult to estimate the true prevalence of the disease during a pandemic and the extent of its spread is only known after extensive study (1)(2)(3). Most patients infected with SARS-CoV-2 develop robust antibody responses against the viral spike protein, nucleocapsid protein, and the envelope protein that can be detected by serological testing (4)(5)(6)(7)(8). Antibodies against spike protein persist for months and can neutralize SARS-CoV-2 (9). Frequently, these neutralizing antibodies bind to the receptor binding domain (RBD) of the spike protein, but antibodies against the spike protein S2 domain have also been observed (10)(11)(12)(13)(14)(15).
To characterize the spread of SARS-CoV-2 infection in the United States, we evaluated seropositivity in a national survey of participants who had not previously been diagnosed with SARS-CoV-2 infection. We used quota sampling from a large pool of volunteers (n = 462,949) to obtain a representative sample (n = 9089) and performed statistical weighting to generate prevalence estimates that revealed the extent of SARS-CoV-2 infection in the general population. To ensure accurate classification of seropositivity, we used our dual-antigen enzyme-linked immunosorbent assay (ELISA) protocol that evaluated immunoglobulin G (IgG) and IgM antibodies against both the full viral spike protein ectodomain and the RBD (8,16).

Enrollment and demographic representation
Recruitment took place from 1 April 2020 to 4 August 2020. During that time, 11,283 participants were enrolled from a pool of 241,424 volunteers in the United States (50 states and the District of Columbia). Of these participants, 214 had blood collected via venipuncture and 11,069 were sent volumetric dried blood microsamplers (absorbent polymer, 20-l collection volume). More than 80% of the microsamplers were returned (9089 participants). Ultimately, 9028 participant blood samples were analyzed using ELISA for the presence of anti-SARS-CoV-2 spike protein antibodies. Of those, 8058 participants had a complete clinical questionnaire and were included in the weighted analysis ( Fig. 1). Most blood sample collection (>88%) occurred within the 11-week period between 10 May and 31 July 2020 (figs. S1 and S2). The six major demographic factors used in participant selection are summarized in Table 1. Participant sampling was representative of the U.S. population. When expanded to include the additional 10 demographic or health-related factors captured by the Behavioral Risk Factor Surveillance System (BRFSS), many factors were well matched, but there were some differences, for example, our sample population was more highly educated, had higher employment rates, and had better access to health care compared to the general U.S. population (Table 1).

Estimates of seroprevalence
There were 304 seropositive participants in the analysis set (Fig. 2). This gave a weighted estimate of 4.6% of the undiagnosed adults in the U.S. population who were seropositive for SARS-CoV-2 infection [95% confidence interval (CI), 2.6% to 6.5%, n = 8058 complete testing and survey]. Using this average rate over the study period, we estimated that there were 4.8 undiagnosed SARS-CoV-2 infections for each diagnosed case over the course of the study (95% CI, 2.8 to 6.8). Among seropositive participants, 36.51% were IgG + IgM + IgA + , 28.29% were IgG + IgM − IgA + , 17.11% were IgG + IgM − IgA − , 13.16% were IgG + IgM + IgA − , 4.28% were IgG − IgM + IgA − , and 0.66% were IgG − IgM + IgA + (Fig. 2, A to D, and fig. S3). There were variations in antibody profiles across different demographic groups, specifically anti-spike protein and anti-RBD IgG antibodies (figs. S4 and S5). We found regional variations in seroprevalence estimates across the United States (Figs. 2E and 3). The Northeast and Mid-Atlantic regions showed the highest rates of seropositivity, whereas the lowest seropositivity was in the Midwest. Urban areas were estimated to have higher point estimates of seropositivity (5.3%) compared to rural areas (1.1%) at the time blood samples were collected. Estimates Fig. 1. SARS-CoV-2 serosurvey study overview and statistical workflow. A flow chart of participant recruitment through data analysis displays steps in data acquisition and lists participant attrition. Ovals show the start and end of data analysis or data acquisition; gray rectangles indicate subsets of participants in this study; blue parallelograms represent individuals from outside data sets that contributed to adjusted prevalence estimates; blue rounded rectangles present analysis processes. Table 1. Characteristics of the serosurvey population compared to the U.S. population. Census and BRFSS (2018) data on selection criteria were used for quota-based sampling in our SARS-CoV-2 serosurvey. Other values from BRFSS were used for statistical weighting. The table shows comparisons between the estimated proportion of the U.S. population in each category according to weighted BRFSS data compared to our sample population in the SARS-CoV-2 serosurvey. NLF, not in the labor force (student, retired, unable to work, refused to answer, not asked/missing). of seroprevalence were calculated for other demographic subgroups (Fig. 3). The youngest age group, 18 to 44 years, had the highest estimated seropositivity (5.9%). Estimated seroprevalence for females was 5.5% and was 3.5% for males. The seroprevalence estimate for African Americans was highest at 14.2% followed by participants who self-identified as other/unlisted race (11.1%), American Indian/Alaska Native (6.8%), followed by White/Caucasian (3.1%), whereas those identifying as Asian displayed the lowest seroprevalence estimate (2.0%). Participants who reported a known exposure to a SARS-CoV-2infected individual had a higher seroprevalence estimate (15.6%) compared to those who did not (2.7%). In comparison to the national average (4.6%), those who worked from home had a lower seropositivity estimate of 3.0%. Those who reported previous vaccination (for influenza 3.2% or pneumonia 2.3%) had a lower likelihood of undiagnosed seropositivity. Those who had health conditions associated with poor outcomes in SARS-CoV-2 infection, including coronary heart disease, asthma, and diabetes, displayed lower rates of seropositivity (Fig. 4). Other health conditions were also correlated with a decreased seropositivity rate such as skin cancer, stroke, or arthritis.

U.S. population (BRFSS, census survey) SARS-CoV-2 serosurvey population
Our results estimate that as of July 2020, there were about 4.79 undiagnosed infections (95% CI, 2.76 to 6.82; fig. S6) for every identified case of COVID-19, suggesting a potential 16.8 million undiagnosed infections by July 2020 in addition to the reported 3.5 million diagnosed cases in the United States. These data suggest that a higher level of infection-induced immunity exists in the U.S. population than previously predicted.

DISCUSSION
These results, including the subgroup analysis, provide us a previously undescribed view into the spread of the COVID-19 pandemic by more clearly identifying the large numbers of individuals with undiagnosed infections during the initial months of the pandemic. These data are of great importance as we consider the impact vaccination may have on the future course of the pandemic and plan for current and future available vaccines to be administered. In addition, these data can also help us to better assess the public health measures taken during the pandemic and how to take the best approaches forward during any future public health emergencies.
This study demonstrates that spread of the SARS-CoV-2 virus in the United States during the first 6 months of the pandemic was more widespread than has been suggested by data reporting diagnostic test-confirmed cases. Similar to responses to other respiratory viruses, such as influenza, many individuals develop asymptomatic or mild disease that is not medically attended and therefore never diagnosed. Our findings indicate that there are nearly five individuals with a previous asymptomatic infection for every diagnosed case of COVID-19. Furthermore, patterns of our seroprevalence data match well with those of diagnosed cases reported during a similar time frame (17). For example, the greater seropositivity estimated in densely populated urban areas follows the observed initial spread of SARS-CoV-2. In comparison to the national average, we found that the Midwest, South, and West had lower seroprevalence rates during the study time frame, which preceded a substantial increase in SARS-CoV-2 infections in these regions detected by viral testing. Our data suggest that the youngest age group had the highest undiagnosed seroprevalence, which is consistent with observations that they display less severe symptoms than older patients (18). We also found higher undiagnosed seroprevalence in females, possibly suggesting a higher risk for asymptomatic disease. Participants with chronic diseases that are more likely to be associated with severe clinical manifestations of COVID-19, including diabetes, heart disease, and asthma, had a lower prevalence of asymptomatic SARS-CoV-2 infection in comparison to the national average. Those with known exposure to SARS-CoV-2-infected individuals had a higher estimated incidence of undiagnosed seropositivity. We also found that African American and Hispanic participants had higher undiagnosed seropositivity, correlating with national data on disease burden in these subgroups.

U.S. population (BRFSS, census survey) SARS-CoV-2 serosurvey population
Our study reports a representative population sample across the United States and evaluated regional, demographic, and socioeconomic differences in the prevalence of asymptomatic SARS-CoV-2 infection. In contrast, other reports of seroprevalence data focus on specific groups of individuals or geographic locations, such as dialysis patients or individuals who reported for blood draws that may be biased toward those needing medical care during the pandemic (19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30)(31)(32)(33)(34)(35)(36). These previous studies came within the range of our estimate of undiagnosed cases when considering the additional diagnosed cases within the same time frame. Our results provide new insight into the spread of SARS-CoV-2, estimating the national undiagnosed exposure rate to illuminate the scope of infection during the first 6 months of the pandemic. As expected, given delayed arrival in different geographic areas such as the Midwest and rural South, undiagnosed infection estimates varied by region, with the Mid-Atlantic region having the largest proportion of undiagnosed infections in comparison to diagnosed cases. Given the high point estimate of undiagnosed seropositivity in younger participants, lower point estimates in individuals with preexisting conditions such as diabetes, and the vaccine rollout starting with older persons and those at risk, we could see a faster onset of herd immunity due to these undiagnosed infections in populations that are in lower priority groups for vaccination. Young and healthy individuals, such as those under the age of 16 who were not eligible for the first wave of vaccines in the United States and those under 12 who are still ineligible, could serve as an asymptomatic reservoir for viral mutations leading to increased transmissibility or vaccine escape mutations, which has been shown in unvaccinated children and adults with viral persistence (37). Further long-term studies of immunity in the population will be necessary to understand durability of the immune response to the vaccine versus infection, how infectioninduced immunity affects vaccine response and performance, and whether herd immunity can play a role in controlling the spread of SARS-CoV-2. In addition, further subgroup analysis of these data will be useful in clarifying the spread of disease in the presence of public health measures and how we may be able to refine and further target those measures in the future. Our study has several limitations. First, although extensive statistical adjustments were made, our study cohort is based on a nonrandom volunteer sample, which can have selection bias. Traditional random sampling studies using probability sampling design may have low response rates, calling into question the advantages of that practice (38,39). Our study population also exhibited some differences from the general U.S. population, such as higher education level and access to health care that had to be adjusted for with statistical weighting. Larger sample sizes would allow us to make more detailed estimates, although potentially at the cost of how representative the population is. We used both census and behavioral data to weight our results, although it is possible that there are variables associated with disease transmission that were not accounted for in our weighting. Although we used extensive validation methods on our ELISA (8)   dried blood was unavailable from historical samples on the collection devices. Future cross-verification with an independent analyte, such as the nucleocapsid protein, could prove useful, although antibodies to nucleocapsid fade and would require correction for antibody decay. Our data suggest a larger spread of the COVID-19 pandemic in the United States during the first 6 months than originally thought. Our findings have implications for understanding SARS-CoV-2 spread, epidemiological characteristics of spread, and prevalence in different communities and could have a potential impact on decisions involved in vaccine rollout. Continued large-scale surveillance of SARS-CoV-2 immunity is in progress, discriminating infection-based and vaccine-induced antibody responses. Mathematical models are being generated to understand the pandemic, vaccine performance, and public health measure efficacy and to provide insight into the best approach for handling the next virus with pandemic potential.

Study design
This study was designed to determine the seroprevalence of anti-SARS-CoV-2 antibodies in adults 18 years of age or older in the United States who had not been previously diagnosed with COVID-19. This serosurvey clinical study (ClinicalTrials.gov NCT04334954) is ongoing and will follow the same cohort of participants over time to evaluate seroprevalence and antibody profiles in comparison to the demographic, health, and socioeconomic data provided by each participant. This study was approved by the NIH Institutional Review Board and conducted in accordance with the provisions of the Declaration of Helsinki and Good Clinical Practice guidelines. All participants provided verbal informed consent before enrollment.

Participant selection
The study was advertised online through an official NIH Press Release that linked to an email address to volunteer for selection in the study (www.niaid.nih.gov/news-events/nih-begins-study-quantifyundetected-cases-coronavirus-infection). This press release was subsequently publicized by local and national news outlets and covered via broadcast television news, print news, and internet news articles. All volunteers were emailed an initial survey to collect basic demographic characteristics. Survey responses were de-identified and aggregated by subcategory of state, type of locality approximated from zip codes, age, sex, race, and ethnicity (Fig. 1). Target sample sizes for these subcategories were determined from the U.S. census and were updated every evening based on the characteristics of people who had already enrolled to assure that individuals in each subcategory were enrolled evenly over time. Within each subcategory, participants were initially assigned a selection probability calculated from the target number as a proportion of the available pool. Specific subcategories that had insufficient numbers were aggregated to estimate their impact on the overall distribution of the six main characteristics. If a particular characteristic had insufficient numbers, sample probabilities were boosted for volunteers who had the characteristic. For each day's call list, the most representative of 20,000 randomly generated lists was used, each list drawn without replacement from the volunteer pool based on the sampling probabilities previously defined. Representativeness was assessed by estimating a weighted sum of squared differences from the desired targets and picking the list with the lowest deviation. Unselected participants were eligible to be called at a later date. This algorithm is designed such that each cohort of invited participants is representative of the diversity of the U.S. population with respect to the six sampling variables (see section S4).

Blood sample collection
Participants provided blood samples by mail using a Mitra microsampling kit (Neoteryx, Torrance, CA) or standard venipuncture. Microsampling kits contained visual instructions on the sampling process, bandages, gauze, lancets, and four 20-l microsampling devices for a total collection of 80 l of whole blood. Participants used the lancet to draw blood from their fingertip and collect blood onto each of the four microsamplers. Participants returned the dried microsamplers with desiccant via overnight shipping. Those who underwent venipuncture did so in the NIH Clinical Center phlebotomy laboratory, where 18 ml of blood was collected in a serum separator and whole blood tube. Once received in the laboratory, serum samples were processed, and microsamplers were stored dry at −80°C until elution and analysis.

Serologic assays
Antibodies from samples were analyzed using ELISA as previously described (8,(40)(41)(42). To maintain longitudinal quality control and ensure that the assays remained stable across multiple months of assay implementation, positive and negative controls were included on each assay plate and monitored for stability ( fig. S7). Seropositivity cut points were defined by evaluating 300 true-negative samples and 56 true-positive samples. Positivity thresholds were based on the mean optical density (absorbance) plus 3 SDs (see the Supplementary Materials for details). The final criterion of a Spike + and RBD + for any combination of IgG or IgM gave estimated sensitivity and specificity of 1, with raw values for recombinant antibody results reported in fig. S8 and table S1. In addition, IgA was evaluated via previously described ELISA to further phenotype the participant's serologic status. Raw sample positivity data by state can be found in fig. S9.

Statistical analysis
The iterative quota sampling (described in the "Participant selection" section) that we used continuously matched the proportion of people in the study with the census estimated proportion of people in the United States on six variables (Table 1 and Fig. 1). This ensured that each periodic sample of participants over the course of the study was representative, and the time effects of the pandemic were approximately independent of those six variables ( fig. S2). Each participant was asked demographic and health-related questions that matched those on the BRFSS survey, a large probabilitybased national survey (43). Responses to those matching questions were used with BRFSS survey data to adjust estimators to account for important criteria that may be related to both selection probability and seropositivity but were not accounted for in our quota sampling. Those adjusted estimators used weighting based on the propensity of being a quota sample versus a BRFSS sample participant and poststratification to U.S. census data. Weighting additionally accounted for sensitivity and specificity. CIs were calculated for the final seroprevalence estimates accounting for both the variability of the weighting and of the sensitivity and specificity adjustment. The ratio of undiagnosed SARS-CoV-2 infections to diagnosed cases of COVID-19 was estimated as the final seroprevalence estimate times a factor calculated from the daily national population and diagnosed cases. Detailed statistical methods are provided in the Supplementary Materials. The main computer code used in this study is available at: https://zenodo.org/record/4958017#.YMkzYpNKh26. Sources used for analysis can be found in (8,38,39,(43)(44)(45)(46)(47)(48)(49)(50)(51)(52)(53)(54)(55)(56).