The Human Proteoform Project: Defining the human proteome

Description

The Human Genome Project (HGP) was a remarkable and unqualified success profoundly transforming and accelerating biological and medical research while converting a ~ $4B public investment into over $700B of economic activity and new industries (1). The challenge of revealing the "Blueprints of Life," however, is surpassed by the challenge we face today: deriving from these blueprints an understanding of the structures they dictate and how these function within biological systems.
Proteins are primary effectors of function in biology, and thus, complete knowledge of their structure and behavior is fundamental to deciphering function in basic and translational research (2). The richness of protein structure and function goes far beyond the linear amino acid sequence dictated by the genetic code. Genetic variation, alternative splicing, and posttranslational modification (PTM) work together to create a rich variety of different proteoforms arising from our genes ( Fig. 1) (3). The chemical diversity of proteins is foundational for the biological complexes and networks that control biology yet remains largely unknown. Genome sequence alone does not provide the needed information-only direct analysis of the proteoforms themselves can reveal their composition, enabling studies of their spatial distributions and temporal dynamics in biological systems. We propose here an ambitious initiative to define the human proteome, that is, to generate a definitive set of reference proteoforms produced from the genome (see Box 1).

PROTEOFORM-LEVEL KNOWLEDGE IS ESSENTIAL TO UNDERSTAND BIOLOGICAL FUNCTION
Proteins are the central intermediaries between genotype and phenotype (2)(3)(4). It is not possible to understand the functioning of a biological system if one does not know what protein molecules are present, as well as the nature and abundances of their proteoforms. Knowledge of where the proteoforms are located within cells or tissues, what other proteoforms they interact with to form the multifunctional complexes that carry out critical functions in cell biology, and how they change in response to stimuli is essential. Innovative new tools are needed to comprehensively define the proteome, allowing proteoform abundances, interactors, and locations to be assessed with far greater depth at lower cost. The foundational premise of the HGP-that knowledge of the genome sequence will provide a fundamental understanding of biological systems-will not be realized in the absence of detailed proteoform-level information. This was clearly articulated by Collins et al. (2), "A critical step toward gaining a complete understanding... will be to take an accurate census of the proteins present in particular cell types. It will be a major challenge to catalog proteins present in low abundance or in membranes. Determining the absolute abundance of each protein, including all modified forms, will be an important next step." The Human Proteoform Project we present here is the critical next step in the quest to understand human health and disease. Several examples from five important disease areas illustrate the critical role of proteoforms in disease and health (Fig. 2). These examples show how disease-driven research has been advanced by discovery of proteoforms and their PTMs.
in the human genome. We forward a two-pronged strategy: On the one hand, we pursue deep proteoform-level analysis in medically relevant systems (Fig. 2); this will continue to open up fundamental insights into targets and use cases of high biomedical importance. In parallel, we invest heavily in the accelerated development of proteoform discovery and characterization technologies and deploy them for large-scale proteoform analysis to specimens from nominally healthy donors.
The project is modeled roughly after the successful roadmap provided by the HGP, which generated the human reference genome sequence while advancing technology in the process (2,3,5). An international effort on the scale of the HGP in both funding and time will reveal the full chemical complexity of our proteins, drive the frontiers of research and medicine well beyond what is currently possible, and be critical in the assignment of function to proteins and their PTMs in the decades ahead.

THE HUMAN PROTEOFORM PROJECT
We propose the Human Proteoform Project, a program to aggressively develop new technologies for comprehensive proteoform analysis and to assemble an extensive, high-quality atlas of human proteoforms. We envision next-generation proteomics in humans to be based on ~20,000 proteoform families (6), one for each gene in the genome. Deep catalogs of proteoforms compiled for widely characterized mammalian cell lines and primarily human samples will markedly accelerate our understanding and exploitation of proteins. This more profound knowledge of the central molecules of biology will provide an essential cornerstone for 21st century biology. New technologies will be central to this effort, as today's ability to comprehensively identify proteoforms in complex systems is limited.

ASSEMBLING THE HUMAN PROTEOFORM ATLAS
Proteoform expression varies across cells and tissues, and studies of proteoform expression can be either global or targeted. The expression of rare proteoforms is stochastic in nature. The Human Proteoform Project will thus necessarily focus on capturing the identities of the dominant functional proteoform population rather than rare occurrences. We propose the bifurcated approach shown in Fig. 3. In global studies, all proteoforms present at detectable levels are characterized; in targeted studies, specific proteoform families arising from each human gene will be enriched and subjected to systematic proteoform discovery to reveal the molecular diversity present. The two paths are described below.
Cell-based approach to proteoform discovery An important thrust of the project is the delineation of proteoform expression patterns in human cell types (Fig. 3, bottom) (7). Defining the number and nature of human cell types is an ambitious undertaking in its own right and is currently being pursued by several consortia (see below). Anchoring proteoform analysis with cell types provides a generalized strategy to access human biology across the natural context present within our tissues. The depth of proteoform analysis obtained depends on the detection sensitivity of the technology used: While today's mass spectrometric platforms are pushing toward detection limits of ~25 copies per cell (7), aggressive technology investment is needed to further develop these platforms and to develop new approaches and paradigms (see the section below). A cell-based approach can begin using many thousands of cells of a given type and adopt single-cell proteoform technologies as they become available.

Gene-based approach for targeted proteoform discovery
The development of affinity reagents to capture the proteins encoded by each human gene will be invaluable to enrich and then characterize their proteoform families in a selection of human specimens. The fundamental role of proteoform-level knowledge in understanding human disease and health (Fig. 2) is evident from consideration of the most highly cited human genes in the biomedical literature (Fig. 4). Tumor necrosis factor, at the top of the list, has >200,000 citations; this high-citation number can be considered a reasonable proxy for the research funding that has gone into its study over decades. Notably, even the most-studied genes have unknown proteoforms essential to understand their biological and disease-related functions. The economies of scale afforded by a concerted project to obtain comprehensive proteoform-level knowledge will make possible the acquisition of such information for the 20,000 proteoform families derived from the human genome.

NEW TECHNOLOGIES
At present, the dominant "bottom-up" paradigm of mass spectrometry (MS)-based proteomics sacrifices information about proteoforms by cleaving proteins into peptides; this is done for a pragmatic reason-it works, as the resultant peptides are generally much easier to identify than their parent intact proteoforms (8,9). Top-down proteomics, in contrast, analyzes the entire intact proteoform and is the most powerful proteoform-level analysis technology in existence, providing knowledge regarding RNA isoform translation and combinatorial PTMs, but is limited in depth and throughput (4,6). The flagship efforts of the Cancer Proteomics Consortium, CPTAC, have brought targeted proteomics and proteogenomics into regular use and produced major studies on ovarian (10), breast (11), and colorectal cancer (12). Using the bottom-up approach to proteomics, CPTAC noted recently that "the aggregated NCI-60 proteomics dataset covers only 12% of the whole encoded proteome, and only ~5% of the genes had sequence coverage of >50% of their protein coding regions." Fig. 3. Approach to creating an integrated Human Proteoform Atlas. The upper path illustrates the use of protein affinity reagents to capture proteoform families derived from targeted genes. The lower path illustrates the in-depth analysis of human cell types for proteoform discovery and characterization. Relative abundance refers to the ratio of a given proteoform to the sum of all proteoforms in that family. (13). Regarding alternative splicing, "there is yet a major gap between the number of alternative transcripts asserted by RNA sequencing and that detectable by proteomics (e.g., <0.1% of putative novel splice junctions in cancer xenografts)" (13). This state of affairs underscores the critical need to advance the state of the art in proteomic analysis (14)(15)(16)(17) via new technologies and extensive proteoform-level characterization of biological systems.
To achieve the objectives outlined above, it is critical to expand our technological abilities through a concerted long-term and multifaceted research and development effort. This effort should pursue both the continued development of MS-based technologies for proteoform analysis, as well as the exploration of potential paradigm-shifting new ideas and approaches that offer the possibility of transformative change. The development of increasingly powerful and effective nucleic acid sequencing has demonstrated the importance of investing heavily in ambitious new efforts to drive technology development. Similarly, single-molecule MS (18)(19)(20), nanopore sequencing (21,22), cryoelectron microscopy and visual proteomics (23,24), single-cell proteomics (25)(26)(27)(28)(29), single-molecule protein arrays (30,31), and other ideas yet to be conceived need to be encouraged, supported, and developed to advance proteoform biology.
The outstanding success of the technology development program in the HGP and the associated private sector engagement provide an inspiring model for how this can be done well. Just as the $1 per base estimate for the HGP provided an important target to spur technology competition and development, so will a $1 per proteoform goal for the Human Proteoform Project as proposed previously (7). Although the details of its implementation plan will be developed with key stakeholders, at this time, the main parameters and their estimates help frame the project. For the cell-based prong (Fig. 3), we can anticipate that the output of the Human Cell Atlas, Human Biomolecular Atlas Program (HuBMAP), and other consortia will be a defined ontology and number of human cell types, allowing the proteome of each to be targeted. Assuming 5000 cell types and prescribing a depth of 1 million proteoforms in each, constructing the Human Proteoform Atlas would involve ~5 billion measurements of redundant proteoforms (32). Combined with the gene-based approach, perhaps ~50 million unique (nonredundant) proteoforms will be asserted with defined quality metrics over the course of the project.

THE PIVOT FROM PROTEOFORM DISCOVERY TO PROTEOFORM SCORING
A central principle in comprehensive proteoform analysis concerns the distinction between discovery and scoring. Comprehensive analysis of protein primary structure requires the generation of highly complex data necessary in the discovery phase of proteoform analysis. However, once we have in hand a comprehensive index of these proteoforms for the system under study, efforts can shift to a scoring mode informed by the previous knowledge. This transition from discovery to scoring is central to many fields: in genomics, for example, the initial discovery of single-nucleotide polymorphisms (SNPs) led to the generation of SNP databases and technologies for their scoring at scale. The scoring technology enabled cost-effective functional studies and disease-based research across human populations. Similarly, in MS, initial work to develop small molecule identification from gas-phase fragmentation patterns led to the establishment of rich databases of molecular fragmentation spectra allowing the rapid identification of already known compounds. This venerable principle will be invaluable to driving increased throughput and decreased cost. This anticipates that disruptive technologies such as single-molecule proteoform sequencing and analysis would benefit by providing a reference set of the human proteoforms actually present.

ENABLING NEW LEVELS OF BIOMEDICAL RESEARCH
With a new generation of precision measurement tools, studies of mutations, disease, infection, and drug treatment will all operate with more detailed knowledge afforded by creation of a comprehensive proteoform index. This will further accelerate the goal of 21st century biomedicine such as regenerative biology, enhanced drug development, and better detection of human disease-all of which involve proteins. Beyond improving the use of proteins as biomarkers, the reference atlas of proteoforms will enable the study of their spatial and temporal distributions within cells and tissues, information presently impossible to obtain. This will often involve protein affinity capture reagents enabling readouts using a wide array of technologies (Fig. 5). Scoring technologies for single-molecule and single-cell biology will be propelled by having proteoform answers in the "back of the book" as we develop and optimize them in the decade ahead. Fig. 4. The most studied proteins have essential proteoforms that contain common PTMs such as phosphorylation, methylation, acetylation, and other important variations of primary structure such as disulfide bond formation, metal attachment, and proteolytic processing. TNF, tumor necrosis factor; TAU, tubulinassociated unit; HB, hemoglobin subunits (HbA, HbB, etc.); CASP, cysteine-aspartic proteases (Casp1 to Casp9); SOD, superoxide dismutases (SOD-1, SOD-2, and SOD-3); EGFR, estrogen growth factor receptor; CYTC, cytochrome C; TN, troponin (Tn-C, Tn-I, and Tn-T); APOE, apolipoprotein E; CA, carbonic anhydrase; CREB, cyclic adenosine 5´-monophosphate response element-binding protein; TP53, cellular tumor antigen p53. Citations are from the Web of Science Core Collection from 1975 to 2020. Citations per year and a history of research trends have been chronicled for a subset of these proteins (46).

SYNERGY OF THE HUMAN PROTEOFORM PROJECT WITH OTHER INITIATIVES
The Human Proteoform Project, by capturing all sources of protein variation for creation of a reference atlas of whole proteoforms, is fundamentally different from other proteomics initiatives. Prior initiatives such as those describing first drafts of the human proteome in 2014 (33,34) and ongoing work under the aegis of the Human Protein Atlas and the Human Proteome Project (35) have accomplished a great deal over the past several years, and the Human Proteome Organization has called for the community to "systematically map all human proteoforms" (36). There has also been an industry-led call from several pharmaceutical companies underscoring the need for major improvements in proteoform measurement (37).  Clear synergies with initiatives focused on human cell typing and protein capture reagents are visible. The Human Protein Atlas with its existing set of >15,000 antibodies provides a valuable resource for targeted studies while also driving efforts to develop "open source" renewable affinity reagents of known sequence (38). These affinity reagents enable targeted enrichment of proteoform families deriving from each human gene (Fig. 3, top). Once the members of proteoform families are known, creation of a next generation of proteoform-directed affinity reagents will be possible ( Fig. 5) (39). An important thrust of the Human Proteoform Project is the delineation of proteoform expression patterns across human tissues and cell types to be archived in the Human Proteoform Atlas. This effort will benefit greatly from the output of the now accelerating efforts in the HuBMAP (40), the Human Cell Atlas (41), and several affiliated consortia. These groups are actively in the process of defining all human cell types in an organized and interoperable ontology. This includes generating markers of cell types that will facilitate their sampling for cell-based proteomics to determine the proteoforms present.

ROLES OF GOVERNMENT, FOUNDATIONS, AND THE PRIVATE SECTOR
For the necessary transformation of technology and knowledge to take place over the coming decade, numerous stakeholders will be needed to engage and align with the project to bring it to fruition (42). Within the emergent proteomics ecosystem that we envisage, three categories of organizations can be identified-those focused on creating new knowledge (universities and research institutes), those creating new value for customers (instrument, biopharma, and diagnostics companies), and those providing financial and other resource support for the creation of knowledge or customer value (government agencies, philanthropies, nonprofit foundations, and well-established companies) (43). The role of the knowledge creators is paramount for a research-intensive area similar to this, and the major universities and research institutes will generate the structural, large-scale data to drive this effort. This will require substantial funding; for comparison, genomics research worldwide was publicly funded at about $3B per year from 2003 to 2006, with the United States contributing about 35% of this (44).
The companies and institutions that commercialize the tools, technologies, and services to advance the field also play a pivotal role in this endeavor often collaborating with academic researchers to bring new technologies to the marketplace. This cycle of innovation and commercialization was a fundamental enabler of the HGP. The biopharmaceutical and diagnostic companies invest heavily in research and development [for example, having spent $97 billion in R&D in the United States in 2017 (45)] and so are well poised to participate in these efforts. As noted above, generating the definitive proteoform set for the expressed human proteome presents a major economic opportunity for the private sector.
Bringing alignment and finding common goals for the various members of the emerging "proteoform ecosystem" is already underway with organizations starting to forge bridges across the boundaries. Increasing cooperation between public agencies, organizations, and international institutions will hasten the discovery and understanding of human proteoforms and provide marked growth in therapeutics, diagnostics, and the life sciences.

CONCLUSION AND OUTLOOK
The Human Proteoform Project will revolutionize our understanding of human health and disease. This ambitious project to develop and apply powerful new technologies to reveal the molecular complexity that underlies human biology will be transformative. While a full exploration into the nature of its many impacts is beyond the scope of this article, we provide in Fig. 6 an overview of some of the many areas in which it will open new vistas and enable revolutionary new technologies. We offer the roadmap outlined here to inspire its realization.