Global analysis of the yeast knockout phenome

Genome-wide phenotypic screens in the budding yeast Saccharomyces cerevisiae, enabled by its knockout collection, have produced the largest, richest, and most systematic phenotypic description of any organism. However, integrative analyses of this rich data source have been virtually impossible because of the lack of a central data repository and consistent metadata annotations. Here, we describe the aggregation, harmonization, and analysis of ~14,500 yeast knockout screens, which we call Yeast Phenome. Using this unique dataset, we characterized two unknown genes (YHR045W and YGL117W) and showed that tryptophan starvation is a by-product of many chemical treatments. Furthermore, we uncovered an exponential relationship between phenotypic similarity and intergenic distance, which suggests that gene positions in both yeast and human genomes are optimized for function.


Note S1 -Reproducibility of YKO screens
By virtue of its size and robust annotations, Yeast Phenome enables an assessment of knock-out screen reproducibility.An example of an experiment repeated multiple times is growth on glycerol, a non-fermentable carbon source that can only be metabolized via respiration.Inability of knock-out mutants to respire, revealed by their failure to proliferate on rich media containing glycerol as the sole carbon source, has been systematically tested 8 times in 6 different laboratories (table S1).By comparing the results of all 8 screens, we found that each screen reported a similar number of slow growing mutants (NPV < -3) that were deemed respiration deficient (140 ± 42 mutants, mean ± std.dev., n = 8).Importantly, most mutants (57-82%) identified in any one screen were reproduced in at least 5 of the 8 independent replicates, while 0-24% of mutants remained unique to single datasets (fig.S2A).One possible explanation for genes required for respiration in some but not all screens is that differences in experimental design (e.g., glycerol dosage, type of growth assay, mutant ploidy or mating type) create conditional essentialities.If that is the case, we expect genes unique to any single experiment to share common biological functions as often as genes reproduced in multiple experiments.To test this hypothesis, we used proximity on the genetic interaction similarity (GIS) network as an unbiased measure of shared function.The GIS network connects genes if their mutations have similar effects on the fitness of other mutants (11).Without any prior knowledge of gene function, the GIS network places genes relative to one another based on the extent of their functional similarity, producing an unsupervised view of cellular organization spanning multiple levels of resolution, from molecular pathways to organelles (5,11).We performed Spatial Analysis of Functional Enrichment (SAFE) (5) to identify regions of the GIS network that are overrepresented for respiration-deficient mutants from each screen (fig.S2B).We found that the enrichment profiles of all screens were nearly identical to one another (cosine correlation between neighborhood enrichment scores ρ = 0.994 ± 0.003, mean ± std.dev., npairs = 28) and consistent with respiration, oxidative phosphorylation, and other mitochondrial and metabolic functions (fig.S2B).The enrichment was driven primarily by mutants identified in multiple screens, whereas mutants unique to single experiments scattered randomly throughout the network (fig.S2C).This observation does not support the hypothesis that isolated findings share common biological functions and suggests, instead, that they are likely false positives.Furthermore, it suggests that, whenever replicate screens are not available, SAFE enrichment profiles could inform our level of confidence in single observations.

Note S2 -Analysis of secondary mutations
Knock-out collections are built on the principle that genetic loci can be systematically altered, one at a time, while keeping the rest of the genome constant.However, the genomes of knock-out mutants are not expected to remain constant over generations.Depending on time and selective pressure, knock-out mutants may acquire secondary mutations that partially compensate for gene loss and alleviate any corresponding fitness defects.Consistent with this expectation, studies have reported heritable phenotypic heterogeneity within isogenic knock-out populations (83), mapped secondary site suppressors using systematic genetic crosses (84) and identified a wide range of genomic alterations through whole genome sequencing (18,25).While knowing that adaptation is a general property of living systems that can be minimized but not eliminated completely, we sought to understand to what extent acquired secondary mutations may affect the interpretation of phenotypes derived from the yeast knock-out collection.We compared phenotypes across two independently constructed versions of the haploid collection (Mat-a and Mat-a), as well as the homozygous diploid collection produced by mating them.Copies of the collections were housed separately across many laboratories and exposed to vastly different experimental conditions, giving each strain an opportunity to evolve independently from its siblings.Despite the opportunity to diverge, estimates of phenotype rate, i.e. the frequency of strong phenotypes (|NPV| > 3) displayed by each gene (see also Fig. 2 and related section in the main text), were correlated across collections (e.g., cosine r = 0.66-0.72;fig.S4A), suggesting that secondary mutations are either rare, reoccur frequently in strains lacking the same gene or have relatively little impact on most phenotypes.To examine the impact on phenotypes more directly, we asked how often secondary mutations mask existing phenotypes or produce new ones, thus lowering the degree to which a phenotypic profile reflects the function of the deleted gene.Using data from previous investigations (83,84), we compiled a list of 207 knock-out mutants that show (n = 103) or do not show (n = 104) evidence of secondary mutations (Materials & Methods).We presented random subsets of this list to two independent examiners and asked them to evaluate the phenotypes of each gene with respect to the gene's known biological function (Materials & Methods).The evaluations provided by each examiner, as well as their consensus, showed no statistical association between evidence of secondary mutations and phenotype-function inconsistency (p-value = 0.98, one-tailed Fisher's exact test; fig.S4B).Indeed, in contrast to expectation, the phenotypes of knock-out mutants carrying secondary mutations were more, not less, likely to agree with the functions of the deleted genes (70% among strains with secondary mutations vs 58% in the control group; fig.S4B).Given these data, we estimate with 95% confidence that secondary mutations increase the relative risk of phenotype-function inconsistency by no more than 3% (relative risk RR = 0.711, 95% CI [0.491, 1.030]; table S2).The relatively low impact of secondary mutations on the functional interpretation of knockout phenotypes may be explained by close functional proximity (and therefore high phenotypic similarity) between the two affected genes.Indeed, in cases where secondary mutations have been identified, they often occurred in genes that act in the same biological pathway, protein complex or regulatory response as the deleted gene (83,84).These observations suggest that secondary mutations are likely to modulate, but not obscure, the phenotypes of the original deletion.Consistent with this hypothesis, knock-out mutants with and without secondary site suppressors show highly similar genetic interaction profiles (84).We therefore conclude that secondary mutations, arising spontaneously during routine laboratory manipulations, should not impede the use and interpretation of phenotypic profiles derived from the yeast knock-out collection.

Note S3 -Phenotypic screens of the heterozygous diploid YKO
The data provided at www.yeastphenome.org and used for analysis in this study include only phenotypic screens of the two haploid YKO collections (Mat-a and Mat-a) and the homozygous diploid YKO collection.However, we also assembled and annotated data derived from phenotypic screens of the heterozygous diploid collection.These data were processed and transformed in the same manner as the haploid/homozygous phenotypic screens (Materials & Methods).The unprocessed input data and processing code for each publication reporting heterozygous screens are provided in the yp-data Github repository (https://github.com/yeastphenome/yp-data)and its archived version (https://doi.org/10.5281/zenodo.7714347).In addition, we're providing the following 3 files in the "Bundle downloads" section of the Yeastphenome.orgwebsite (https://yeastphenome.org/downloads/): 1. yp_datasets_het_20221018.tar.gz(tab-delimited) -A list of heterozygous diploid screens with relevant metadata (explained in the README.txtfile).2. yp_matrix_het_20221018.tar.gz(tab-delimited) -A gene x screen matrix for cleaned but unnormalized phenotypic values.3. yp_matrix_het_z_20221018.tar.gz(tab-delimited) -A gene x screen matrix of normalized phenotypic values.

Yeast Phenome
Versions: 2022-02-08 and 2022-10-25 URL: https://yeastphenome.org/downloads/ Notes: 1. Similarity of phenotypic profiles was measured for each pair of genes using a bootstrap strategy as described below ("Calculating profile similarity").The similarity metric was cosine correlation.Gene expression data from Kemmeren et al., 2014 (4) were excluded from similarity analyses because only ~1,500 knock-out mutants were tested.2. When comparing phenotypic profiles to genetic interaction, protein-protein interaction and gene expression profiles, genes encoding ribosome components were excluded.

Genetic interactions
Publication: Costanzo et al., 2016 (11) Notes: 1. Similarity of genetic interaction profiles was measured for each pair of query strains using a bootstrap strategy as described below ("Calculating profile similarity").The similarity metric was cosine correlation.Dubious ORFs, essential genes and genes encoding ribosome components were excluded.2. Genetic interaction degree was calculated as the number of genetic interactions per query strain that satisfy the intermediate stringency cutoff: |ε| > 0.08 and p-value < 0.05 (where ε is the genetic interaction score).1. Similarity of protein-protein interaction profiles was measured a bootstrap strategy as described below ("Calculating profile similarity").The similarity metric was Jaccard index.Dubious ORFs, essential proteins, ribosome components and proteins with fewer than 4 interactions were excluded.

Gene expression
Database: SPELL (37) URL: http://sgd-archive.yeastgenome.org/expression/microarray/Accessed on: 2017-09-24 Notes: 1. Similarity of gene expression profiles was measured using a bootstrap strategy as described below ("Calculating profile similarity").The similarity metric was cosine correlation.Dubious ORFs, essential genes and genes encoding ribosome components were excluded.1.In the precision-recall analysis of phenotypic profiles, as well the analysis of chromosomal co-clustering, GO was restricted to a list of 295 biological process terms that were previously identified by expert biologists as moderately specific (87).Only experiments that employed YP-based media and glycerol as the sole carbon source were included in this analysis.Yeast Phenome contains additional data for growth on synthetic media (partial and complete) and media supplemented with other carbon sources (e.g., glucose, ethanol).The relative risk (RR) of a secondary mutation affecting the phenotypes is given by RR = ARM / ARW = 0.3 / 0.42 = 0.711 (values below 1 indicate that secondary mutations help, do not harm, phenotypic profiles).

Phenotype-function consistency
To estimate the confidence intervals around this relative risk, we can use the Taylor series approximate variance (90).The two-sided 95% confidence limits are given by: So, in this case, RR = 0.711, CI 95%: [0.491, 1.030].That means that, with 95% confidence, the relative risk of a secondary mutation to negatively impact the phenotypic profile of a knock-out mutant is at most 3%.All published screens of the YKO collection were identified, curated, assembled and normalized to enable analysis and integration.(A) In the YKO collection, each open reading frame is deleted and replaced via homologous recombination with a selectable marker (kanMX) flanked by locus-specific molecular barcodes (UP and DN), as well as universal sequences that can be used for amplification (black vertical bars).(B) Phenotypic screens involving the YKO collection are typically performed in an arrayed or a pooled format.In the arrayed format, each strain is examined independently from other strains by virtue of being grown in a separate well in a 96-well plate and/or as a separate colony on solid media.In the pooled format, all strains are co-cultured together in the same vessel and identified by barcode sequencing or microarray hybridization.(C) Publications that report phenotypic screens of the YKO collection were discovered using a comprehensive strategy (Materials & Methods).Each publication was associated with a list of screens and each screen was annotated with a set of standard vocabularies, i.e. lists of standardized terms that describe the measured phenotype (e.g., growth, expression of RNR3, mtDNA copy number) and the environment or experimental condition in which the phenotype was measured (e.g., growth medium, exposure to a chemical compound, temperature).Each screen was also associated with the corresponding data, which comprise the list of tested knock-out mutants (whenever available) and the list of phenotypic values for each tested mutant (whenever available).These data were cleaned, harmonized and normalized (Materials & Methods).The original and normalized data, as well as the Python code used for processing, were also stored in a database and a GitHub repository (see "Data and materials availability").Thanks to its size and meta-data annotations, Yeast Phenome allows to identify similar screens and assess their reproducibility (note S1).(A) We identified 8 independent screens of respiratory metabolism (i.e., growth on rich media with glycerol as sole carbon source) using criteria described in table S1.In each screen, "hits" were defined as knock-out mutants with a strong growth defect (NPV < -3) relative to the most typical mutant in that screen (i.e., mode of all phenotypic values).The fractions of hits identified in one screen and reproduced in 0-7 other screens are shown as stacked bars and color-coded.For example, ~4% of hits reported by the first screen (Dimmer et al., 2002 (76)) were unique to that screen (black).In contrast, ~18% of hits were reproduced by 6 or 7 other studies (dark red + red).(B) The reproducibility of respiration deficiency across the 8 screens was nearly complete when, instead of a gene-by-gene overlap, we compared their SAFE enrichment profiles.A screen's SAFE profile illustrates the statistical association between the identified hits and one or more domains of the genetic interaction similarity network (11).The comparison of 164 pairs of near-replicate screens provides an estimate of screen reproducibility within and between labs.The plots show a cumulative distribution of near-replicate screen-screen similarities computed using phenotypic profiles (A) and SAFE enrichment profiles (B).Near-replicate screen pairs were defined as screens that have tested the same phenotype under similar experimental conditions (Materials & Methods).Additionally, screens performed by the same lab were analyzed separately from screens performed by different labs.The background distribution corresponds to all screens, regardless of their tested phenotype, condition or lab of origin.Secondary mutations are unlikely to impact gene-phenotype associations from YKO screens (note S2).(A) Phenotype rate is defined as the fraction of screens in which a knock-out mutant shows a strong phenotype (|NPV| > 3).The phenotype rates of Mat-a, Mat-α and homo ous diploid strains mutated for the same enes are enerall correlated cosine correlation . .su estin that secondary mutations are either rare, reoccur frequently in strains lacking the same gene or have relatively little impact on most phenotypes.(B) The phenotypic profiles of knock-out mutants with evidence of secondary mutations are more, not less, likely to be consistent with known functions of the knocked-out genes than the phenotypic profiles of mutants without evidence of secondary mutations.Each blue box represents a set of knock-out mutants with (right) and without (left) evidence of secondary mutations (note S2).The grey areas in each box are proportional to the fraction of mutants presentin phenot pes that are inconsistent with the ene's nown function.

Phenotypic profiles predict functional relationships as accurately as other genome-scale datasets. (A)
We examined gene-gene similarities using 4 independent data sources: Yeast Phenome, genetic interactions, protein-protein interactions and gene expression.For each dataset, we calculated profile similarities using a bootstrap strategy (20 samples, 1,500 features per sample).In each sample, we ran ed ene pairs b their profile similarit and computed recall number of functionall related pairs with for decreasin alues for and precision the fraction of functionall related pairs amon all ene pairs with for decreasin alues of .A ene pair was considered functionally related if both genes are co-annotated to the same GO biological process term, protein complex or biochemical pathwa .he plot shows the relationship between recall and precision for each dataset.ines and shaded areas represent the a era e and standard de iation of precision recall cur es for the samples.ata sources and details about calculatin phenot pic similarities precision recall and areas under the precision recall cur e A are described in aterials ethods.B ifferent t pes of functional relationships are better predicted b different data t pes. he heatmap shows areas under the precision recall cur es A s computed following the precision-recall analysis described in (A) but using narrower definitions of functional relationship (e.g., only co-annotation to the same protein complex or only co-annotation to the same biochemical pathway).(C espite an o erall consistent performance in functional prediction we obser ed little redundanc between data t pes such that enes correlated in one dataset were enerall uncorrelated in others.he heatmap shows earson correlation coefficients between ene ene profile similarit alues computed from the different data sources.Phenotype rate, defined as the fraction of screens in which a gene shows a strong phenotype (|NPV| > 3), is not uniformly distributed across biological processes.An average phenotype rate was computed for each GO biological process with more than 10 genes represented in Yeast Phenome.The true phenotype rate (P true , red line) was compared to the average and standard deviation of 1,000 randomly sampled gene sets of the same size (P random blac dot and random , grey box, respectively).A z-score for each biological process was computed as (P true -P random random .The top 50 biological processes with the highest (A) and the lowest (B) z-scores are shown.The aromatic amino acid family biosynthetic process (highlighted in red) is the only metabolic process with an elevated phenotype rate.Phenotype rates and genetic interaction degrees for 1,099 GO biological processes are correlated.An average phenotype rate was computed for each process with more than 10 genes represented in Yeast Phenome.A z-score was computed by comparing the true phenotype rate (P true ) to the average and standard deviation of 1,000 randomly sampled gene sets of the same size (P random ): Z = (P true -P random random .The same calculation was performed for genetic itneraction degrees.A scatter-plot of phenotype rate and genetic interaction degree z-scores is shown.The solid red line corresponds to y = x.The dotted red lines correspond to y = x -2 and y = x + 2. Processes that fall outside of the red dotted lines and are associated with intracellular trafficking, transcription/chromatin remodeling and DNA replication are colored in blue, yellow and green, respectively.Evidence from Yeast Phenome and validation experiments suggests that Ygl117w is a novel member or regulator of the aromatic amino acid biosynthesis pathway.A he phenot pic profile of ygl117w is as similar to the phenot pic profiles of trp aro mutants as the are to one another.osine correlations between all pairs of enes were computed usin the bootstrap method described in aterials ethods.B he rowth defect of ygl117w on media lac in tr ptophan S rp is rescued b the e pression of a plasmid borne YGL117W aterials ethods .C A model describin the potential role of l w in the aromatic amino acid bios nthesis pathwa .ata in the literature east henome data and our own alidation e periments are consistent with the h pothesis that l w ne ati el re ulates the abilit of phen lalanine to feedbac inhibit the A acti it of Aro .The relationship between phenotypic similarity and chromosomal proximity is consistent across all chromosomes examined independently.Gene pairs located on each chromosome were sorted by their intergenic distance and subdivided into groups of 250 pairs.In each group, the average intergenic distance and average phenotypic similarity were computed and plotted on the As a reference, the red line indicates the same linear fit estimated from all chromosomes (Fig. 5).

Figure S14
Percent of adjacent genes The relationship between phenotypic similarity and chromosomal proximity is consistent across several independent subsets of Yeast Phenome data.Phenotypic similarity was computed using only screens from 3 large studies (Hillenmeyer et al., 2008   The relationship between phenotypic similarity and chromosomal proximity is consistent across all sets of strains constructed by the same laboratory.Gene pairs located on each chromosome were sorted by their intergenic distance and subdivided into groups of 100 pairs.In each group, the average intergenic distance and average phenotypic similarity were computed and plotted on the x and y-axis, respectively.Distance was plotted on a log 10 scale.The yellow line indicates the average phenotypic similarity for gene pairs located on different chromosomes.The blue line indicates the approximate boundary of the exponential relationship estimated from all gene pairs (380 kb; Fig. 5; Materials & Methods).The green line indicates the linear fit between log 10 intergenic distance and phenotypic similarity for all gene pairs within the estimated distance boundary (left of the blue line) constructed by a given laboratory.As a reference, the red line indicates the same linear fit estimated from all gene pairs (Fig. 5).Despite the fact that strains constructed by Laboratory 14 have a consistently higher phenotypic similarity (see main text), they still show an exponential relationship with intergenic distance.YKO mutants constructed by laboratory 14 appear to be the only case of a lab-linked aneuploidy that affects proximal genes.(A) The average correlation was computed for YKO mutants constructed by the same laboratory and affecting genes located on the same chromosome (x-axis) vs different chromosomes (y-axis).In all cases (except for laboratory 14), YKO mutants affecting genes located on different chromosomes show background levels of phenotypic similarity (yellow line), whereas those located on the same chromosome are consistently more similar.(B) YKO mutants constructed by laboratory 14 have been previously shown to carry an extra copy of chromosome XI (see main text).An overview of aneuploidy across ~4,400 YKO mutants indicate that amplification of chromosome XI in strains generated by laboratory 14 is the only example of chromosome aneuploidy shared by proximal genes.The relationship between phenotypic similarity and chromosomal proximity is not explained by known cases of functional co-clustering.(A) The relationship persists if we exclude all gene pairs with prior evidence of functional co-clustering (Materials & Methods): members of the same protein complex, metabolic pathway, moderately specific GO biological process term; gene co-expressed or co-regulated by the same transcription factor; paralogous gene pairs.Gene pairs located on the same chromosome were sorted by their intergenic distance and subdivided into groups of 1,000 pairs.In each group, the average intergenic distance and average phenotypic similarity were computed and plotted on the x and y-axis, respectively.Distance was plotted on a log  The relationship between phenotypic similarity (co-essentiality) and chromosomal proximity among human genes persists if we exclude amplified genes.Genes with reported evidence of copy number amplification (n > 4) in at least 2 cancer cell lines were excluded from the analysis (Materials & Methods).Gene pairs located on the same chromosome were sorted by their intergenic distance and subdivided into groups of 1,000 pairs.In each group, the average intergenic distance and average phenotypic similarity were computed and plotted on the x and y-axis, respectively.Distance was plotted on a log 10 scale.The color of each point indicates the fraction of immediately adjacent genes in the group.The yellow line indicates the average phenotypic similarity for gene pairs located on different chromosomes.The blue line indicates the approximate boundary of the exponential relationship estimated from each subset.The red line indicates the linear fit between log 10 intergenic distance and phenotypic similarity for all points within the estimated distance boundary (left of the blue line).

Figure S2 AB
Figure S2 Visual and quantitative comparisons of the SAFE profiles of the 8 screens (3 of which are shown here) demonstrate that, on a functional level, the sets of identified hits are highly similar to one another and consistently associated with respiration, oxidative phosphorylation and mitochondrial targeting functions.(C) Reproducible hits are more likely to associate with relevant biological functions than non-reproducible hits.Nodes of the genetic interaction similarity network (B) that correspond to hits fromMerz et al., 2009 (80)  are represented as dots.The color of each dot indicates the total number of screens in which that gene was identified as a hit.Dark red, red, orange and light green colors indicate hits reproduced in at least 5 of the 8 studies.As expected, these hits are concentrated in the network domain associated with respiration, oxidative phosphorylation and mitochondrial targeting.In contrast, black and dark blue dots indicate genes identified in only 1 or 2 screens.These hits are more randomly distributed throughout the network.

8
Figure S12 Figure S13 x and y-axis, respectively.Distance was plotted on a log 10 scale.The color of each point indicates the fraction of immediately adjacent genes in the group.The yellow line indicates the average phenotypic similarity for gene pairs located on different chromosomes.The blue line indicates the approximate boundary of the exponential relationship estimated from all chromosomes (380 kb; Fig. 5; Materials & Methods).The green line indicates the linear fit between log 10 intergenic distance and phenotypic similarity for all points within the estimated distance boundary (left of the blue line) on each chromosome.
7); Hoepfner et al., 2014 (18); Lee et al., 2014 (71)), as well as all screens excluding the top 5 largest chemo-genomics datasets (the top 3 plus Parsons et al., 2006 (6);Ericson et al., 2008 (72)).Gene pairs located on the same chromosome were sorted by their intergenic distance and subdivided into groups of 1,000 pairs.In each group, the average intergenic distance and average phenotypic similarity were computed and plotted on the x and y-axis, respectively.Distance was plotted on a log 10 scale.The color of each point indicates the fraction of immediately adjacent genes in the group.The yellow line indicates the average phenotypic similarity for gene pairs located on different chromosomes.The blue line indicates the approximate boundary of the exponential relationship estimated from each subset (Materials & Methods).The red line indicates the linear fit between log 10 intergenic distance and phenotypic similarity for all points within the estimated distance boundary (left of the blue line).

Figure S16 A
Figure S16 Figure S17 10 scale.The color of each point indicates the fraction of immediately adjacent genes in the group.The yellow line indicates the average phenotypic similarity for gene pairs located on different chromosomes.The blue line indicates the approximate boundary of the exponential relationship estimated from each subset (Materials & Methods).The red line indicates the linear fit between log 10 intergenic distance and phenotypic similarity for all points within the estimated distance boundary (left of the blue line).(B) Similarity of gene expression across multiple experimental conditions (37) is also related to chromosomal proximity.However, this relationship has a much shorter range (10.8 kb) than phenotypic similarity (380 kb).Gene pairs located on the same chromosome were sorted by their intergenic distance and subdivided into groups of 1,000 pairs.In each group, the average intergenic distance and average gene co-expression (measured by cosine similarity) were computed and plotted on the x and y-axis, respectively.Distance was plotted on a log 10 scale.The color of each point indicates the fraction of immediately adjacent genes in the group.The yellow line indicates the average phenotypic similarity for gene pairs located on different chromosomes.The blue line indicates the approximate boundary of the exponential relationship estimated from each subset (Materials & Methods).The red line indicates the linear fit between log 10 intergenic distance and phenotypic similarity for all points within the estimated distance boundary (left of the blue line).