Origin and functional diversification of PAS domain, a ubiquitous intracellular sensor

Signal perception is a key function in regulating biological activities and adapting to changing environments. Per-Arnt-Sim (PAS) domains are ubiquitous sensors found in diverse receptors in bacteria, archaea, and eukaryotes, but their origins, distribution across the tree of life, and extent of their functional diversity are not fully characterized. Here, we show that using sequence conservation and structural information, it is possible to propose specific and potential functions for a large portion of nearly 3 million PAS domains. Our analysis suggests that PAS domains originated in bacteria and were horizontally transferred to archaea and eukaryotes. We reveal that gas sensing via a heme cofactor evolved independently in several lineages, whereas redox and light sensing via flavin adenine dinucleotide and flavin mononucleotide cofactors have the same origin. The close relatedness of human PAS domains to those in bacteria provides an opportunity for drug design by exploring potential natural ligands and cofactors for bacterial homologs.


INTRODUCTION
Signal transduction pathways in all living cells detect nutrients, hormones, oxygen, redox potential, and other signals and regulate various cellular functions accordingly.Signals are recognized by receptor proteins via dedicated sensor domains.PAS domains are ubiquitous sensor domains that are found in transcription factors, chemoreceptors, histidine and serine/threonine protein kinases and phosphatases, enzymes controlling second messenger turnover, ion channels, and other signaling proteins (1).While many sensor domains are extracytoplasmic, PAS domains are predominantly cytoplasmic (1,2).All experimentally studied PAS domains that were reported as extracytoplasmic (3,4) were later reclassified as Cache domains that share the PAS fold but belong to a distinct domain superfamily (2).PAS domains adopt a conserved globular fold with a distinct binding cavity that can accommodate various ligands and cofactors thus enabling sensing capability (Fig. 1) (4).In addition to the sensory role, PAS domains may act as signal transducers and promote protein-protein interactions (1).PAS domain containing signaling pathways control complex behaviors ranging from motility (5), quorum sensing (6), and virulence (7) in bacteria to phototropism (8), circadian rhythms (9), oxygen homeostasis (10), and immune response (11) in eukaryotes.PAS domains in human transcription factors have become important drug targets for cancer therapy (12)(13)(14).
Over a million of PAS domain containing proteins are identifiable in current databases.However, signals recognized by these domains have been experimentally identified only in a handful of model proteins (table S1).PAS domains are classified into several families based on sequence similarity, but the current classification does not reflect their biological functions.This is largely due to the extreme sequence divergence among PAS domains and the fact that only a few residues define their signal specificity.PAS domains are found in organisms ranging from bacteria to humans, suggesting that they might have originated in the common ancestor.However, their evolutionary history, distribution across the Tree of Life, and potential functional diversification remain largely unknown.
Here, we provide an extensive comparative genomic analysis of PAS domains across bacteria, archaea, and eukaryotes.We show that PAS domains originated in bacteria and evolved sensory functions through different paths.Eukaryotes acquired PAS domains from bacteria via many independent horizontal gene transfer events.Similarity of some of the human PAS domains to those in bacteria suggests that bacterial proteins can serve as useful models to determine signal specificity of the human counterparts, as it was recently shown for another ubiquitous sensor domain (15).Our findings open avenues for functional studies and drug development using PAS domains, and our approach can serve as a framework for studying other protein families with extremely diverse members and versatile functions.

PAS domains are distributed across the tree of life
The distribution of PAS domains across the tree of life has not been systematically investigated since the original discovery of this superfamily, which involved a small number of genomes available at the time (16,17).We searched for PAS domains in the entire set of UniProt reference proteomes (see Materials and Methods for details) and established that PAS domains are present in 66% of archaeal, 93% of bacterial, and 93% of eukaryotic proteomes (data S1).In bacteria, PAS domains are widely present in most phyla; they are absent in the reduced genomes of endosymbionts and intracellular parasites, such as Dependentiae, Endomicrobia, Mycoplasmatales, and Rickettsiales (Fig. 2A and data S1).In archaea, PAS domains have an uneven distribution: Halobacteriota have disproportionally more PAS domains than other phyla (Fig. 2B).In eukaryotes, PAS domains are widely distributed across different phyla except for Apicomplexa (Fig. 2C), which includes obligate endoparasites with reduced genomes, such as Plasmodium, Babesia, Cryptosporidium, and Toxoplasma.
Next, we collected all PAS-containing proteins defined by the Pfam PAS Fold profiles (CL0183) from InterPro (18) and identified their domain composition (data S2).We found that histidine kinases are the most common PAS proteins in bacteria and archaea, whereas transcription factors are more prevalent PAS proteins in eukaryotes (fig.S1).Examples of typical domain architectures of the PAS proteins in three model organisms are shown in Fig. 2

(D to F).
To find out whether PAS domains and PAS-containing proteins are enriched in some taxa, we quantified their content for each genome (normalized by the genome size at the family level; see Materials and Methods).Because of alternative splicing, we calculated the number of eukaryotic PAS proteins both with and without isoforms (Fig. 2G).We found that archaeal and bacterial genomes on average encode more PAS proteins and PAS domains and have higher variances than eukaryotic genomes (Fig. 2G).Moreover, all 17 Pfam PAS families are found in bacteria, whereas only 10 PAS families were found in eukaryotes and 7 in archaea (table S2).Within each kingdom, the number of PAS proteins and PAS domains varies in different taxonomic groups (data S2).For example, genomes of Desulfomonilaceae (desulfobacteria) and Phormidiaceae (cyanobacteria) are enriched with PAS domains (Fig. 2G, blue).In archaea, the Methanomicrobia class (Halobacteriota) encodes a substantially higher proportion of PAS proteins and PAS domains (Fig. 2G, orange).In eukaryotes, teleost fishes have the highest number of PAS proteins especially when considering isoforms (Fig. 2G, gold).
On average, PAS-containing proteins in all three kingdoms contain two PAS domains (Fig. 2H).However, some organisms have proteins with a much larger number of PAS domains.For example, Aulosira tenue and some other cyanobacteria contain up to 19 PAS domains in a single histidine kinase (Fig. 2H, blue; UniProt: A0A1Z4MZL8).Histidine kinases with more than 14 PAS domains are also found in eukaryotes, specifically, in notorious plant pathogens of Phytophthora genus (Fig. 2H, gold, and data S2).We then analyzed the relationship between the number of PAS domains and the genome size and found statistically significant correlation between the number of genes and the number of PAS proteins and PAS domains (fig.S2; see Materials and Methods).
We noticed that some of eukaryotic PAS domains were not identified by Pfam profile models and the default search tool HMMER (e.g., the third PAS domain of human PASK in Fig. 2F), indicating that the actual number of eukaryotic PAS domains may be higher.
To explore this, we compared the number of PAS domains identified in 14 representative eukaryote genomes by HMMER (19) with that identified by a manually curated structure search using Foldseek against the AlphaFold database (20,21).Proteins containing a PAS fold in their AlphaFold models were subjected to sequence similarity searches followed by multiple sequence alignment to distinguish PAS domains from structurally similar Cache and GAF domains.Notably, 36.4% PAS domains revealed by FoldSeek were not identified by HMMER, indicating that the thresholds used by Pfam were too strict (data S3).Therefore, we applied a relaxed threshold for HMMER (E value < 1) and collected PAS proteins from 43 representative eukaryotic genomes.False-positive matches without a PAS fold have been removed by manually checking the ESMFold structures (22).As a result, we found that most eukaryotes have around 10 to 30 PAS proteins per proteome, but much larger numbers (50 to 80 PAS proteins) are seen in Naegleria gruberi, Acanthamoeba castellanii, Guillardia theta, and Danio rerio (data S4).Moreover, alternative splicing in Chordata results in up to 300 to 400 PAS isoforms per proteome (data S4).As the structural search requires substantial manual curation, we performed it only on representative genomes.

Current classification of PAS domains does not reflect their biological functions
The current Pfam database classified PAS domains into 17 families (23).To understand whether the current classification of PAS domains reflects their biological functions, we conducted a comprehensive literature search for cofactor-binding PAS domains and compared their distribution within Pfam protein families (table S1) (23).We established that the current classification of PAS domains does not reflect sensory functions: (i) PAS domains with the same cofactor can be found in different families and (ii) the same family can contain PAS domains with different cofactors (table S1).
Most PAS families contain general structural elements of the conserved PAS fold (Fig. 1 and table S3).However, some of the large families do not cover the entire PAS domain region.For example, PAS_3 family profile (Pfam entry: PF08447) does not include the N-terminal Aβ and Bβ, whereas PAS_8 family profile (Pfam entry: PF13188) does not include the C-terminal Gβ, Hβ, and Iβ (fig.S3 and table S3) (23).On the other hand, profiles for some other families include structural elements that are not part of the PAS fold.For example, additional multiple α helices are included at the N terminus of PAS_5 and PAS_12 profiles (table S3).Curiously, PAS domain profiles for two families, CpxA_peri and AbfS_sensor, contain two transmembrane regions, which is typical of Cache domains (table S3) (2).
Next, we performed all-versus-all BLAST searches using seed sequences from all families after adding missing regions to PAS_3 and PAS_8.We found that while small families form distinct clusters, large families, including PAS, PAS_3, PAS_4, PAS_8, and PAS_9 (Pfam entries: PF00989, PF08447, PF08448, PF13188, and PF13426), are mixed in the sequence similarity network, indicating overlaps and misclassification (fig.S3B).Functional PAS families can be identified using sequence and structural information Using HMMER, we collected approximately 2.8 million sequences that belong to 17 PAS families from the NCBI reference sequence (RefSeq) database (24) and searched sequences of each PAS family against the entire Pfam database (23) (see Materials and Methods).Noticeably, we observed numerous cases of overlaps between nine major families, including PAS, PAS_3, PAS_4, PAS_7, PAS_8, PAS_9, PAS_10, PAS_11, and PAS_12 (fig.S4 and table S4).According to the InterPro database, sequences of these families comprise more than 80% of the entire PAS fold superfamily (18).To reclassify these families, we combined these sequences and reduced redundancy over 80% identities, which resulted in 369,910 PAS domain sequences (data available at https://github.com/bioliners/PAS).Next, we performed an all-versus-all sequence alignment using DIAMOND (25) and assigned sequences into Markov clusters (MCL) based on the sequence similarity network (26).This resulted in 462 clusters, 163 of which contain at least 100 sequences and comprised 98% of the entire dataset (Fig. 3A, fig.S5, and data S5).The size of the cluster correlates with the average sequence length of its members (fig.S5), indicating that clusters 164 to 462 may consist of partial sequences.Of the 163 clusters, more than 100 (over 300,000 sequences) have conserved residues inside the PAS domain cavity that might serve as potential binding sites for cofactors or ligands (Fig. 3A; all data available at https://github.com/bioliners/PAS).
Some of these clusters contain well-studied PAS domains with known cofactors (Fig. 3B, fig.S6, and table S1).For example, cluster 3 includes PAS domains of the histidine kinase FixL and the diguanylate cyclase DosP, both of which have a histidine at Fα serving as a heme-binding site (Fig. 3C and table S1) (27,28).We performed a structural comparison using available structures from the Protein Data Bank (PDB) and found that the histidine is the primary determinant of heme binding, while a few other conserved residues in the surroundings do not interact with heme in all structures and may serve as secondary key residues (fig.S7A).In more than 9700 (60%) sequences in this cluster, this histidine is invariable, suggesting that all these homologs may bind heme as the cofactor (fig.S6).A comparison between the His-containing sequences and the rest of the sequences from cluster 3 demonstrated that their overall sequence conservation is very similar, supporting that our clusters reflect sequence homology, and the observed lack of conservation in 40% of sequences may be due to site-specific mutations that potentially lead to acquiring different sensory specificity (fig.S8).Custer 119 contains PAS domains from the oxygen sensor Aer2, which has a histidine for heme binding at a different location, Eα (Eη) (Fig. 3C and table S1) (29).Similar to cluster 3, this histidine is the primary determinant of heme binding (fig.S7A).More than 200 (73%) sequences in cluster 119 have this invariable histidine, suggesting heme binding for all these homologs (fig.S6).Clusters 4, 9, and 14 contain PAS domains of the histidine kinases NifL and MmoS, and the chemoreceptors AerC and Aer, all of which are redox sensors with flavin adenine dinucleotide (FAD) as the cofactor (Fig. 3C and table S1) (30)(31)(32)(33).In accordance with a previous study, we identified a conserved tryptophan at Fα as the primary determinant for FAD binding across the three clusters (fig.S7B) (33).However, two other FAD-binding residues are conserved in some of the clusters (a histidine in clusters 14 and 9 and an asparagine in clusters 4 and 14), indicating secondary key residues specifical for these clusters.Across the three clusters, more than 22,600 sequences (53, 91, and 84%, respectively) have the invariable tryptophan in the same position, suggesting FAD as the cofactor for all these PAS domains (fig.S6).Cluster 11 contains a PAS domain subfamily called LOV (light-oxygen-voltage) domains, which serve as blue light sensors and have a conserved cysteine for covalent bonding to flavin mononucleotide (FMN), FAD, or riboflavin (Fig. 3C and table S1) (34)(35)(36).More than 4300 (57%) PAS domains from this cluster have the conserved cysteine, suggesting binding of these cofactors (fig.S6).Cluster 144 contains a different blue light sensor-photoactive yellow protein (PYP), which has conserved cysteine, tyrosine, and glutamate for p-coumaric acid binding (Fig. 3C and table S1) (37).The three residues are also conserved in this cluster, indicating p-coumaric acid as their cofactor (fig.S6).
Our clustering approach enabled assignment of specific cofactor binding and indication of potential ligand/cofactor binding to more than 80% of PAS domains with currently unknown functions (Fig. 3A).It further confirmed that PAS domains have a broad phyletic distribution, especially in bacteria (Fig. 3D and data S6).Furthermore, some clusters are enriched in distinct signal transduction proteins.For example, clusters 3 (heme), 4 (FAD), and 11 (FMN) are associated with histidine kinases and diguanylate cyclases, clusters 9 (FAD) and 119 (heme) with chemoreceptors, and cluster 14 (FAD) with all three types of receptors (Fig. 3D and data S6).

Heme and flavin containing PAS domains have different evolutionary paths
PAS domains have short and diverse sequences, which complicates their phylogenetic analysis (1).However, they all share the same structural fold (38).Here, we sampled 1% sequences from each cluster and constructed phylogenetic trees based on a previously established structure-guided approach designed for short and diverse sequences (see Materials and Methods) (39,40).We found that clusters 3 and 119 that contain known heme-binding PAS domains are not related to each other (Fig. 4A).Other putative heme-binding PAS domains are also found in different clusters (table S1): specifically, ERG3(KCNH7)-PAS (41) is in cluster 11, CLOCK/NPAS2-PAS B (42,43) are in cluster 72, and CLOCK/ NPAS2-PAS A (42,43) are in cluster 87, all of which are also not related to clusters 3 and 119 (Fig. 4A).In addition, the histidine residue for heme binding is also located in a different position in each cluster (Figs.3C and 4B) (41,42).Notably, evidence for heme binding in ERG3 and CLOCK/NPAS2 is inconclusive due to the lack of heme-bound structures and formally defined binding sites.As shown for PER2, heme binding may be unspecific (44).Together, the sequence, structural, and phylogenetic information suggests that PAS domains in these clusters evolved hemebinding capabilities independently (Fig. 4C).
In contrast, clusters 4 (FAD), 9 (FAD), 11 (FMN), and 14 (FAD) form a single clade on the tree (Fig. 4A).We also built a more accurate tree using all sequences from these four clusters (after reducing redundancy) and found that cluster 11 (FMN) and other clusters (FAD) form two branches, indicating functional divergence from a single origin (Fig. 4A, right).Furthermore, a similar cofactorbinding site is present in all four clusters (Fig. 3C).Together, these data suggest that FMN and FAD binding PAS domains evolved from the same origin and then diverged into four clusters (Fig. 4C).Intriguingly, the heme-binding PAS domain from ERG3 (KCNH7) is a homolog of cluster 11 LOV domains, indicating neofunctionalization from FMN to heme binding (Fig. 4C).A scheme for binding of the three cofactors in PAS domains is shown in Fig. 4D: Histidine is the only invariable residue interacting with heme in all structures in both clusters 3 and 119; tryptophan, histidine, and asparagine are the three conserved residues binding FAD in clusters 4, 14, and 9 (at least two residues conserved in each cluster); arginine, cysteine, and asparagine are the three conserved residues binding FMN corresponding to the three residues in FAD binding (fig.S7).

Eukaryotic PAS domains have bacterial origins
Using the sequence and structure searches, we identified 34 PAS proteins in the human genome (data S3, S4, and S7; see Materials and Methods).A literature search showed that these proteins play important roles in human health (table S5).On the basis of the domain architecture, these proteins can be grouped into four families: the potassium ion channels KCNH, the phosphodiesterases PDE8, the transcription factors bHLH-PAS, and the serine/threonine kinase PASK (Figs. 2F and 5 and table S5).According to the analysis of our representative genome dataset, all these four families were present before the emergence of Metazoa; however, the PAS domains were integrated into these proteins during early stages of metazoan life (Fig. 5A and data S4).To investigate origins of these PAS domains, we performed BLAST searches against the NCBI RefSeq database using human PAS orthologs from the organisms located closest to the root of the tree (Fig. 5A): These searches identified no similar sequences in eukaryotes other than Metazoa but found PAS domains from various bacterial proteins with more than 40% sequence identity, suggesting their origins in bacteria (data S8).To determine whether this observation holds true for other eukaryotic PAS domains, we performed the same type of searches for randomly selected PAS domains from eukaryotes representing major lineages other than Metazoa: fungus Spizellomyces punctatus (RefSeq accession XP_016607844.1),choanoflagellate Monosiga brevicollis (XP_001742645.1),plant Arabidopsis thaliana (NP_001077788.1),cryptophyte Guillardia theta (XP_005835918.1),oomycete Phytophthora sojae (XP_009526312.1),alveolate Perkinsus marinus (XP_002767112.1), and excavate N. gruberi (XP_002678807.1).Notably, we observed the same pattern: In each case, no eukaryotic sequences from other lineages were found, but instead diverse bacterial PAS domains with 35 to 48% sequence identities were identified (data S8).Together, these data suggest that eukaryotes acquired PAS domains from bacteria via multiple independent horizontal transfer events.

Human PAS domains are promising drug targets
To further investigate the bacterial origins of human PAS domains, we integrated information from clustering and phylogenetic trees.We found that KCNH-PAS belong to cluster 11 (FMN) and may originate from Alphaproteobacteria FMN-PAS, as previously suggested (Fig. 5B and fig.S9) (45).Structure comparison reveals a putative binding site in EAG1(KCNH1)-PAS, where polar residues for flavin binding are replaced by hydrophobic residues (Fig. 5B).Notably, the antipsychotic drug chlorpromazine, which is a hydrophobic flavin analog, binds EAG1-PAS and modulates channel activities (Fig. 5B) (46).Thus, EAG1-PAS may bind chlorpromazine in a similar way as bacterial PAS domains bind flavins.In addition, we found that PDE8-PAS belongs to cluster 4 (FAD) (Fig. 5C).Structure comparison also shows a putative binding site and hydrophobic residues, suggesting that PDE8-PAS might bind flavin analogs (Fig. 5C).
The two PAS domains of bHLH transcription factors form their own clusters (87 for PAS A and 72 for PAS B) (Fig. 5D).Cluster 87 has low sequence conservation, but cluster 72 groups with cluster 8 on the phylogenetic tree, suggesting that the bHLH-PAS may originate from cluster 8, most likely from the PAS domains in Firmicutes (Bacilli) (fig.S10).Notably, both bHLH-PAS and cluster 8 have a conserved H(P)(D/E)D…R motif connecting the α helix and β strand, but how this conserved structure affects PAS domain properties remains unknown (Fig. 5D and fig.S6).Although cluster 8 is largely unstudied, ligand binding has been reported in the bHLH-PAS protein AHR and anticancer drug binding in HIF2α (Fig. 5D) (47,48).PAS domains in PASK also form their own clusters, with cluster 3 (heme), which primarily consists of bacterial sequences, being the closest neighbor (Fig. 5E).Moreover, PASK-PAS from A. castellanii, Thecamonas trahens, and Capsaspora owczarzaki are found in cluster 3, further suggesting their origins from bacterial heme-PAS (data S3).However, in human PASK-PAS A, the heme-binding residue is not conserved, and potential ligand-binding cavities are in a different location compared to bacterial heme-PAS, indicating that it evolved a different function (Fig. 5E).A previous study has identified non-heme compounds as ligands for PASK-PAS A (Fig. 5E) (49).

DISCUSSION
The last time genome-wide survey of PAS domains was carried out in 1999, when only 300 PAS domains from fewer than 20 genomes were available for analysis (1).In this study, we analyzed nearly 3 million PAS domains from more than 100,000 genomes of bacteria, archaea, and eukaryotes.Here, we present findings about evolution, genomic landscape, and functional diversification of these remarkable biological sensors.We now show that PAS domains are present in most life forms; they are absent from organisms with reduced genomes, such as endosymbionts and intracellular parasites (50).We found that the number of PAS domains per genome correlates with the genome size (fig.S2), thus following the trend established previously for major signal transduction families in bacteria (51).PAS domains were identified in major types of signal transduction proteins.In bacteria, they are predominantly found in histidine kinases, whereas in eukaryotes, they are more often found in transcription regulators (fig.S1).
We demonstrate that the current classification of PAS domains and profile models used for their identification in genomic datasets are outdated and lack biological information.We showed that using advanced clustering methods, sequence conservation, and structural information, it is possible to produce a better classification system.Still, deriving PAS domain function from sequence and structure remains a grand challenge.First, sequences from the same cluster might have different functions due to recent neofunctionalization, as we have recently shown for the PYP (52).Here, we found similar cases.For example, most sequences in cluster 3 contain a conserved histidine, which is a known heme-binding site, but some sequences in the cluster have a different conservation pattern, resulting in association with a different cofactor, Fe-S cluster (table S1) (53).However, it should be noted that the mixing of different functions occurs to a much lesser extent in our clusters compared to the current Pfam classification; for example, cluster 3 contains only nine PAS domains with the cysteine motif for putative binding of Fe-S clusters (53).Second, proteins from different clusters may have the same function-what is known as nonhomologous, isofunctional proteins or results of convergent evolution.Several examples of heme-binding PAS domains fall into this category.Third, homologous PAS domains with the same function may be found in different clusters.This could be due to their association with different downstream signaling domains.For example, FAD-binding PAS domains form three clusters (4, 9, and 14), but they are more related to each other than to any other cluster and share the same cofactor-binding site, suggesting the same function despite substantial sequence divergence (Fig. 3, B  and C).Such cases are also infrequent compared to the Pfam families and can be addressed by combining closely related clusters with the same function.A previous study identified two clusters of FAD-PAS chemoreceptors exemplified by Aer and AerC that have different domain composition (33).Satisfactorily, these clusters correspond to our clusters 14 and 9.In addition, here, we identified an extra cluster (cluster 4), where PAS domains come from signaling proteins other than chemoreceptors.
Despite these problems, we were able to propose a specific function (a conserved binding site for a specific cofactor) or a potential function (a conserved pattern within a binding cavity, indicating potential cofactor or ligand binding) for a large portion of the vastly unstudied "galaxy" of PAS domains (Fig. 3A).The latter category involves ligand-binding PAS domains (54), which were not analyzed in this study, because of limited published data.Furthermore, we hypothesize that clusters of PAS domains without conspicuous conservation may perform other biological functions, such as promoting protein-protein interactions and serving as signal transducers, which does not require strict conservation of individual amino acid residues.Thus, our findings provide opportunities for targeted biochemical characterization of this important but largely unstudied superfamily.
Our study also offers a view on the origin and evolution of PAS domains.The fact that PAS domains were identified in such diverse organisms as bacteria, archaea, and eukaryotes, including humans (1), suggested early origins of PAS domains, perhaps even their presence in the last universal common ancestor.Here, we show that this is not the case.Considering our current understanding of the relationship between bacteria, archaea, and eukaryotes (55), several lines of evidence suggest that PAS domains originated in bacteria and were horizontally transferred to archaea and eukaryotes: (i) Bacteria have the widest phyletic distribution of PAS domains, and nearly all bacterial phyla have a similar proportion of PAS domains per genome (Fig. 2A).In contrast, many archaeal phyla do not have any representatives with PAS domains.Among archaea, Halobacteriota have disproportionally more PAS domains than other phyla (Fig. 2B), which is consistent with a previous finding that they acquired many signal transduction genes from bacteria (56).(ii) Bacteria have the widest diversity of PAS domains: They come from all currently known PAS families, whereas many of these families are not present in archaea and eukaryotes (table S2).(iii) Histidine kinases are the class of signal transduction proteins that are mostly enriched in PAS domains (fig.S1).It was previously shown that histidine kinases likely originated in bacteria and were horizontally transferred to archaea and eukaryotes (57).(iv) Metazoan PAS domains are more similar to bacterial PAS domains than to those from other eukaryotic lineages, indicating horizontal gene transfer (data S8).(v) Randomly selected PAS domains from major eukaryotic supergroups are more similar to bacterial PAS domains than to those from other eukaryotic supergroups, indicating horizontal gene transfer (data S8).A nearly equal distribution of PAS domains among bacterial phyla suggests that the PAS domain emerged early in bacterial evolution.
With respect to the evolution of a specific function, our results suggest that while heme-PAS domains have evolved several times, flavin-PAS domains have the same evolutionary origin.Heme binding usually requires only one conserved residue, a histidine, for iron coordination, which may facilitate its de novo evolution (Fig. 3D).In contrast, flavin-binding is much more complex, as it requires multiple hydrogen bonds and stacking interactions, thus constraining de novo evolution (Fig. 3D).
Our finding that PAS domains in humans have close homologs in bacteria is exciting.An anticancer drug targeting the PAS B domain in the transcription factor HIF2α was recently approved by US Food and Drug Administration (14,48), revealing a promising role of PAS domains as drug targets (Fig. 5D).The bacterial origin of human PAS domains could provide valuable information for drug design.For example, we found that bacterial flavin-PAS is the origin of human KCNH-PAS and PDE8-PAS, both are potential targets for cancer therapies (58,59).EAG1(KCNH1)-PAS is regulated by the antipsychotic drug chlorpromazine, which is one of the flavin analogs blocking the flavokinase activity (46,60).The fact that EAG1-PAS evolved from flavin-PAS and binds a flavin analog suggests that potentially drugs for PAS domains can be designed based on their original sensory functions in bacterial homologs.This hypothesis is further supported by a recent finding that the Cache domain (a distant homolog of PAS domains) (2) in the human α2δ-1 protein binds γ-aminobutyric acid (GABA)-derived drugs the same way as bacterial Cache domains bind their natural ligands-GABA and other amino acids (15,61,62).
This study highlights key properties of PAS domains as universal molecular sensors-their plasticity, evolvability, and transmissibility.It also exposes the need for a better classification system and current limitations in their identification and function prediction.Conserved clusters of PAS domains with proposed functions present an opportunity to design targeted experiments for validation and further exploration of their diverse properties.

Data collection
PAS-containing proteomes and corresponding genome assembly identifiers were retrieved from the InterPro database (18) using its API (PAS Fold, CL0183).We assigned the NCBI (63) and GTDB (64) taxonomy using taxonomy IDs and a custom Python script.Protein and gene identifiers of PAS proteins were retrieved from the UniProt database (65).In the case of eukaryotes, the longest isoform of the PAS proteins was kept for the subsequent analysis.Protein domain compositions were retrieved from the UniProt database based on the available Pfam models (23).PAS protein counts and PAS domain counts were calculated using a custom Python script.PAS protein and domain counts were normalized on the basis of the total gene count of corresponding genomes.All scripts used are available at https://github.com/bioliners/PAS.
Hidden Markov models (HMMs) of PAS domain families were downloaded from the Pfam database (23).These HMMs were used as queries in hmmsearch to collect PAS domain sequences from the NCBI reference sequence protein database (RefSeq) (24) with a threshold of E value < 0.01 (19).To find overlaps between PAS domain families, these sequences were then searched against all HMMs from Pfam using hmmscan (19).

PAS domain distribution analyses
To mitigate the issue of uneven distribution of sequenced genomes across taxonomic groups when assessing the distribution of PAS domains across the tree of life, we sequentially averaged normalized PAS protein and domain counts from the species level up to the family level and expressed the final values as a percentage.In this way, even if bacteria have substantially greater number of orders per family (and correspondingly genera per order and species per genus), it gets compressed to a single value per family.At the family level, bacteria and eukaryotes have the most comparable numbers of members (families) without compromising the power of statistical tests.Spearman test was used to investigate the correlation of the number of PAS proteins and PAS domains with genome size expressed as the number of all encoded genes in the analyzed genomes.The number of PAS proteins was subtracted from the total number of genes before calculations to alleviate possible self-correlation.Scripts used are available at https://github.com/bioliners/PAS.

Sequence clustering
PAS domain sequences collected from RefSeq were combined, and sequence redundancy was reduced using CD-HIT with a threshold of identity > 80% (24,66).A sequence similarity network was generated by DIAMOND in the sensitive mode (E value < 0.05, query coverage > 80%) (25).The network was then used for clustering by the high-performance Markov cluster algorithm (HipMCL) with an inflation of 1.4 (26,67).The distance between two clusters was calculated as the average number of DIAMOND hits, i.e., the total number of DIAMOND hits between them divided by the product of the numbers of sequences in the two clusters.

Cluster analyses
For each cluster, multiple sequence alignments were constructed using MAFFT with default parameters (68).Sequence logos were generated using WebLogo 3 (69) and Skyalign (70).For clusters 1 to 163 with conserved residues that may serve as binding sites, a representative sequence was selected, defined by the best hit in hmmsearch when searching sequences against the HMM of the cluster built by hmmbuild (19).The conserved residues were highlighted in the AlphaFold model of the representative PAS domain (21).For the clusters with putative cofactors, taxonomy was obtained from GTDB (for bacteria and archaea) (64) and NCBI (for eukaryotes) (63), and protein architectures were identified using TREND with default parameters (71,72).Sequences, logos, and structure models for clusters are available at https://github.com/bioliners/PAS.

Phylogenetic analyses
Approaches for structure-guided alignment and phylogenetic tree construction were modified from previous studies (39,40).PDB structures (1DP6, 1NWZ, 1V9Z, 2GJ3, 2MWG, 2PD7, 2Z6C, 3BWL, 3EWK, 3FG8, 3K3C, 4F3L, 4HI4, 4MN5, 4ZP4, 5HWT, 6CEQ, 6DGA, 6KJU, and 6QPJ) were used for structure alignment using MUSTANG (73).This alignment resulted in an HMM using hmmbuild, which guided the alignment of representative sequences from each cluster using hmmalign (19).After trimming gaps (gaps > 90%) and removing sequences with low coverages (query coverage < 80%), the alignment was used for building a maximum likelihood tree using IQ-Tree (74) with the LG + G4 substitution model proposed by ProtTest 3 (75) and 1000 bootstraps.Sequences (1%) were randomly sampled from each cluster using a custom Python script, from which a final sequence alignment was built on the basis of the previous alignment and tree using SEPP (76).A final maximum likelihood tree was built from this alignment using IQ-Tree with the same parameters (74).

Structure search in human proteomes
PAS B from BMAL1 (PDB: 2KDK) was identified as the representative human PAS domain by hmmsearch against HMM (19).This PAS domain was used to search against the human proteome (UP000005640) from the AlphaFold database (77) using TMalign with the default parameters (78).Data S5 shows results with TM-scores larger than 0.5.

Supplementary Materials
This PDF file includes: Figs.S1 to S10 Tables S1 to S5 Legends for data S1 to S8 References Other Supplementary Material for this manuscript includes the following: Data S1 to S8

Fig. 2 .
Fig. 2. Global distribution of PAS domains.(A to C) PAS domain distributions in reference proteomes from bacterial (A), archaeal (B), and eukaryotic (C) phyla.Colored bars show numbers of proteomes with PAS domains in each phylum.Gray bars show numbers of proteomes without PAS domains in each phylum.Percentages show proportions of PAS-containing proteomes in each phylum.Phyla with fewer than 20 reference proteomes are not shown.(D to F) Representative PAS-containing proteins in three model organisms.(G) Average numbers of PAS proteins per genome (left) and PAS domains per genome (right) across three kingdoms (normalized by total gene numbers, see Materials and Methods).Each dot represents the average count of a family.(H) Numbers of PAS domains per protein across three kingdoms.Each dot represents the number of PAS domains from a protein.

Fig. 3 .
Fig. 3. Functional clusters of PAS domains.(A) Clustering results.Possible functions stand for sequences with conservation but unknown functions.Specific functions indicate sequences with predicted cofactors.Numbers of sequences in each category are shown.(B) Markov clusters of PAS domains.Clusters are numbered according to the number of sequences (clusters with smaller numbers contain more sequences).Colored edges connecting clusters reflect BLAST hits between clusters (yellow to red from few to many).Only clusters 1 to 163 with conserved residues within the cavity are shown.Outlier clusters are not shown for simplicity.Clusters with specific conserved cofactors are highlighted: blue, heme binding; orange, FAD binding; red, FMN binding; yellow, PYP homologs.Representative proteins are labeled next to these clusters.(C) Cofactor-binding structures.A representative PAS domain is shown for each cluster: cluster 3, FixL (PDB: 1DRM); cluster 119, Aer2 (PDB: 4HI4); cluster 144, PYP (PDB: 2PHY); cluster 4, NifL (PDB: 2GJ3); cluster 14, Aer (PDB: 8DIK); cluster 11, Phot1 (PDB: 2Z6C).Yellow, cofactor; green, conserved residues for cofactor binding.(D) Distribution of PAS domain clusters.The size of the chart represents the number of PAS domains.Colors indicate PAS domains from different proteins.HK/RR, two component systems; GGDEF/EAL, cGMP regulators.

Fig. 4 .
Fig. 4. Evolutionary history of cofactor-binding PAS domains.(A) Maximum likelihood tree using sampled sequences from each cluster.Clusters of interest are marked with different colors.Clusters 4, 9, 11, and 14 form a single clade, and a more accurate tree is shown on the right using all sequences from these clusters after reducing 60% sequence redundancy.Dots on the tree indicate bootstrap values not less than 70.(B) Structures of heme-binding PAS domains.ERG3-PAS (PDB: 6Y7Q) and CLOCK-PAS A (PDB: 6QPJ) are shown.Putative residues for heme coordination are highlighted in green.ERG3-PAS is from cluster 11 [shown in (A)].CLOCK/NPAS2-PAS A is from cluster 87 (not included for tree building due to low conservation).CLOCK/NPAS2-PAS B is from cluster 72 [shown in (A)] and may also bind heme, but the putative binding site is unknown.Note that heme binding in these PAS domains is inconclusive.(C) Evolutionary paths of heme and flavin binding PAS domains.Ancestor 1, ancestor of cluster 3 PAS domains; ancestor 2, ancestor of cluster 119 PAS domains; ancestor 3, ancestor of bHLH-PAS; ancestor 4, common ancestor of clusters 4, 9, 11, and 14.Blue, hemebinding PAS domains; red, FMN-binding PAS domains; orange, FAD-binding PAS domains.Key residues for cofactor binding are labeled for each PAS domain.LOV domains bind FMN (or FAD) for blue light sensing.(D) Scheme of heme and flavin binding sites.Key residues for cofactor binding are shown.

Fig. 5 .
Fig. 5. Evolutionary origin of human PAS domains.(A) Distribution of human PAS protein homologs across the tree of life.Filled boxes show organisms with PAS proteins homologous to human proteins.Empty boxes show organisms with PAS proteins not homologous to human proteins.(B) Potassium ion channel KCNH (e.g., EAG1, ERG3).KCNH-PAS belongs to cluster 11 (FMN).In gray: the FMN-binding PAS domain of A. thaliana phototropin 1 (PDB: 2Z6C).In red: the mouse EAG1-PAS (PDB: 4HOI) with a putative drug-binding site (in yellow) visualized as the culled cavity and pocket in the structure using a "show surface" function in PyMOL.The drug chlorpromazine is a flavin analog.(C) Phosphodiesterase PDE8 (PDE8A and PDE8B).PDE8-PAS belongs to cluster 4 (FAD).In gray: the FAD-binding PAS domain of NifL (PDB: 2GJ3).In orange: the human PDE8A-PAS with a putative binding site (visualized on the AlphaFold structure using PyMOL).(D) bHLH-PAS transcription factors (e.g., CLOCK, NPAS2, AHR, and HIF2α).PAS A form cluster 87 and PAS B form cluster 72.Both clusters are close to cluster 8 (clusters with average BLAST hit > 0.001 are shown).In green: the ligand-bound PAS B of AHR and HIF2α (PDB: 7ZUB and 7W80).Dashed boxes indicate the H(P)(D/E)D…R motif, which is also conserved in cluster 8 (structure model available at https://github.com/bioliners/PAS). (E) PAS domains of the serine/threonine kinase PASK.PAS A form cluster 172, PAS B form cluster 313, and PAS C does not match to any Pfam HMMs.Both clusters are close to cluster 3 (clusters with average BLAST hit > 0.001 are shown).In gray: the heme-binding PAS domain of FixL (PDB: 1DRM).The blue structure shows the human PASK-PAS A with cavities (visualized in PyMOL; PDB: 1LL8).