共查询到20条相似文献,搜索用时 15 毫秒
1.
Statistical and learning techniques are becoming increasingly popular for different tasks in bioinformatics. Many of the most powerful statistical and learning techniques are applicable to points in a Euclidean space but not directly applicable to discrete sequences such as protein sequences. One way to apply these techniques to protein sequences is to embed the sequences into a Euclidean space and then apply these techniques to the embedded points. In this work we introduce a biologically motivated sequence embedding, the homology kernel, which takes into account intuitions from local alignment, sequence homology, and predicted secondary structure. This embedding allows us to directly apply learning techniques to protein sequences. We apply the homology kernel in several ways. We demonstrate how the homology kernel can be used for protein family classification and outperforms state-of-the-art methods for remote homology detection. We show that the homology kernel can be used for secondary structure prediction and is competitive with popular secondary structure prediction methods. Finally, we show how the homology kernel can be used to incorporate information from homologous sequences in local sequence alignment. 相似文献
2.
3.
Background
In general, the length of a protein sequence is determined by its function and the wide variance in the lengths of an organism's proteins reflects the diversity of specific functional roles for these proteins. However, additional evolutionary forces that affect the length of a protein may be revealed by studying the length distributions of proteins evolving under weaker functional constraints. 相似文献4.
Amino acid background distribution is an important factor for entropy-based methods which extract sequence conservation information from protein multiple sequence alignments (MSAs). However, MSAs are usually not large enough to allow a reliable observed background distribution. In this paper, we propose two new estimations of background distribution. One is an integration of the observed background distribution and the position-specific residue distribution, and the other is a normalized square root of observed background frequency. To validate these new background distributions, they are applied to the relative entropy model to find catalytic sites and ligand binding sites from protein MSAs. Experimental results show that they are superior to the observed background distribution in predicting functionally important residues. 相似文献
5.
Background
The sequencing of the human genome has enabled us to access a comprehensive list of genes (both experimental and predicted) for further analysis. While a majority of the approximately 30000 known and predicted human coding genes are characterized and have been assigned at least one function, there remains a fair number of genes (about 12000) for which no annotation has been made. The recent sequencing of other genomes has provided us with a huge amount of auxiliary sequence data which could help in the characterization of the human genes. Clustering these sequences into families is one of the first steps to perform comparative studies across several genomes. 相似文献6.
7.
Protein simple sequences, a subset of low-complexity sequences, are regions of sequence highly enriched in one or a few residue types. Simple sequences are exceedingly common, the average being more than one per protein sequence. Despite being so common, such sequences are not well-studied. The simple sequences that have been subjected to detailed study are often found to possess important functions. Here we present a survey of protein simple sequences, generally enriched in a single residue type, with the aim of studying their conservation. We find that the majority of such simple sequences are not conserved. However, conserved protein simple sequences are relatively common, with approximately 11% of the surveyed protein families possessing a conserved simple sequence. The data obtained in this study support the idea that simple sequences are conserved for functional reasons. Such functions can range from substrate binding, to mediating protein-protein interactions, to structural integrity. A perhaps surprising finding is that the residue enriching a conserved simple sequence is itself not necessarily conserved. Neither is the length of many of the highly conserved simple sequences. In the few cases where structural and functional data is available it is found that the conserved simple sequences are consistent with both local structure and function. The data presented support the idea that protein simple sequences can be conserved and have important roles in protein structure and function. 相似文献
8.
9.
Zakharia M. Frenkel Zeev M. Frenkel Edward N. Trifonov Sagi Snir 《Journal of theoretical biology》2009,260(3):438-444
A novel approach for evaluation of sequence relatedness via a network over the sequence space is presented. This relatedness is quantified by graph theoretical techniques. The graph is perceived as a flow network, and flow algorithms are applied. The number of independent pathways between nodes in the network is shown to reflect structural similarity of corresponding protein fragments. These results provide an appropriate parameter for quantitative estimation of such relatedness, as well as reliability of the prediction. They also demonstrate a new potential for sequence analysis and comparison by means of the flow network in the sequence space. 相似文献
10.
MOTIVATION: Due to the recent advances in technology of mass spectrometry, there has been an exponential increase in the amount of data being generated in the past few years. Database searches have not been able to keep with this data explosion. Thus, speeding up the data searches becomes increasingly important in mass-spectrometry-based applications. Traditional database search methods use one-against-all comparisons of a query spectrum against a very large number of peptides generated from in silico digestion of protein sequences in a database, to filter potential candidates from this database followed by a detailed scoring and ranking of those filtered candidates. RESULTS: In this article, we show that we can avoid the one-against-all comparisons. The basic idea is to design a set of hash functions to pre-process peptides in the database such that for each query spectrum we can use the hash functions to find only a small subset of peptide sequences that are most likely to match the spectrum. The construction of each hash function is based on a random spectrum and the hash value of a peptide is the normalized shared peak counts score (cosine) between the random spectrum and the hypothetical spectrum of the peptide. To implement this idea, we first embed each peptide into a unit vector in a high-dimensional metric space. The random spectrum is represented by a random vector, and we use random vectors to construct a set of hash functions called locality sensitive hashing (LSH) for preprocessing. We demonstrate that our mapping is accurate. We show that our method can filter out >95.65% of the spectra without missing any correct sequences, or gain 111 times speedup by filtering out 99.64% of spectra while missing at most 0.19% (2 out of 1014) of the correct sequences. In addition, we show that our method can be effectively used for other mass spectra mining applications such as finding clusters of spectra efficiently and accurately. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. 相似文献
11.
12.
Phylogeny, sequence conservation, and functional complementation of the SBDS protein family 总被引:5,自引:0,他引:5
The Shwachman-Bodian-Diamond syndrome (SBDS) protein family occurs widely in nature, although its function has not been determined. Comprehensive database searches revealed SBDS homologues from 159 species, including examples from all sequenced archaeal and eukaryotic genomes and all eukaryotic kingdoms. Sequence alignment with ClustalX and MUSCLE algorithms led to the identification of conserved residues that occurred predominantly in the amino-terminal FYSH domain where they appeared to contribute to protein folding or stability. Only SBDS residue Gly91 was invariant in all species. Four distantly related protists were found to have two divergent SBDS genes in their genomes. In each case, phylogenetic analyses and the identification of shared sequence features suggested that one gene was derived from lateral gene transfer. We also identified a shared C-terminal zinc finger domain fusion in flowering plants and chromalveolates that may shed light on the function of the protein family and the evolutionary histories of these kingdoms. To assess the extent of SBDS functional conservation, we carried out complementation studies of SBDS homologues and interspecies chimeras in Saccharomyces cerevisiae. We determined that the FYSH domain was widely interchangeable among eukaryotes, while domain 2 imparted species specificity to protein function. Domain 3 was largely dispensable for function in our yeast complementation assay. Overall, the phylogeny of SBDS was shared with a group of proteins that were markedly enriched for RNA metabolism and/or ribosome-associated functions. These findings link Shwachman-Diamond syndrome to other bone marrow failure syndromes with defects in nucleolus-associated processes, including Diamond-Blackfan anemia, cartilage-hair hypoplasia, and dyskeratosis congenita. 相似文献
13.
MOTIVATION: Amino acid sequence alignments are widely used in the analysis of protein structure, function and evolutionary relationships. Proteins within a superfamily usually share the same fold and possess related functions. These structural and functional constraints are reflected in the alignment conservation patterns. Positions of functional and/or structural importance tend to be more conserved. Conserved positions are usually clustered in distinct motifs surrounded by sequence segments of low conservation. Poorly conserved regions might also arise from the imperfections in multiple alignment algorithms and thus indicate possible alignment errors. Quantification of conservation by attributing a conservation index to each aligned position makes motif detection more convenient. Mapping these conservation indices onto a protein spatial structure helps to visualize spatial conservation features of the molecule and to predict functionally and/or structurally important sites. Analysis of conservation indices could be a useful tool in detection of potentially misaligned regions and will aid in improvement of multiple alignments. RESULTS: We developed a program to calculate a conservation index at each position in a multiple sequence alignment using several methods. Namely, amino acid frequencies at each position are estimated and the conservation index is calculated from these frequencies. We utilize both unweighted frequencies and frequencies weighted using two different strategies. Three conceptually different approaches (entropy-based, variance-based and matrix score-based) are implemented in the algorithm to define the conservation index. Calculating conservation indices for 35522 positions in 284 alignments from SMART database we demonstrate that different methods result in highly correlated (correlation coefficient more than 0.85) conservation indices. Conservation indices show statistically significant correlation between sequentially adjacent positions i and i + j, where j < 13, and averaging of the indices over the window of three positions is optimal for motif detection. Positions with gaps display substantially lower conservation properties. We compare conservation properties of the SMART alignments or FSSP structural alignments to those of the ClustalW alignments. The results suggest that conservation indices should be a valuable tool of alignment quality assessment and might be used as an objective function for refinement of multiple alignments. AVAILABILITY: The C code of the AL2CO program and its pre-compiled versions for several platforms as well as the details of the analysis are freely available at ftp://iole.swmed.edu/pub/al2co/. 相似文献
14.
Prediction of protein-protein interactions by combining structure and sequence conservation in protein interfaces 总被引:4,自引:0,他引:4
MOTIVATION: Elucidation of the full network of protein-protein interactions is crucial for understanding of the principles of biological systems and processes. Thus, there is a need for in silico methods for predicting interactions. We present a novel algorithm for automated prediction of protein-protein interactions that employs a unique bottom-up approach combining structure and sequence conservation in protein interfaces. RESULTS: Running the algorithm on a template dataset of 67 interfaces and a sequentially non-redundant dataset of 6170 protein structures, 62 616 potential interactions are predicted. These interactions are compared with the ones in two publicly available interaction databases (Database of Interacting Proteins and Biomolecular Interaction Network Database) and also the Protein Data Bank. A significant number of predictions are verified in these databases. The unverified ones may correspond to (1) interactions that are not covered in these databases but known in literature, (2) unknown interactions that actually occur in nature and (3) interactions that do not occur naturally but may possibly be realized synthetically in laboratory conditions. Some unverified interactions, supported significantly with studies found in the literature, are discussed. AVAILABILITY: http://gordion.hpc.eng.ku.edu.tr/prism CONTACT: agursoy@ku.edu.tr; okeskin@ku.edu.tr. 相似文献
15.
One of the major goals of comparative genomics is to understand the evolutionary history of each nucleotide in the human genome sequence, and the degree to which it is under selective pressure. Ascertainment of selective constraint at nucleotide resolution is particularly important for predicting the functional significance of human genetic variation and for analyzing the sequence substructure of cis-regulatory sequences and other functional elements. Current methods for analysis of sequence conservation are focused on delineation of conserved regions comprising tens or even hundreds of consecutive nucleotides. We therefore developed a novel computational approach designed specifically for scoring evolutionary conservation at individual base-pair resolution. Our approach estimates the rate at which each nucleotide position is evolving, computes the probability of neutrality given this rate estimate, and summarizes the result in a Sequence CONservation Evaluation (SCONE) score. We computed SCONE scores in a continuous fashion across 1% of the human genome for which high-quality sequence information from up to 23 genomes are available. We show that SCONE scores are clearly correlated with the allele frequency of human polymorphisms in both coding and noncoding regions. We find that the majority of noncoding conserved nucleotides lie outside of longer conserved elements predicted by other conservation analyses, and are experiencing ongoing selection in modern humans as evident from the allele frequency spectrum of human polymorphism. We also applied SCONE to analyze the distribution of conserved nucleotides within functional regions. These regions are markedly enriched in individually conserved positions and short (<15 bp) conserved “chunks.” Our results collectively suggest that the majority of functionally important noncoding conserved positions are highly fragmented and reside outside of canonically defined long conserved noncoding sequences. A small subset of these fragmented positions may be identified with high confidence. 相似文献
16.
M J Sternberg 《Protein engineering》1990,4(1):45-47
The extent of inter-species sequence identity in single-spanning transmembrane regions of integral membrane proteins was evaluated. The sequences of the 32 human transmembrane regions were compared with the respective rodent homologues. The identity between homologous transmembrane regions ranged from 32 to 100%, compared with a mean value of 14% identity between unrelated transmembrane sections. On average the identity between homologous transmembrane regions is slightly higher than for the rest of the chain. These values suggest that, in general, there are structural and/or functional constraints on the transmembrane regions beyond the simple requirement to act as a passive, nonpolar, connecting region across the cell membrane. Although there is limited experimental evidence available, the three transmembrane regions (CD2 antigen, MHC class I and ICAM-1) with particularly low values of inter-species identity (less than 50%) are probably not involved in an interaction with another transmembrane section in the same cell. 相似文献
17.
MOTIVATION: All residues in a protein are not equally important. Some are essential for the proper structure and function of the protein, whereas others can be readily replaced. Conservation analysis is one of the most widely used methods for predicting these functionally important residues in protein sequences. RESULTS: We introduce an information-theoretic approach for estimating sequence conservation based on Jensen-Shannon divergence. We also develop a general heuristic that considers the estimated conservation of sequentially neighboring sites. In large-scale testing, we demonstrate that our combined approach outperforms previous conservation-based measures in identifying functionally important residues; in particular, it is significantly better than the commonly used Shannon entropy measure. We find that considering conservation at sequential neighbors improves the performance of all methods tested. Our analysis also reveals that many existing methods that attempt to incorporate the relationships between amino acids do not lead to better identification of functionally important sites. Finally, we find that while conservation is highly predictive in identifying catalytic sites and residues near bound ligands, it is much less effective in identifying residues in protein-protein interfaces. AVAILABILITY: Data sets and code for all conservation measures evaluated are available at http://compbio.cs.princeton.edu/conservation/ 相似文献
18.
《BBA》2023,1864(2):148958
Pyruvate:quinone oxidoreductases (PQOs) catalyse the oxidative decarboxylation of pyruvate to acetate and concomitant reduction of quinone to quinol with the release of CO2. They are thiamine pyrophosphate (TPP) and flavin-adenine dinucleotide (FAD) containing enzymes, which interact with the membrane in a monotopic way. PQOs are considered as part of alternatives to most recognized pyruvate catabolizing pathways, and little is known about their taxonomic distribution and structural/functional relationship.In this bioinformatics work we tackled these gaps in PQO knowledge. We used the KEGG database to identify PQO coding genes, performed a multiple sequence analysis which allowed us to study the amino acid conservation on these enzymes, and looked at their possible cellular function. We observed that PQOS are enzymes exclusively present in prokaryotes with most of the sequences identified in bacteria. Regarding the amino acid sequence conservation, we found that 75 amino acid residues (out of 570, on average) have a conservation over 90 %, and that the most conserved regions in the protein are observed around the TPP and FAD binding sites. We systematized the presence of conserved features involved in Mg2+, TPP and FAD binding, as well as residues directly linked to the catalytic mechanism. We also established the presence of a new motif named “HEH lock”, possibly involved in the dimerization process. The results here obtained for the PQO protein family contribute to a better understanding of the biochemistry of these respiratory enzymes. 相似文献
19.
Prediction of transmembrane (TM) segments of amino acid sequences of membrane proteins is a well-known and very important problem. The accuracy of its solution can be improved for approaches that do not use a homology search in an additional data bank. There is a lack of tested data in this area of research, because information on the structure of membrane proteins is scarce. In this work we created a test sample of structural alignments for membrane proteins. The TM segments of these proteins were mapped according to aligned 3D structures resolved for these proteins. A method for predicting TM segments in an alignment was developed on the basis of the forward-backward algorithm from the HMM theory. This method allows a user not only to predict TM segments, but also to create a probabilistic membrane profile, which can be employed in multiple alignment procedures taking the secondary structure of proteins into account. The method was implemented in a computer program available at http://bioinf.fbb.msu.ru/fwdbck/. It provides better results than the MEMSAT method, which is nearly the only tool predicting TM segments in multiple alignments, without a homology search. 相似文献
20.
We have recently showed that the weighted contact number profiles (or the packing density profiles) of proteins are well correlated with those of the corresponding sequence conservation profiles. The results suggest that a protein structure may contain sufficient information about sequence conservation comparable to that derived from multiple homologous sequences. However, there are ambiguities concerning how to compute the packing density of the subunit of a protein complex. For the subunits of a complex, there are different ways to compute its packing density – one including the packing contributions of the other subunits and the other one excluding their contributions. Here we selected two sets of enzyme complexes. Set A contains complexes with the active sites comprising residues from multiple subunits, while set B contains those with the active sites residing on single subunits. In Set A, if the packing density profile of a subunit is computed considering the contributions of the other subunits of the complex, it will agree better with the sequence conservation profile. But in Set B the situations are reversed. The results may be due to the stronger functional and structural constraints on the evolution processes on the complexes of Set A than those of Set B to maintain the enzymatic functions of the complexes. The comparison of the packing density and the sequence conservation profiles may provide a simple yet potentially useful way to understanding the structural and evolutionary couplings between the subunits of protein complexes. Proteins 2013; 81:1192–1199. © 2013 Wiley Periodicals, Inc. 相似文献