共查询到20条相似文献,搜索用时 0 毫秒
1.
Eleazar Eskin William Stafford Noble Yoram Singer 《Journal of computational biology》2003,10(2):187-213
We present a method for classifying proteins into families based on short subsequences of amino acids using a new probabilistic model called sparse Markov transducers (SMT). We classify a protein by estimating probability distributions over subsequences of amino acids from the protein. Sparse Markov transducers, similar to probabilistic suffix trees, estimate a probability distribution conditioned on an input sequence. SMTs generalize probabilistic suffix trees by allowing for wild-cards in the conditioning sequences. Since substitutions of amino acids are common in protein families, incorporating wild-cards into the model significantly improves classification performance. We present two models for building protein family classifiers using SMTs. As protein databases become larger, data driven learning algorithms for probabilistic models such as SMTs will require vast amounts of memory. We therefore describe and use efficient data structures to improve the memory usage of SMTs. We evaluate SMTs by building protein family classifiers using the Pfam and SCOP databases and compare our results to previously published results and state-of-the-art protein homology detection methods. SMTs outperform previous probabilistic suffix tree methods and under certain conditions perform comparably to state-of-the-art protein homology methods. 相似文献
2.
Taylor WR 《Protein science : a publication of the Protein Society》1999,8(3):654-665
A protein structure comparison method is described that allows the generation of large populations of high-scoring alternate alignments. This was achieved by incorporating a random element into an iterative double dynamic programming algorithm. The maximum scores from repeated comparisons of a pair of structures converged on a value that was taken as the global maximum. This lay 15% over the score obtained from the single fixed (unrandomized) calculation. The effect of the gap penalty was observed through the shift of the alignment populations, characterized by their alignment length and root-mean-square deviation (RMSD). The best (lowest RMSD) values found in these populations provided a base-line against which other methods were compared. 相似文献
3.
Protein engineers can alter the properties of enzymes by directing their evolution in vitro. Many methods to generate molecular diversity and to identify improved clones have been developed, but experimental evolution remains as much an art as a science. We previously used DNA shuffling (sexual recombination) and a histochemical screen to direct the evolution of Escherichia coli beta-glucuronidase (GUS) variants with improved beta-galactosidase (BGAL) activity. Here, we employ the same model evolutionary system to test the efficiencies of several other techniques: recursive random mutagenesis (asexual), combinatorial cassette mutagenesis (high-frequency recombination) and a versatile high-throughput microplate screen. GUS variants with altered specificity evolved in each trial, but different combinations of mutagenesis and screening techniques effected the fixation of different beneficial mutations. The new microplate screen identified a broader set of mutations than the previously employed X-gal colony screen. Recursive random mutagenesis produced essentially asexual populations, within which beneficial mutations drove each other into extinction (clonal interference); DNA shuffling and combinatorial cassette mutagenesis led instead to the accumulation of beneficial mutations within a single allele. These results explain why recombinational approaches generally increase the efficiency of laboratory evolution. 相似文献
4.
Since membranous proteins play a key role in drug targeting therefore transmembrane proteins prediction is active and challenging area of biological sciences. Location based prediction of transmembrane proteins are significant for functional annotation of protein sequences. Hidden markov model based method was widely applied for transmembrane topology prediction. Here we have presented a revised and a better understanding model than an existing one for transmembrane protein prediction. Scripting on MATLAB was built and compiled for parameter estimation of model and applied this model on amino acid sequence to know the transmembrane and its adjacent locations. Estimated model of transmembrane topology was based on TMHMM model architecture. Only 7 super states are defined in the given dataset, which were converted to 96 states on the basis of their length in sequence. Accuracy of the prediction of model was observed about 74 %, is a good enough in the area of transmembrane topology prediction. Therefore we have concluded the hidden markov model plays crucial role in transmembrane helices prediction on MATLAB platform and it could also be useful for drug discovery strategy. AVAILABILITY: The database is available for free at bioinfonavneet@gmail.comvinaysingh@bhu.ac.in. 相似文献
5.
Taylor WR 《Molecular & cellular proteomics : MCP》2002,1(4):334-339
A measure of protein structure similarity is calculated from the matching of pairs of secondary structure elements between two proteins. The interaction of each pair was estimated from their axial line segments and combined with other geometric features to produce an optimal discrimination between intrafamily and interfamily relationships. The matching used a fast bipartite graph-matching algorithm that avoids the computational complexity of searching for the full subgraph isomorphism between the two sets of interactions. The main algorithm used was the "stable marriage" algorithm, which works on the ranked "preferences" of one interaction for another. The method takes 1/10 of a second for a typical comparison making it suitable as a fast pre-filter for slower, more exhaustive approaches. An application to protein structure classification is described. 相似文献
6.
7.
Protein structure and neutral theory of evolution 总被引:2,自引:0,他引:2
The neutral theory of evolution is extended to the origin of protein molecules. Arguments are presented which suggest that the amino acid sequences of many globular proteins mainly represent "memorized" random sequences while biological evolution reduces to the "editing" these random sequences. Physical requirements for a functional globular protein are formulated and it is shown that many of these requirement do not involve strategical selection of amino acid sequences during biological evolution but are inherent also for typical random sequences. In particular, it is shown that random sequences of polar and amino acid residues can form alpha-helices and beta-strand with lengths and arrangement along the chain similar to those in real globular proteins. These alpha- and beta-regions in random sequences can form three-dimensional folding patterns also similar to those in proteins. The arguments are presented suggesting that even the tight packing of side groups inside protein core do not require very strong biological selection of amino acid sequences either. Thus many structural features of real proteins can exist also in random sequences and the biological selection is needed mainly for the creation of active site of protein and for their stability under physiological conditions. 相似文献
8.
Motivation: Most genome-wide association studies rely on singlenucleotide polymorphism (SNP) analyses to identify causal loci.The increased stringency required for genome-wide analyses (withper-SNP significance threshold typically 10–7) meansthat many real signals will be missed. Thus it is still highlyrelevant to develop methods with improved power at low typeI error. Haplotype-based methods provide a promising approach;however, they suffer from statistical problems such as abundanceof rare haplotypes and ambiguity in defining haplotype blockboundaries. Results: We have developed an ancestral haplotype clustering(AncesHC) association method which addresses many of these problems.It can be applied to biallelic or multiallelic markers typedin haploid, diploid or multiploid organisms, and also handlesmissing genotypes. Our model is free from the assumption ofa rigid block structure but recognizes a block-like structureif it exists in the data. We employ a Hidden Markov Model (HMM)to cluster the haplotypes into groups of predicted common ancestralorigin. We then test each cluster for association with diseaseby comparing the numbers of cases and controls with 0, 1 and2 chromosomes in the cluster. We demonstrate the power of thisapproach by simulation of case-control status under a rangeof disease models for 1500 outcrossed mice originating fromeight inbred lines. Our results suggest that AncesHC has substantiallymore power than single-SNP analyses to detect disease association,and is also more powerful than the cladistic haplotype clusteringmethod CLADHC. Availability: The software can be downloaded from http://www.imperial.ac.uk/medicine/people/l.coin Contact: I.coin{at}imperial.ac.uk Supplementary Information: Supplementary data are availableat Bioinformatics online.
Associate Editor: Martin Bishop 相似文献
9.
MOTIVATION: Protein structure comparison (PSC) has been used widely in studies of structural and functional genomics. However, PSC is computationally expensive and as a result almost all of the PSC methods currently in use look only for the optimal alignment and ignore many alternative alignments that are statistically significant and that may provide insight into protein evolution or folding. RESULTS: We have developed a new PSC method with efficiency to detect potentially viable alternative alignments in all-against-all database comparisons. The efficiency of the new PSC method derives from the ability to directly home in on a limited number of viable and ranked alignment solutions based on intuitively derived SSE (secondary structure element)-matching probabilities. 相似文献
10.
11.
修正非齐次模型是在齐次模型和非齐次模型基础上提出的适用于蛋白质编码区的马尔可夫模型。此模型可以用来分析生物物种进化和基因突变,模型中的马尔可夫度与序列进化水平相关联,转移矩阵与基因突变相关联。本文通过比较7类不同物种-1度马尔可夫链的含量,验证了生物物种进化反映在密码子使用上的特征;通过密码子位点间转移矩阵的计算,分析了基因突变在密码子不同位点上发生的可能性。 相似文献
12.
Background
For the purposes of finding and aligning noncoding RNA gene- and cis-regulatory elements in multiple-genome datasets, it is useful to be able to derive multi-sequence stochastic grammars (and hence multiple alignment algorithms) systematically, starting from hypotheses about the various kinds of random mutation event and their rates. 相似文献13.
A new protein structure alignment procedure is described. An initial alignment is made by comparing a one-dimensional list of primary, secondary and tertiary structural features (profiles) of two proteins, without explicitly considering the three-dimensional geometry of the structures. The alignment is then iteratively refined in the second step, in which new alignments are found by three-dimensional superposition of the structures based on the current alignment. This new procedure is fast enough to do all-against-all structural comparisons routinely. The procedure sometimes finds an alignment that suggests an evolutionary relationship and which is not normally obtained if only geometry is considered. All pair-wise comparisons were made among 3539 protein structural domains that represent all known protein structures. The resulting 3539 z-scores were used to cluster the proteins. The number of main clusters increased continuously as the z-cutoff was raised, but the number of multiple-member clusters showed a maximum at z-cutoff values of 5.0 and 5.5. When a z-cutoff value of 5.0 was used, the total number of main clusters was 2043, of which only 336 clusters had more than one member. 相似文献
14.
We test models for the evolution of helical regions of RNA sequences, where the base pairing constraint leads to correlated compensatory substitutions occurring on either side of the pair. These models are of three types: 6-state models include only the four Watson-Crick pairs plus GU and UG; 7-state models include a single mismatch state that combines all of the 10 possible mismatches; 16-state models treat all mismatch states separately. We analyzed a set of eubacterial ribosomal RNA sequences with a well-established phylogenetic tree structure. For each model, the maximum-likelihood values of the parameters were obtained. The models were compared using the Akaike information criterion, the likelihood-ratio test, and Cox's test. With a high significance level, models that permit a nonzero rate of double substitutions performed better than those that assume zero double substitution rate. Some models assume symmetry between GC and CG, between AU and UA, and between GU and UG. Models that relaxed this symmetry assumption performed slightly better, but the tests did not all agree on the significance level. The most general time-reversible model significantly outperformed any of the simplifications. We consider the relative merits of all these models for molecular phylogenetics. 相似文献
15.
Protein secondary structure prediction using three neural networks and a segmental semi Markov model
Prediction of protein secondary structure is an important step towards elucidating its three dimensional structure and its function. This is a challenging problem in bioinformatics. Segmental semi Markov models (SSMMs) are one of the best studied methods in this field. However, incorporating evolutionary information to these methods is somewhat difficult. On the other hand, the systems of multiple neural networks (NNs) are powerful tools for multi-class pattern classification which can easily be applied to take these sorts of information into account.To overcome the weakness of SSMMs in prediction, in this work we consider a SSMM as a decision function on outputs of three NNs that uses multiple sequence alignment profiles. We consider four types of observations for outputs of a neural network. Then profile table related to each sequence is reduced to a sequence of four observations. In order to predict secondary structure of each amino acid we need to consider a decision function. We use an SSMM on outputs of three neural networks. The proposed SSMM has discriminative power and weights over different dependency models for outputs of neural networks. The results show that the accuracy of our model in predictions, particularly for strands, is considerably increased. 相似文献
16.
Torrance GM Gilbert DR Michalopoulos I Westhead DW 《Bioinformatics (Oxford, England)》2005,21(10):2537-2538
We describe a fold level fast protein comparison and motif matching facility based on the TOPS representation of structure. This provides an update to a previous service at the EBI, with a better graph matching with faster results and visualization of both the structures being compared against and the common pattern of each with the target domain. AVAILABILITY: Web service at http://balabio.dcs.gla.ac.uk/tops or via the main TOPS site at http://www.tops.leeds.ac.uk. Software is also available for download from these sites. 相似文献
17.
Protein structure alignment using a genetic algorithm 总被引:3,自引:0,他引:3
We have developed a novel, fully automatic method for aligning the three-dimensional structures of two proteins. The basic approach is to first align the proteins' secondary structure elements and then extend the alignment to include any equivalent residues found in loops or turns. The initial secondary structure element alignment is determined by a genetic algorithm. After refinement of the secondary structure element alignment, the protein backbones are superposed and a search is performed to identify any additional equivalent residues in a convergent process. Alignments are evaluated using intramolecular distance matrices. Alignments can be performed with or without sequential connectivity constraints. We have applied the method to proteins from several well-studied families: globins, immunoglobulins, serine proteases, dihydrofolate reductases, and DNA methyltransferases. Agreement with manually curated alignments is excellent. A web-based server and additional supporting information are available at http://engpub1.bu.edu/-josephs. 相似文献
18.
We present a comprehensive evaluation of a new structure mining method called PB-ALIGN. It is based on the encoding of protein structure as 1D sequence of a combination of 16 short structural motifs or protein blocks (PBs). PBs are short motifs capable of representing most of the local structural features of a protein backbone. Using derived PB substitution matrix and simple dynamic programming algorithm, PB sequences are aligned the same way amino acid sequences to yield structure alignment. PBs are short motifs capable of representing most of the local structural features of a protein backbone. Alignment of these local features as sequence of symbols enables fast detection of structural similarities between two proteins. Ability of the method to characterize and align regions beyond regular secondary structures, for example, N and C caps of helix and loops connecting regular structures, puts it a step ahead of existing methods, which strongly rely on secondary structure elements. PB-ALIGN achieved efficiency of 85% in extracting true fold from a large database of 7259 SCOP domains and was successful in 82% cases to identify true super-family members. On comparison to 13 existing structure comparison/mining methods, PB-ALIGN emerged as the best on general ability test dataset and was at par with methods like YAKUSA and CE on nontrivial test dataset. Furthermore, the proposed method performed well when compared to flexible structure alignment method like FATCAT and outperforms in processing speed (less than 45 s per database scan). This work also establishes a reliable cut-off value for the demarcation of similar folds. It finally shows that global alignment scores of unrelated structures using PBs follow an extreme value distribution. PB-ALIGN is freely available on web server called Protein Block Expert (PBE) at http://bioinformatics.univ-reunion.fr/PBE/. 相似文献
19.
Robinson DM Jones DT Kishino H Goldman N Thorne JL 《Molecular biology and evolution》2003,20(10):1692-1704
Markovian models of protein evolution that relax the assumption of independent change among codons are considered. With this comparatively realistic framework, an evolutionary rate at a site can depend both on the state of the site and on the states of surrounding sites. By allowing a relatively general dependence structure among sites, models of evolution can reflect attributes of tertiary structure. To quantify the impact of protein structure on protein evolution, we analyze protein-coding DNA sequence pairs with an evolutionary model that incorporates effects of solvent accessibility and pairwise interactions among amino acid residues. By explicitly considering the relationship between nonsynonymous substitution rates and protein structure, this approach can lead to refined detection and characterization of positive selection. Analyses of simulated sequence pairs indicate that parameters in this evolutionary model can be well estimated. Analyses of lysozyme c and annexin V sequence pairs yield the biologically reasonable result that amino acid replacement rates are higher when the replacements lead to energetically favorable proteins than when they destabilize the proteins. Although the focus here is evolutionary dependence among codons that is associated with protein structure, the statistical approach is quite general and could be applied to diverse cases of evolutionary dependence where surrogates for sequence fitness can be measured or modeled. 相似文献
20.
In the present paper, we describe how a directed graph was constructed and then searched for the optimum path using a dynamic programming approach, based on the secondary structure propensity of the protein short sequence derived from a training data set. The protein secondary structure was thus predicted in this way. The average three-state accuracy of the algorithm used was 76.70%. 相似文献