首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The alpha-helical coiled coil can adopt a variety of topologies, among the most common of which are parallel and antiparallel dimers and trimers. We present Multicoil2, an algorithm that predicts both the location and oligomerization state (two versus three helices) of coiled coils in protein sequences. Multicoil2 combines the pairwise correlations of the previous Multicoil method with the flexibility of Hidden Markov Models (HMMs) in a Markov Random Field (MRF). The resulting algorithm integrates sequence features, including pairwise interactions, through multinomial logistic regression to devise an optimized scoring function for distinguishing dimer, trimer and non-coiled-coil oligomerization states; this scoring function is used to produce Markov Random Field potentials that incorporate pairwise correlations localized in sequence. Multicoil2 significantly improves both coiled-coil detection and dimer versus trimer state prediction over the original Multicoil algorithm retrained on a newly-constructed database of coiled-coil sequences. The new database, comprised of 2,105 sequences containing 124,088 residues, includes reliable structural annotations based on experimental data in the literature. Notably, the enhanced performance of Multicoil2 is evident when tested in stringent leave-family-out cross-validation on the new database, reflecting expected performance on challenging new prediction targets that have minimal sequence similarity to known coiled-coil families. The Multicoil2 program and training database are available for download from http://multicoil2.csail.mit.edu.  相似文献   

2.
3.
In this study we apply a genetic algorithm to a set of RNA sequences to find common RNA secondary structures. Our method is a three-step procedure. At the first stage of the procedure for each sequence, a genetic algorithm is used to optimize the structures in a population to a certain degree of stability. In this step, the free energy of a structure is the fitness criterion for the algorithm. Next, for each structure, we define a measure of structural conservation with respect to those in other sequences. We use this measure in a genetic algorithm to improve the structural similarity among sequences for the structures in the population of a sequence. Finally, we select those structures satisfying certain conditions of structural stability and similarity as predicted common structures for a set of RNA sequences. We have obtained satisfactory results from a set of tRNA, 5S rRNA, rev response elements (RRE) of HIV-1 and RRE of HIV-2/SIV, respectively.  相似文献   

4.
We propose a new method, called ‘size leap’ algorithm,of search for motifs of maximum size and common to two fragmentsat least. It allows the creation of a reduced database of motifsfrom a set of sequences whose size obeys the series of Fibonaccinumbers. The convenience lies in the efficiency of the motifextraction. It can be applied in the establishment of overlapregions for DNA sequence reconstruction and multiple alignmentof biological sequences. The method of complete DNA sequencereconstruction by extraction of the longest motifs (‘anchormotifs’) is presented as an application of the size leapalgorithm. The details of a reconstruction from three sequencedfragments are given as an example. Received on February 12, 1991; accepted on February 15, 1991  相似文献   

5.
MOTIVATION: Non-coding RNA genes and RNA structural regulatory motifs play important roles in gene regulation and other cellular functions. They are often characterized by specific secondary structures that are critical to their functions and are often conserved in phylogenetically or functionally related sequences. Predicting common RNA secondary structures in multiple unaligned sequences remains a challenge in bioinformatics research. Methods and RESULTS: We present a new sampling based algorithm to predict common RNA secondary structures in multiple unaligned sequences. Our algorithm finds the common structure between two sequences by probabilistically sampling aligned stems based on stem conservation calculated from intrasequence base pairing probabilities and intersequence base alignment probabilities. It iteratively updates these probabilities based on sampled structures and subsequently recalculates stem conservation using the updated probabilities. The iterative process terminates upon convergence of the sampled structures. We extend the algorithm to multiple sequences by a consistency-based method, which iteratively incorporates and reinforces consistent structure information from pairwise comparisons into consensus structures. The algorithm has no limitation on predicting pseudoknots. In extensive testing on real sequence data, our algorithm outperformed other leading RNA structure prediction methods in both sensitivity and specificity with a reasonably fast speed. It also generated better structural alignments than other programs in sequences of a wide range of identities, which more accurately represent the RNA secondary structure conservations. AVAILABILITY: The algorithm is implemented in a C program, RNA Sampler, which is available at http://ural.wustl.edu/software.html  相似文献   

6.
Rhizobium leguminosarum biovar phaseoli CFN23 loses its ability to nodulate beans at a high frequency because of a deletion of part of its symbiotic (pSym) plasmid (Soberón-Chávez et al., 1986). We report here that at least 80 kb of pSym are deleted upon loss of the symbiotic phenotype; the deletion removes the nitrogenase structural nifHDK and the common nodABC genes. The size of the deleted pSym is not reduced, since it is accompanied by an amplification of other pSym plasmid sequences. This genetic rearrangement is similar to the deletion and amplification of yeast mitochondrial DNA leading to 'petite' mutations.  相似文献   

7.
We report an interesting case of structural similarity between 2 small, nonhomologous proteins, the third domain of ovomucoid (ovomucoid) and the C-terminal fragment of ribosomal L7/L12 protein (CTF). The region of similarity consists of a 3-stranded beta-sheet and an alpha-helix. This region is highly similar; the corresponding elements of secondary structure share a common topology, and the RMS difference for "equivalent" C alpha atoms is 1.6 A. Surprisingly, this common structure arises from completely different sequences. For the common core, the sequence identity is less than 3%, and there is neither significant sequence similarity nor similarity in the position or orientation of conserved hydrophobic residues. This superposition raises the question of how 2 entirely different sequences can produce an identical structure. Analyzing this common region in ovomucoid revealed that it is stabilized by disulfide bonds. In contrast, the corresponding structure in CTF is stabilized in the alpha-helix by a composition of residues with high helix-forming propensities. This result suggests that different sequences and different stabilizing interactions can produce an identical structure.  相似文献   

8.
MOTIVATION: Pair-wise alignment of protein sequences and local similarity searches produce many false positives because of compositionally biased regions, also called low-complexity regions (LCRs), of amino acid residues. Masking and filtering such regions significantly improves the reliability of homology searches and, consequently, functional predictions. Most of the available algorithms are based on a statistical approach. We wished to investigate the structural properties of LCRs in biological sequences and develop an algorithm for filtering them. RESULTS: We present an algorithm for detecting and masking LCRs in protein sequences to improve the quality of database searches. We developed the algorithm based on the complexity analysis of subsequences delimited by a pair of identical, repeating subsequences. Given a protein sequence, the algorithm first computes the suffix tree of the sequence. It then collects repeating subsequences from the tree. Finally, the algorithm iteratively tests whether each subsequence delimited by a pair of repeating subsequences meets a given criteria. Test results with 1000 proteins from 20 families in Pfam show that the repeating subsequences are a good indicator for the low-complexity regions, and the algorithm based on such structural information strongly compete with others. AVAILABILITY: http://bioinfo.knu.ac.kr/research/CARD/ CONTACT: swshin@bioinfo.knu.ac.kr  相似文献   

9.
We present a new algorithm for the display of RNA secondarystructure. The principle of the algorithm is entirely differentfrom those currently in use in that our algorithm is ‘objectoriented’ while currrent algorithms are ‘procedural’.The circular RNA molecule of chrysanthemum stunt viroid wasused as input data for demonstrating the operation of the program.The major interest of this method will be found in its potentialuse in simulation graphics of RNA folding processes. Received on October 9, 1986; accepted on February 17, 1987  相似文献   

10.
We have previously reported an algorithm (Yamamoto and Yoshikura,1985) for the prediction of optimum and suboptimum RNA foldingstructures. The calculation data was presented as an ‘informationmap’. However, the result was affected by the startingpoint of calculation. In this paper, we have improved the methodso that the result will not be affected by the starting point.In addition, we present a method of converting the informationmap into a set of pictures of optimal and suboptimal molecularstructures. Received on May 19, 1986; accepted on December 22, 1986  相似文献   

11.
The reconstruction and synthesis of ancestral RNAs is a feasible goal for paleogenetics. This will require new bioinformatics methods, including a robust statistical framework for reconstructing histories of substitutions, indels and structural changes. We describe a “transducer composition” algorithm for extending pairwise probabilistic models of RNA structural evolution to models of multiple sequences related by a phylogenetic tree. This algorithm draws on formal models of computational linguistics as well as the 1985 protosequence algorithm of David Sankoff. The output of the composition algorithm is a multiple-sequence stochastic context-free grammar. We describe dynamic programming algorithms, which are robust to null cycles and empty bifurcations, for parsing this grammar. Example applications include structural alignment of non-coding RNAs, propagation of structural information from an experimentally-characterized sequence to its homologs, and inference of the ancestral structure of a set of diverged RNAs. We implemented the above algorithms for a simple model of pairwise RNA structural evolution; in particular, the algorithms for maximum likelihood (ML) alignment of three known RNA structures and a known phylogeny and inference of the common ancestral structure. We compared this ML algorithm to a variety of related, but simpler, techniques, including ML alignment algorithms for simpler models that omitted various aspects of the full model and also a posterior-decoding alignment algorithm for one of the simpler models. In our tests, incorporation of basepair structure was the most important factor for accurate alignment inference; appropriate use of posterior-decoding was next; and fine details of the model were least important. Posterior-decoding heuristics can be substantially faster than exact phylogenetic inference, so this motivates the use of sum-over-pairs heuristics where possible (and approximate sum-over-pairs). For more exact probabilistic inference, we discuss the use of transducer composition for ML (or MCMC) inference on phylogenies, including possible ways to make the core operations tractable.  相似文献   

12.
One of the main advantages of de novo gene synthesis is the fact that it frees the researcher from any limitations imposed by the use of natural templates. To make the most out of this opportunity, efficient algorithms are needed to calculate a coding sequence, combining different requirements, such as adapted codon usage or avoidance of restriction sites, in the best possible way. We present an algorithm where a “variation window” covering several amino acid positions slides along the coding sequence. Candidate sequences are built comprising the already optimized part of the complete sequence and all possible combinations of synonymous codons representing the amino acids within the window. The candidate sequences are assessed with a quality function, and the first codon of the best candidates’ variation window is fixed. Subsequently the window is shifted by one codon position. As an example of a freely accessible software implementing the algorithm, we present the Mr. Gene web-application. Additionally two experimental applications of the algorithm are shown.  相似文献   

13.
A new approach to search for common patterns in many sequencesis presented. The idea is that one sequence from the set ofsequences to be compared is considered as a ‘basic’one and all its similarities with other sequences are found.Multiple similarities are then reconstructed using these data.This approach allows one to search for similar segments whichcan differ in both substitutions and deletions/insertions. Thesesegments can be situated at different positions in various sequences.No regions of complete or strong similarity within the segmentsare required. The other parts of the sequences can have no similarityat all. The only requirement is that the similar segments canbe found in all the sequences (or in the majority of them, giventhe common segments are present in the basic sequence). Workingtime of an algorithm presented is proportional to n.L2when nsequences of length L are analyzed. The algorithm proposed isimplemented as programs for the IBM-PC and IBM/370. Its applicationsto the analysis of biopolymer primary structures as well asthe dependence of the results on the choice of basic sequenceare discussed.  相似文献   

14.
15.
K Han  H J Kim 《Nucleic acids research》1993,21(5):1251-1257
We have developed an algorithm and a computer program for simultaneously folding homologous RNA sequences. Given an alignment of M homologous sequences of length N, the program performs phylogenetic comparative analysis and predicts a common secondary structure conserved in the sequences. When the structure is not uniquely determined, it infers multiple structures which appear most plausible. This method is superior to energy minimization methods in the sense that it is not sensitive to point mutation of a sequence. It is also superior to usual phylogenetic comparative methods in that it does not require manual scrutiny for covariation or secondary structures. The most plausible 1-5 structures are produced in O(MN2 + N3) time and O(N2) space, which are the same requirements as those of widely used dynamic programs based on energy minimization for folding a single sequence. This is the first algorithm probably practical both in terms of time and space for finding secondary structures of homologous RNA sequences. The algorithm has been implemented in C on a Sun SparcStation, and has been verified by testing on tRNAs, 5S rRNAs, 16S rRNAs, TAR RNAs of human immunodeficiency virus type 1 (HIV-1), and RRE RNAs of HIV-1. We have also applied the program to cis-acting packaging sequences of HIV-1, for which no generally accepted structures yet exist, and propose potentially stable structures. Simulation of the program with random sequences with the same base composition and the same degree of similarity as the above sequences shows that structures common to homologous sequences are very unlikely to occur by chance in random sequences.  相似文献   

16.
A basic, amphiphilic alpha helix is a structural feature common to a variety of inhibitors of calmodulin and to the calmodulin-binding domains of myosin light chain kinases. To aid in recognizing this structural feature in sequences of peptides and proteins we have developed a computer algorithm which searches for sequences of appropriate length, hydrophobicity, helical hydrophobic moment, and charge to be considered as potential calmodulin-binding sequences. Such sequences occurred infrequently in proteins of known crystal structure. This algorithm was used to find the most likely site in the catalytic (gamma) subunit of phosphorylase b kinase for interaction with calmodulin (the delta subunit). A peptide corresponding to this site (residues 341-361 of the gamma subunit) was synthesized and found to bind calmodulin with approximately an 11 nM dissociation constant. A variant of this peptide in which an aspartic acid at position 7 in its sequence (347 of the gamma subunit) was replaced with an asparagine was found to bind calmodulin with approximately a 3 nM dissociation constant.  相似文献   

17.
Identifying the fold class of a protein sequence of unknown structure is a fundamental problem in modern biology. We apply a supervised learning algorithm to the classification of protein sequences with low sequence identity from a library of 174 structural classes created with the Combinatorial Extension structural alignment methodology. A class of rules is considered that assigns test sequences to structural classes based on the closest match of an amino acid index profile of the test sequence to a profile centroid for each class. A mathematical optimization procedure is applied to determine an amino acid index of maximal structural discriminatory power by maximizing the ratio of between-class to within-class profile variation. The optimal index is computed as the solution to a generalized eigenvalue problem, and its performance for fold classification is compared to that of other published indices. The optimal index has significantly more structural discriminatory power than all currently known indices, including average surrounding hydrophobicity, which it most closely resembles. It demonstrates >70% classification accuracy over all folds and nearly 100% accuracy on several folds with distinctive conserved structural features. Finally, there is a compelling universality to the optimal index in that it does not appear to depend strongly on the specific structural classes used in its computation.  相似文献   

18.
MOTIVATION: Structural RNA genes exhibit unique evolutionary patterns that are designed to conserve their secondary structures; these patterns should be taken into account while constructing accurate multiple alignments of RNA genes. The Sankoff algorithm is a natural alignment algorithm that includes the effect of base-pair covariation in the alignment model. However, the extremely high computational cost of the Sankoff algorithm precludes its application to most RNA sequences. RESULTS: We propose an efficient algorithm for the multiple alignment of structural RNA sequences. Our algorithm is a variant of the Sankoff algorithm, and it uses an efficient scoring system that reduces the time and space requirements considerably without compromising on the alignment quality. First, our algorithm computes the match probability matrix that measures the alignability of each position pair between sequences as well as the base pairing probability matrix for each sequence. These probabilities are then combined to score the alignment using the Sankoff algorithm. By itself, our algorithm does not predict the consensus secondary structure of the alignment but uses external programs for the prediction. We demonstrate that both the alignment quality and the accuracy of the consensus secondary structure prediction from our alignment are the highest among the other programs examined. We also demonstrate that our algorithm can align relatively long RNA sequences such as the eukaryotic-type signal recognition particle RNA that is approximately 300 nt in length; multiple alignment of such sequences has not been possible by using other Sankoff-based algorithms. The algorithm is implemented in the software named 'Murlet'. AVAILABILITY: The C++ source code of the Murlet software and the test dataset used in this study are available at http://www.ncrna.org/papers/Murlet/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

19.
An algorithm has been developed for the identification of unknownpatterns which are distinctive for a set of short DNA sequencesbelieved to be functionally equivalent. A pattern is definedas being a string, containing fully or partially specified nucleotidesat each position of the string. The advantage of this ‘vague’definition of the pattern is that it imposes minimum constraintson the characterization of patterns. A new feature of the approachdeveloped here is that it allows a ‘fair’ simultaneoustesting of patterns of all degrees of degeneracy. This analysisis based on an evaluation of inhomogeneity in the empiricaloccurrence distribution of any such pattern within a set ofsequences. The use of the nonparametric kernel density estimationof Parzen allows one to assess small disturbances among thesequence alignments. The method also makes it possible to identifysequence subsets with different characteristic patterns. Thisalgorithm was implemented in the analysis of patterns characteristicof sets of promoters, terminators and splice junction sequences.The results are compared with those obtained by other methods. Received on November 17, 1986; accepted on June 15, 1987  相似文献   

20.
Pattern matching of biological sequences with limited storage   总被引:1,自引:0,他引:1  
Existing methods for getting the locally best matched alignmentsbetween a pair of biological sequences require O(N2) computationalsteps and O(N2) storage, where N is the average sequence length.An improved method is presented with which the storage requirementis greatly reduced, while the computational steps remain O(N2).Only a small number of additional steps are required to displayany common sub–sequences with similarity scores greaterthan a given threshold. The aligments found by the algorithmare optimal in the sense that their scores are locally maximal,where each score is a sum of weights given to individual matches/replacements,insertions and deletions involved in the alignment. The algorithmwas implemented in C programming language on a personal computer.Data area of 64 kbytes on random access memory and a few hundredkbytes on a disk is sufficient for comparing two protein ornucleic acid sequences of 2500 residues. The programs are particularlyvaluable when used in combination with fast sequence searchprograms. Received on July 25, 1986; accepted on October 27, 1986  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号