首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.

Background  

We introduce the decision support system for Protein (Structure) Comparison, Knowledge, Similarity and Information (ProCKSI). ProCKSI integrates various protein similarity measures through an easy to use interface that allows the comparison of multiple proteins simultaneously. It employs the Universal Similarity Metric (USM), the Maximum Contact Map Overlap (MaxCMO) of protein structures and other external methods such as the DaliLite and the TM-align methods, the Combinatorial Extension (CE) of the optimal path, and the FAST Align and Search Tool (FAST). Additionally, ProCKSI allows the user to upload a user-defined similarity matrix supplementing the methods mentioned, and computes a similarity consensus in order to provide a rich, integrated, multicriteria view of large datasets of protein structures.  相似文献   

3.

Background

For over a decade the idea of representing biological sequences in a continuous coordinate space has maintained its appeal but not been fully realized. The basic idea is that any sequence of symbols may define trajectories in the continuous space conserving all its statistical properties. Ideally, such a representation would allow scale independent sequence analysis – without the context of fixed memory length. A simple example would consist on being able to infer the homology between two sequences solely by comparing the coordinates of any two homologous units.

Results

We have successfully identified such an iterative function for bijective mappingψ of discrete sequences into objects of continuous state space that enable scale-independent sequence analysis. The technique, named Universal Sequence Mapping (USM), is applicable to sequences with an arbitrary length and arbitrary number of unique units and generates a representation where map distance estimates sequence similarity. The novel USM procedure is based on earlier work by these and other authors on the properties of Chaos Game Representation (CGR). The latter enables the representation of 4 unit type sequences (like DNA) as an order free Markov Chain transition table. The properties of USM are illustrated with test data and can be verified for other data by using the accompanying web-based tool:http://bioinformatics.musc.edu/~jonas/usm/.

Conclusions

USM is shown to enable a statistical mechanics approach to sequence analysis. The scale independent representation frees sequence analysis from the need to assume a memory length in the investigation of syntactic rules.  相似文献   

4.
A multiple alignment methodology that can produce high-qualityalignment is extremely important for predicting the structureof unknown proteins. Nearly all the methodologies developedso far have employed two-way alignment only. Although thesemethods are fast, the alignments they produce lose reliabilityas the similarity of sequences reduces. We developed the MASCOTmultiple alignment system. MASCOT can sustain the reliabilityof alignment even when the similarity of sequences is low. MASCOTachieves high-quality alignment by employing three-way alignmentin addition to two-way alignment. The resultant alignments arerefined by simulated annealing to higher quality. We also usea cluster analysis of sequences to produce highly reliable alignments.  相似文献   

5.

Background  

Recently, Almeida and Vinga offered a new approach for the representation of arbitrary discrete sequences, referred to as Universal Sequence Maps (USM), and discussed its applicability to genomic sequence analysis. Their work generalizes and extends Chaos Game Representation (CGR) of DNA for arbitrary discrete sequences.  相似文献   

6.
Most phylogenetic‐tree building applications use multiple sequence alignments as a starting point. A recent meta‐level methodology, called Heads or Tails, aims to reveal the quality of multiple sequence alignments by comparing alignments taken in the forward direction with the alignments of the same sequences when the sequences are reversed. Through an examination of a special case for multiple sequence alignment – pair‐wise alignments, where an optimal algorithm exists – and the use of a modi?ed global‐alignment application, it is shown that the forward and reverse alignments, even when they are the same, do not capture all the possible variations in the alignments and when the forward and reverse alignments differ there may be other alignments that remain unaccounted for. The implication is that comparing just the forward and (biologically irrelevant) reverse alignments is not sufficient to capture the variability in multiple sequence alignments, and the Heads or Tails methodology is therefore not suitable as a method for investigating multiple sequence alignment accuracy. Part of the reason is the inability of individual multiple sequence alignment applications to adequately sample the space of possible alignments. A further implication is that the Hall [Hall, B.G., 2008. Mol. Biol. Evol. 25, 1576–1580] methodology may create optimal synthetic multiple sequence alignments that extant aligners will be unable to completely recover ab initio due to alternative alignments being possible at particular sites. In general, it is shown that more divergent sequences will give rise to an increased number of alternative alignments, so sequence sets with a higher degree of similarity are preferable to sets with lower similarity as the starting point for phylogenetic tree building. © The Willi Hennig Society 2009.  相似文献   

7.
When preparing data sets of amino acid or nucleotide sequences it is necessary to exclude redundant or homologous sequences in order to avoid overestimating the predictive performance of an algorithm. For some time methods for doing this have been available in the area of protein structure prediction. We have developed a similar procedure based on pair-wise alignments for sequences with functional sites. We show how a correlation coefficient between sequence similarity and functional homology can be used to compare the efficiency of different similarity measures and choose a nonarbitrary threshold value for excluding redundant sequences. The impact of the choice of scoring matrix used in the alignments is examined. We demonstrate that the parameter determining the quality of the correlation is the relative entropy of the matrix, rather than the assumed (PAM or identity) substitution model. Results are presented for the case of prediction of cleavage sites in signal peptides. By inspection of the false positives, several errors in the database were found. The procedure presented may be used as a general outline for finding a problem-specific similarity measure and threshold value for analysis of other functional amino acid or nucleotide sequence patterns.  相似文献   

8.
Lin HN  Notredame C  Chang JM  Sung TY  Hsu WL 《PloS one》2011,6(12):e27872
Most sequence alignment tools can successfully align protein sequences with higher levels of sequence identity. The accuracy of corresponding structure alignment, however, decreases rapidly when considering distantly related sequences (<20% identity). In this range of identity, alignments optimized so as to maximize sequence similarity are often inaccurate from a structural point of view. Over the last two decades, most multiple protein aligners have been optimized for their capacity to reproduce structure-based alignments while using sequence information. Methods currently available differ essentially in the similarity measurement between aligned residues using substitution matrices, Fourier transform, sophisticated profile-profile functions, or consistency-based approaches, more recently.In this paper, we present a flexible similarity measure for residue pairs to improve the quality of protein sequence alignment. Our approach, called SymAlign, relies on the identification of conserved words found across a sizeable fraction of the considered dataset, and supported by evolutionary analysis. These words are then used to define a position specific substitution matrix that better reflects the biological significance of local similarity. The experiment results show that the SymAlign scoring scheme can be incorporated within T-Coffee to improve sequence alignment accuracy. We also demonstrate that SymAlign is less sensitive to the presence of structurally non-similar proteins. In the analysis of the relationship between sequence identity and structure similarity, SymAlign can better differentiate structurally similar proteins from non- similar proteins. We show that protein sequence alignments can be significantly improved using a similarity estimation based on weighted n-grams. In our analysis of the alignments thus produced, sequence conservation becomes a better indicator of structural similarity. SymAlign also provides alignment visualization that can display sub-optimal alignments on dot-matrices. The visualization makes it easy to identify well-supported alternative alignments that may not have been identified by dynamic programming. SymAlign is available at http://bio-cluster.iis.sinica.edu.tw/SymAlign/.  相似文献   

9.
Abstract: Amplification and sequence analysis of the 16S rRNA genes from DNA samples extracted directly from the environment allows the study of microbial diversity in natural ecosystems without the need for cultivation. In this study this methodology has been applied to two coastal lagoons. Activity and numbers of heterotrophic bacteria have indicated that, as expected, Prévost lagoon (located on the French Mediterranean coast) is more eutrophic than that of the Arcachon Bay (French Atlantic coast). Analysis of partial 16S rRNA gene sequences revealed that, in both environments, a relatively large number of clones related to Cytophaga/Flexibacter/Bacteroides as well as to α-Proteobacteria were found. One hundred percent similarity with the sequences of the data bases were not found for any of the more than a hundred clones studied, in fact for most clones maximum similarity was below 95% for the approx. 200 bases sequenced. Similarity was not higher with any of the sequences found for the 14 isolates (pure cultures) obtained from the same samples. Redundancy, i.e. number of identical sequences, was higher in the samples from Arcachon. In addition, sequences related to representatives of ten major phylogenetic branches of Bacteria were obtained from Prévost lagoon; however only five branches were represented by the data from Arcachon. These findings indicated a higher bacterial phylogenetic diversity in the Prévost lagoon.  相似文献   

10.

Background  

Protein alignments are an essential tool for many bioinformatics analyses. While sequence alignments are accurate for proteins of high sequence similarity, they become unreliable as they approach the so-called 'twilight zone' where sequence similarity gets indistinguishable from random. For such distant pairs, structure alignment is of much better quality. Nevertheless, sequence alignment is the only choice in the majority of cases where structural data is not available. This situation demands development of methods that extend the applicability of accurate sequence alignment to distantly related proteins.  相似文献   

11.
Sequence alignment is an important bioinformatics tool for identifying homology, but searching against the full set of available sequences is likely to result in many hits to poorly annotated sequences providing very little information. Consequently, we often want alignments against a specific subset of sequences: for instance, we are looking for sequences from a particular species, sequences that have known 3d-structures, sequences that have a reliable (curated) function annotation, and so on. Although such subset databases are readily available, they only represent a small fraction of all sequences. Thus, the likelihood of finding close homologs for query sequences is smaller, and the alignments will in general have lower scores. This makes it difficult to distinguish hits to homologous sequences from random hits to unrelated sequences. Here, we propose a method that addresses this problem by first aligning query sequences against a large database representing the corpus of known sequences, and then constructing indirect (or transitive) alignments by combining the results with alignments from the large database against the desired target database. We compare the results to direct pairwise alignments, and show that our method gives us higher sensitivity alignments against the target database.  相似文献   

12.
13.
By searching the current protein sequence databases using sequences from human and chicken histones H1/H5, H2A, H2B, H3 and H4, a database of aligned histone protein sequences with statistically significant sequence similarity to the search sequence was constructed. In addition, a nucleotide sequence database of the corresponding coding regions for these proteins has been assembled. The region of each of the core histones containing the histone fold motif is identified in the protein alignments. The database contains >1300 protein and nucleotide sequences. All sequences and alignments in this database are available through the World Wide Web at http://www.ncbi.nlm.nih.gov/Baxevani/HISTO NES.  相似文献   

14.
Tan YH  Huang H  Kihara D 《Proteins》2006,64(3):587-600
Aligning distantly related protein sequences is a long-standing problem in bioinformatics, and a key for successful protein structure prediction. Its importance is increasing recently in the context of structural genomics projects because more and more experimentally solved structures are available as templates for protein structure modeling. Toward this end, recent structure prediction methods employ profile-profile alignments, and various ways of aligning two profiles have been developed. More fundamentally, a better amino acid similarity matrix can improve a profile itself; thereby resulting in more accurate profile-profile alignments. Here we have developed novel amino acid similarity matrices from knowledge-based amino acid contact potentials. Contact potentials are used because the contact propensity to the other amino acids would be one of the most conserved features of each position of a protein structure. The derived amino acid similarity matrices are tested on benchmark alignments at three different levels, namely, the family, the superfamily, and the fold level. Compared to BLOSUM45 and the other existing matrices, the contact potential-based matrices perform comparably in the family level alignments, but clearly outperform in the fold level alignments. The contact potential-based matrices perform even better when suboptimal alignments are considered. Comparing the matrices themselves with each other revealed that the contact potential-based matrices are very different from BLOSUM45 and the other matrices, indicating that they are located in a different basin in the amino acid similarity matrix space.  相似文献   

15.
Protein homology detection using string alignment kernels   总被引:2,自引:0,他引:2  
MOTIVATION: Remote homology detection between protein sequences is a central problem in computational biology. Discriminative methods involving support vector machines (SVMs) are currently the most effective methods for the problem of superfamily recognition in the Structural Classification Of Proteins (SCOP) database. The performance of SVMs depends critically on the kernel function used to quantify the similarity between sequences. RESULTS: We propose new kernels for strings adapted to biological sequences, which we call local alignment kernels. These kernels measure the similarity between two sequences by summing up scores obtained from local alignments with gaps of the sequences. When tested in combination with SVM on their ability to recognize SCOP superfamilies on a benchmark dataset, the new kernels outperform state-of-the-art methods for remote homology detection. AVAILABILITY: Software and data available upon request.  相似文献   

16.
IntroductionIn the present study, we sought to quantify and contrast the secretome and biomechanical properties of the non-chondrodystrophic (NCD) and chondrodystrophic (CD) canine intervertebral disc (IVD) nucleus pulposus (NP).MethodsWe used iTRAQ proteomic methods to quantify the secretome of both CD and NCD NP. Differential levels of proteins detected were further verified using immunohistochemistry, Western blotting, and proteoglycan extraction in order to evaluate the integrity of the small leucine-rich proteoglycans (SLRPs) decorin and biglycan. Additionally, we used robotic biomechanical testing to evaluate the biomechanical properties of spinal motion segments from both CD and NCD canines.ResultsWe detected differential levels of decorin, biglycan, and fibronectin, as well as of other important extracellular matrix (ECM)-related proteins, such as fibromodulin and HAPLN1 in the IVD NP obtained from CD canines compared with NCD canines. The core proteins of the vital SLRPs decorin and biglycan were fragmented in CD NP but were intact in the NP of the NCD animals. CD and NCD vertebral motion segments demonstrated significant differences, with the CD segments having less stiffness and a more varied range of motion.ConclusionsThe CD NP recapitulates key elements of human degenerative disc disease. Our data suggest that at least some of the compromised biomechanical properties of the degenerative disc arise from fibrocartilaginous metaplasia of the NP secondary to fragmentation of SLRP core proteins and associated degenerative changes affecting the ECM. This study demonstrates that the degenerative changes that naturally occur within the CD NP make this animal a valuable animal model with which to study IVD degeneration and potential biological therapeutics.

Electronic supplementary material

The online version of this article (doi:10.1186/s13075-015-0733-z) contains supplementary material, which is available to authorized users.  相似文献   

17.
A strategy for finding regions of similarity in complete genome sequences   总被引:3,自引:2,他引:1  
MOTIVATION: Complete genomic sequences will become available in the future. New methods to deal with very large sequences (sizes beyond 100 kb) efficiently are required. One of the main aims of such work is to increase our understanding of genome organization and evolution. This requires studies of the locations of regions of similarity. RESULTS: We present here a new tool, ASSIRC ('Accelerated Search for SImilarity Regions in Chromosomes'), for finding regions of similarity in genomic sequences. The method involves three steps: (i) identification of short exact chains of fixed size, called 'seeds', common to both sequences, using hashing functions; (ii) extension of these seeds into putative regions of similarity by a 'random walk' procedure; (iii) final selection of regions of similarity by assessing alignments of the putative sequences. We used simulations to estimate the proportion of regions of similarity not detected for particular region sizes, base identity proportions and seed sizes. This approach can be tailored to the user's specifications. We looked for regions of similarity between two yeast chromosomes (V and IX). The efficiency of the approach was compared to those of conventional programs BLAST and FASTA, by assessing CPU time required and the regions of similarity found for the same data set. AVAILABILITY: Source programs are freely available at the following address: ftp://ftp.biologie.ens. fr/pub/molbio/assirc.tar.gz CONTACT: vincens@biologie.ens.fr, hazout@urbb.jussieu.fr   相似文献   

18.
All popular algorithms of pair-wise alignment of protein primary structures (e.g. Smith-Waterman (SW), FASTA, BLAST, et al.) utilize only amino acid sequences. The SW-algorithm is the most accurate among them, i.e. it produces alignments that are most similar to the alignments obtained by superposition of protein 3D-structures. But even the SW-algorithm is unable to restore the 3D-based alignment if similarity of amino acid sequences (%id) is below 30%. We have proposed a novel alignment method that explicitly takes into account the secondary structure of the compared proteins. We have shown that it creates significantly more accurate alignments compared to SW-algorithm. In particular, for sequences with %id < 30% the average accuracy of the new method is 58% compared to 35% for SW-algorithm (the accuracy of an algorithmic sequence alignment is the part of restored position of a "golden standard" alignment obtained by superposition of corresponding 3D-structures). The accuracy of the proposed method is approximately identical both for experimental, and for theoretically predicted secondary structures. Thus the method can be applied for alignment of protein sequences even if protein 3D-structure is unknown. The program is available at ftp://194.149.64.196/STRUSWER/.  相似文献   

19.
Constructing multiple homologous alignments for protein-coding DNA sequences is crucial for a variety of bioinformatic analyses but remains computationally challenging. With the growing amount of sequence data available and the ongoing efforts largely dependent on protein-coding DNA alignments, there is an increasing demand for a tool that can process a large number of homologous groups and generate multiple protein-coding DNA alignments. Here we present a parallel tool - ParaAT that is capable of parallelly constructing multiple protein-coding DNA alignments for a large number of homologs. As testified on empirical datasets, ParaAT is well suited for large-scale data analysis in the high-throughput era, providing good scalability and exhibiting high parallel efficiency for computationally demanding tasks. ParaAT is freely available for academic use only at http://cbb.big.ac.cn/software.  相似文献   

20.
Racolta S  Juhl PB  Sirim D  Pleiss J 《Proteins》2012,80(8):2009-2019
Triterpene cyclases catalyze a broad range of cyclization reactions to form polycyclic triterpenes. Triterpene cyclases that convert squalene to hopene are named squalene-hopene cyclases (SHC) and triterpene cyclases that convert oxidosqualene are named oxidosqualene cyclases (OSC). Many sequences have been published, but there is only one structure available for each of SHCs and OSCs. Although they catalyze a similar reaction, the sequence similarity between SHCs and OSCs is low. A family classification based on phylogenetic analysis revealed 20 homologous families which are grouped into two superfamilies, SHCs and OSCs. Based on this family assignment, the Triterpene Cyclase Engineering Database (TTCED) was established. It integrates available information on sequence and structure of 639 triterpene cyclases as well as on structurally and functionally relevant amino acids. Family specific multiple sequence alignments were generated to identify the functionally relevant residues. Based on sequence alignments, conserved residues in SHCs and OSCs were analyzed and compared to experimentally confirmed mutational data. Functional schematic models of the central cavities of OSCs and SHCs were derived from structure comparison and sequence conservation analysis. These models demonstrate the high similarity of the substrate binding cavity of SHCs and OSCs and the equivalences of the respective residues. The TTCED is a novel source for comprehensive information on the triterpene cyclase family, including a compilation of previously described mutational data. The schematic models present the conservation analysis in a readily available fashion and facilitate the correlation of residues to a specific function or substrate interaction.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号