首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 649 毫秒
1.
Comparison of methods for searching protein sequence databases.   总被引:12,自引:2,他引:10       下载免费PDF全文
We have compared commonly used sequence comparison algorithms, scoring matrices, and gap penalties using a method that identifies statistically significant differences in performance. Search sensitivity with either the Smith-Waterman algorithm or FASTA is significantly improved by using modern scoring matrices, such as BLOSUM45-55, and optimized gap penalties instead of the conventional PAM250 matrix. More dramatic improvement can be obtained by scaling similarity scores by the logarithm of the length of the library sequence (In()-scaling). With the best modern scoring matrix (BLOSUM55 or JO93) and optimal gap penalties (-12 for the first residue in the gap and -2 for additional residues), Smith-Waterman and FASTA performed significantly better than BLASTP. With In()-scaling and optimal scoring matrices (BLOSUM45 or Gonnet92) and gap penalties (-12, -1), the rigorous Smith-Waterman algorithm performs better than either BLASTP and FASTA, although with the Gonnet92 matrix the difference with FASTA was not significant. Ln()-scaling performed better than normalization based on other simple functions of library sequence length. Ln()-scaling also performed better than scores based on normalized variance, but the differences were not statistically significant for the BLOSUM50 and Gonnet92 matrices. Optimal scoring matrices and gap penalties are reported for Smith-Waterman and FASTA, using conventional or In()-scaled similarity scores. Searches with no penalty for gap extension, or no penalty for gap opening, or an infinite penalty for gaps performed significantly worse than the best methods. Differences in performance between FASTA and Smith-Waterman were not significant when partial query sequences were used. However, the best performance with complete query sequences was obtained with the Smith-Waterman algorithm and In()-scaling.  相似文献   

2.
MOTIVATION: We consider the problem of identifying low-complexity regions (LCRs) in a protein sequence. LCRs are regions of biased composition, normally consisting of different kinds of repeats. RESULTS: We define new complexity measures to compute the complexity of a sequence based on a given scoring matrix, such as BLOSUM 62. Our complexity measures also consider the order of amino acids in the sequence and the sequence length. We develop a novel graph-based algorithm called GBA to identify LCRs in a protein sequence. In the graph constructed for the sequence, each vertex corresponds to a pair of similar amino acids. Each edge connects two pairs of amino acids that can be grouped together to form a longer repeat. GBA finds short subsequences as LCR candidates by traversing this graph. It then extends them to find longer subsequences that may contain full repeats with low complexities. Extended subsequences are then post-processed to refine repeats to LCRs. Our experiments on real data show that GBA has significantly higher recall compared to existing algorithms, including 0j.py, CARD, and SEG. AVAILABILITY: The program is available on request.  相似文献   

3.
SEGMENT: identifying compositional domains in DNA sequences   总被引:2,自引:0,他引:2  
MOTIVATION: DNA sequences are formed by patches or domains of different nucleotide composition. In a few simple sequences, domains can simply be identified by eye; however, most DNA sequences show a complex compositional heterogeneity (fractal structure), which cannot be properly detected by current methods. Recently, a computationally efficient segmentation method to analyse such nonstationary sequence structures, based on the Jensen-Shannon entropic divergence, has been described. Specific algorithms implementing this method are now needed. RESULTS: Here we describe a heuristic segmentation algorithm for DNA sequences, which was implemented on a Windows program (SEGMENT). The program divides a DNA sequence into compositionally homogeneous domains by iterating a local optimization procedure at a given statistical significance. Once a sequence is partitioned into domains, a global measure of sequence compositional complexity (SCC), accounting for both the sizes and compositional biases of all the domains in the sequence, is derived. SEGMENT computes SCC as a function of the significance level, which provides a multiscale view of sequence complexity.  相似文献   

4.
构建基于折叠核心的全α类蛋白取代矩阵   总被引:1,自引:0,他引:1  
氨基酸残基取代矩阵是影响多序列比对效果的重要因素,现有的取代矩阵对低相似序列的比对性能较低.在已有的 BLOSUM 取代矩阵算法基础上,定义了基于蛋白质折叠核心结构的序列 结构数据块;提出一种新的基于全α类蛋白质折叠核心结构的氨基酸残基取代矩阵——TOPSSUM25,用于提高低相似度序列的比对效果.将矩阵TOPSSUM25导入多序列比对程序,对相似性小于25%的一组四螺旋束序列 结构数据块的测试结果表明,基于 TOPSSUM25的多序列比对效果明显优于BLOSUM30矩阵;基于一个BAliBASE子集的比对检验也进一步表明, TOPSSUM25在全α类蛋白质的两两序列比对上优于BLOSUM30矩阵.研究结果可为进一步的阐明低同源蛋白质序列 结构 功能关系提供帮助.  相似文献   

5.
6.
Position-specific substitution matrices, known as profiles,derived from multiple sequence alignments are currently usedto search sequence databases for distantly related members ofprotein families. The performance of the database searches isenhanced by using (i) a sequence weighting scheme which assignshigher weights to more distantly related sequences based onbranch lengths derived from phylogenetic trees, (ii) exclusionof positions with mainly padding characters at sites of insertionsor deletions and (iii) the BLOSUM62 residue comparison matrix.A natural consequence of these modifications is an improvementin the alignment of new sequences to the profiles. However,the accuracy of the alignments can be further increased by employinga similarity residue comparison matrix. These developments areimplemented in a program called PROFILEWEIGHT which runs onUnix and Vax computers. The only input required by the programis the multiple sequence alignment. The output from PROFILEWEIGHTis a profile designed to be used by existing searching and alignmentprograms. Test results from database searches with four differentfamilies of proteins show the improved sensitivity of the weightedprofiles.  相似文献   

7.
MOTIVATION: Algorithm development for finding typical patterns in sequences, especially multiple pseudo-repeats (pseudo-periodic regions), is at the core of many problems arising in biological sequence and structure analysis. In fact, one of the most significant features of biological sequences is their high quasi-repetitiveness. Variation in the quasi-repetitiveness of genomic and proteomic texts demonstrates the presence and density of different biologically important information. It is very important to develop sensitive automatic computational methods for the identification of pseudo-periodic regions of sequences through which we can infer, describe and understand biological properties, and seek precise molecular details of biological structures, dynamics, interactions and evolution. RESULTS: We develop a novel, powerful computational tool for partitioning a sequence to pseudo-periodic regions. The pseudo-periodic partition is defined as a partition, which intuitively has the minimal bias to some perfect-periodic partition of the sequence based on the evolutionary distance. We devise a quadratic time and space algorithm for detecting a pseudo-periodic partition for a given sequence, which actually corresponds to the shortest path in the main diagonal of the directed (acyclic) weighted graph constructed by the Smith-Waterman self-alignment of the sequence. We use several typical examples to demonstrate the utilization of our algorithm and software system in detecting functional or structural domains and regions of proteins. A big advantage of our software program is that there is a parameter, the granularity factor, associated with it and we can freely choose a biological sequence family as a training set to determine the best parameter. In general, we choose all repeats (including many pseudo-repeats) in the SWISS-PROT amino acid sequence database as a typical training set. We show that the granularity factor is 0.52 and the average agreement accuracy of pseudo-periodic partitions, detected by our software for all pseudo-repeats in the SWISS-PROT database, is as high as 97.6%.  相似文献   

8.
Scaffold/matrix-associated region (S/MAR) sequences are DNA regions that are attached to the nuclear matrix, and participate in many cellular processes. The nuclear matrix is a complex structure consisting of various elements. In this paper we compared frequencies of simple nucleotide motifs in S/MAR sequences and in sequences extracted directly from various nuclear matrix elements, such as nuclear lamina, cores of rosette-like structures, synaptonemal complex. Multivariate linear discriminant analysis revealed significant differences between these sequences. Based on this result we have developed a program, ChrClass (Win/NT version, ftp.bionet.nsc.ru/pub/biology/chrclass/chrclass.zip), for the prediction of the regions associated with various elements of the nuclear matrix in a query sequence. Subsequently, several test samples were analyzed by using two S/MAR prediction programs (a ChrClass and MAR-Finder) and a simple MRS criterion (S/MAR recognition signature) indicating the presence of S/MARs. Some overlap between the predictions of all MAR prediction tools has been found. Simultaneous use of the ChrClass, MRS criterion and MAR-Finder programs may help to obtain a more clearcut picture of S/MAR distribution in a query sequence. In general, our results suggest that the proportion of missed S/MARs is lower for ChrClass, whereas the proportion of wrong S/MARs is lower for MAR-Finder and MRS.  相似文献   

9.
Karp PD  Paley S  Zhu J 《Bioinformatics (Oxford, England)》2001,17(6):526-32; discussion 533-4
PROBLEM STATEMENT: We have studied the relationships among SWISS-PROT, TrEMBL, and GenBank with two goals. First is to determine whether users can reliably identify those proteins in SWISS-PROT whose functions were determined experimentally, as opposed to proteins whose functions were predicted computationally. If this information was present in reasonable quantities, it would allow researchers to decrease the propagation of incorrect function predictions during sequence annotation, and to assemble training sets for developing the next generation of sequence-analysis algorithms. Second is to assess the consistency between translated GenBank sequences and sequences in SWISS-PROT and TrEMBL. RESULTS: (1) Contrary to claims by the SWISS-PROT authors, we conclude that SWISS-PROT does not identify a significant number of experimentally characterized proteins. (2) SWISS-PROT is more incomplete than we expected in that version 38.0 from July 1999 lacks many proteins from the full genomes of important organisms that were sequenced years earlier. (3) Even if we combine SWISS-PROT and TrEMBL, some sequences from the full genomes are missing from the combined dataset. (4) In many cases, translated GenBank genes do not exactly match the corresponding SWISS-PROT sequences, for reasons that include missing or removed methionines, differing translation start positions, individual amino-acid differences, and inclusion of sequence data from multiple sequencing projects. For example, results show that for Escherichia coli, 80.6% of the proteins in the GenBank entry for the complete genome have identical sequence matches with SWISS-PROT/TrEMBL sequences, 13.4% have exact substring matches, and matches for 4.1% can be found using BLAST search; the remaining 2.0% of E.coli protein sequences (most of which are ORFs) have no clear matches to SWISS-PROT/TrEMBL. Although many of these differences can be explained by the complexity of the DB, and by the curation processes used to create it, the scale of the differences is notable.  相似文献   

10.
MOTIVATION: When analysing novel protein sequences, it is now essential to extend search strategies to include a range of 'secondary' databases. Pattern databases have become vital tools for identifying distant relationships in sequences, and hence for predicting protein function and structure. The main drawback of such methods is the relatively small representation of proteins in trial samples at the time of their construction. Therefore, a negative result of an amino acid sequence comparison with such a databank forces a researcher to search for similarities in the original protein banks. We developed a database of patterns constructed for groups of related proteins with maximum representation of amino acid sequences of SWISS-PROT in the groups. RESULTS: Software tools and a new method have been designed to construct patterns of protein families. By using such method, a new version of databank of protein family patterns, PROF_ PAT 1.3, is produced. This bank is based on SWISS-PROT (r1.38) and TrEMBL (r1.11), and contains patterns of more than 13 000 groups of related proteins in a format similar to that of the PROSITE. Motifs of patterns, which had the minimum level of probability to be found in random sequences, were selected. Flexible fast search program accompanies the bank. The researcher can specify a similarity matrix (the type PAM, BLOSUM and other). Variable levels of similarity can be set (permitting search strategies ranging from exact matches to increasing levels of 'fuzziness'). AVAILABILITY: The Internet address for comparing sequences with the bank is: http://wwwmgs.bionet.nsc.ru/mgs/programs/prof_pat/. The local version of the bank and search programs (approximately 50 Mb) is available via ftp: ftp://ftp.bionet.nsc. ru/pub/biology/vector/prof_pat/, and ftp://ftp.ebi.ac. uk/pub/databases/prof_pat/. Another appropriate way for its external use is to mail amino acid sequences to bachin@vector.nsc.ru for comparison with PROF_ PAT 1.3.  相似文献   

11.
MOTIVATION: Protein and DNA are generally represented by sequences of letters. In a number of circumstances simplified alphabets (where one or more letters would be represented by the same symbol) have proved their potential utility in several fields of bioinformatics including searching for patterns occurring at an unexpected rate, studying protein folding and finding consensus sequences in multiple alignments. The main issue addressed in this paper is the possibility of finding a general approach that would allow an exhaustive analysis of all the possible simplified alphabets, using substitution matrices like PAM and BLOSUM as a measure for scoring. RESULTS: The computational approach presented in this paper has led to a computer program called AlphaSimp (Alphabet Simplifier) that can perform an exhaustive analysis of the possible simplified amino acid alphabets, using a branch and bound algorithm together with standard or user-defined substitution matrices. The program returns a ranked list of the highest-scoring simplified alphabets. When the extent of the simplification is limited and the simplified alphabets are maintained above ten symbols the program is able to complete the analysis in minutes or even seconds on a personal computer. However, the performance becomes worse, taking up to several hours, for highly simplified alphabets. AVAILABILITY: AlphaSimp and other accessory programs are available at http://bioinformatics.cribi.unipd.it/alphasimp  相似文献   

12.
MOTIVATION: Clustering of protein sequences is widely used for the functional characterization of proteins. However, it is still not easy to cluster distantly-related proteins, which have only regional similarity among their sequences. It is therefore necessary to develop an algorithm for clustering such distantly-related proteins. RESULTS: We have developed a time and space efficient clustering algorithm. It uses a graph representation where its vertices and edges denote proteins and their sequence similarities above a certain cutoff score, respectively. It repeatedly partitions the graph by removing edges that have small weights, which correspond to low sequence similarities. To find the appropriate partitions, we introduce a score combining the normalized cut and a locally minimal cut capacities. Our method is applied to the entire 40,703 human proteins in SWISS-PROT and TrEMBL. The resulting clusters shows a 76% recall (20,529 proteins) of the 26,917 classified by InterPro. It also finds relationships not found by other clustering methods. AVAILABILITY: The complete result of our algorithm for all the human proteins in SWISS-PROT and TrEMBL, and other supplementary information are available at http://motif.ics.es.osaka-u.ac.jp/Ncut-KL/  相似文献   

13.
MOTIVATION: Low-complexity or cryptically simple sequences are widespread in protein sequences but their evolution and function are poorly understood. To date methods for the detection of low complexity in proteins have been directed towards the filtering of such regions prior to sequence homology searches but not to the analysis of the regions per se. However, many of these regions are encoded by non-repetitive DNA sequences and may therefore result from selection acting on protein structure and/or function. RESULTS: We have developed a new tool, based on the SIMPLE algorithm, that facilitates the quantification of the amount of simple sequence in proteins and determines the type of short motifs that show clustering above a certain threshold. By modifying the sensitivity of the program simple sequence content can be studied at various levels, from highly organised tandem structures to complex combinations of repeats. We compare the relative amount of simplicity in different functional groups of yeast proteins and determine the level of clustering of the different amino acids in these proteins. AVAILABILITY: The program is available on request or online at http://www.biochem.ucl.ac.uk/bsm/SIMPLE.  相似文献   

14.
TMCompare is an alignment and visualization tool for comparison of sequence information for membrane proteins contained in SWISS-PROT entries, with structural information contained in PDB files. The program can be used for: detection of breaks in alpha helical structure of transmembrane regions; examination of differences in coverage between PDB and SWISS-PROT files; examination of annotation differences between PDB files and associated SWISS-PROT files; examination and comparison of assigned PDB alpha helix regions and assigned SWISS-PROT transmembrane regions in linear sequence (one letter code) format; examination of these differences in 3D using the CHIME plugin, allowing; analysis of the alpha and non-alpha content of transmembrane regions. AVAILABILITY: TMCompare is available for use through selection of a query protein via the internet (http://www.membraneproteins.org/TMCompare) CONTACT: tmcompare@membraneproteins.org  相似文献   

15.
We present a neural network based method (ChloroP) for identifying chloroplast transit peptides and their cleavage sites. Using cross-validation, 88% of the sequences in our homology reduced training set were correctly classified as transit peptides or nontransit peptides. This performance level is well above that of the publicly available chloroplast localization predictor PSORT. Cleavage sites are predicted using a scoring matrix derived by an automatic motif-finding algorithm. Approximately 60% of the known cleavage sites in our sequence collection were predicted to within +/-2 residues from the cleavage sites given in SWISS-PROT. An analysis of 715 Arabidopsis thaliana sequences from SWISS-PROT suggests that the ChloroP method should be useful for the identification of putative transit peptides in genome-wide sequence data. The ChloroP predictor is available as a web-server at http://www.cbs.dtu.dk/services/ChloroP/.  相似文献   

16.
The application of degenerate oligonucleotides to DNA Sequencing by Hybridisation with Oligonucleotide Matrix (SHOM) is proposed. The use of degenerate oligonucleotides is regarded as an example of pooling methods that are suitable for various laboratory procedures requiring numerous samples to be assayed. As each DNA sequence coded by four letters (A, G, C, T) may be defined by two sequences: a sequence coded by W and S (W-weak-A or T, S-strong-G or C) and a sequence coded by R and Y (R-purine-A or G, Y-pirymidine-T or C), n4n -nucleotide sequences may be defined with the help of 2xn2sequences. In the place of the originally described microchip matrix composed of all possible unambiguous octanucleotides (4(8)=65 536) attached to the equal number of 65 536 microlocations a matrix composed of 512 microlocations containing 256 2(8)-degenerate octanucleotides is proposed. The matrix contains all 256 possible octanucleotides coded by W and S variations and all 256 possible octanucleotides coded by R and Y variations. The 512 256-degenerate octanucleotides allows to retrieve the same information as 65 536 unambiguous octanucleotides. A variant of the DNA sequence reconstruction method applicable to this system is presented. The use of degenerate oligonucleotides also gives the possibility to apply matrices composed of longer oligonucleotides without increasing the number of microlocations in matrices, which would enable increasing the length of unambiguously reconstructed sequence, e.g. a matrix comprising 131 072 16-mer oligonucleotides i.e. 65 536 65 536-fold degenerate oligonucleotide coded by W and S variations and 65 536 65 536-fold degenerate oligonucleotide coded by R and Y variations could replace one matrix comprising all possible unambiguous 16-mer oligonucleotides (ca. 4.3x10(9)).  相似文献   

17.
MOTIVATION: We propose representing amino acids by bit-patterns so they may be used in a filter algorithm for similarity searches over protein databases, to rapidly eliminate non-homologous regions of database sequences. The filter algorithm would be based on dynamic programming optimization. It would have the advantage over previous filter algorithms that its substitution scoring function distinguishes between conservative and non-conservative amino acid substitutions. RESULTS: Simulated annealing was used to search for the best five-bit or three-bit patterns to represent amino acids, where similar amino acids were given similar bit-patterns. The similarity between amino acids was estimated from the BLOSUM45 matrix. Representing amino acids by these five-bit and three-bit patterns, the Escherichia coli PhoE precursor and the bacteriophage PA2 LC precursor were aligned. The alignments were nearly the same as that obtained when BLOSUM45 was used to score substitutions. AVAILABILITY: The C code of the optimization algorithm for searching for the optimal bit-pattern representation of amino acids is available from the authors upon request.  相似文献   

18.
19.
20.
JAE     
One of the most basic methods of understanding the biological significance of a sequence is to produce an alignment with related sequences. A vital aspect of correctly aligning sequences is to apply biological intuition through manual editing of an alignment produced by multiple-sequence alignment software. As part of the European Molecular Biology Open Source Software Suite (EMBOSS), a new alignment editor in the Jemboss package is freely available for download. The Jemboss Alignment Editor (JAE) incorporates standard methods of editing, and colouring residues and nucleotides to highlight important regions of interest. JAE also makes use of scoring matrices (PAM and BLOSUM), selected by the user, to display regions of high degrees of similarity and identity. Other tools include the ability to calculate a consensus, a consensus plot (using a selected scoring matrix) and pairwise identities. AVAILABILITY: The JAE can be launched from the webpage (http://emboss.sourceforge.net/Jemboss/).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号