首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION: In 2001 and 2002, we published two papers (Bioinformatics, 17, 282-283, Bioinformatics, 18, 77-82) describing an ultrafast protein sequence clustering program called cd-hit. This program can efficiently cluster a huge protein database with millions of sequences. However, the applications of the underlying algorithm are not limited to only protein sequences clustering, here we present several new programs using the same algorithm including cd-hit-2d, cd-hit-est and cd-hit-est-2d. Cd-hit-2d compares two protein datasets and reports similar matches between them; cd-hit-est clusters a DNA/RNA sequence database and cd-hit-est-2d compares two nucleotide datasets. All these programs can handle huge datasets with millions of sequences and can be hundreds of times faster than methods based on the popular sequence comparison and database search tools, such as BLAST.  相似文献   

2.
Shaw G 《BioTechniques》2000,28(6):1198-1201
Biologists today make extensive use of word processing programs for the production of research reports, literature reviews and grant proposals. Frequently, such programs become the default platform for viewing and the later publication of protein and nucleic acid sequence data. Thus, researchers often switch between their word processor and more specialized programs designed to analyze protein and nucleic acid sequences. It would be more convenient to perform these simple sequence analyses using the word processor without switching to another program. The focus here is on the use of the Visual Basic programming language, which is built into all recent versions of Microsoft Word to generate surprisingly complex and useful macros that can conveniently analyze several important features of protein and nucleic acid sequences. The standard Word interface can also be easily modified to display and run these macros from a pull-down menu. Several examples of this approach are provided.  相似文献   

3.

Background

Phylogenetic study of protein sequences provides unique and valuable insights into the molecular and genetic basis of important medical and epidemiological problems as well as insights about the origins and development of physiological features in present day organisms. Consensus phylogenies based on the bootstrap and other resampling methods play a crucial part in analyzing the robustness of the trees produced for these analyses.

Methodology

Our focus was to increase the number of bootstrap replications that can be performed on large protein datasets using the maximum parsimony, distance matrix, and maximum likelihood methods. We have modified the PHYLIP package using MPI to enable large-scale phylogenetic study of protein sequences, using a statistically robust number of bootstrapped datasets, to be performed in a moderate amount of time. This paper discusses the methodology used to parallelize the PHYLIP programs and reports the performance of the parallel PHYLIP programs that are relevant to the study of protein evolution on several protein datasets.

Conclusions

Calculations that currently take a few days on a state of the art desktop workstation are reduced to calculations that can be performed over lunchtime on a modern parallel computer. Of the three protein methods tested, the maximum likelihood method scales the best, followed by the distance method, and then the maximum parsimony method. However, the maximum likelihood method requires significant memory resources, which limits its application to more moderately sized protein datasets.  相似文献   

4.
R Staden 《DNA sequence》1991,1(6):369-374
We describe programs that can screen nucleic acid and protein sequences against libraries of motifs and patterns. Such comparisons are likely to play an important role in interpreting the function of sequences determined during large scale sequencing projects. In addition we report programs for converting the Prosite protein motif library into a form that is compatible with our searching programs. The programs work on VAX and SUN computers.  相似文献   

5.
We use methods from Data Mining and Knowledge Discovery to design an algorithm for detecting motifs in protein sequences. The algorithm assumes that a motif is constituted by the presence of a "good" combination of residues in appropriate locations of the motif. The algorithm attempts to compile such good combinations into a "pattern dictionary" by processing an aligned training set of protein sequences. The dictionary is subsequently used to detect motifs in new protein sequences. Statistical significance of the detection results are ensured by statistically determining the various parameters of the algorithm. Based on this approach, we have implemented a program called GYM. The Helix-Turn-Helix motif was used as a model system on which to test our program. The program was also extended to detect Homeodomain motifs. The detection results for the two motifs compare favorably with existing programs. In addition, the GYM program provides a lot of useful information about a given protein sequence.  相似文献   

6.
Newly developed in silico protein design methods have recently been applied to problems in protein stabilization. Stabilized protein sequences can be designed by combining potential functions that model a protein sequence's compatibility with a structure and fast optimization tools that can search the enormous number of sequence possibilities. The experimental testing of several sequence-design strategies has demonstrated that a wide range of protein structures can be stabilized. The primary advantage of in silico design is the vast number of sequences that can be rapidly screened in the search for an optimal design, far exceeding non-computational methods. This feature allows very large changes in protein properties to be discovered.  相似文献   

7.
Distance-based methods are popular for reconstructing evolutionary trees of protein sequences, mainly because of their speed and generality. A number of variants of the classical neighbor-joining (NJ) algorithm have been proposed, as well as a number of methods to estimate protein distances. We here present a large-scale assessment of performance in reconstructing the correct tree topology for the most popular algorithms. The programs BIONJ, FastME, Weighbor, and standard NJ were run using 12 distance estimators, producing 48 tree-building/distance estimation method combinations. These were evaluated on a test set based on real trees taken from 100 Pfam families. Each tree was used to generate multiple sequence alignments with the ROSE program using three evolutionary models. The accuracy of each method was analyzed as a function of both sequence divergence and location in the tree. We found that BIONJ produced the overall best results, although the average accuracy differed little between the tree-building methods (normally less than 1%). A noticeable trend was that FastME performed poorer than the rest on long branches. Weighbor was several orders of magnitude slower than the other programs. Larger differences were observed when using different distance estimators. Protein-adapted Jukes-Cantor and Kimura distance correction produced clearly poorer results than the other methods, even worse than uncorrected distances. We also assessed the recently developed Scoredist measure, which performed equally well as more complex methods.  相似文献   

8.
ANTHEPROT: a package for protein sequence analysis using a microcomputer   总被引:2,自引:0,他引:2  
A simple microcomputer package is described to make the theoreticalanalysis of protein sequences. Several methods designed to comparetwo sequences, to model proteolytic reactions and to predictthe secondary structure, the hydro-phobic/hydrophilic regionsand the potential antigenic sites of proteins have been includedin an Apple II microcomputer software. The package comprises21 programs as well as the secondary structure database of Kabschand Sander (1983). Received on November 24, 1987; accepted on March 8, 1988  相似文献   

9.
10.
Two programs written in BASIC are used for the teaching of protein synthesis. Students may individually and at their own pace test their knowledge of base pairing using one of the programs. Again individually, students then investigate the process of protein synthesis using randomly generated DNA sequences produced by the second program. Students are thereby reinforcing their understanding of base pairing, complementary sequences, triplet codons, and relating nucleotide sequences to amino acid sequences. The final part of the second exercise introduces a known genetic defect of man.  相似文献   

11.
Hydrophobic cluster analysis (HCA) is an efficient method foranalysing and comparing the amino acid sequences of proteins.It relies on two–dimensional representations of the sequencespresently generated by simple plot programs working on microcomputers.Two interactive programs, MANSEK. and SUNHCA, are describedhere that operate from Vax and Sun workstations respectively.These programs allow the display of several protein sequencesin the form of two–dimensional helical plots suitablefor HCA. Several tedious, repetitive and timeconsuming stepsof HCA have been suppressed by implementing several featuressuch as interactive on–screen manipulations (zoom, translations)of the plots and HCA score calculations on segments chosen bythe user. Plots on paper can be obtained through hard copiesor plotting subroutines.  相似文献   

12.
Knowledge of structural class plays an important role in understanding protein folding patterns. So it is necessary to develop effective and reliable computational methods for prediction of protein structural class. To this end, we present a new method called NN-CDM, a nearest neighbor classifier with a complexity-based distance measure. Instead of extracting features from protein sequences as done previously, distance between each pair of protein sequences is directly evaluated by a complexity measure of symbol sequences. Then the nearest neighbor classifier is adopted as the predictive engine. To verify the performance of this method, jackknife cross-validation tests are performed on several benchmark datasets. Results show that our approach achieves a high prediction accuracy over some classical methods.  相似文献   

13.
Statistical and learning techniques are becoming increasingly popular for different tasks in bioinformatics. Many of the most powerful statistical and learning techniques are applicable to points in a Euclidean space but not directly applicable to discrete sequences such as protein sequences. One way to apply these techniques to protein sequences is to embed the sequences into a Euclidean space and then apply these techniques to the embedded points. In this work we introduce a biologically motivated sequence embedding, the homology kernel, which takes into account intuitions from local alignment, sequence homology, and predicted secondary structure. This embedding allows us to directly apply learning techniques to protein sequences. We apply the homology kernel in several ways. We demonstrate how the homology kernel can be used for protein family classification and outperforms state-of-the-art methods for remote homology detection. We show that the homology kernel can be used for secondary structure prediction and is competitive with popular secondary structure prediction methods. Finally, we show how the homology kernel can be used to incorporate information from homologous sequences in local sequence alignment.  相似文献   

14.
Genomic SELEX is a method for studying the network of nucleic acid–protein interactions within any organism. Here we report the discovery of several interesting and potentially biologically important interactions using genomic SELEX. We have found that bacteriophage MS2 coat protein binds several Escherichia coli mRNA fragments more tightly than it binds the natural, well-studied, phage mRNA site. MS2 coat protein binds mRNA fragments from rffG (involved in formation of lipopolysaccharide in the bacterial outer membrane), ebgR (lactose utilization repressor), as well as from several other genes. Genomic SELEX may yield experimentally induced artifacts, such as molecules in which the fixed sequences participate in binding. We describe several methods (annealing of oligonucleotides complementary to fixed sequences or switching fixed sequences) to eliminate some, or almost all, of these artifacts. Such methods may be useful tools for both randomized sequence SELEX and genomic SELEX.  相似文献   

15.
Locating protein coding regions in genomic DNA is a critical step in accessing the information generated by large scale sequencing projects. Current methods for gene detection depend on statistical measures of content differences between coding and noncoding DNA in addition to the recognition of promoters, splice sites, and other regulatory sites. Here we explore the potential value of recurrent amino acid sequence patterns 3-19 amino acids in length as a content statistic for use in gene finding approaches. A finite mixture model incorporating these patterns can partially discriminate protein sequences which have no (detectable) known homologs from randomized versions of these sequences, and from short (< or = 50 amino acids) non-coding segments extracted from the S. cerevisiea genome. The mixture model derived scores for a collection of human exons were not correlated with the GENSCAN scores, suggesting that the addition of our protein pattern recognition module to current gene recognition programs may improve their performance.  相似文献   

16.
Recent progress in predicting protein sub-subcellular locations   总被引:1,自引:0,他引:1  
In the last two decades, the number of the known protein sequences increased very rapidly. However, a knowledge of protein function only exists for a small portion of these sequences. Since the experimental approaches for determining protein functions are costly and time consuming, in silico methods have been introduced to bridge the gap between knowledge of protein sequences and their functions. Knowing the subcellular location of a protein is considered to be a critical step in understanding its biological functions. Many efforts have been undertaken to predict the protein subcellular locations in silico. With the accumulation of available data, the substructures of some subcellular organelles, such as the cell nucleus, mitochondria and chloroplasts, have been taken into consideration by several studies in recent years. These studies create a new research topic, namely 'protein sub-subcellular location prediction', which goes one level deeper than classic protein subcellular location prediction.  相似文献   

17.
The progress achieved by several groups in the field of computational protein design shows that successful design methods include two major features: efficient algorithms to deal with the combinatorial exploration of sequence space and optimal energy functions to rank sequences according to their fitness for the given fold.  相似文献   

18.
Hydrophobic cluster analysis (HCA) [15] is a very efficient method to analyse and compare protein sequences. Despite its effectiveness, this method is not widely used because it relies in part on the experience and training of the user. In this article, detailed guidelines as to the use of HCA are presented and include discussions on: the definition of the hydrophobic clusters and their relationships with secondary and tertiary structures; the length of the clusters; the amino acid classification used for HCA; the HCA plot programs; and the working strategies. Various procedures for the analysis of a single sequence are presented: structural segmentation, structural domains and secondary structure evaluation. Like most sequence analysis methods, HCA is more efficient when several homologous sequences are compared. Procedures for the detection and alignment of distantly related proteins by HCA are described through several published examples along with 2 previously unreported cases: the beta-glucosidase from Ruminococcus albus is clearly related to the beta-glucosidases from Clostridum thermocellum and Hansenula anomala although they display a reverse organization of their constitutive domains; the alignment of the sequence of human GTPase activating protein with that of the Crk oncogene is presented. Finally, the pertinence of HCA in the identification of important residues for structure/function as well as in the preparation of homology modelling is discussed.  相似文献   

19.
During the past several years, the use of computer programs in the analysis of protein and DNA sequences has become commonplace. In all but the simplest procedures, the ability to critically review the results obtained with computer methods requires a basic knowledge of the algorithms employed (and the assumptions upon which they are based), an awareness of the capabilities and limitations of the particular program that implements an algorithm, and some familiarity with probability and statistics. We describe a number of computer methods that have been applied to the analysis of apolipoprotein sequences. We discuss the suitability of these methods for particular problems, how the choice of initial "parameters" can affect the results, and what the results can tell us about protein or gene sequences. We also identify some outstanding problems of apolipoprotein sequence analysis where further work is needed.  相似文献   

20.

Background  

Until today, analysis of 16S ribosomal RNA (rRNA) sequences has been the de-facto gold standard for the assessment of phylogenetic relationships among prokaryotes. However, the branching order of the individual phlya is not well-resolved in 16S rRNA-based trees. In search of an improvement, new phylogenetic methods have been developed alongside with the growing availability of complete genome sequences. Unfortunately, only a few genes in prokaryotic genomes qualify as universal phylogenetic markers and almost all of them have a lower information content than the 16S rRNA gene. Therefore, emphasis has been placed on methods that are based on multiple genes or even entire genomes. The concatenation of ribosomal protein sequences is one method which has been ascribed an improved resolution. Since there is neither a comprehensive database for ribosomal protein sequences nor a tool that assists in sequence retrieval and generation of respective input files for phylogenetic reconstruction programs, RibAlign has been developed to fill this gap.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号