共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION: Word-matching algorithms such as BLAST are routinely used for sequence comparison. These algorithms typically use areas of matching words to seed alignments which are then used to assess the degree of sequence similarity. In this paper, we show that by formally separating the word-matching and sequence-alignment process, and using information about word frequencies to generate alignments and similarity scores, we can create a new sequence-comparison algorithm which is both fast and sensitive. The formal split between word searching and alignment allows users to select an appropriate alignment method without affecting the underlying similarity search. The algorithm has been used to develop software for identifying entries in DNA sequence databases which are contaminated with vector sequence. RESULTS: We present three algorithms, RAPID, PHAT and SPLAT, which together allow vector contaminations to be found and assessed extremely rapidly. RAPID is a word search algorithm which uses probabilities to modify the significance attached to different words; PHAT and SPLAT are alignment algorithms. An initial implementation has been shown to be approximately an order of magnitude faster than BLAST. The formal split between word searching and alignment not only offers considerable gains in performance, but also allows alignment generation to be viewed as a user interface problem, allowing the most useful output method to be selected without affecting the underlying similarity search. Receiver Operator Characteristic (ROC) analysis of an artificial test set allows the optimal score threshold for identifying vector contamination to be determined. ROC curves were also used to determine the optimum word size (nine) for finding vector contamination. An analysis of the entire expressed sequence tag (EST) subset of EMBL found a contamination rate of 0.27%. A more detailed analysis of the 50 000 ESTs in est10.dat (an EST subset of EMBL) finds an error rate of 0.86%, principally due to two large-scale projects. AVAILABILITY: A Web page for the software exists at http://bioinf.man.ac.uk/rapid, or it can be downloaded from ftp://ftp.bioinf.man.ac.uk/RAPID CONTACT: crispin@cs.man.ac.uk 相似文献
2.
《Genomics》2019,111(6):1298-1305
Based on the k-mer model for protein sequence, a novel k-mer natural vector method is proposed to characterize the features of k-mers in a protein sequence, in which the numbers and distributions of k-mers are considered. It is proved that the relationship between a protein sequence and its k-mer natural vector is one-to-one. Phylogenetic analysis of protein sequences therefore can be easily performed without requiring evolutionary models or human intervention. In addition, there exists no a criterion to choose a suitable k, and k has a great influence on obtaining results as well as computational complexity. In this paper, a compound k-mer natural vector is utilized to quantify each protein sequence. The results gotten from phylogenetic analysis on three protein datasets demonstrate that our new method can precisely describe the evolutionary relationships of proteins, and greatly heighten the computing efficiency. 相似文献
3.
Daniel Luis Notari Aurione Molin Vanessa Davanzo Douglas Picolotto Helena Graziottin Ribeiro Scheila de Avila e Silva 《Bioinformation》2014,10(6):381-383
A whole genome contains not only coding regions, but also non-coding regions. These are located between the end of a given
coding region and the beginning of the following coding region. For this reason, the information about gene regulation process
underlies in intergenic regions. There is no easy way to obtain intergenic regions from current available databases. IntergenicDB
was developed to integrate data of intergenic regions and their gene related information from NCBI databases. The main goal of
INTERGENICDB is to offer friendly database for intergenic sequences of bacterial genomes.
Availability
http://intergenicdb.bioinfoucs.com/ 相似文献4.
In database searches for sequence similarity, matches to a distinct sequence region (e.g., protein domain) are frequently obscured by numerous matches to another region of the same sequence. In order to cope with this problem, algorithms are developed to discard redundant matches. One model for this problem begins with a list of intervals, each with an associated score; each interval gives the range of positions in the query sequence that align to a database sequence, and the score is that of the alignment. If interval I is contained in interval J, and I's score is less than J's, then I is said to be dominated by J. The problem is then to identify each interval that is dominated by at least K other intervals, where K is a given level of "tolerable redundancy." An algorithm is developed to solve the problem in O(N log N) time and O(N*) space, where N is the number of intervals and N* is a precisely defined value that never exceeds N and is frequently much smaller. This criterion for discarding database hits has been implemented in the Blast program, as illustrated herein with examples. Several variations and extensions of this approach are also described. 相似文献
5.
Shahmuradov IA Gammerman AJ Hancock JM Bramley PM Solovyev VV 《Nucleic acids research》2003,31(1):114-117
6.
Over the last decade the modeling and the storage of biological data has been a topic of wide interest for scientists dealing with biological and biomedical research. Currently most data is still stored in text files which leads to data redundancies and file chaos.In this paper we show how to use relational modeling techniques and relational database technology for modeling and storing biological sequence data, i.e. for data maintained in collections like EMBL or SWISS-PROT to better serve the needs for these application domains.For this reason we propose a two step approach. First, we model the structure (and therefore the meaning of the) data using an Entity-Relationship approach. The ER model leads to a clean design of a relational database schema for storing and retrieving the DNA and protein data extracted from various sources. Our approach provides the clean basis for building complex biological applications that are more amenable to changes and software ports than their file-base counterparts. 相似文献
7.
8.
To measure the similarity or dissimilarity between two given biological sequences, several papers proposed metrics based on the "word-composition vector". The essence of these metrics is as follows. First, we count the appearance frequencies of all the K-tuple words throughout each of two given sequences. Then, the two given sequences are transformed into their respective word-composition vectors. Next, the distance metrics, for example the angle between the two vectors, are calculated. A significant issue is to determine the optimal word size K. With a mathematical model of mutational events (including substitutions, insertions, deletions and duplications) that occur in sequences, we analyzed how the angle between the composition vectors depends on the mutational events. We also considered the optimal word size (=resolution) from our original approach. Our results were verified by computational experiments using artificially generated sequences, amino acid sequences of hemoglobin and nucleotide sequences of 16S ribosomal RNA. 相似文献
9.
MIPS: a database for genomes and protein sequences 总被引:15,自引:0,他引:15
H. W. Mewes D. Frishman U. Güldener G. Mannhaupt K. Mayer M. Mokrejs B. Morgenstern M. Münsterktter S. Rudd B. Weil 《Nucleic acids research》2002,30(1):31-34
10.
Mewes HW Frishman D Gruber C Geier B Haase D Kaps A Lemcke K Mannhaupt G Pfeiffer F Schüller C Stocker S Weil B 《Nucleic acids research》2000,28(1):37-40
The Munich Information Center for Protein Sequences (MIPS-GSF), Martinsried, near Munich, Germany, continues its longstanding tradition to develop and maintain high quality curated genome databases. In addition, efforts have been intensified to cover the wealth of complete genome sequences in a systematic, comprehensive form. Bioinformatics, supporting national as well as European sequencing and functional analysis projects, has resulted in several up-to-date genome-oriented databases. This report describes growing databases reflecting the progress of sequencing the Arabidopsis thaliana (MATDB) and Neurospora crassa genomes (MNCDB), the yeast genome database (MYGD) extended by functional analysis data, the database of annotated human EST-clusters (HIB) and the database of the complete cDNA sequences from the DHGP (German Human Genome Project). It also contains information on the up-to-date database of complete genomes (PEDANT), the classification of protein sequences (ProtFam) and the collection of protein sequence data within the framework of the PIR-International Protein Sequence Database. These databases can be accessed through the MIPS WWW server (http://www. mips.biochem.mpg.de). 相似文献
11.
ConoServer, a database for conopeptide sequences and structures 总被引:1,自引:0,他引:1
SUMMARY: ConoServer is a new database dedicated to conopeptides, a large family of peptides found in the venom of marine snails of the genus Conus. These peptides have an exceptional diversity of sequences and chemical modifications and their ability to block ion channels makes them important as drug leads and tools for physiological studies. ConoServer uses standardized names and a genetic and structural classification scheme to present data retrieved from SwissProt, GenBank, the Protein DataBank and the literature. The ConoServer web site incorporates specialized features like the graphic display of post-translational modifications that are extensively present in conopeptides. Currently, ConoServer manages 1214 nucleic sequences (from 54 Conus species), 2258 proteic sequences (from 66 Conus species) and 99 3D structures. AVAILABILITY: http://research1t.imb.uq.edu.au/conoserver/. 相似文献
12.
MOTIVATION: We present a new concept that combines data storage and data analysis in genome research, based on an associative network memory. As an illustration, 115 000 conserved regions from over 73 000 published sequences (i.e. from the entire annotated part of the SWISSPROT sequence database) were identified and clustered by a self-organizing network. Similarity and kinship, as well as degree of distance between the conserved protein segments, are visualized as neighborhood relationship on a two-dimensional topographical map. RESULTS: Such a display overcomes the restrictions of linear list processing and allows local and global sequence relationships to be studied visually. Families are memorized as prototype vectors of conserved regions. On a massive parallel machine, clustering and updating of the database take only a few seconds; a rapid analysis of incoming data such as protein sequences or ESTs is carried out on present-day workstations. AVAILABILITY: Access to the database is available at http://www.bioinf.mdc-berlin.de/unter2.html++ + CONTACT: (hanke,lehmann,reich)@mdc-berlin.de; bork@embl-heidelberg.de 相似文献
13.
Gilks WR Audit B De Angelis D Tsoka S Ouzounis CA 《Bioinformatics (Oxford, England)》2002,18(12):1641-1649
Public sequence databases contain information on the sequence, structure and function of proteins. Genome sequencing projects have led to a rapid increase in protein sequence information, but reliable, experimentally verified, information on protein function lags a long way behind. To address this deficit, functional annotation in protein databases is often inferred by sequence similarity to homologous, annotated proteins, with the attendant possibility of error. Now, the functional annotation in these homologous proteins may itself have been acquired through sequence similarity to yet other proteins, and it is generally not possible to determine how the functional annotation of any given protein has been acquired. Thus the possibility of chains of misannotation arises, a process we term 'error percolation'. With some simple assumptions, we develop a dynamical probabilistic model for these misannotation chains. By exploring the consequences of the model for annotation quality it is evident that this iterative approach leads to a systematic deterioration of database quality. 相似文献
14.
MOTIVATION: The specific hybridization of complementary DNA molecules underlies many widely used molecular biology assays, including the polymerase chain reaction and various types of microarray analysis. In order for such an assay to work well, the primer or probe must bind to its intended target, without also binding to additional sequences in the reaction mixture. For any given probe or primer, potential non-specific binding partners can be identified using state-of-the-art models of DNA binding stability. Unfortunately, these models rely on dynamic programming algorithms that are too slow to apply on a genomic scale. RESULTS: We present an algorithm that efficiently scans a DNA database for short (approximately 20-30 base) sequences that will bind to a query sequence. We use a filtering approach, in which a series of increasingly stringent filters is applied to a set of candidate k-mers. The k-mers that pass all filters are then located in the sequence database using a precomputed index, and an accurate model of DNA binding stability is applied to the sequence surrounding each of the k-mer occurrences. This approach reduces the time to identify all binding partners for a given DNA sequence in human genomic DNA by approximately three orders of magnitude, from two days for the ENCODE regions to less than one minute for typical queries. Our approach is scalable to large DNA sequences. Our method can scan the human genome for medium strength binding sites to a candidate PCR primer in an average of 34.5 minutes. AVAILABILITY: Software implementing the algorithms described here is available at http://noble.gs.washington.edu/proj/dna-binding. 相似文献
15.
16.
Signal-exon trap: a novel method for the identification of signal sequences from genomic DNA 总被引:2,自引:1,他引:1 下载免费PDF全文
We describe a genomic DNA-based signal sequence trap method, signal-exon trap (SET), for the identification of genes encoding secreted and membrane-bound proteins. SET is based on the coupling of an exon trap to the translation of captured exons, which allows screening of the exon-encoded polypeptides for signal peptide function. Since most signal sequences are expected to be located in the 5′-terminal exons of genes, we first demonstrate that trapping of these exons is feasible. To test the applicability of SET for the screening of complex genomic DNA, we evaluated two critical features of the method. Specificity was assessed by the analysis of random genomic DNA and efficiency was demonstrated by screening a 425 kb YAC known to contain the genes of four secretory or membrane-bound proteins. All trapped clones contained a translation initiation signal followed by a hydrophobic stretch of amino acids representing either a known signal peptide, transmembrane domain or novel sequence. Our results suggest that SET is a potentially useful method for the isolation of signal sequence-containing genes and may find application in the discovery of novel members of known secretory gene clusters, as well as in other positional cloning approaches. 相似文献
17.
TATPred: a Bayesian method for the identification of twin arginine translocation pathway signal sequences 下载免费PDF全文
The twin arginine translocation (TAT) system ferries folded proteins across the bacterial membrane. Proteins are directed into this system by the TAT signal peptide present at the amino terminus of the precursor protein, which contains the twin arginine residues that give the system its name. There are currently only two computational methods for the prediction of TAT translocated proteins from sequence. Both methods have limitations that make the creation of a new algorithm for TAT-translocated protein prediction desirable. We have developed TATPred, a new sequence-model method, based on a Nave-Bayesian network, for the prediction of TAT signal peptides. In this approach, a comprehensive range of models was tested to identify the most reliable and robust predictor. The best model comprised 12 residues: three residues prior to the twin arginines and the seven residues that follow them. We found a prediction sensitivity of 0.979 and a specificity of 0.942. 相似文献
18.
In order to identify cross-culture contamination of cell lines, we applied DNA fingerprinting using variable number of tandem repeat (VNTR) loci and short tandem repeat (STR) loci amplified by polymerase chain reaction (PCR) instead of a radioisotope labeled multilocus probe. Eleven cell lines were used for the Apo B and D1S80 loci detection, and twelve cell lines were examined in the Y-chromosome analysis. The data obtained from the sister cell lines NALM-6 and B85, two MOLM-1 cultures from two cryopreserved tubes, and four subclones of BALM-9 and its sister cell line BALM-10, displayed clear and distinct bands of each PCR product for both Apo B and D1S80. Detection of a Y-chromosome DNA sequence is another very informative marker for the identification of cell lines, if the Y-chromosome is present. We examined eight cell lines for the expression of four STR loci; the data thus generated were compared with the results previously reported from other laboratories. The resulting electrophoretic banding patterns showed that our "home-made" STR detection system is a useful and efficient tool for the authentication of cell lines. PCR detection of VNTR and STR loci represents a simple, rapid and powerful DNA fingerprinting technique to authenticate human cell lines and to detect cross-culture contamination. This PCR technique may be used in lieu of the more time-consuming, labor-intensive and radioactive Southern blot multilocus method. 相似文献
19.
MOTIVATION: Genome projects for many prokaryotic and eukaryotic species have been completed and more new genome projects are being underway currently. The availability of a large number of genomic sequences for researchers creates a need to find graphic tools to study genomes in a perceivable form. The Z curve is one of such tools available for visualizing genomes. The Z curve is a unique three-dimensional curve representation for a given DNA sequence in the sense that each can be uniquely reconstructed given the other. The Z curve database for more than 1000 genomes have been established here. RESULTS: The database contains the Z curves for archaea, bacteria, eukaryota, organelles, phages, plasmids, viroids and viruses, whose genomic sequences are currently available. All the 3-dimensional Z curves and their three component curves are stored in the database. The applications of the Z curve database on comparative genomics, gene prediction, computation of G+C content with a windowless technique, prediction of replication origins and terminations of bacterial and archaeal genomes and study of local deviations from the Chargaff Parity Rule 2 etc. are presented in detail. The Z curve database reported here is a treasure trove in which biologists could find useful biological knowledge. 相似文献
20.
Takeshi Kawabata Satoshi Fukuchi Keiichi Homma Motonori Ota Jiro Araki Takehiko Ito Nobuyuki Ichiyoshi Ken Nishikawa 《Nucleic acids research》2002,30(1):294-298
Large-scale genome projects generate an unprecedented number of protein sequences, most of them are experimentally uncharacterized. Predicting the 3D structures of sequences provides important clues as to their functions. We constructed the Genomes TO Protein structures and functions (GTOP) database, containing protein fold predictions of a huge number of sequences. Predictions are mainly carried out with the homology search program PSI-BLAST, currently the most popular among high-sensitivity profile search methods. GTOP also includes the results of other analyses, e.g. homology and motif search, detection of transmembrane helices and repetitive sequences. We have completed analyzing the sequences of 41 organisms, with the number of proteins exceeding 120 000 in total. GTOP uses a graphical viewer to present the analytical results of each ORF in one page in a ‘color-bar’ format. The assigned 3D structures are presented by Chime plug-in or RasMol. The binding sites of ligands are also included, providing functional information. The GTOP server is available at http://spock.genes.nig.ac.jp/~genome/gtop.html. 相似文献