首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Nucleic acid sequences from genome sequencing projects are submitted as raw data, from which biologists attempt to elucidate the function of the predicted gene products. The protein sequences are stored in public databases, such as the UniProt Knowledgebase (UniProtKB), where curators try to add predicted and experimental functional information. Protein function prediction can be done using sequence similarity searches, but an alternative approach is to use protein signatures, which classify proteins into families and domains. The major protein signature databases are available through the integrated InterPro database, which provides a classification of UniProtKB sequences. As well as characterization of proteins through protein families, many researchers are interested in analyzing the complete set of proteins from a genome (i.e. the proteome), and there are databases and resources that provide non-redundant proteome sets and analyses of proteins from organisms with completely sequenced genomes. This article reviews the tools and resources available on the web for single and large-scale protein characterization and whole proteome analysis.  相似文献   

2.
Amino acid substitution matrices play an essential role in protein sequence alignment, a fundamental task in bioinformatics. Most widely used matrices, such as PAM matrices derived from homologous sequences and BLOSUM matrices derived from aligned segments of PROSITE, did not integrate conformation information in their construction. There are a few structure-based matrices, which are derived from limited data of structure alignment. Using databases PDB_SELECT and DSSP, we create a database of sequence-conformation blocks which explicitly represent sequence-structure relationship. Members in a block are identical in conformation and are highly similar in sequence. From this block database, we derive a conformation-specific amino acid substitution matrix CBSM60. The matrix shows an improved performance in conformational segment search and homolog detection.  相似文献   

3.
对蛋白质质谱数据进行数据库比对和鉴定是蛋白质组学研究技术中的一个重要步骤。由于公共数据库蛋白质数据信息不全,有些蛋白质质谱数据无法得到有效的鉴定。而利用相关物种的EST序列构建专门的质谱数据库则可以增加鉴定未知蛋白的几率。本文介绍了利用EST序列构建Mascot本地数据库的具体方法和步骤,扩展了Mascot检索引擎对蛋白质质谱数据的鉴定范围,从数据库层面提高了对未知蛋白的鉴别几率,为蛋白质组学研究提供了一种较为实用的生物信息学分析技术。  相似文献   

4.
Babnigg G  Giometti CS 《Proteomics》2006,6(16):4514-4522
In proteome studies, identification of proteins requires searching protein sequence databases. The public protein sequence databases (e.g., NCBInr, UniProt) each contain millions of entries, and private databases add thousands more. Although much of the sequence information in these databases is redundant, each database uses distinct identifiers for the identical protein sequence and often contains unique annotation information. Users of one database obtain a database-specific sequence identifier that is often difficult to reconcile with the identifiers from a different database. When multiple databases are used for searches or the databases being searched are updated frequently, interpreting the protein identifications and associated annotations can be problematic. We have developed a database of unique protein sequence identifiers called Sequence Globally Unique Identifiers (SEGUID) derived from primary protein sequences. These identifiers serve as a common link between multiple sequence databases and are resilient to annotation changes in either public or private databases throughout the lifetime of a given protein sequence. The SEGUID Database can be downloaded (http://bioinformatics.anl.gov/SEGUID/) or easily generated at any site with access to primary protein sequence databases. Since SEGUIDs are stable, predictions based on the primary sequence information (e.g., pI, Mr) can be calculated just once; we have generated approximately 500 different calculations for more than 2.5 million sequences. SEGUIDs are used to integrate MS and 2-DE data with bioinformatics information and provide the opportunity to search multiple protein sequence databases, thereby providing a higher probability of finding the most valid protein identifications.  相似文献   

5.
Most pockets in the human leukocyte antigen-group DR (HLA-DR) groove are shaped by clusters of polymorphic residues and, thus, have distinct chemical and size characteristics in different HLA-DR alleles. Each HLA-DR pocket can be characterized by "pocket profiles," a quantitative representation of the interaction of all natural amino acid residues with a given pocket. In this report we demonstrate that pocket profiles are nearly independent of the remaining HLA-DR cleft. A small database of profiles was sufficient to generate a large number of HLA-DR matrices, representing the majority of human HLA-DR peptide-binding specificity. These virtual matrices were incorporated in software (TEPITOPE) capable of predicting promiscuous HLA class II ligands. This software, in combination with DNA microarray technology, has provided a new tool for the generation of comprehensive databases of candidate promiscuous T-cell epitopes in human disease tissues. First, DNA microarrays are used to reveal genes that are specifically expressed or upregulated in disease tissues. Second, the prediction software enables the scanning of these genes for promiscuous HLA-DR binding sites. In an example, we demonstrate that starting from nearly 20,000 genes, a database of candidate colon cancer-specific and promiscuous T-cell epitopes could be fully populated within a matter of days. Our approach has implications for the development of epitope-based vaccines.  相似文献   

6.
Southwestern blotting is used to investigate DNA-protein interactions. The advantage of this technique over other related methods such as electrophoretic mobility shift assay (EMSA) and DNA footprinting is that it provides information regarding the molecular weight of unknown protein factor. This method combines the features of Southern and Western blotting techniques; a denaturing SDS-PAGE is first employed to separate proteins electrophoretically based on size, and after transferring the proteins to a membrane support, the membrane-bound proteins are renatured and incubated with a (32)P-labeled double-stranded oligonucleotide probe of specific DNA sequence. The interaction of the probe with the protein(s) is later visualized by autoradiography. This technique could be combined with database searching (TransFac, http://www.gene-regulation.com/pub/databases.html#transfac), prediction of potential protein factors binding onto a target motif (e.g., Patch search), in vitro supershift EMSA and in vivo chromatin immunoprecipitation (ChIP) assays for effective identification of protein factors. The whole Southwestern blotting procedure takes approximately 4 d to complete. In this article, a commonly used protocol and expected results are described and discussed.  相似文献   

7.
DNA databases     
Schuster A 《Bio Systems》2005,81(3):234-246
This paper presents DNA algorithms for five relational algebra database operations, selection, projection, union, set difference, and Cartesian product on so-called DNA databases. A DNA database is a database where data records are encoded as DNA strands. The five operations mentioned before are fundamental in the field of databases and perform most of the data retrieval operations on current databases.  相似文献   

8.
Optimizing substitution matrices by separating score distributions   总被引:1,自引:0,他引:1  
MOTIVATION: Homology search is one of the most fundamental tools in Bioinformatics. Typical alignment algorithms use substitution matrices and gap costs. Thus, the improvement of substitution matrices increases accuracy of homology searches. Generally, substitution matrices are derived from aligned sequences whose relationships are known, and gap costs are determined by trial and error. To discriminate relationships more clearly, we are encouraged to optimize the substitution matrices from statistical viewpoints using both positive and negative examples utilizing Bayesian decision theory. RESULTS: Using Cluster of Orthologous Group (COG) database, we optimized substitution matrices. The classification accuracy of the obtained matrix is better than that of conventional substitution matrices to COG database. It also achieves good performance in classifying with other databases.  相似文献   

9.
UniProt archive     
UniProt Archive (UniParc) is the most comprehensive, non-redundant protein sequence database available. Its protein sequences are retrieved from predominant, publicly accessible resources. All new and updated protein sequences are collected and loaded daily into UniParc for full coverage. To avoid redundancy, each unique sequence is stored only once with a stable protein identifier, which can be used later in UniParc to identify the same protein in all source databases. When proteins are loaded into the database, database cross-references are created to link them to the origins of the sequences. As a result, performing a sequence search against UniParc is equivalent to performing the same search against all databases cross-referenced by UniParc. UniParc contains only protein sequences and database cross-references; all other information must be retrieved from the source databases.  相似文献   

10.
Zhang C  Liu S  Zhou H  Zhou Y 《Biophysical journal》2004,86(6):3349-3358
An accurate statistical energy function that is suitable for the prediction of protein structures of all classes should be independent of the structural database used for energy extraction. Here, two high-resolution, low-sequence-identity structural databases of 333 alpha-proteins and 271 beta-proteins were built for examining the database dependence of three all-atom statistical energy functions. They are RAPDF (residue-specific all-atom conditional probability discriminatory function), atomic KBP (atomic knowledge-based potential), and DFIRE (statistical potential based on distance-scaled finite ideal-gas reference state). These energy functions differ in the reference states used for energy derivation. The energy functions extracted from the different structural databases are used to select native structures from multiple decoys of 64 alpha-proteins and 28 beta-proteins. The performance in native structure selections indicates that the DFIRE-based energy function is mostly independent of the structural database whereas RAPDF and KBP have a significant dependence. The construction of two additional structural databases of alpha/beta and alpha + beta-proteins further confirmed the weak dependence of DFIRE on the structural databases of various structural classes. The possible source for the difference between the three all-atom statistical energy functions is that the physical reference state of ideal gas used in the DFIRE-based energy function is least dependent on the structural database.  相似文献   

11.
Hernandez P  Gras R  Frey J  Appel RD 《Proteomics》2003,3(6):870-878
In recent years, proteomics research has gained importance due to increasingly powerful techniques in protein purification, mass spectrometry and identification, and due to the development of extensive protein and DNA databases from various organisms. Nevertheless, current identification methods from spectrometric data have difficulties in handling modifications or mutations in the source peptide. Moreover, they have low performance when run on large databases (such as genomic databases), or with low quality data, for example due to bad calibration or low fragmentation of the source peptide. We present a new algorithm dedicated to automated protein identification from tandem mass spectrometry (MS/MS) data by searching a peptide sequence database. Our identification approach shows promising properties for solving the specific difficulties enumerated above. It consists of matching theoretical peptide sequences issued from a database with a structured representation of the source MS/MS spectrum. The representation is similar to the spectrum graphs commonly used by de novo sequencing software. The identification process involves the parsing of the graph in order to emphasize relevant sections for each theoretical sequence, and leads to a list of peptides ranked by a correlation score. The parsing of the graph, which can be a highly combinatorial task, is performed by a bio-inspired algorithm called Ant Colony Optimization algorithm.  相似文献   

12.
In this article, we present some simple yet effective statistical techniques for analysing and comparing large DNA sequences. These techniques are based on frequency distributions of DNA words in a large sequence, and have been packaged into a software called SWORDS. Using sequences available in public domain databases housed in the Internet, we demonstrate how SWORDS can be conveniently used by molecular biologists and geneticists to unmask biologically important features hidden in large sequences and assess their statistical significance.  相似文献   

13.
Multiple sequence alignments are powerful tools for understanding the structures, functions, and evolutionary histories of linear biological macromolecules (DNA, RNA, and proteins), and for finding homologs in sequence databases. We address several ontological issues related to RNA sequence alignments that are informed by structure. Multiple sequence alignments are usually shown as two-dimensional (2D) matrices, with rows representing individual sequences, and columns identifying nucleotides from different sequences that correspond structurally, functionally, and/or evolutionarily. However, the requirement that sequences and structures correspond nucleotide-by-nucleotide is unrealistic and hinders representation of important biological relationships. High-throughput sequencing efforts are also rapidly making 2D alignments unmanageable because of vertical and horizontal expansion as more sequences are added. Solving the shortcomings of traditional RNA sequence alignments requires explicit annotation of the meaning of each relationship within the alignment. We introduce the notion of “correspondence,” which is an equivalence relation between RNA elements in sets of sequences as the basis of an RNA alignment ontology. The purpose of this ontology is twofold: first, to enable the development of new representations of RNA data and of software tools that resolve the expansion problems with current RNA sequence alignments, and second, to facilitate the integration of sequence data with secondary and three-dimensional structural information, as well as other experimental information, to create simultaneously more accurate and more exploitable RNA alignments.  相似文献   

14.
GenBank          下载免费PDF全文
GenBank (R) is a comprehensive sequence database that contains publicly available DNA sequences for more than 119 000 different organisms, obtained primarily through the submission of sequence data from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the BankIt (web) or Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the EMBL Data Library in the UK and the DNA Data Bank of Japan helps ensure worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, go to the NCBI home page at: http://www.ncbi.nlm.nih.gov.  相似文献   

15.
Substitution matrices have been useful for sequence alignment and protein sequence comparisons. The BLOSUM series of matrices, which had been derived from a database of alignments of protein blocks, improved the accuracy of alignments previously obtained from the PAM-type matrices estimated from only closely related sequences. Although BLOSUM matrices are scoring matrices now widely used for protein sequence alignments, they do not describe an evolutionary model. BLOSUM matrices do not permit the estimation of the actual number of amino acid substitutions between sequences by correcting for multiple hits. The method presented here uses the Blocks database of protein alignments, along with the additivity of evolutionary distances, to approximate the amino acid substitution probabilities as a function of actual evolutionary distance. The PMB (Probability Matrix from Blocks) defines a new evolutionary model for protein evolution that can be used for evolutionary analyses of protein sequences. Our model is directly derived from, and thus compatible with, the BLOSUM matrices. The model has the additional advantage of being easily implemented.  相似文献   

16.
Mishra P  Pandey PN 《Bioinformation》2011,6(10):372-374
The number of amino acid sequences is increasing very rapidly in the protein databases like Swiss-Prot, Uniprot, PIR and others, but the structure of only some amino acid sequences are found in the Protein Data Bank. Thus, an important problem in genomics is automatically clustering homologous protein sequences when only sequence information is available. Here, we use graph theoretic techniques for clustering amino acid sequences. A similarity graph is defined and clusters in that graph correspond to connected subgraphs. Cluster analysis seeks grouping of amino acid sequences into subsets based on distance or similarity score between pairs of sequences. Our goal is to find disjoint subsets, called clusters, such that two criteria are satisfied: homogeneity: sequences in the same cluster are highly similar to each other; and separation: sequences in different clusters have low similarity to each other. We tested our method on several subsets of SCOP (Structural Classification of proteins) database, a gold standard for protein structure classification. The results show that for a given set of proteins the number of clusters we obtained is close to the superfamilies in that set; there are fewer singeltons; and the method correctly groups most remote homologs.  相似文献   

17.
Expressed sequence tags (ESTs) from the marine red alga Gracilaria gracilis   总被引:2,自引:0,他引:2  
Expressed sequence tags (ESTs) are partial sequences of cDNAs, and can be used to characterize gene expression in organisms or tissues. We have constructed a 200-sequence EST database from vegetative thalli of Gracilaria gracilis, the first ESTs reported from any alga. This database contains recognizable ESTs corresponding to genes of carbohydrate metabolism (seven), amino acid metabolism (three), photosynthesis (five), nucleic acid synthesis, repair and processing (three), protein synthesis (14), protein degradation (six), cellular maintenance and stress response (three), other identifiable protein-coding genes (13) and 146 sequences for which significant matches were not found in existing sequence databases. We have already used this EST database to recover genes of carbohydrate biosynthesis from G. gracilis. This revised version was published online in August 2006 with corrections to the Cover Date.  相似文献   

18.
Peptide mass fingerprint (PMF) matching is a high-throughput method used for protein spot identification in connection with two-dimensional gel electrophoresis (2DE). However, the success of PMF matching largely depends on whether the proteins to be identified exist in the database searched. Consequently, it is often necessary to apply other more sophisticated but also time-consuming technologies to generate sequence-tags for definitive protein identification. On the other hand, modern sequencing technologies are generating a large quantity of DNA sequences, first in unfinished form or with low genome coverage due to the time-consuming and thus limiting steps of finishing and annotation. We recently started to sequence the genome of Bacillus megaterium DSM 319, a bacterium of industrial interest. In this study, we demonstrate that a protein database generated from merely three-fold coverage, unfinished genomic sequences of this bacterium allows a fast and reliable protein spot identification solely based on PMF from high-throughput MALDI-TOF MS analysis. We further show that the strain-specific protein database from low coverage genomic sequence greatly outperforms the commonly used cross-species databases constructed from 13 completely sequenced Bacillus strains for protein spot identification via PMF.  相似文献   

19.
MOTIVATION: Multiple sequence alignments (MSAs) are at the heart of bioinformatics analysis. Recently, a number of multiple protein sequence alignment benchmarks (i.e. BAliBASE, OXBench, PREFAB and SMART) have been released to evaluate new and existing MSA applications. These databases have been well received by researchers and help to quantitatively evaluate MSA programs on protein sequences. Unfortunately, analogous DNA benchmarks are not available, making evaluation of MSA programs difficult for DNA sequences. RESULTS: This work presents the first known multiple DNA sequence alignment benchmarks that are (1) comprised of protein-coding portions of DNA (2) based on biological features such as the tertiary structure of encoded proteins. These reference DNA databases contain a total of 3545 alignments, comprising of 68 581 sequences. Two versions of the database are available: mdsa_100s and mdsa_all. The mdsa_100s version contains the alignments of the data sets that TBLASTN found 100% sequence identity for each sequence. The mdsa_all version includes all hits with an E-value score above the threshold of 0.001. A primary use of these databases is to benchmark the performance of MSA applications on DNA data sets. The first such case study is included in the Supplementary Material.  相似文献   

20.
Recently, Krane et al. addressed the effect of estimating various target DNA profile frequencies in Finnish, Italian, and general U.S. Caucasian databases. They observed that, when using an "inappropriate ethnic," or noncognate, database (e.g., a Finnish target profile frequency estimated in an Italian database, and vice versa), estimates were less common than when the target profile frequencies were estimated using their own ethnic, or cognate, database, and these estimates differed by more than one order of magnitude. These differences were attributed to the effects of subgroups. We demonstrate that the differences can be ascribed to statistical artifacts that induce large biases and correlations. Additionally, we show that the differences of estimates of any specific DNA profile frequency based on allelic size frequencies from different databases become smaller when the fixed-bin rather than a +/- 2.5% floating-bin approach is used.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号