首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Mass spectrometry‐based proteomics is a popular and powerful method for precise and highly multiplexed protein identification. The most common method of analyzing untargeted proteomics data is called database searching, where the database is simply a collection of protein sequences from the target organism, derived from genome sequencing. Experimental peptide tandem mass spectra are compared to simplified models of theoretical spectra calculated from the translated genomic sequences. However, in several interesting application areas, such as forensics, archaeology, venomics, and others, a genome sequence may not be available, or the correct genome sequence to use is not known. In these cases, de novo peptide identification can play an important role. De novo methods infer peptide sequence directly from the tandem mass spectrum without reference to a sequence database, usually using graph‐based or machine learning algorithms. In this review, we provide a basic overview of de novo peptide identification methods and applications, briefly covering de novo algorithms and tools, and focusing in more depth on recent applications from venomics, metaproteomics, forensics, and characterization of antibody drugs.  相似文献   

2.
Mass spectrometric identification of proteins in species lacking validated sequence information is a major problem in veterinary science. In the present study, we used ochratoxin A producing Penicillium verrucosum to identify and quantitatively analyze proteins of an organism with yet no protein information available. The work presented here aimed to provide a comprehensive protein identification of P. verrucosum using shotgun proteomics. We were able to identify 3631 proteins in an “ab initio” translated database from DNA sequences of P. verrucosum. Additionally, a sequential window acquisition of all theoretical fragment‐ion spectra analysis was done to find differentially regulated proteins at two different time points of the growth curve. We compared the proteins at the beginning (day 3) and at the end of the log phase (day 12).  相似文献   

3.
Rapid identification of proteins by peptide-mass fingerprinting   总被引:33,自引:0,他引:33  
BACKGROUND: Developments in 'soft' ionisation techniques have revolutionized mass-spectro-metric approaches for the analysis of protein structure. For more than a decade, such techniques have been used, in conjuction with digestion b specific proteases, to produce accurate peptide molecular weight 'fingerprints' of proteins. These fingerprints have commonly been used to screen known proteins, in order to detect errors of translation, to characterize post-translational modifications and to assign diulphide bonds. However, the extent to which peptide-mass information can be used alone to identify unknown sample proteins, independent of other analytical methods such as protein sequence analysis, has remained largely unexplored. RESULTS: We report here on the development of the molecular weight search (MOWSE) peptide-mass database at the SERC Daresbury Laboratory. Practical experience has shown that sample proteins can be uniquely identified from a few as three or four experimentally determined peptide masses when these are screened against a fragment database that is derived from over 50 000 proteins. Experimental errors of a few Daltons are tolerated by the scoring algorithms, thus permitting the use of inexpensive time-of-flight mass spectrometers. As with other types of physical data, such as amino-acid composition or linear sequence, peptide masses provide a set of determinants that are sufficiently discriminating to identify or match unknown sample proteins. CONCLUSION: Peptide-mass fingerprints can prove as discriminating as linear peptide sequences, but can be obtained in a fraction of the time using less protein. In many cases, this allows for a rapid identification of a sample protein before committing it to protein sequence analysis. Fragment masses also provide information, at the protein level, that is complementary to the information provided by large-scale DNA sequencing or mapping projects.  相似文献   

4.
Los Alamos sequence analysis package for nucleic acids and proteins.   总被引:58,自引:11,他引:47       下载免费PDF全文
An interactive system for computer analysis of nucleic acid and protein sequences has been developed for the Los Alamos DNA Sequence Database. It provides a convenient way to search or verify various sequence features, e.g., restriction enzyme sites, protein coding frames, and properties of coded proteins. Further, the comprehensive analysis package on a large-scale database can be used for comparative studies on sequence and structural homologies in order to find unnoted information stored in nucleic acid sequences.  相似文献   

5.
The ever increasing speed of DNA sequencing widens the discrepancy between the number of known gene products, and the knowledge of their function and structure. Proper annotation of protein sequences is therefore crucial if the missing information is to be deduced from sequence‐based similarity comparisons. These comparisons become exceedingly difficult as the pairwise identities drop to very low values. To improve the accuracy of domain identification, we exploit the fact that the three‐dimensional structures of domains are much more conserved than their sequences. Based on structure‐anchored multiple sequence alignments of low identity homologues we constructed 850 structure‐anchored hidden Markov models (saHMMs), each representing one domain family. Since the saHMMs are highly family specific, they can be used to assign a domain to its correct family and clearly distinguish it from domains belonging to other families, even within the same superfamily. This task is not trivial and becomes particularly difficult if the unknown domain is distantly related to the rest of the domain sequences within the family. In a search with full length protein sequences, harbouring at least one domain as defined by the structural classification of proteins database (SCOP), version 1.71, versus the saHMM database based on SCOP version 1.69, we achieve an accuracy of 99.0%. All of the few hits outside the family fall within the correct superfamily. Compared to Pfam_ls HMMs, the saHMMs obtain about 11% higher coverage. A comparison with BLAST and PSI‐BLAST demonstrates that the saHMMs have consistently fewer errors per query at a given coverage. Within our recommended E‐value range, the same is true for a comparison with SUPERFAMILY. Furthermore, we are able to annotate 232 proteins with 530 nonoverlapping domains belonging to 102 different domain families among human proteins labelled “unknown” in the NCBI protein database. Our results demonstrate that the saHMM database represents a versatile and reliable tool for identification of domains in protein sequences. With the aid of saHMMs, homology on the family level can be assigned, even for distantly related sequences. Due to the construction of the saHMMs, the hits they provide are always associated with high quality crystal structures. The saHMM database can be accessed via the FISH server at http://babel.ucmp.umu.se/fish/ . Proteins 2009. © 2008 Wiley‐Liss, Inc.  相似文献   

6.
Species identification of Scenedesmus-like microalgae, comprising Desmodesmus, Tetradesmus, and Scenedesmus, has been challenging due to their high morphological and genetic similarity. After developing a DNA signaturing tool for Desmodesmus identification, we built a DNA signaturing database for Tetradesmus. The DNA signaturing tool contained species-specific nucleotide sequences of Tetradesmus species or strain groups with high similarity in ITS2 sequences. To construct DNA signaturing, we collected data on ITS2 sequences, aligned the sequences, organized the data by ITS2 sequence homology, and determined signature sequences according to hemi-compensatory base changes (hCBC)/CBC data from previous studies. Four Tetradesmus species and 11 strain groups had DNA signatures. The signature sequence of the genus Tetradesmus, TTA GAG GCT TAA GCA AGG ACCC, recognized 86% (157/183) of the collected Tetradesmus strains. Phylogenetic analysis of Scenedesmus-like species revealed that the Tetradesmus species were monophyletic and closely related to each other based on branch lengths. Desmodesmus was suggested to split into two subgenera due to their genetic and morphological distinction. Scenedesmus must be analyzed along with other genera of the Scenedesmaceae family to determine their genetic relationships. Importantly, DNA signaturing was integrated into a database for identifying Scenedesmus-like species through BLAST.  相似文献   

7.
Peptide mass fingerprint (PMF) matching is a high-throughput method used for protein spot identification in connection with two-dimensional gel electrophoresis (2DE). However, the success of PMF matching largely depends on whether the proteins to be identified exist in the database searched. Consequently, it is often necessary to apply other more sophisticated but also time-consuming technologies to generate sequence-tags for definitive protein identification. On the other hand, modern sequencing technologies are generating a large quantity of DNA sequences, first in unfinished form or with low genome coverage due to the time-consuming and thus limiting steps of finishing and annotation. We recently started to sequence the genome of Bacillus megaterium DSM 319, a bacterium of industrial interest. In this study, we demonstrate that a protein database generated from merely three-fold coverage, unfinished genomic sequences of this bacterium allows a fast and reliable protein spot identification solely based on PMF from high-throughput MALDI-TOF MS analysis. We further show that the strain-specific protein database from low coverage genomic sequence greatly outperforms the commonly used cross-species databases constructed from 13 completely sequenced Bacillus strains for protein spot identification via PMF.  相似文献   

8.
GenBank.   总被引:2,自引:0,他引:2       下载免费PDF全文
The GenBank (Registered Trademark symbol) sequence database incorporates DNA sequences from all available public sources, primarily through the direct submission of sequence data from individual laboratories and from large-scale sequencing projects. Most submitters use the BankIt (Web) or Sequin programs to format and send sequence data. Data exchange with the EMBL Data Library and the DNA Data Bank of Japan helps ensure comprehensive worldwide coverage. GenBank data is accessible through NCBI's integrated retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome and protein structure information. MEDLINE (Registered Trademark symbol) s from published articles describing the sequences are included as an additional source of biological annotation through the PubMed search system. Sequence similarity searching is offered through the BLAST series of database search programs. In addition to FTP, Email, and server/client versions of Entrez and BLAST, NCBI offers a wide range of World Wide Web retrieval and analysis services based on GenBank data. The GenBank database and related resources are freely accessible via the URL: http://www.ncbi.nlm.nih.gov  相似文献   

9.
Identification of North Sea molluscs with DNA barcoding   总被引:1,自引:0,他引:1       下载免费PDF全文
Sequence‐based specimen identification, known as DNA barcoding, is a common method complementing traditional morphology‐based taxonomic assignments. The fundamental resource in DNA barcoding is the availability of a taxonomically reliable sequence database to use as a reference for sequence comparisons. Here, we provide a reference library including 579 sequences of the mitochondrial cytochrome c oxidase subunit I for 113 North Sea mollusc species. We tested the efficacy of this library by simulating a sequence‐based specimen identification scenario using Best Match, Best Close Match (BCM) and All Species Barcode (ASB) criteria with three different threshold values. Each identification result was compared with our prior morphology‐based taxonomic assignments. Our simulation resulted in 87.7% congruent identifications (93.8% when excluding singletons). The highest number of congruent identifications was obtained with BCM and ASB and a 0.05 threshold. We also compared identifications with genetic clustering (Barcode Index Numbers, BINs) computed by the Barcode of Life Datasystem (BOLD). About 68% of our morphological identifications were congruent with BINs created by BOLD. Forty‐nine sequences were clustered in 16 discordant BINs, and these were divided in two classes: sequences from different species clustered in a single BIN and conspecific sequences divided in more BINs. Whereas former incongruences were probably caused by BOLD entries in need of a taxonomic update, the latter incongruences regarded taxa requiring further investigations. These include species with amphi‐Atlantic distribution, whose genetic structure should be evaluated over their entire range to produce a reliable sequence‐based identification system.  相似文献   

10.
GenBank.   总被引:2,自引:1,他引:2       下载免费PDF全文
The GenBank(R) sequence database (http://www.ncbi.nlm.nih.gov/) incorporates DNA sequences from all available public sources, primarily through the direct submission of sequence data from individual laboratories and from large-scale sequencing projects. Most submitters use the BankIt (WWW) or Sequin programs to send their sequence data. Data exchange with the EMBL Data Library and the DNA Data Bank of Japan helps ensure comprehensive worldwide coverage. GenBank data is accessible through NCBI's integrated retrieval system, Entrez , which integrates data from the major DNA and protein sequence databases along with taxonomy, genome and protein structure information. MEDLINE(R) abstracts from published articles describing the sequences are also included as an additional source of biological annotation. Sequence similarity searching is offered through the BLAST series of database search programs. In addition to FTP, e-mail and server/client versions of Entrez and BLAST, NCBI offers a wide range of World Wide Web retrieval and analysis services of interest to biologists.  相似文献   

11.
Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query. Pairwise sequence comparison methods--i.e., measures of similarity between query and target sequences--provide the engine for sequence database search and have been the subject of 30 years of computational research. For the difficult problem of detecting remote evolutionary relationships between protein sequences, the most successful pairwise comparison methods involve building local models (e.g., profile hidden Markov models) of protein sequences. However, recent work in massive data domains like web search and natural language processing demonstrate the advantage of exploiting the global structure of the data space. Motivated by this work, we present a large-scale algorithm called ProtEmbed, which learns an embedding of protein sequences into a low-dimensional "semantic space." Evolutionarily related proteins are embedded in close proximity, and additional pieces of evidence, such as 3D structural similarity or class labels, can be incorporated into the learning process. We find that ProtEmbed achieves superior accuracy to widely used pairwise sequence methods like PSI-BLAST and HHSearch for remote homology detection; it also outperforms our previous RankProp algorithm, which incorporates global structure in the form of a protein similarity network. Finally, the ProtEmbed embedding space can be visualized, both at the global level and local to a given query, yielding intuition about the structure of protein sequence space.  相似文献   

12.
Babnigg G  Giometti CS 《Proteomics》2006,6(16):4514-4522
In proteome studies, identification of proteins requires searching protein sequence databases. The public protein sequence databases (e.g., NCBInr, UniProt) each contain millions of entries, and private databases add thousands more. Although much of the sequence information in these databases is redundant, each database uses distinct identifiers for the identical protein sequence and often contains unique annotation information. Users of one database obtain a database-specific sequence identifier that is often difficult to reconcile with the identifiers from a different database. When multiple databases are used for searches or the databases being searched are updated frequently, interpreting the protein identifications and associated annotations can be problematic. We have developed a database of unique protein sequence identifiers called Sequence Globally Unique Identifiers (SEGUID) derived from primary protein sequences. These identifiers serve as a common link between multiple sequence databases and are resilient to annotation changes in either public or private databases throughout the lifetime of a given protein sequence. The SEGUID Database can be downloaded (http://bioinformatics.anl.gov/SEGUID/) or easily generated at any site with access to primary protein sequence databases. Since SEGUIDs are stable, predictions based on the primary sequence information (e.g., pI, Mr) can be calculated just once; we have generated approximately 500 different calculations for more than 2.5 million sequences. SEGUIDs are used to integrate MS and 2-DE data with bioinformatics information and provide the opportunity to search multiple protein sequence databases, thereby providing a higher probability of finding the most valid protein identifications.  相似文献   

13.
14.
Numts are nonfunctional mitochondrial sequences that have translocated into nuclear DNA, where they evolve independently from the original mitochondrial DNA (mtDNA) sequence. Numts can be unintentionally amplified in addition to authentic mtDNA, complicating both the analysis and interpretation of mtDNA-based studies. Amplification of numts creates particular issues for studies on the noncoding, hypervariable 1 mtDNA region of gorillas. We provide data on putative numt sequences of the coding mitochondrial gene cytochrome oxidase subunit II (COII). Via polymerase chain reaction (PCR) and cloning, we obtained COII sequences for gorilla, orangutan, and human high-quality DNA and also from a gorilla fecal DNA sample. Both gorilla and orangutan samples yielded putative numt sequences. Phylogenetically more anciently transferred numts were amplified with a greater incidence from the gorilla fecal DNA sample than from the high-quality gorilla sample. Data on phylogenetically more recently transferred numts are equivocal. We further demonstrate the need for additional investigations into the use of mtDNA markers for noninvasively collected samples from gorillas and other primates.  相似文献   

15.
In the 10 years since we published our first full analysis of mitochondrial DNA (mtDNA) variation in Rattus exulans as a means for tracking human migration in Polynesia, we have extended the commensal approach through time and space with the use of ancient DNA (aDNA) and by analysing samples from across the Pacific. Not only can mtDNA phylogenies provide information regarding population origins and paths of migration, they have also provided information regarding degrees of contact and interaction between islands. An important extension of the R. exulans project is the creation and on-going development of a genetic database for the identification of Rattus species based on mtDNA sequences. The phylogenetic analysis of sequences from 18 species and 1 subspecies of Rattus thus far have raised some questions regarding species identification and species distributions in the Pacific.  相似文献   

16.
17.
GenBank   总被引:51,自引:4,他引:47       下载免费PDF全文
The GenBank((R))sequence database incorporates publicly available DNA sequences of >55 000 different organisms, primarily through direct submission of sequence data from individual laboratories and large-scale sequencing projects. Most submissions are made using the BankIt (Web) or Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Data exchange with the EMBL Data Library and the DNA Data Bank of Japan helps ensure comprehensive worldwide coverage. GenBank data is accessible through NCBI's integrated retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping and protein structure information, plus the biomedical literature via PubMed. Sequence similarity searching is provided by the BLAST family of programs. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. NCBI also offers a wide range of WWW retrieval and analysis services based on GenBank data. The GenBank database and related resources are freely accessible via the NCBI home page at http://www.ncbi.nlm.nih.gov  相似文献   

18.
GenBank.   总被引:4,自引:1,他引:3       下载免费PDF全文
The GenBank sequence database incorporates DNA sequences from all available public sources, primarily through the direct submission of sequence data from authors and from large-scale sequencing projects. Data exchange with the EMBL Data Library and the DNA Data Bank of Japan helps ensure comprehensive coverage. GenBank continues to focus on quality control and annotation while expanding data coverage and retrieval services. An integrated retrieval system, known asEntrez, incorporates data from the major DNA and protein sequence databases, along with genome maps and protein structure information. MEDLINE abstracts from published articles describing the sequences are also included as an additional source of biological annotation. Sequence similarity searching is offered through the BLAST family of programs. All of NCBI's services are offered through the World Wide Web. In addition, there are specialized server/client versions as well as FTP and e-mail server access.  相似文献   

19.
Identification of fern gametophytes is generally hampered by low morphological complexity. Here we explore an alternative: DNA‐based identification. We obtained a plastid rbcL sequence from a sterile gametophyte of unknown origin (cultivated for more than 30 years) and employed blast to determine its affinities. Using this approach, we identified the gametophyte as Osmunda regalis. To evaluate the robustness of this determination, and the usefulness of rbcL in differentiating among species, we conducted a phylogenetic analysis of osmundaceous fern sequences. Based on our results, it is evident that DNA‐based identification has considerable potential in exploring the ecology of fern gametophytes.  相似文献   

20.
We have constructed a subtractive cDNA library from regenerating Retzius cells of the leech,Hirudo medicinalis. It is highly enriched in sequences up-regulated during nerve regeneration. Sequence analysis of selected recombinants has identified both novel sequences and sequences homologous to molecules characterised in other species. Homologies include α-tubulin, a calmodulin-like protein, CAAT/enhancer-binding protein (C/EBP), protein 4.1 and synapsin. These types of proteins are exactly those predicted to be associated with axonal growth and their identification confirms the quality of the library. Most interesting, however, is the isolation of 5 previously uncharacterised cDNAs which appear to be up-regulated during regeneration. Their analysis is likely to provide new information on the molecular mechanisms of neuronal regeneration. Data deposition: The sequence of Hm C/EBPγ has been deposited in the EMBL database. Accession no. U67068  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号