首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 9 毫秒
1.
We derive the optimal number of peaks (defined as the minimum number that provides the required efficiency of spectra identification) in the theoretical spectra as a function of (i) the experimental accuracy, sigma, of the measured ratio m/z; (ii) experimental spectrum density; (iii) size of the database; (iv) number of peaks in the theoretical spectra; and (v) types of ions that the peaks represent. We show that if theoretical spectra are constructed including b and y ions alone, then for sigma = 0.5, which is typical for high-throughput data, peptide chains of eight amino acids or longer can be identified based on the positions of peaks alone, at a rate of false identification below 1%. To discriminate between shorter peptides, additional (e.g., intensity-inferred) information is necessary. We derive the dependence of the probability of false identification on the number of peaks in the theoretical spectra and on the types of ions that the peaks represent. Our results suggest that the class of mass spectrum identification problems, for which more elaborate development of fragmentation rules (such as intensity model) is required, can be reduced to the problems that involve homologous peptides.  相似文献   

2.
Methods for calculating the probabilities of finding patterns in sequences   总被引:1,自引:0,他引:1  
This paper describes the use of probability-generating functionsfor calculating the probabilities of finding motifs in nucleicacid and protein sequences. Equations and algorithms are givenfor calculating the probabilities associated with nine differentways of defining motifs. Comparisons are made with searchesof random sequences. A higher level structure-the pattern-isdefined as a list of motifs. A pattern also specifies the permittedranges of spacing allowed between its constituent motifs. Equationsfor calculating the expected numbers of matches to patternsare given. Received on March 1, 1988; accepted on September 30, 1988  相似文献   

3.
4.
A tool for aligning very similar DNA sequences   总被引:4,自引:0,他引:4  
Results: We have produced a computer program, named sim3, thatsolves the following computational problem. Two DNA sequencesare given, where the shorter sequence is very similar to somecontiguous region of the longer sequence. Sim3 determines sucha similar region of the longer sequence, and then computes anoptimal set of single-nucleotide changes (i.e. insertions, deletionsor substitutions) that will convert the shorter sequence tothat region. Thus, the alignment scoring scheme is designedto model sequencing errors, rather than evolutionary processes.The program can align a 100 kb sequence to a 1 megabase sequencein a few seconds on a workstation, provided that there are veryfew differences between the shorter sequence and some regionin the longer sequence. The program has been used to assemblesequence data for the Genomes Division at the National Centerfor Biotechnology Information. Availability: A version of sim3 for UNIX machines can be obtainedby anonymous ftp from ncbi. nlm. nih. gov, in the pub/sim3 directory. Contact: For portable versions for Macs and PCs, contact zjing@sunset.nlm. nih. gov.  相似文献   

5.
Hu Y  Li Y  Lam H 《Proteomics》2011,11(24):4702-4711
Spectral library searching is a promising alternative to sequence database searching in peptide identification from MS/MS spectra. The key advantage of spectral library searching is the utilization of more spectral features to improve score discrimination between good and bad matches, and hence sensitivity. However, the coverage of reference spectral library is limited by current experimental and computational methods. We developed a computational approach to expand the coverage of spectral libraries with semi-empirical spectra predicted from perturbing known spectra of similar sequences, such as those with single amino acid substitutions. We hypothesized that the peptide of similar sequences should produce similar fragmentation patterns, at least in most cases. Our results confirm our hypothesis and specify when this approach can be applied. In actual spectral searching of real data sets, the sensitivity advantage of spectral library searching over sequence database searching can be mostly retained even when all real spectra are replaced by semi-empirical ones. We demonstrated the applicability of this approach by detecting several known non-synonymous single-nucleotide polymorphisms in three large human data sets by spectral searching.  相似文献   

6.

Background  

In the research on protein functional sites, researchers often need to identify binding-site residues on a protein. A commonly used strategy is to find a complex structure from the Protein Data Bank (PDB) that consists of the protein of interest and its interacting partner(s) and calculate binding-site residues based on the complex structure. However, since a protein may participate in multiple interactions, the binding-site residues calculated based on one complex structure usually do not reveal all binding sites on a protein. Thus, this requires researchers to find all PDB complexes that contain the protein of interest and combine the binding-site information gleaned from them. This process is very time-consuming. Especially, combing binding-site information obtained from different PDB structures requires tedious work to align protein sequences. The process becomes overwhelmingly difficult when researchers have a large set of proteins to analyze, which is usually the case in practice.  相似文献   

7.
8.
MOTIVATION: A tool that simultaneously aligns multiple protein sequences, automatically utilizes information about protein domains, and has a good compromise between speed and accuracy will have practical advantages over current tools. RESULTS: We describe COBALT, a constraint based alignment tool that implements a general framework for multiple alignment of protein sequences. COBALT finds a collection of pairwise constraints derived from database searches, sequence similarity and user input, combines these pairwise constraints, and then incorporates them into a progressive multiple alignment. We show that using constraints derived from the conserved domain database (CDD) and PROSITE protein-motif database improves COBALT's alignment quality. We also show that COBALT has reasonable runtime performance and alignment accuracy comparable to or exceeding that of other tools for a broad range of problems. AVAILABILITY: COBALT is included in the NCBI C++ toolkit. A Linux executable for COBALT, and CDD and PROSITE data used is available at: ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/cobalt  相似文献   

9.
The present study examined the suitability of matrix assisted laser desorption/ionisation time-of-flight mass spectrometry (MALDI-TOF MS) for the rapid grouping of bacterial isolates, i.e. dereplication. Dereplication is important in large-scale isolation campaigns and screening programs since it can significantly reduce labor intensity, time and costs in further downstream analyses. Still, current dereplication techniques are time consuming and costly. MALDI-TOF MS is an attractive tool since it performs fast and cheap analyses with the potential of automation. However, its taxonomic resolution for a broad diversity of bacteria remains largely unknown. To verify the suitability of MALDI-TOF MS for dereplication, a total of 249 unidentified bacterial isolates retrieved from the rhizosphere of potato plants, were analyzed with both MALDI-TOF MS and repetitive element sequence based polymerase chain reaction (rep-PCR). The latter technique was used as a benchmark. Cluster analysis and inspection of the profiles showed that for 204 isolates (82%) the taxonomic resolution of both techniques was comparable, while for 45 isolates (18%) one of both techniques had a higher taxonomic resolution. Additionally, 16S rRNA gene sequence analysis was performed on all members of each delineated cluster to gain insight in the identity and sequence similarity between members in each cluster. MALDI-TOF MS proved to have higher reproducibility than rep-PCR and seemed to be more promising with respect to high-throughput analyses, automation, and time and cost efficiency. Its taxonomic resolution was situated at the species to strain level. The present study demonstrated that MALDI-TOF MS is a powerful tool for dereplication.  相似文献   

10.
11.
In this article, we present some simple yet effective statistical techniques for analysing and comparing large DNA sequences. These techniques are based on frequency distributions of DNA words in a large sequence, and have been packaged into a software called SWORDS. Using sequences available in public domain databases housed in the Internet, we demonstrate how SWORDS can be conveniently used by molecular biologists and geneticists to unmask biologically important features hidden in large sequences and assess their statistical significance.  相似文献   

12.
Sequence annotation is essential for genomics-based research. Investigators of a specific genomic region who have developed abundant local discoveries such as genes and genetic markers, or have collected annotations from multiple resources, can be overwhelmed by the difficulty in creating local annotation and the complexity of integrating all the annotations. Presenting such integrated data in a form suitable for data mining and high-throughput experimental design is even more daunting. DNannotator, a web application, was designed to perform batch annotation on a sizeable genomic region. It takes annotation source data, such as SNPs, genes, primers, and so on, prepared by the end-user and/or a specified target of genomic DNA, and performs de novo annotation. DNannotator can also robustly migrate existing annotations in GenBank format from one sequence to another. Annotation results are provided in GenBank format and in tab-delimited text, which can be imported and managed in a database or spreadsheet and combined with existing annotation as desired. Graphic viewers, such as Genome Browser or Artemis, can display the annotation results. Reference data (reports on the process) facilitating the user's evaluation of annotation quality are optionally provided. DNannotator can be accessed at http://sky.bsd.uchicago.edu/DNannotator.htm.  相似文献   

13.
LDDist is a Perl module implemented in C++ that allows the user to calculate LogDet pair-wise genetic distances for amino acid as well as nucleotide sequence data. It can handle site-to-site rate variation by treating a proportion of the sites as invariant and/or by assigning sites to different, presumably homogenous, rate categories. The rate-class assignments and invariant proportion can be set explicitly, or estimated by the program; the latter using either of two different capture-recapture methods. The assignment to rate categories in lieu of a phylogeny can be done using Shannon-Wiener index as a crude token for relative rate.  相似文献   

14.
We report an isotope labeling shotgun proteome analysis strategy to validate the spectrum-to-sequence assignments generated by using sequence-database searching for the construction of a more reliable MS/MS spectral library. This strategy is demonstrated in the analysis of the E. coli K12 proteome. In the workflow, E. coli cells were cultured in normal and (15)N-enriched media. The differentially labeled proteins from the cell extracts were subjected to trypsin digestion and two-dimensional liquid chromatography quadrupole time-of-flight tandem mass spectrometry (2D-LC QTOF MS/MS) analysis. The MS/MS spectra of the two samples were individually searched using Mascot against the E. coli proteome database to generate lists of peptide sequence matches. The two data sets were compared by overlaying the spectra of unlabeled and labeled matches of the same peptide sequence for validation. Two cutoff filters, one based on the number of common fragment ions and another one on the similarity of intensity patterns among the common ions, were developed and applied to the overlaid spectral pairs to reject the low quality or incorrectly assigned spectra. By examining 257,907 and 245,156 spectra acquired from the unlabeled and (15)N-labeled samples, respectively, an experimentally validated MS/MS spectral library of tryptic peptides was constructed for E. coli K12 that consisted of 9,302 unique spectra with unique sequence and charge state, representing 7,763 unique peptide sequences. This E. coli spectral library could be readily expanded, and the overall strategy should be applicable to other organisms. Even with this relatively small library, it was shown that more peptides could be identified with higher confidence using the spectral search method than by sequence-database searching.  相似文献   

15.
Small I  Peeters N  Legeai F  Lurin C 《Proteomics》2004,4(6):1581-1590
Probably more than 25% of the proteins encoded by the nuclear genomes of multicellular eukaryotes are targeted to membrane-bound compartments by N-terminal targeting signals. The major signals are those for the endoplasmic reticulum, the mitochondria, and in plants, plastids. The most abundant of these targeted proteins are well-known and well-studied, but a large proportion remain unknown, including most of those involved in regulation of organellar gene expression or regulation of biochemical pathways. The discovery and characterization of these proteins by biochemical means will be long and difficult. An alternative method is to identify candidate organellar proteins via their characteristic N-terminal targeting sequences. We have developed a neural network-based approach (Predotar--Prediction of Organelle Targeting sequences) for identifying genes encoding these proteins amongst eukaryotic genome sequences. The power of this approach for identifying and annotating novel gene families has been illustrated by the discovery of the pentatricopeptide repeat family.  相似文献   

16.
NMR spectroscopy is a widely used technique for characterizing the structure and dynamics of macromolecules. Often large amounts of NMR data are required to characterize the structure of proteins. To save valuable time and resources on data acquisition, simulated data is useful in the developmental phase, for data analysis, and for comparison with experimental data. However, existing tools for this purpose can be difficult to use, are sometimes specialized for certain types of molecules or spectra, or produce too idealized data. Here we present a fast, flexible and robust tool, VirtualSpectrum, for generating peak lists for most multi-dimensional NMR experiments for both liquid and solid state NMR. It is possible to tune the quality of the generated peak lists to include sources of artifacts from peak overlap, noise and missing signals. VirtualSpectrum uses an analytic expression to represent the spectrum and derive the peak positions, seamlessly handling overlap between signals. We demonstrate our tool by comparing simulated and experimental spectra for different multi-dimensional NMR spectra and analyzing systematically three cases where overlap between peaks is particularly relevant; solid state NMR data, liquid state NMR homonuclear 1H and 15N-edited spectra, and 2D/3D heteronuclear correlation spectra of unstructured proteins. We analyze the impact of protein size and secondary structure on peak overlap and on the accuracy of structure determination based on data of different qualities simulated by VirtualSpectrum.  相似文献   

17.
MOTIVATION: The study and comparison of mutational spectra is an important problem in molecular biology, because these spectra often reveal important features of the action of various mutagens and the functioning of repair/replication enzymes. As is known, mutability varies significantly along nucleotide sequences: mutations often concentrate at certain positions in a sequence, otherwise termed 'hotspots'. RESULTS: Herein, we propose a regression analysis method based on the use of regression trees in order to analyse the influence of nucleotide context on the occurrence of such hotspots. The REGRT program developed has been tested on simulated and real mutational spectra. For the G:C-->T:A mutational spectra induced by Sn1 alkylating agents (nine spectra), the prediction accuracy was 0. 99. AVAILABILITY: The REGRT program is available upon request from V.Berikov.  相似文献   

18.
MOTIVATION: Structural RNA genes exhibit unique evolutionary patterns that are designed to conserve their secondary structures; these patterns should be taken into account while constructing accurate multiple alignments of RNA genes. The Sankoff algorithm is a natural alignment algorithm that includes the effect of base-pair covariation in the alignment model. However, the extremely high computational cost of the Sankoff algorithm precludes its application to most RNA sequences. RESULTS: We propose an efficient algorithm for the multiple alignment of structural RNA sequences. Our algorithm is a variant of the Sankoff algorithm, and it uses an efficient scoring system that reduces the time and space requirements considerably without compromising on the alignment quality. First, our algorithm computes the match probability matrix that measures the alignability of each position pair between sequences as well as the base pairing probability matrix for each sequence. These probabilities are then combined to score the alignment using the Sankoff algorithm. By itself, our algorithm does not predict the consensus secondary structure of the alignment but uses external programs for the prediction. We demonstrate that both the alignment quality and the accuracy of the consensus secondary structure prediction from our alignment are the highest among the other programs examined. We also demonstrate that our algorithm can align relatively long RNA sequences such as the eukaryotic-type signal recognition particle RNA that is approximately 300 nt in length; multiple alignment of such sequences has not been possible by using other Sankoff-based algorithms. The algorithm is implemented in the software named 'Murlet'. AVAILABILITY: The C++ source code of the Murlet software and the test dataset used in this study are available at http://www.ncrna.org/papers/Murlet/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

19.
Blogo is a web-based tool that detects and displays statistically significant position-specific sequence bias with reduced background noise. The over-represented and under-represented symbols in a particular position are shown above and below the zero line. When the sequences are in open reading frames, the background frequency of nucleotides could be calculated separately for the three positions of a codon, thus greatly reducing the background noise. The chi(2)-test or Fisher's exact test is used to evaluate the statistical significance of every symbol in every position and only those that are significant are highlighted in the resulting logo. The perl source code of the program is freely available and can be run locally. AVAILABILITY: http://acephpx.cropdb.org/blogo/, http://www.bioinformatics.org/blogo/.  相似文献   

20.
A solution is presented for the problem of how to find ancestral codons which minimize the number of mutations over a given network of species for which character-states of aligned amino acid sequences among the contemporary species are known. Three theorems which allow this “maximum parsimony” problem to be solved are proved; then the use of these theorems in finding maximum parsimony ancestral codons is illustrated on a network of chicken and mammalian alpha globin amino acid sequences at two alignment positions.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号