首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Background  

The key to mass-spectrometry-based proteomics is peptide identification, which relies on software analysis of tandem mass spectra. Although each search engine has its strength, combining the strengths of various search engines is not yet realizable largely due to the lack of a unified statistical framework that is applicable to any method.  相似文献   

2.
Balding DJ 《Biometrics》2002,58(1):241-244
A recent article in Biometrics (Stockmarr, 1999, 55, 671-677) has generated correspondence (56, 1274-1277; 57, 976-980) reigniting a controversy started by a 1996 report on DNA profile evidence issued by the U.S. National Research Council (NRC). The issue concerns the evidential weight of a DNA profile match when the match results from a search through a profile database. The views of both Stockmarr and the NRC report conflict with those of many statisticians working in the area, and the differing viewpoints lead to dramatically different assessments of evidence. I outline reasons why Stockmarr and the NRC report are wrong. I also briefly discuss possible reasons why forensic applications tend to be problematic for statisticians.  相似文献   

3.
Filtration techniques in the form of rapid elimination of candidate sequences while retaining the true one are key ingredients of database searches in genomics. Although SEQUEST and Mascot perform a conceptually similar task to the tool BLAST, the key algorithmic idea of BLAST (filtration) was never implemented in these tools. As a result MS/MS protein identification tools are becoming too time-consuming for many applications including search for post-translationally modified peptides. Moreover, matching millions of spectra against all known proteins will soon make these tools too slow in the same way that "genome vs genome" comparisons instantly made BLAST too slow. We describe the development of filters for MS/MS database searches that dramatically reduce the running time and effectively remove the bottlenecks in searching the huge space of protein modifications. Our approach, based on a probability model for determining the accuracy of sequence tags, achieves superior results compared to GutenTag, a popular tag generation algorithm. Our tag generating algorithm along with our de novo sequencing algorithm PepNovo can be accessed via the URL http://peptide.ucsd.edu/.  相似文献   

4.
A method for fast database search for all k-nucleotide repeats.   总被引:3,自引:0,他引:3       下载免费PDF全文
A significant portion of DNA consists of repeating patterns of various sizes, from very small (one, two and three nucleotides) to very large (over 300 nucleotides). Although the functions of these repeating regions are not well understood, they appear important for understanding the expression, regulation and evolution of DNA. For example, increases in the number of trinucleotide repeats have been associated with human genetic disease, including Fragile-X mental retardation and Huntington's disease. Repeats are also useful as a tool in mapping and identifying DNA; the number of copies of a particular pattern at a site is often variable among individuals (polymorphic) and is therefore helpful in locating genes via linkage studies and also in providing DNA fingerprints of individuals. The number of repeating regions is unknown as is the distribution of pattern sizes. It would be useful to search for such regions in the DNA database in order that they may be studied more fully. The DNA database currently consists of approximately 150 million basepairs and is growing exponentially. Therefore, any program to look for repeats must be efficient and fast. In this paper, we present some new techniques that are useful in recognizing repeating patterns and describe a new program for rapidly detecting repeat regions in the DNA database where the basic unit of the repeat has size up to 32 nucleotides. It is our hope that the examples in this paper will illustrate the unrealized diversity of repeats in DNA and that the program we have developed will be a useful tool for locating new and interesting repeats.  相似文献   

5.

Background  

Sequence similarity searching is an important and challenging task in molecular biology and next-generation sequencing should further strengthen the need for faster algorithms to process such vast amounts of data. At the same time, the internal architecture of current microprocessors is tending towards more parallelism, leading to the use of chips with two, four and more cores integrated on the same die. The main purpose of this work was to design an effective algorithm to fit with the parallel capabilities of modern microprocessors.  相似文献   

6.
We introduce a metric for local sequence alignments that has utility for accelerating optimal alignment searches without loss of sensitivity. The metric's triangle inequality property permits identification of redundant database entries guaranteed to have optimal alignments to the query sequence that fall below a specified score threshold, thereby permitting comparisons to these entries to be skipped. We prove the existence of the metric for a variety of scoring systems, including the most commonly used ones, and show that a triangle inequality can be established as well for nucleotide-to-protein sequence comparisons. We discuss a database clustering and search strategy that takes advantage of the triangle inequality. The strategy permits moderate but significant acceleration of searches against the widely used "nr" protein database. It also provides a theoretically based method for database clustering in general and provides a standard against which to compare heuristic clustering strategies.  相似文献   

7.
In database searches for sequence similarity, matches to a distinct sequence region (e.g., protein domain) are frequently obscured by numerous matches to another region of the same sequence. In order to cope with this problem, algorithms are developed to discard redundant matches. One model for this problem begins with a list of intervals, each with an associated score; each interval gives the range of positions in the query sequence that align to a database sequence, and the score is that of the alignment. If interval I is contained in interval J, and I's score is less than J's, then I is said to be dominated by J. The problem is then to identify each interval that is dominated by at least K other intervals, where K is a given level of "tolerable redundancy." An algorithm is developed to solve the problem in O(N log N) time and O(N*) space, where N is the number of intervals and N* is a precisely defined value that never exceeds N and is frequently much smaller. This criterion for discarding database hits has been implemented in the Blast program, as illustrated herein with examples. Several variations and extensions of this approach are also described.  相似文献   

8.
Yang JM  Tung CH 《Nucleic acids research》2006,34(13):3646-3659
As more protein structures become available and structural genomics efforts provide structural models in a genome-wide strategy, there is a growing need for fast and accurate methods for discovering homologous proteins and evolutionary classifications of newly determined structures. We have developed 3D-BLAST, in part, to address these issues. 3D-BLAST is as fast as BLAST and calculates the statistical significance (E-value) of an alignment to indicate the reliability of the prediction. Using this method, we first identified 23 states of the structural alphabet that represent pattern profiles of the backbone fragments and then used them to represent protein structure databases as structural alphabet sequence databases (SADB). Our method enhanced BLAST as a search method, using a new structural alphabet substitution matrix (SASM) to find the longest common substructures with high-scoring structured segment pairs from an SADB database. Using personal computers with Intel Pentium4 (2.8 GHz) processors, our method searched more than 10 000 protein structures in 1.3 s and achieved a good agreement with search results from detailed structure alignment methods. [3D-BLAST is available at http://3d-blast.life.nctu.edu.tw].  相似文献   

9.
10.

Background  

Current experimental techniques, especially those applying liquid chromatography mass spectrometry, have made high-throughput proteomic studies possible. The increase in throughput however also raises concerns on the accuracy of identification or quantification. Most experimental procedures select in a given MS scan only a few relatively most intense parent ions, each to be fragmented (MS2) separately, and most other minor co-eluted peptides that have similar chromatographic retention times are ignored and their information lost.  相似文献   

11.
Chen Y  Hanan J 《Bio Systems》2002,65(2-3):187-197
Models of plant architecture allow us to explore how genotype environment interactions effect the development of plant phenotypes. Such models generate masses of data organised in complex hierarchies. This paper presents a generic system for creating and automatically populating a relational database from data generated by the widely used L-system approach to modelling plant morphogenesis. Techniques from compiler technology are applied to generate attributes (new fields) in the database, to simplify query development for the recursively-structured branching relationship. Use of biological terminology in an interactive query builder contributes towards making the system biologist-friendly.  相似文献   

12.
Tandem mass spectrometry (MS/MS) combined with protein database searching has been widely used in protein identification. A validation procedure is generally required to reduce the number of false positives. Advanced tools using statistical and machine learning approaches may provide faster and more accurate validation than manual inspection and empirical filtering criteria. In this study, we use two feature selection algorithms based on random forest and support vector machine to identify peptide properties that can be used to improve validation models. We demonstrate that an improved model based on an optimized set of features reduces the number of false positives by 58% relative to the model which used only search engine scores, at the same sensitivity score of 0.8. In addition, we develop classification models based on the physicochemical properties and protein sequence environment of these peptides without using search engine scores. The performance of the best model based on the support vector machine algorithm is at 0.8 AUC, 0.78 accuracy, and 0.7 specificity, suggesting a reasonably accurate classification. The identified properties important to fragmentation and ionization can be either used in independent validation tools or incorporated into peptide sequencing and database search algorithms to improve existing software programs.  相似文献   

13.
It has become standard to evaluate newly devised database search methods in terms of sensitivity and selectivity and to compare them with existing methods. This involves the construction of a suitable evaluation scenario, the execution of the methods, the assessment of their performances, and the presentation of the results. Each of these four phases and their smooth connection usually imposes formidable work. To relieve the evaluator of this burden, a system has been designed with which evaluations can be effected rapidly. It is implemented in the programming language Python whose object-oriented features are used to offer a great flexibility in changing the evaluation design. A graphical user interface is provided which offers the usual amenities such as radio- and checkbuttons or file browsing facilities.  相似文献   

14.
dbSNP数据库为广大研究者提供了丰富的SNPs信息,充分地利用dbSNP数据库中的资源将大幅度降低研究成本提高研究效率。结合本实验室的研究工作,对鸡dbSNP数据库的检索和应用进行了一些探索。认为根据研究目的的不同,dbSNP数据库的检索和应用有必要同其它的数据库相结合。  相似文献   

15.
Proteins are extensively modified after translation due to cellular regulation, signal transduction, or chemical damage. Peptide tandem mass spectrometry can discover post-translational modifications, as well as sequence polymorphisms. Recent efforts have studied modifications at the proteomic scale. In this context, it becomes crucial to assess the accuracy of modification discovery. We discuss methods to quantify the false discovery rate from a search and demonstrate how several features can be used to distinguish valid modifications from search artifacts. We present a tool, PTMFinder, which implements these methods. We summarize the corpus of post-translational modifications identified on large data sets. Thousands of known and novel modification sites are identified, including site-specific modifications conserved over vast evolutionary distances.  相似文献   

16.
A wealth of bioinformatics tools and databases has been created over the last decade and most are freely available to the general public. However, these valuable resources live a shadow existence compared to experimental results and methods that are widely published in journals and relatively easily found through publication databases such as PubMed. For the general scientist as well as bioinformaticists, these tools can deliver great value to the design and analysis of biological and medical experiments, but there is no inventory presenting an up-to-date and easily searchable index of all these resources. To remedy this, the BioWareDB search engine has been created. BioWareDB is an extensive and current catalog of software and databases of relevance to researchers in the fields of biology and medicine, and presently consists of 2800 validated entries. AVAILABILITY: BioWareDB is freely available over the Internet at http://www.biowaredb.org/  相似文献   

17.

Background  

Analysis of complex samples with tandem mass spectrometry (MS/MS) has become routine in proteomic research. However, validation of database search results creates a bottleneck in MS/MS data processing. Recently, methods based on a randomized database have become popular for quality control of database search results. However, a consequent problem is the ignorance of how to combine different database search scores to improve the sensitivity of randomized database methods.  相似文献   

18.
MOTIVATION: Due to the recent advances in technology of mass spectrometry, there has been an exponential increase in the amount of data being generated in the past few years. Database searches have not been able to keep with this data explosion. Thus, speeding up the data searches becomes increasingly important in mass-spectrometry-based applications. Traditional database search methods use one-against-all comparisons of a query spectrum against a very large number of peptides generated from in silico digestion of protein sequences in a database, to filter potential candidates from this database followed by a detailed scoring and ranking of those filtered candidates. RESULTS: In this article, we show that we can avoid the one-against-all comparisons. The basic idea is to design a set of hash functions to pre-process peptides in the database such that for each query spectrum we can use the hash functions to find only a small subset of peptide sequences that are most likely to match the spectrum. The construction of each hash function is based on a random spectrum and the hash value of a peptide is the normalized shared peak counts score (cosine) between the random spectrum and the hypothetical spectrum of the peptide. To implement this idea, we first embed each peptide into a unit vector in a high-dimensional metric space. The random spectrum is represented by a random vector, and we use random vectors to construct a set of hash functions called locality sensitive hashing (LSH) for preprocessing. We demonstrate that our mapping is accurate. We show that our method can filter out >95.65% of the spectra without missing any correct sequences, or gain 111 times speedup by filtering out 99.64% of spectra while missing at most 0.19% (2 out of 1014) of the correct sequences. In addition, we show that our method can be effectively used for other mass spectra mining applications such as finding clusters of spectra efficiently and accurately. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

19.
SUMMARY: An algorithm and software are described that provide a fast method to produce a novel, function-oriented visualization of the results of a sequence database search. Text mining of sequence annotations allows position specific plots of potential functional similarity to be compared in a simple compact representation. AVAILABILITY: The application can be accessed via a web server at http://www.compbio.dundee.ac.uk. The RHIMS software may be obtained by request to the authors.  相似文献   

20.
Protein identification is important in proteomics. Proteomic analyses based on mass spectra (MS) constitute innovative ways to identify the components of protein complexes. Instruments can obtain the mass spectrum to an accuracy of 0.01 Da or better, but identification errors are inevitable. This study shows a novel tool, MultiProtIdent, which can identify proteins using additional information about protein-protein interactions and protein functional associations. Both single and multiple Peptide Mass Fingerprints (PMFs) are input to MultiProtIdent, which matches the PMFs to a theoretical peptide mass database. The relationships or interactions among proteins are considered to reduce false positives in PMF matching. Experiments to identify protein complexes reveal that MultiProtIdent is highly promising. The website associated with this study is http://dbms104.csie.ncu.edu.tw/.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号