首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We propose an ab initio method, named DiscoverR, for finding common patterns from two RNA secondary structures. The method works by representing RNA secondary structures as ordered labeled trees and performs tree pattern discovery using an efficient dynamic programming algorithm. DiscoverR is able to identify and extract the largest common substructures from two RNA molecules having different sizes without prior knowledge of the locations and topologies of these substructures. We also extend DiscoverR to find repeated regions in an RNA secondary structure, and apply this extended method to detect structural repeats in the 3'-untranslated region of a protein kinase gene. We describe the biological significance of a repeated hairpin found by our method, demonstrating the usefulness of the method. DiscoverR is implemented in Java; a jar file including the source code of the program is available for download at http://bioinformatics.njit.edu/DiscoverR.  相似文献   

2.
MOTIVATION: Accurate time series for biological processes are difficult to estimate due to problems of synchronization, temporal sampling and rate heterogeneity. Methods are needed that can utilize multi-dimensional data, such as those resulting from DNA microarray experiments, in order to reconstruct time series from unordered or poorly ordered sets of observations. RESULTS: We present a set of algorithms for estimating temporal orderings from unordered sets of sample elements. The techniques we describe are based on modifications of a minimum-spanning tree calculated from a weighted, undirected graph. We demonstrate the efficacy of our approach by applying these techniques to an artificial data set as well as several gene expression data sets derived from DNA microarray experiments. In addition to estimating orderings, the techniques we describe also provide useful heuristics for assessing relevant properties of sample datasets such as noise and sampling intensity, and we show how a data structure called a PQ-tree can be used to represent uncertainty in a reconstructed ordering. AVAILABILITY: Academic implementations of the ordering algorithms are available as source code (in the programming language Python) on our web site, along with documentation on their use. The artificial 'jelly roll' data set upon which the algorithm was tested is also available from this web site. The publicly available gene expression data may be found at http://genome-www.stanford.edu/cellcycle/ and http://caulobacter.stanford.edu/CellCycle/.  相似文献   

3.
Comparing the 3D structures of proteins is an important but computationally hard problem in bioinformatics. In this paper, we propose studying the problem when much less information or assumptions are available. We model the structural alignment of proteins as a combinatorial problem. In the problem, each protein is simply a set of points in the 3D space, without sequence order information, and the objective is to discover all large enough alignments for any subset of the input. We propose a data-mining approach for this problem. We first perform geometric hashing of the structures such that points with similar locations in the 3D space are hashed into the same bin in the hash table. The novelty is that we consider each bin as a coincidence group and mine for frequent patterns, which is a well-studied technique in data mining. We observe that these frequent patterns are already potentially large alignments. Then a simple heuristic is used to extend the alignments if possible. We implemented the algorithm and tested it using real protein structures. The results were compared with existing tools. They showed that the algorithm is capable of finding conserved substructures that do not preserve sequence order, especially those existing in protein interfaces. The algorithm can also identify conserved substructures of functionally similar structures within a mixture with dissimilar ones. The running time of the program was smaller or comparable to that of the existing tools.  相似文献   

4.
To interpret LC-MS/MS data in proteomics, most popular protein identification algorithms primarily use predicted fragment m/z values to assign peptide sequences to fragmentation spectra. The intensity information is often undervalued, because it is not as easy to predict and incorporate into algorithms. Nevertheless, the use of intensity to assist peptide identification is an attractive prospect and can potentially improve the confidence of matches and generate more identifications. On the basis of our previously reported study of fragmentation intensity patterns, we developed a protein identification algorithm, SeQuence IDentfication (SQID), that makes use of the coarse intensity from a statistical analysis. The scoring scheme was validated by comparing with Sequest and X!Tandem using three data sets, and the results indicate an improvement in the number of identified peptides, including unique peptides that are not identified by Sequest or X!Tandem. The software and source code are available under the GNU GPL license at http://quiz2.chem.arizona.edu/wysocki/bioinformatics.htm.  相似文献   

5.
Recent advances in high-throughput technologies have made it possible to generate both gene and protein sequence data at an unprecedented rate and scale thereby enabling entirely new "omics"-based approaches towards the analysis of complex biological processes. However, the amount and complexity of data that even a single experiment can produce seriously challenges researchers with limited bioinformatics expertise, who need to handle, analyze and interpret the data before it can be understood in a biological context. Thus, there is an unmet need for tools allowing non-bioinformatics users to interpret large data sets. We have recently developed a method, NNAlign, which is generally applicable to any biological problem where quantitative peptide data is available. This method efficiently identifies underlying sequence patterns by simultaneously aligning peptide sequences and identifying motifs associated with quantitative readouts. Here, we provide a web-based implementation of NNAlign allowing non-expert end-users to submit their data (optionally adjusting method parameters), and in return receive a trained method (including a visual representation of the identified motif) that subsequently can be used as prediction method and applied to unknown proteins/peptides. We have successfully applied this method to several different data sets including peptide microarray-derived sets containing more than 100,000 data points. NNAlign is available online at http://www.cbs.dtu.dk/services/NNAlign.  相似文献   

6.
The secondary structure of a 38 kDa core protein from pig skin proteodermatan sulfate (PDS), was investigated in solution using CD and Fourier transform (FT) ir spectroscopy. Both techniques generally have provided complementary data on the secondary structures of proteins. CD spectral analysis has shown that the core protein contains 60% beta-turn and alpha-helical structures, the rest being "unordered" structure. FT ir data do not permit calculation of quantitative contributions of substructures, at the present time, to the overall secondary structure of the core protein. CD spectrum of the intact PDS is similar to the core protein CD spectrum.  相似文献   

7.
We present an approach to predicting protein structural class that uses amino acid composition and hydrophobic pattern frequency information as input to two types of neural networks: (1) a three-layer back-propagation network and (2) a learning vector quantization network. The results of these methods are compared to those obtained from a modified Euclidean statistical clustering algorithm. The protein sequence data used to drive these algorithms consist of the normalized frequency of up to 20 amino acid types and six hydrophobic amino acid patterns. From these frequency values the structural class predictions for each protein (all-alpha, all-beta, or alpha-beta classes) are derived. Examples consisting of 64 previously classified proteins were randomly divided into multiple training (56 proteins) and test (8 proteins) sets. The best performing algorithm on the test sets was the learning vector quantization network using 17 inputs, obtaining a prediction accuracy of 80.2%. The Matthews correlation coefficients are statistically significant for all algorithms and all structural classes. The differences between algorithms are in general not statistically significant. These results show that information exists in protein primary sequences that is easily obtainable and useful for the prediction of protein structural class by neural networks as well as by standard statistical clustering algorithms.  相似文献   

8.
Greedily building protein networks with confidence   总被引:2,自引:0,他引:2  
MOTIVATION: With genome sequences complete for human and model organisms, it is essential to understand how individual genes and proteins are organized into biological networks. Much of the organization is revealed by proteomics experiments that now generate torrents of data. Extracting relevant complexes and pathways from high-throughput proteomics data sets has posed a challenge, however, and new methods to identify and extract networks are essential. We focus on the problem of building pathways starting from known proteins of interest. RESULTS: We have developed an efficient, greedy algorithm, SEEDY, that extracts biologically relevant biological networks from protein-protein interaction data, building out from selected seed proteins. The algorithm relies on our previous study establishing statistical confidence levels for interactions generated by two-hybrid screens and inferred from mass spectrometric identification of protein complexes. We demonstrate the ability to extract known yeast complexes from high-throughput protein interaction data with a tunable parameter that governs the trade-off between sensitivity and selectivity. DNA damage repair pathways are presented as a detailed example. We highlight the ability to join heterogeneous data sets, in this case protein-protein interactions and genetic interactions, and the appearance of cross-talk between pathways caused by re-use of shared components. SIGNIFICANCE AND COMPARISON: The significance of the SEEDY algorithm is that it is fast, running time O[(E + V) log V] for V proteins and E interactions, a single adjustable parameter controls the size of the pathways that are generated, and an associated P-value indicates the statistical confidence that the pathways are enriched for proteins with a coherent function. Previous approaches have focused on extracting sub-networks by identifying motifs enriched in known biological networks. SEEDY provides the complementary ability to perform a directed search based on proteins of interest. AVAILABILITY: SEEDY software (Perl source), data tables and confidence score models (R source) are freely available from the author.  相似文献   

9.

Background  

Many studies have provided algorithms or methods to assess a statistical significance in quantitative proteomics when multiple replicates for a protein sample and a LC/MS analysis are available. But, confidence is still lacking in using datasets for a biological interpretation without protein sample replicates. Although a fold-change is a conventional threshold that can be used when there are no sample replicates, it does not provide an assessment of statistical significance such as a false discovery rate (FDR) which is an important indicator of the reliability to identify differentially expressed proteins. In this work, we investigate whether differentially expressed proteins can be detected with a statistical significance from a pair of unlabeled protein samples without replicates and with only duplicate LC/MS injections per sample. A FDR is used to gauge the statistical significance of the differentially expressed proteins.  相似文献   

10.
The integration of proteomics data with biological knowledge is a recent trend in bioinformatics. A lot of biological information is available and is spread on different sources and encoded in different ontologies (e.g. Gene Ontology). Annotating existing protein data with biological information may enable the use (and the development) of algorithms that use biological ontologies as framework to mine annotated data. Recently many methodologies and algorithms that use ontologies to extract knowledge from data, as well as to analyse ontologies themselves have been proposed and applied to other fields. Conversely, the use of such annotations for the analysis of protein data is a relatively novel research area that is currently becoming more and more central in research. Existing approaches span from the definition of the similarity among genes and proteins on the basis of the annotating terms, to the definition of novel algorithms that use such similarities for mining protein data on a proteome-wide scale. This work, after the definition of main concept of such analysis, presents a systematic discussion and comparison of main approaches. Finally, remaining challenges, as well as possible future directions of research are presented.  相似文献   

11.
Comparative analyses of cellular interaction networks enable understanding of the cell's modular organization through identification of functional modules and complexes. These techniques often rely on topological features such as connectedness and density, based on the premise that functionally related proteins are likely to interact densely and that these interactions follow similar evolutionary trajectories. Significant recent work has focused on efficient algorithms for identification of such functional modules and their conservation. In spite of algorithmic advances, development of a comprehensive infrastructure for interaction databases is in relative infancy compared to corresponding sequence analysis tools. One critical, and as yet unresolved aspect of this infrastructure is a measure of the statistical significance of a match, or a dense subcomponent. In the absence of analytical measures, conventional methods rely on computationally expensive simulations based on ad-hoc models for quantifying significance. In this paper, we present techniques for analytically quantifying statistical significance of dense components in reference model graphs. We consider two reference models--a G(n, p) model in which each pair of nodes in a graph has an identical likelihood, p, of sharing an edge, and a two-level G(n, p) model, which accounts for high-degree hub nodes generally observed in interaction networks. Experiments performed on a rich collection of protein interaction (PPI) networks show that the proposed model provides a reliable means of evaluating statistical significance of dense patterns in these networks. We also adapt existing state-of-the-art network clustering algorithms by using our statistical significance measure as an optimization criterion. Comparison of the resulting module identification algorithm, SIDES, with existing methods shows that SIDES outperforms existing algorithms in terms of sensitivity and specificity of identified clusters with respect to available GO annotations.  相似文献   

12.
Background: Statistical validation of predicted complexes is a fundamental issue in proteomics and bioinformatics. The target is to measure the statistical significance of each predicted complex in terms of p-values. Surprisingly, this issue has not received much attention in the literature. To our knowledge, only a few research efforts have been made towards this direction. Methods: In this article, we propose a novel method for calculating the p-value of a predicted complex. The null hypothesis is that there is no difference between the number of edges in target protein complex and that in the random null model. In addition, we assume that a true protein complex must be a connected subgraph. Based on this null hypothesis, we present an algorithm to compute the p-value of a given predicted complex. Results: We test our method on five benchmark data sets to evaluate its effectiveness. Conclusions: The experimental results show that our method is superior to the state-of-the-art algorithms on assessing the statistical significance of candidate protein complexes.  相似文献   

13.
14.

Background  

Translational initiation site (TIS) prediction is a very important and actively studied topic in bioinformatics. In order to complete a comparative analysis, it is desirable to have several benchmark data sets which can be used to test the effectiveness of different algorithms. An ideal benchmark data set should be reliable, representative and readily available. Preferably, proteins encoded by members of the data set should also be representative of the protein population actually expressed in cellular specimens.  相似文献   

15.
16.
Selection of representative protein data sets.   总被引:37,自引:17,他引:20       下载免费PDF全文
The Protein Data Bank currently contains about 600 data sets of three-dimensional protein coordinates determined by X-ray crystallography or NMR. There is considerable redundancy in the data base, as many protein pairs are identical or very similar in sequence. However, statistical analyses of protein sequence-structure relations require nonredundant data. We have developed two algorithms to extract from the data base representative sets of protein chains with maximum coverage and minimum redundancy. The first algorithm focuses on optimizing a particular property of the selected proteins and works by successive selection of proteins from an ordered list and exclusion of all neighbors of each selected protein. The other algorithm aims at maximizing the size of the selected set and works by successive thinning out of clusters of similar proteins. Both algorithms are generally applicable to other data bases in which criteria of similarity can be defined and relate to problems in graph theory. The largest nonredundant set extracted from the current release of the Protein Data Bank has 155 protein chains. In this set, no two proteins have sequence similarity higher than a certain cutoff (30% identical residues for aligned subsequences longer than 80 residues), yet all structurally unique protein families are represented. Periodically updated lists of representative data sets are available by electronic mail from the file server "netserv@embl-heidelberg.de." The selection may be useful in statistical approaches to protein folding as well as in the analysis and documentation of the known spectrum of three-dimensional protein structures.  相似文献   

17.
Current proteomics technology is limited in resolving the proteome complexity of biological systems. The main issue at stake is to increase throughput and spectra quality so that spatiotemporal dimensions, population parameters and the complexity of protein modifications on a quantitative scale can be considered. MS-based proteomics and protein arrays are the main players in large-scale proteome analysis and an integration of these two methodologies is powerful but presently not sufficient for detailed quantitative and spatiotemporal proteome characterization. Improvements of instrumentation for MS-based proteomics have been achieved recently resulting in data sets of approximately one million spectra which is a large step in the right direction. The corresponding raw data range from 50 to 100?Gb and are frequently made available. Multidimensional LC-MS data sets have been demonstrated to identify and quantitate 2000-8000 proteins from whole cell extracts. The analysis of the resulting data sets requires several steps from raw data processing, to database-dependent search, statistical evaluation of the search result, quantitative algorithms and statistical analysis of quantitative data. A large number of software tools have been proposed for the above-mentioned tasks. However, it is not the aim of this review to cover all software tools, but rather discuss common data analysis strategies used by various algorithms for each of the above-mentioned steps in a non-redundant approach and to argue that there are still some areas which need improvements.  相似文献   

18.

Background  

Statistical bioinformatics is the study of biological data sets obtained by new micro-technologies by means of proper statistical methods. For a better understanding of environmental adaptations of proteins, orthologous sequences from different habitats may be explored and compared. The main goal of the DeltaProt Toolbox is to provide users with important functionality that is needed for comparative screening and studies of extremophile proteins and protein classes. Visualization of the data sets is also the focus of this article, since visualizations can play a key role in making the various relationships transparent. This application paper is intended to inform the reader of the existence, functionality, and applicability of the toolbox.  相似文献   

19.
Mass spectrometry (MS) is a technique that is used for biological studies. It consists in associating a spectrum to a biological sample. A spectrum consists of couples of values (intensity, m/z), where intensity measures the abundance of biomolecules (as proteins) with a mass-to-charge ratio (m/z) present in the originating sample. In proteomics experiments, MS spectra are used to identify pattern expressions in clinical samples that may be responsible of diseases. Recently, to improve the identification of peptides/proteins related to patterns, MS/MS process is used, consisting in performing cascade of mass spectrometric analysis on selected peaks. Latter technique has been demonstrated to improve the identification and quantification of proteins/peptide in samples. Nevertheless, MS analysis deals with a huge amount of data, often affected by noises, thus requiring automatic data management systems. Tools have been developed and most of the time furnished with the instruments allowing: (i) spectra analysis and visualization, (ii) pattern recognition, (iii) protein databases querying, (iv) peptides/proteins quantification and identification. Currently most of the tools supporting such phases need to be optimized to improve the protein (and their functionalities) identification processes. In this article we survey on applications supporting spectrometrists and biologists in obtaining information from biological samples, analyzing available software for different phases. We consider different mass spectrometry techniques, and thus different requirements. We focus on tools for (i) data preprocessing, allowing to prepare results obtained from spectrometers to be analyzed; (ii) spectra analysis, representation and mining, aimed to identify common and/or hidden patterns in spectra sets or in classifying data; (iii) databases querying to identify peptides; and (iv) improving and boosting the identification and quantification of selected peaks. We trace some open problems and report on requirements that represent new challenges for bioinformatics.  相似文献   

20.
Yu L  Gao L  Kong C 《Proteomics》2011,11(19):3826-3834
In this paper, we present a method for core-attachment complexes identification based on maximal frequent patterns (CCiMFP) in yeast protein-protein interaction (PPI) networks. First, we detect subgraphs with high degree as candidate protein cores by mining maximal frequent patterns. Then using topological and functional similarities, we combine highly similar protein cores and filter insignificant ones. Finally, the core-attachment complexes are formed by adding attachment proteins to each significant core. We experimentally evaluate the performance of our method CCiMFP on yeast PPI networks. Using gold standard sets of protein complexes, Gene Ontology (GO), and localization annotations, we show that our method gains an improvement over the previous algorithms in terms of precision, recall, and biological significance of the predicted complexes. The colocalization scores of our predicted complex sets are higher than those of two known complex sets. Moreover, our method can detect GO-enriched complexes with disconnected cores compared with other methods based on the subgraph connectivity.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号