首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION: The maximum expected accuracy optimization criterion for multiple sequence alignment uses pairwise posterior probabilities of residues to align sequences. The partition function methodology is one way of estimating these probabilities. Here, we combine these two ideas for the first time to construct maximal expected accuracy sequence alignments. RESULTS: We bridge the two techniques within the program Probalign. Our results indicate that Probalign alignments are generally more accurate than other leading multiple sequence alignment methods (i.e. Probcons, MAFFT and MUSCLE) on the BAliBASE 3.0 protein alignment benchmark. Similarly, Probalign also outperforms these methods on the HOMSTRAD and OXBENCH benchmarks. Probalign ranks statistically highest (P-value < 0.005) on all three benchmarks. Deeper scrutiny of the technique indicates that the improvements are largest on datasets containing N/C-terminal extensions and on datasets containing long and heterogeneous length proteins. These points are demonstrated on both real and simulated data. Finally, our method also produces accurate alignments on long and heterogeneous length datasets containing protein repeats. Here, alignment accuracy scores are at least 10% and 15% higher than the other three methods when standard deviation of length is >300 and 400, respectively. AVAILABILITY: Open source code implementing Probalign as well as for producing the simulated data, and all real and simulated data are freely available from http://www.cs.njit.edu/usman/probalign  相似文献   

2.
The statistical validation of database search results is a complex issue in bottom-up proteomics. The correct and incorrect peptide spectrum match (PSM) scores overlap significantly, making an accurate assessment of true peptide matches challenging. Since the complete separation between the true and false hits is practically never achieved, there is need for better methods and rescoring algorithms to improve upon the primary database search results. Here we describe the calibration and False Discovery Rate (FDR) estimation of database search scores through a dynamic FDR calculation method, FlexiFDR, which increases both the sensitivity and specificity of search results. Modelling a simple linear regression on the decoy hits for different charge states, the method maximized the number of true positives and reduced the number of false negatives in several standard datasets of varying complexity (18-mix, 49-mix, 200-mix) and few complex datasets (E. coli and Yeast) obtained from a wide variety of MS platforms. The net positive gain for correct spectral and peptide identifications was up to 14.81% and 6.2% respectively. The approach is applicable to different search methodologies- separate as well as concatenated database search, high mass accuracy, and semi-tryptic and modification searches. FlexiFDR was also applied to Mascot results and showed better performance than before. We have shown that appropriate threshold learnt from decoys, can be very effective in improving the database search results. FlexiFDR adapts itself to different instruments, data types and MS platforms. It learns from the decoy hits and sets a flexible threshold that automatically aligns itself to the underlying variables of data quality and size.  相似文献   

3.
Comprehensive understanding of biological systems requires efficient and systematic assimilation of high-throughput datasets in the context of the existing knowledge base. A major limitation in the field of proteomics is the lack of an appropriate software platform that can synthesize a large number of experimental datasets in the context of the existing knowledge base. Here, we describe a software platform, termed PROTEOME-3D, that utilizes three essential features for systematic analysis of proteomics data: creation of a scalable, queryable, customized database for identified proteins from published literature; graphical tools for displaying proteome landscapes and trends from multiple large-scale experiments; and interactive data analysis that facilitates identification of crucial networks and pathways. Thus, PROTEOME-3D offers a standardized platform to analyze high-throughput experimental datasets for the identification of crucial players in co-regulated pathways and cellular processes.  相似文献   

4.
As the reliable identification of proteins by tandem mass spectrometry becomes increasingly common, the full characterization of large data sets of proteins remains a difficult challenge. Our goal was to survey the proteome of a human T-cell lymphoma-derived cell line in a single set of experiments and present an automated method for the annotation of lists of proteins. A downstream application of these data includes the identification of novel pathogenetic and candidate diagnostic markers of T-cell lymphoma. Total protein isolated from cytoplasmic, membrane, and nuclear fractions of the SUDHL-1 T-cell lymphoma cell line was resolved by SDS-PAGE, and the entire gel lanes digested and analyzed by tandem mass spectrometry. Acquired data files were searched against the UniProt protein database using the SEQUEST algorithm. Search results for each subcellular fraction were analyzed using INTERACT and ProteinProphet. All protein identifications with an error rate of less than 10% were directly exported into excel and analyzed using GOMiner (NIH/NCI). The Gene ontology molecular function and cell location data were summarized for the identified proteins and results exported as user-interactive directed acyclic graphs. A total of 1105 unique proteins were identified and fully annotated, including numerous proteins that had not been previously characterized in lymphoma, in functional categories such as cell adhesion, migration, signaling, and stress response. This study demonstrates the utility of currently available bioinformatics tools for the robust identification and annotation of large numbers of proteins in a batchwise fashion.  相似文献   

5.
6.
MOTIVATION: The rapid increase in the amount of protein sequence data has created a need for an automated identification of evolutionarily related subgroups from large datasets. The existing methods typically require a priori specification of the number of putative groups, which defines the resolution of the classification solution. RESULTS: We introduce a Bayesian model-based approach to simultaneous identification of evolutionary groups and conserved parts of the protein sequences. The model-based approach provides an intuitive and efficient way of determining the number of groups from the sequence data, in contrast to the ad hoc methods often exploited for similar purposes. Our model recognizes the areas in the sequences that are relevant for the clustering and regards other areas as noise. We have implemented the method using a fast stochastic optimization algorithm which yields a clustering associated with the estimated maximum posterior probability. The method has been shown to have high specificity and sensitivity in simulated and real clustering tasks. With real datasets the method also highlights the residues close to the active site. AVAILABILITY: Software 'kPax' is available at http://www.rni.helsinki.fi/jic/softa.html  相似文献   

7.
The structural annotation of proteins with no detectable homologs of known 3D structure identified using sequence‐search methods is a major challenge today. We propose an original method that computes the conditional probabilities for the amino‐acid sequence of a protein to fit to known protein 3D structures using a structural alphabet, known as “Protein Blocks” (PBs). PBs constitute a library of 16 local structural prototypes that approximate every part of protein backbone structures. It is used to encode 3D protein structures into 1D PB sequences and to capture sequence to structure relationships. Our method relies on amino acid occurrence matrices, one for each PB, to score global and local threading of query amino acid sequences to protein folds encoded into PB sequences. It does not use any information from residue contacts or sequence‐search methods or explicit incorporation of hydrophobic effect. The performance of the method was assessed with independent test datasets derived from SCOP 1.75A. With a Z‐score cutoff that achieved 95% specificity (i.e., less than 5% false positives), global and local threading showed sensitivity of 64.1% and 34.2%, respectively. We further tested its performance on 57 difficult CASP10 targets that had no known homologs in PDB: 38 compatible templates were identified by our approach and 66% of these hits yielded correctly predicted structures. This method scales‐up well and offers promising perspectives for structural annotations at genomic level. It has been implemented in the form of a web‐server that is freely available at http://www.bo‐protscience.fr/forsa .  相似文献   

8.
Trypsin cleaves exclusively C-terminal to arginine and lysine residues   总被引:2,自引:0,他引:2  
Almost all large-scale projects in mass spectrometry-based proteomics use trypsin to convert protein mixtures into more readily analyzable peptide populations. When searching peptide fragmentation spectra against sequence databases, potentially matching peptide sequences can be required to conform to tryptic specificity, namely, cleavage exclusively C-terminal to arginine or lysine. In many published reports, however, significant numbers of proteins are identified by non-tryptic peptides. Here we use the sub-parts per million mass accuracy of a new ion trap Fourier transform mass spectrometer to achieve more than a 100-fold increased confidence in peptide identification compared with typical ion trap experiments and show that trypsin cleaves solely C-terminal to arginine and lysine. We find that non-tryptic peptides occur only as the C-terminal peptides of proteins and as breakup products of fully tryptic peptides N-terminal to an internal proline. Simulating lower mass accuracy led to a large number of proteins erroneously identified with non-tryptic peptide hits. Our results indicate that such peptide hits in previous studies should be re-examined and that peptide identification should be based on strict trypsin specificity.  相似文献   

9.
10.
11.
Biclustering has emerged as an important approach to the analysis of large-scale datasets. A biclustering technique identifies a subset of rows that exhibit similar patterns on a subset of columns in a data matrix. Many biclustering methods have been proposed, and most, if not all, algorithms are developed to detect regions of “coherence” patterns. These methods perform unsatisfactorily if the purpose is to identify biclusters of a constant level. This paper presents a two-step biclustering method to identify constant level biclusters for binary or quantitative data. This algorithm identifies the maximal dimensional submatrix such that the proportion of non-signals is less than a pre-specified tolerance δ. The proposed method has much higher sensitivity and slightly lower specificity than several prominent biclustering methods from the analysis of two synthetic datasets. It was further compared with the Bimax method for two real datasets. The proposed method was shown to perform the most robust in terms of sensitivity, number of biclusters and number of serotype-specific biclusters identified. However, dichotomization using different signal level thresholds usually leads to different sets of biclusters; this also occurs in the present analysis.  相似文献   

12.
Determining the error rate for peptide and protein identification accurately and reliably is necessary to enable evaluation and crosscomparisons of high throughput proteomics experiments. Currently, peptide identification is based either on preset scoring thresholds or on probabilistic models trained on datasets that are often dissimilar to experimental results. The false discovery rates (FDR) and peptide identification probabilities for these preset thresholds or models often vary greatly across different experimental treatments, organisms, or instruments used in specific experiments. To overcome these difficulties, randomized databases have been used to estimate the FDR. However, the cumulative FDR may include low probability identifications when there are a large number of peptide identifications and exclude high probability identifications when there are few. To overcome this logical inconsistency, this study expands the use of randomized databases to generate experiment-specific estimates of peptide identification probabilities. These experiment-specific probabilities are generated by logistic and Loess regression models of the peptide scores obtained from original and reshuffled database matches. These experiment-specific probabilities are shown to very well approximate "true" probabilities based on known standard protein mixtures across different experiments. Probabilities generated by the earlier Peptide_Prophet and more recent LIPS models are shown to differ significantly from this study's experiment-specific probabilities, especially for unknown samples. The experiment-specific probabilities reliably estimate the accuracy of peptide identifications and overcome potential logical inconsistencies of the cumulative FDR. This estimation method is demonstrated using a Sequest database search, LIPS model, and a reshuffled database. However, this approach is generally applicable to any search algorithm, peptide scoring, and statistical model when using a randomized database.  相似文献   

13.
The phosphatase and tensin homolog (PTEN) tumor suppressor is a multifunctional protein deregulated in many types of cancer. To date, a comprehensive documentation of PTEN interacting proteins has not been performed. The goal of our study was to characterize the PTEN interactome using affinity pull-down and tandem mass spectrometry (MS/MS). Wild-type PTEN cDNA was inserted into pTRC-His2 vector to create a 6-His tagged protein, which was expressed in Escherichia coli. Lysate from a human lymphoma cell line was used in pull-down assays, utilizing affinity for nickel-agarose beads. Bound proteins were eluted with imidazole, digested and analyzed on an LCQ DecaXP ion trap mass spectrometer. The nickel affinity pull-down efficiency was evaluated by sodium dodecyl sulfate-polyacrylamide gel electrophoresis and Western blot analysis. Acquired data were searched against the NCBI nr.fasta nonredundant protein database using the SEQUEST algorithm and screened using INTERACT and ProteinProphet. All experiments were performed in duplicate with 6-His-lacZ serving as control. A total of 79 proteins were identified in the wild-type 6-His-PTEN pull-down by MS/MS. We further validated a subset of the proteins present in the PTEN interactome by performing immunoprecipitation using an anti-PTEN antibody and establishing the presence of the proteins in the immunocomplex by Western blot analysis. A search of published PTEN interactions was also performed using Online Mendelian Inheritance in Man, Human Protein Reference Database, the IntAct Project database, and PubMed. This in silico analysis confirmed 42 out of 79 (53%) of the proteins identified by MS/MS. The remaining 37 proteins represent probable PTEN interactions not previously documented in public databases or reported in the literature. These results highlight the value of combining both in vitro biochemical approaches with in silico analyses for a comprehensive study of protein-protein interactions.  相似文献   

14.
The advent of next-generation sequencing technologies has greatly promoted the field of metagenomics which studies genetic material recovered directly from an environment. Characterization of genomic composition of a metagenomic sample is essential for understanding the structure of the microbial community. Multiple genomes contained in a metagenomic sample can be identified and quantitated through homology searches of sequence reads with known sequences catalogued in reference databases. Traditionally, reads with multiple genomic hits are assigned to non-specific or high ranks of the taxonomy tree, thereby impacting on accurate estimates of relative abundance of multiple genomes present in a sample. Instead of assigning reads one by one to the taxonomy tree as many existing methods do, we propose a statistical framework to model the identified candidate genomes to which sequence reads have hits. After obtaining the estimated proportion of reads generated by each genome, sequence reads are assigned to the candidate genomes and the taxonomy tree based on the estimated probability by taking into account both sequence alignment scores and estimated genome abundance. The proposed method is comprehensively tested on both simulated datasets and two real datasets. It assigns reads to the low taxonomic ranks very accurately. Our statistical approach of taxonomic assignment of metagenomic reads, TAMER, is implemented in R and available at http://faculty.wcas.northwestern.edu/hji403/MetaR.htm.  相似文献   

15.
Computational analysis of human protein interaction networks   总被引:4,自引:0,他引:4  
Large amounts of human protein interaction data have been produced by experiments and prediction methods. However, the experimental coverage of the human interactome is still low in contrast to predicted data. To gain insight into the value of publicly available human protein network data, we compared predicted datasets, high-throughput results from yeast two-hybrid screens, and literature-curated protein-protein interactions. This evaluation is not only important for further methodological improvements, but also for increasing the confidence in functional hypotheses derived from predictions. Therefore, we assessed the quality and the potential bias of the different datasets using functional similarity based on the Gene Ontology, structural iPfam domain-domain interactions, likelihood ratios, and topological network parameters. This analysis revealed major differences between predicted datasets, but some of them also scored at least as high as the experimental ones regarding multiple quality measures. Therefore, since only small pair wise overlap between most datasets is observed, they may be combined to enlarge the available human interactome data. For this purpose, we additionally studied the influence of protein length on data quality and the number of disease proteins covered by each dataset. We could further demonstrate that protein interactions predicted by more than one method achieve an elevated reliability.  相似文献   

16.
The process of knowledge discovery from big and high dimensional datasets has become a popular research topic. The classification problem is a key task in bioinformatics, business intelligence, decision science, astronomy, physics, etc. Building associative classifiers has been a notable research interest in recent years because of their superior accuracy. In associative classifiers, using under-sampling or over-sampling methods for imbalanced big datasets reduces accuracy or increases running time, respectively. Hence, there is a significant need to create efficient associative classifiers for imbalanced big data problems. These classifiers should be able to handle challenges such as memory usage, running time and efficiently exploring the search space. To this end, efficient calculation of measures is a primary objective for associative classifiers. In this paper, we propose a new efficient associative classifier for big imbalanced datasets. The proposed method is based on Rare-PEARs (a multi-objective evolutionary algorithm that efficiently discovers rare and reliable association rules) and is able to evaluate rules in a distributed manner by using a new storing data format. This format simplifies measures calculation and is fully compatible with the MapReduce programming model. We have applied the proposed method (RPII) on a well-known big dataset (ECBDL’14) and have compared our results with seven other learning methods. The experimental results show that RPII outperform other methods in sensitivity and final score measures (the values of sensitivity and final score measures were approximately 0.74 and 0.54 respectively). The results demonstrate that the proposed method is a good candidate for large-scale classification problems; furthermore, it achieves reasonable execution time when the target platform is a typical computer clusters.  相似文献   

17.
Proteins from human liver carcinoma Huh7 cells, representing transformed liver cells, and cultured primary human fetal hepatocytes (HFH) and human HH4 hepatocytes, representing nontransformed liver cells, were extracted and processed for proteome analysis. Proteins from stimulated cells (interferon-alpha treatment for the Huh7 and HFH cells and induction of hepatitis C virus [HCV] proteins for the HH4 cells) and corresponding control cells were labeled with light and heavy cleavable ICAT reagents, respectively. The labeled samples were combined, trypsinized, and subject to cation-exchange and avidin-affinity chromatographies. The resulting cysteine-containing peptides were analyzed by microcapillary LC-MS/MS. The MS/MS spectra were initially analyzed by searching the human International Protein Index database using the SEQUEST software (1). Subsequently, new statistical algorithms were applied to the collective SEQUEST search results of each experiment. First, the PeptideProphet software (2) was applied to discriminate true assignments of MS/MS spectra to peptide sequences from false assignments, to assign a probability value for each identified peptide, and to compute the sensitivity and error rate for the assignment of spectra to sequences in each experiment. Second, the ProteinProphet software (3) was used to infer the protein identifications and to compute probabilities that a protein had been correctly identified, based on the available peptide sequence evidence. The resulting protein lists were filtered by a ProteinProphet probability score p > or = 0.5, which corresponded to an error rate of less than 5%. A total of 1,296, 1,430, and 1,476 proteins or related protein groups were identified in three subdatasets from the Huh7, HFH, and HH4 cells, respectively. In total, these subdatasets contained 2,486 unique protein identifications from human liver cells. An increase of the threshold to p > or = 0.9 (corresponding to an error rate of less than 1%) resulted in 2,159 unique protein identifications (1,146, 1,235, and 1,318 for the Huh7, HFH, and HH4 cells, respectively).  相似文献   

18.

Background

In a single proteomic project, tandem mass spectrometers can produce hundreds of millions of tandem mass spectra. However, majority of tandem mass spectra are of poor quality, it wastes time to search them for peptides. Therefore, the quality assessment (before database search) is very useful in the pipeline of protein identification via tandem mass spectra, especially on the reduction of searching time and the decrease of false identifications. Most existing methods for quality assessment are supervised machine learning methods based on a number of features which describe the quality of tandem mass spectra. These methods need the training datasets with knowing the quality of all spectra, which are usually unavailable for the new datasets.

Results

This study proposes an unsupervised machine learning method for quality assessment of tandem mass spectra without any training dataset. This proposed method estimates the conditional probabilities of spectra being high quality from the quality assessments based on individual features. The probabilities are estimated through a constraint optimization problem. An efficient algorithm is developed to solve the constraint optimization problem and is proved to be convergent. Experimental results on two datasets illustrate that if we search only tandem spectra with the high quality determined by the proposed method, we can save about 56 % and 62% of database searching time while losing only a small amount of high-quality spectra.

Conclusions

Results indicate that the proposed method has a good performance for the quality assessment of tandem mass spectra and the way we estimate the conditional probabilities is effective.
  相似文献   

19.
Replacement rate matrices describe the process of evolution at one position in a protein and are used in many applications where proteins are studied with an evolutionary perspective. Several general matrices have been suggested and have proved to be good approximations of the real process. However, there are data for which general matrices are inappropriate, for example, special protein families, certain lineages in the tree of life, or particular parts of proteins. Analysis of such data could benefit from adaption of a data-specific rate matrix. This paper suggests two new methods for estimating replacement rate matrices from independent pairwise protein sequence alignments and also carefully studies Müller-Vingron’s resolvent method. Comprehensive tests on synthetic datasets show that both new methods perform better than the resolvent method in a variety of settings. The best method is furthermore demonstrated to be robust on small datasets as well as practical on very large datasets of real data. Neither short nor divergent sequence pairs have to be discarded, making the method economical with data. A generalization to multialignment data is suggested and used in a test on protein-domain family phylogenies, where it is shown that the method offers family-specific rate matrices that often have a significantly better likelihood than a general matrix. [Reviewing Editor: Dr. Nicolas Galtier]  相似文献   

20.
Proteomic approaches to biological research that will prove the most useful and productive require robust, sensitive, and reproducible technologies for both the qualitative and quantitative analysis of complex protein mixtures. Here we applied the isotope-coded affinity tag (ICAT) approach to quantitative protein profiling, in this case proteins that copurified with lipid raft plasma membrane domains isolated from control and stimulated Jurkat human T cells. With the ICAT approach, cysteine residues of the two related protein isolates were covalently labeled with isotopically normal and heavy versions of the same reagent, respectively. Following proteolytic cleavage of combined labeled proteins, peptides were fractionated by multidimensional chromatography and subsequently analyzed via automated tandem mass spectrometry. Individual tandem mass spectrometry spectra were searched against a human sequence database, and a variety of recently developed, publicly available software applications were used to sort, filter, analyze, and compare the results of two repetitions of the same experiment. In particular, robust statistical modeling algorithms were used to assign measures of confidence to both peptide sequences and the proteins from which they were likely derived, identified via the database searches. We show that by applying such statistical tools to the identification of T cell lipid raft-associated proteins, we were able to estimate the accuracy of peptide and protein identifications made. These tools also allow for determination of the false positive rate as a function of user-defined data filtering parameters, thus giving the user significant control over and information about the final output of large-scale proteomic experiments. With the ability to assign probabilities to all identifications, the need for manual verification of results is substantially reduced, thus making the rapid evaluation of large proteomic datasets possible. Finally, by repeating the experiment, information relating to the general reproducibility and validity of this approach to large-scale proteomic analyses was also obtained.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号