共查询到20条相似文献,搜索用时 15 毫秒
1.
The potential for obtaining a true mass spectrometric protein identification result depends on the choice of algorithm as well as on experimental factors that influence the information content in the mass spectrometric data. Current methods can never prove definitively that a result is true, but an appropriate choice of algorithm can provide a measure of the statistical risk that a result is false, i.e., the statistical significance. We recently demonstrated an algorithm, Probity, which assigns the statistical significance to each result. For any choice of algorithm, the difficulty of obtaining statistically significant results depends on the number of protein sequences in the sequence collection searched. By simulations of random protein identifications and using the Probity algorithm, we here demonstrate explicitly how the statistical significance depends on the number of sequences searched. We also provide an example on how the practitioner's choice of taxonomic constraints influences the statistical significance. 相似文献
2.
MOTIVATION: Predicting protein function accurately is an important issue in the post-genomic era. To achieve this goal, several approaches have been proposed deduce the function of unclassified proteins through sequence similarity, co-expression profiles, and other information. Among these methods, the global optimization method (GOM) is an interesting and powerful tool that assigns functions to unclassified proteins based on their positions in a physical interactions network [Vazquez, A., Flammini, A., Maritan, A. and Vespignani, A. (2003) Global protein function prediction from protein-protein interaction networks, Nat. Biotechnol., 21, 697-700]. To boost both the accuracy and speed of GOM, a new prediction method, MFGO (modified and faster global optimization) is presented in this paper, which employs local optimal repetition method to reduce calculation time, and takes account of topological structure information to achieve a more accurate prediction. CONCLUSION: On four proteins interaction datasets, including Vazquez dataset, YP dataset, DIP-core dataset, and SPK dataset, MFGO was tested and compared with the popular MR (majority rule) and GOM methods. Experimental results confirm MFGO's improvement on both speed and accuracy. Especially, MFGO method has a distinctive advantage in accurately predicting functions for proteins with few neighbors. Moreover, the robustness of the approach was validated both in a dataset containing a high percentage of unknown proteins and a disturbed dataset through random insertion and deletion. The analysis shows that a moderate amount of misplaced interactions do not preclude a reliable function assignment. 相似文献
3.
Homology-based method for identification of protein repeats using statistical significance estimates
Short protein repeats, frequently with a length between 20 and 40 residues, represent a significant fraction of known proteins. Many repeats appear to possess high amino acid substitution rates and thus recognition of repeat homologues is highly problematic. Even if the presence of a certain repeat family is known, the exact locations and the number of repetitive units often cannot be determined using current methods. We have devised an iterative algorithm based on optimal and sub-optimal score distributions from profile analysis that estimates the significance of all repeats that are detected in a single sequence. This procedure allows the identification of homologues at alignment scores lower than the highest optimal alignment score for non-homologous sequences. The method has been used to investigate the occurrence of eleven families of repeats in Saccharomyces cerevisiae, Caenorhabditis elegans and Homo sapiens accounting for 1055, 2205 and 2320 repeats, respectively. For these examples, the method is both more sensitive and more selective than conventional homology search procedures. The method allowed the detection in the SwissProt database of more than 2000 previously unrecognised repeats belonging to the 11 families. In addition, the method was used to merge several repeat families that previously were supposed to be distinct, indicating common phylogenetic origins for these families. 相似文献
4.
To improve the prediction accuracy in the regime where template alignment quality is poor, an updated version of TASSER_2.0, namely TASSER_WT, was developed. TASSER_WT incorporates more accurate contact restraints from a new method, COMBCON. COMBCON uses confidence-weighted contacts from PROSPECTOR_3.5, the latest version, PROSPECTOR_4, and a new local structural fragment-based threading algorithm, STITCH, implemented in two variants depending on expected fragment prediction accuracy. TASSER_WT is tested on 622 Hard proteins, the most difficult targets (incorrect alignments and/or templates and incorrect side-chain contact restraints) in a comprehensive benchmark of 2591 nonhomologous, single domain proteins ≤200 residues that cover the PDB at 35% pairwise sequence identity. For 454 of 622 Hard targets, COMBCON provides contact restraints with higher accuracy and number of contacts per residue. As contact coverage with confidence weight ≥3 (Fwt≥3cov) increases, the more improved are TASSER_WT models. When Fwt≥3cov > 1.0 and > 0.4, the average root mean-square deviation of TASSER_WT (TASSER_2.0) models is 4.11 Å (6.72 Å) and 5.03 Å (6.40 Å), respectively. Regarding a structure prediction as successful when a model has a TM-score to the native structure ≥0.4, when Fwt≥3cov > 1.0 and > 0.4, the success rate of TASSER_WT (TASSER_2.0) is 98.8% (76.2%) and 93.7% (81.1%), respectively. 相似文献
5.
Fahad Saeed Trairak Pisitkun Jason D Hoffert Sara Rashidian Guanghui Wang Marjan Gucek Mark A Knepper 《Proteome science》2013,11(Z1):S14
Phosphorylation site assignment of high throughput tandem mass spectrometry (LC-MS/MS) data is one of the most common and critical aspects of phosphoproteomics. Correctly assigning phosphorylated residues helps us understand their biological significance. The design of common search algorithms (such as Sequest, Mascot etc.) do not incorporate site assignment; therefore additional algorithms are essential to assign phosphorylation sites for mass spectrometry data. The main contribution of this study is the design and implementation of a linear time and space dynamic programming strategy for phosphorylation site assignment referred to as PhosSA. The proposed algorithm uses summation of peak intensities associated with theoretical spectra as an objective function. Quality control of the assigned sites is achieved using a post-processing redundancy criteria that indicates the signal-to-noise ratio properties of the fragmented spectra. The quality assessment of the algorithm was determined using experimentally generated data sets using synthetic peptides for which phosphorylation sites were known. We report that PhosSA was able to achieve a high degree of accuracy and sensitivity with all the experimentally generated mass spectrometry data sets. The implemented algorithm is shown to be extremely fast and scalable with increasing number of spectra (we report up to 0.5 million spectra/hour on a moderate workstation). The algorithm is designed to accept results from both Sequest and Mascot search engines. An executable is freely available at http://helixweb.nih.gov/ESBL/PhosSA/ for academic research purposes. 相似文献
6.
We present a novel method for the comparison of multiple protein alignments with assessment of statistical significance (COMPASS). The method derives numerical profiles from alignments, constructs optimal local profile-profile alignments and analytically estimates E-values for the detected similarities. The scoring system and E-value calculation are based on a generalization of the PSI-BLAST approach to profile-sequence comparison, which is adapted for the profile-profile case. Tested along with existing methods for profile-sequence (PSI-BLAST) and profile-profile (prof_sim) comparison, COMPASS shows increased abilities for sensitive and selective detection of remote sequence similarities, as well as improved quality of local alignments. The method allows prediction of relationships between protein families in the PFAM database beyond the range of conventional methods. Two predicted relations with high significance are similarities between various Rossmann-type folds and between various helix-turn-helix-containing families. The potential value of COMPASS for structure/function predictions is illustrated by the detection of an intricate homology between the DNA-binding domain of the CTF/NFI family and the MH1 domain of the Smad family. 相似文献
7.
Alpha-helices stand out as common and relatively invariant secondary structural elements of proteins. However, alpha-helices are not rigid bodies and their deformations can be significant in protein function (e.g. coiled coils). To quantify the flexibility of alpha-helices we have performed a structural principal-component analysis of helices of different lengths from a representative set of protein folds in the Protein Data Bank. We find three dominant modes of flexibility: two degenerate bend modes and one twist mode. The data are consistent with independent Gaussian distributions for each mode. The mode eigenvalues, which measure flexibility, follow simple scaling forms as a function of helix length. The dominant bend and twist modes and their harmonics are reproduced by a simple spring model, which incorporates hydrogen-bonding and excluded volume. As an application, we examine the amount of bend and twist in helices making up all coiled-coil proteins in SCOP. Incorporation of alpha-helix flexibility into structure refinement and design is discussed. 相似文献
8.
Jakob Toudahl Nielsen Natalia Kulminskaya Morten Bjerring Niels Chr. Nielsen 《Journal of biomolecular NMR》2014,59(2):119-134
The process of resonance assignment represents a time-consuming and potentially error-prone bottleneck in structural studies of proteins by solid-state NMR (ssNMR). Software for the automation of this process is therefore of high interest. Procedures developed through the last decades for solution-state NMR are not directly applicable for ssNMR due to the inherently lower data quality caused by lower sensitivity and broader lines, leading to overlap between peaks. Recently, the first efforts towards procedures specifically aimed for ssNMR have been realized (Schmidt et al. in J Biomol NMR 56(3):243–254, 2013). Here we present a robust automatic method, which can accurately assign protein resonances using peak lists from a small set of simple 2D and 3D ssNMR experiments, applicable in cases with low sensitivity. The method is demonstrated on three uniformly 13C, 15N labeled biomolecules with different challenges on the assignments. In particular, for the immunoglobulin binding domain B1 of streptococcal protein G automatic assignment shows 100 % accuracy for the backbone resonances and 91.8 % when including all side chain carbons. It is demonstrated, by using a procedure for generating artificial spectra with increasing line widths, that our method, GAMES_ASSIGN can handle a significant amount of overlapping peaks in the assignment. The impact of including different ssNMR experiments is evaluated as well. 相似文献
9.
Automated sequence-specific protein NMR assignment using the memetic algorithm MATCH 总被引:4,自引:2,他引:4
MATCH (Memetic Algorithm and Combinatorial Optimization Heuristics) is a new memetic algorithm for automated sequence-specific polypeptide backbone NMR assignment of proteins. MATCH employs local optimization for tracing partial sequence-specific assignments within a global, population-based search environment, where the simultaneous application of local and global optimization heuristics guarantees high efficiency and robustness. MATCH thus makes combined use of the two predominant concepts in use for automated NMR assignment of proteins. Dynamic transition and inherent mutation are new techniques that enable automatic adaptation to variable quality of the experimental input data. The concept of dynamic transition is incorporated in all major building blocks of the algorithm, where it enables switching between local and global optimization heuristics at any time during the assignment process. Inherent mutation restricts the intrinsically required randomness of the evolutionary algorithm to those regions of the conformation space that are compatible with the experimental input data. Using intact and artificially deteriorated APSY-NMR input data of proteins, MATCH performed sequence-specific resonance assignment with high efficiency and robustness. 相似文献
10.
Although there have been several papers recommending appropriate experimental designs for ancient-DNA studies, there have been few attempts at statistical analysis. We assume that we cannot decide whether a result is authentic simply by examining the sequence (e.g., when working with humans and domestic animals). We use a maximum-likelihood approach to estimate the probability that a positive result from a sample is (either partly or entirely) an amplification of DNA that was present in the sample before the experiment began. Our method is useful in two situations. First, we can decide in advance how many samples will be needed to achieve a given level of confidence. For example, to be almost certain (95% confidence interval 0.96-1.00, maximum-likelihood estimate 1.00) that a positive result comes, at least in part, from DNA present before the experiment began, we need to analyze at least five samples and controls, even if all samples and no negative controls yield positive results. Second, we can decide how much confidence to place in results that have been obtained already, whether or not there are positive results from some controls. For example, the risk that at least one negative control yields a positive result increases with the size of the experiment, but the effects of occasional contamination are less severe in large experiments. 相似文献
11.
A key step in network analysis is to partition a complex network into dense modules. Currently, modularity is one of the most popular benefit functions used to partition network modules. However, recent studies suggested that it has an inherent limitation in detecting dense network modules. In this study, we observed that despite the limitation, modularity has the advantage of preserving the primary network structure of the undetected modules. Thus, we have developed a simple iterative Network Partition (iNP) algorithm to partition a network. The iNP algorithm provides a general framework in which any modularity-based algorithm can be implemented in the network partition step. Here, we tested iNP with three modularity-based algorithms: multi-step greedy (MSG), spectral clustering and Qcut. Compared with the original three methods, iNP achieved a significant improvement in the quality of network partition in a benchmark study with simulated networks, identified more modules with significantly better enrichment of functionally related genes in both yeast protein complex network and breast cancer gene co-expression network, and discovered more cancer-specific modules in the cancer gene co-expression network. As such, iNP should have a broad application as a general method to assist in the analysis of biological networks. 相似文献
12.
Most widely used secondary structure assignment methods such as DSSP identify structural elements based on N-H and C=O hydrogen bonding patterns from X-ray or NMR-determined coordinates. Secondary structure assignment algorithms using limited Cα information have been under development as well, but their accuracy is only ~80% compared to DSSP. We have hereby developed SABA (Secondary Structure Assignment Program Based on only Alpha Carbons) with~90% accuracy. SABA defines a novel geometrical parameter, termed a pseudo center, which is the midpoint of two continuous Cαs. SABA is capable of identifying α-helices, 3(10)-helices, and β-strands with high accuracy by using cut-off criteria on distances and dihedral angles between two or more pseudo centers. In addition to assigning secondary structures to Cα-only structures, algorithms using limited Cα information with high accuracy have the potential to enhance the speed of calculations for high capacity structure comparison. 相似文献
13.
Valentí Rull 《Hydrobiologia》1991,220(2):161-165
The statistical relationship between chrysophycean cyst abundances and ecologically known factors, derived from multivariate analyses, is proposed as a useful way to derive palaeoecological information of non-identified cysts. The method is applied to a case-study from the Spanish Pyrenees, and encouraging results are obtained for some of the morphotypes found. 相似文献
14.
We describe a program STATSEARCH which implements the methodof Mott et al. (1989) for searching DNA and protein sequencedatabanks for statistically significant similarities to a givenquery sequence. STATSEARCH is written to run in conjunctionwith the GCG sequence analysis package. 相似文献
15.
NvAssign: protein NMR spectral assignment with NMRView 总被引:2,自引:0,他引:2
MOTIVATION: Nuclear magnetic resonance (NMR) protein studies rely on the accurate assignment of resonances. The general procedure is to (1) pick peaks, (2) cluster data from various experiments or spectra, (3) assign peaks to the sequence and (4) verify the assignments with the spectra. Many algorithms already exist for automating the assignment process (step 3). What is lacking is a flexible interface to help a spectroscopist easily move from clustering (step 2) to assignment algorithms (step 3) and back to verification of the algorithm output with spectral analysis (step 4). RESULTS: A software module, NvAssign, was written for use with NMRView. It is a significant extension of the previous CBCA module. The module provides a flexible interface to cluster data and interact with the existing assignment algorithms. Further, the software module is able to read the results of other algorithms so that the data can be easily verified by spectral analysis. The generalized interface is demonstrated by connecting the clustered data with the assignment algorithms PACES and MONTE using previously assigned data for the lyase domain of DNA polymerase lambda. The spectral analysis program NMRView is now able to read the output of these programs for simplified analysis and verification. AVAILABILITY: NvAssign is available from http://dir.niehs.nih.gov/dirnmr/nvassign 相似文献
16.
A suite of tests to evaluate the statistical significance of protein sequence similarities is developed for use in data bank searches. The tests are based on the Wilbur-Lipman word-search algorithm, and take into account the sequence lengths and compositions, and optionally the weighting of amino acid matches. The method is extended to allow for the existence of a sequence insertion/deletion within the region of similarity. The accuracy of statistical distributions underlying the tests is validated using randomly generated sequences and real sequences selected at random from the data banks. A computer program to perform the tests is briefly described. 相似文献
17.
The unparalleled growth in the availability of genomic data offers both a challenge to develop orthology detection methods that are simultaneously accurate and high throughput and an opportunity to improve orthology detection by leveraging evolutionary evidence in the accumulated sequenced genomes. Here, we report a novel orthology detection method, termed QuartetS, that exploits evolutionary evidence in a computationally efficient manner. Based on the well-established evolutionary concept that gene duplication events can be used to discriminate homologous genes, QuartetS uses an approximate phylogenetic analysis of quartet gene trees to infer the occurrence of duplication events and discriminate paralogous from orthologous genes. We used function- and phylogeny-based metrics to perform a large-scale, systematic comparison of the orthology predictions of QuartetS with those of four other methods [bi-directional best hit (BBH), outgroup, OMA and QuartetS-C (QuartetS followed by clustering)], involving 624 bacterial genomes and >2 million genes. We found that QuartetS slightly, but consistently, outperformed the highly specific OMA method and that, while consuming only 0.5% additional computational time, QuartetS predicted 50% more orthologs with a 50% lower false positive rate than the widely used BBH method. We conclude that, for large-scale phylogenetic and functional analysis, QuartetS and QuartetS-C should be preferred, respectively, in applications where high accuracy and high throughput are required. 相似文献
18.
To interpret LC-MS/MS data in proteomics, most popular protein identification algorithms primarily use predicted fragment m/z values to assign peptide sequences to fragmentation spectra. The intensity information is often undervalued, because it is not as easy to predict and incorporate into algorithms. Nevertheless, the use of intensity to assist peptide identification is an attractive prospect and can potentially improve the confidence of matches and generate more identifications. On the basis of our previously reported study of fragmentation intensity patterns, we developed a protein identification algorithm, SeQuence IDentfication (SQID), that makes use of the coarse intensity from a statistical analysis. The scoring scheme was validated by comparing with Sequest and X!Tandem using three data sets, and the results indicate an improvement in the number of identified peptides, including unique peptides that are not identified by Sequest or X!Tandem. The software and source code are available under the GNU GPL license at http://quiz2.chem.arizona.edu/wysocki/bioinformatics.htm. 相似文献
19.
20.
Robert C Edgar 《BMC bioinformatics》2007,8(1):18