首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Multiple sequence alignments have wide applicability in many areas of computational biology, including comparative genomics, functional annotation of proteins, gene finding, and modeling evolutionary processes. Because of the computational difficulty of multiple sequence alignment and the availability of numerous tools, it is critical to be able to assess the reliability of multiple alignments. We present a tool called StatSigMA to assess whether multiple alignments of nucleotide or amino acid sequences are contaminated with one or more unrelated sequences. There are numerous applications for which StatSigMA can be used. Two such applications are to distinguish homologous sequences from nonhomologous ones and to compare alignments produced by various multiple alignment tools. We present examples of both types of applications.  相似文献   

2.
Current methods for aligning biological sequences are based on dynamic programming algorithms. If large numbers of sequences or a number of long sequences are to be aligned, the required computations are expensive in memory and central processing unit (CPU) time. In an attempt to bring the tools of large-scale linear programming (LP) methods to bear on this problem, we formulate the alignment process as a controlled Markov chain and construct a suggested alignment based on policies that minimise the expected total cost of the alignment. We discuss the LP associated with the total expected discounted cost and show the results of a solution of the problem based on a primal-dual interior point method. Model parameters, estimated from aligned sequences, along with cost function parameters are used to construct the objective and constraint conditions of the LP problem. This article concludes with a discussion of some alignments obtained from the LP solutions of problems with various cost function parameter values.  相似文献   

3.
A technique for prediction of protein membrane toplogy (intra- and extraceullular sidedness) has been developed. Membrane-spanning segments are first predicted using an algorithm based upon multiply aligned amino acid sequences. The compositional differences in the protein segments exposed at each side of the membrane are then investigated. The ratios are calculated for Asn, Asp, Gly, Phe, Pro, Trp, Tyr, and Val, mostly found on the extracellular side, and for Ala, Arg, Cys, and Lys, mostly occurring on the intracellular side. The consensus over these 12 residue distributions is used for sidedness prediction. The method was developed with a set of 42 protein families for which all but one were correctly predicted with the new algorithm. This represents an improvement over previous techniques. The new method, applied to a set of 12 membrane protein families different from the test set and with recently determined topologies, performed well, with 11 of 12 sidedness assignments agreeing with experimental results. The method has also been applied to several membrane protein families for which the topology has yet to be determined. An electronic prediction service is available at the E-mail address tmap@embl-heidelberg.de and on WWW via http://www.emblheidelberg.de.  相似文献   

4.

Background

While the conserved positions of a multiple sequence alignment (MSA) are clearly of interest, non-conserved positions can also be important because, for example, destabilizing effects at one position can be compensated by stabilizing effects at another position. Different methods have been developed to recognize the evolutionary relationship between amino acid sites, and to disentangle functional/structural dependencies from historical/phylogenetic ones.

Methodology/Principal Findings

We have used two complementary approaches to test the efficacy of these methods. In the first approach, we have used a new program, MSAvolve, for the in silico evolution of MSAs, which records a detailed history of all covarying positions, and builds a global coevolution matrix as the accumulated sum of individual matrices for the positions forced to co-vary, the recombinant coevolution, and the stochastic coevolution. We have simulated over 1600 MSAs for 8 protein families, which reflect sequences of different sizes and proteins with widely different functions. The calculated coevolution matrices were compared with the coevolution matrices obtained for the same evolved MSAs with different coevolution detection methods. In a second approach we have evaluated the capacity of the different methods to predict close contacts in the representative X-ray structures of an additional 150 protein families using only experimental MSAs.

Conclusions/Significance

Methods based on the identification of global correlations between pairs were found to be generally superior to methods based only on local correlations in their capacity to identify coevolving residues using either simulated or experimental MSAs. However, the significant variability in the performance of different methods with different proteins suggests that the simulation of MSAs that replicate the statistical properties of the experimental MSA can be a valuable tool to identify the coevolution detection method that is most effective in each case.  相似文献   

5.
The rapidly growing availability of genome information has created considerable demand for both fast and accurate phylogenetic inference algorithms. We present a novel method called DendroBLAST for reconstructing phylogenetic dendrograms/trees from protein sequences using BLAST. This method differs from other methods by incorporating a simple model of sequence evolution to test the effect of introducing sequence changes on the reliability of the bipartitions in the inferred tree. Using realistic simulated sequence data we demonstrate that this method produces phylogenetic trees that are more accurate than other commonly-used distance based methods though not as accurate as maximum likelihood methods from good quality multiple sequence alignments. In addition to tests on simulated data, we use DendroBLAST to generate input trees for a supertree reconstruction of the phylogeny of the Archaea. This independent analysis produces an approximate phylogeny of the Archaea that has both high precision and recall when compared to previously published analysis of the same dataset using conventional methods. Taken together these results demonstrate that approximate phylogenetic trees can be produced in the absence of multiple sequence alignments, and we propose that these trees will provide a platform for improving and informing downstream bioinformatic analysis. A web implementation of the DendroBLAST method is freely available for use at http://www.dendroblast.com/.  相似文献   

6.
7.
In daily life, huge costs can arise from just one incorrect performance on a visual search task (e.g., a fatal accident due to a driver overlooking a pedestrian). One potential way to prevent such drastic accidents would be for people to modify their decision criterion (e.g., placing a greater priority on accuracy rather than speed) during a visual search. The aim of the present study was to manipulate the criterion by creating an awareness of being watched by another person. During a visual search task, study participants were watched (or not watched) via video cameras and monitors. The results showed that, when they believed they were being watched by another person, they searched more slowly and accurately, as measured by reaction times and hit/miss rates. These findings also were obtained when participants were videotaped and they believed their recorded behavior would be watched by another person in the future. The study primarily demonstrated the role of being watched by another on the modulation of the decision criterion for responding during visual searches.  相似文献   

8.
Abstract

Profile-based sequence search procedures are commonly employed to detect remote relationships between proteins. We provide an assessment of a Cascade PSI-BLAST protocol that rigorously employs intermediate sequences in detecting remote relationships between proteins. In this approach we detect using PSI-BLAST, which involves multiple rounds of iteration, an initial set of homologues for a protein in a ‘first generation’ search by querying a database. We propagate a ‘second generation’ search in the database, involving multiple runs of PSI-BLAST using each of the homologues identified in the previous generation as queries to recognize homologues not detected earlier. This non-directed search process can be viewed as an iteration of iterations that is continued to detect further homologues until no new hits are detectable. We present an assessment of the coverage of this ‘cascaded’ intermediate sequence search on diverse folds and find that searches for up to three generations detect most known homologues of a query. Our assessments show that this approach appears to perform better than the traditional use of PSI-BLAST by detecting 15% more relationships within a family and 35% more relationships within a superfamily. We show that such searches can be performed on generalized sequence databases and non-trivial relationships between proteins can be detected effectively. Such a propagation of searches maximizes the chances of detecting distant homologies by effectively scanning protein “fold space”.  相似文献   

9.
Sequence similarity tools, such as BLAST, seek sequences most similar to a query from a database of sequences. They return results significantly similar to the query sequence and that are typically highly similar to each other. Most sequence analysis tasks in bioinformatics require an exploratory approach, where the initial results guide the user to new searches. However, diversity has not yet been considered an integral component of sequence search tools for this discipline. Some redundancy can be avoided by introducing non-redundancy during database construction, but it is not feasible to dynamically set a level of non-redundancy tailored to a query sequence. We introduce the problem of diverse search and browsing in sequence databases that produce non-redundant results optimized for any given query. We define diversity measures for sequences and propose methods to obtain diverse results extracted from current sequence similarity search tools. We also propose a new measure to evaluate the diversity of a set of sequences that is returned as a result of a sequence similarity query. We evaluate the effectiveness of the proposed methods in post-processing BLAST and PSI-BLAST results. We also assess the functional diversity of the returned results based on available Gene Ontology annotations. Additionally, we include a comparison with a current redundancy elimination tool, CD-HIT. Our experiments show that the proposed methods are able to achieve more diverse yet significant result sets compared to static non-redundancy approaches. In both sequence-based and functional diversity evaluation, the proposed diversification methods significantly outperform original BLAST results and other baselines. A web based tool implementing the proposed methods, Div-BLAST, can be accessed at cedar.cs.bilkent.edu.tr/Div-BLAST  相似文献   

10.
Detecting similarities between ligand binding sites in the absence of global homology between target proteins has been recognized as one of the critical components of modern drug discovery. Local binding site alignments can be constructed using sequence order-independent techniques, however, to achieve a high accuracy, many current algorithms for binding site comparison require high-quality experimental protein structures, preferably in the bound conformational state. This, in turn, complicates proteome scale applications, where only various quality structure models are available for the majority of gene products. To improve the state-of-the-art, we developed eMatchSite, a new method for constructing sequence order-independent alignments of ligand binding sites in protein models. Large-scale benchmarking calculations using adenine-binding pockets in crystal structures demonstrate that eMatchSite generates accurate alignments for almost three times more protein pairs than SOIPPA. More importantly, eMatchSite offers a high tolerance to structural distortions in ligand binding regions in protein models. For example, the percentage of correctly aligned pairs of adenine-binding sites in weakly homologous protein models is only 4–9% lower than those aligned using crystal structures. This represents a significant improvement over other algorithms, e.g. the performance of eMatchSite in recognizing similar binding sites is 6% and 13% higher than that of SiteEngine using high- and moderate-quality protein models, respectively. Constructing biologically correct alignments using predicted ligand binding sites in protein models opens up the possibility to investigate drug-protein interaction networks for complete proteomes with prospective systems-level applications in polypharmacology and rational drug repositioning. eMatchSite is freely available to the academic community as a web-server and a stand-alone software distribution at http://www.brylinski.org/ematchsite.
This is a PLOS Computational Biology Software Article
  相似文献   

11.
In recent years we have witnessed a growth in sequencing yield, the number of samples sequenced, and as a result–the growth of publicly maintained sequence databases. The increase of data present all around has put high requirements on protein similarity search algorithms with two ever-opposite goals: how to keep the running times acceptable while maintaining a high-enough level of sensitivity. The most time consuming step of similarity search are the local alignments between query and database sequences. This step is usually performed using exact local alignment algorithms such as Smith-Waterman. Due to its quadratic time complexity, alignments of a query to the whole database are usually too slow. Therefore, the majority of the protein similarity search methods prior to doing the exact local alignment apply heuristics to reduce the number of possible candidate sequences in the database. However, there is still a need for the alignment of a query sequence to a reduced database. In this paper we present the SW#db tool and a library for fast exact similarity search. Although its running times, as a standalone tool, are comparable to the running times of BLAST, it is primarily intended to be used for exact local alignment phase in which the database of sequences has already been reduced. It uses both GPU and CPU parallelization and was 4–5 times faster than SSEARCH, 6–25 times faster than CUDASW++ and more than 20 times faster than SSW at the time of writing, using multiple queries on Swiss-prot and Uniref90 databases  相似文献   

12.

Background  

Mass spectrometry based peptide mass fingerprints (PMFs) offer a fast, efficient, and robust method for protein identification. A protein is digested (usually by trypsin) and its mass spectrum is compared to simulated spectra for protein sequences in a database. However, existing tools for analyzing PMFs often suffer from missing or heuristic analysis of the significance of search results and insufficient handling of missing and additional peaks.  相似文献   

13.

Background  

With the ever-increasing number of gene sequences in the public databases, generating and analyzing multiple sequence alignments becomes increasingly time consuming. Nevertheless it is a task performed on a regular basis by researchers in many labs.  相似文献   

14.
Abstract

The existence and identity of non-Watson-Crick base pairs (bps) within RNA bulges, internal loops, and hairpin loops cannot reliably be predicted by existing algorithms. We have developed the Isfold (Isosteric Folding) program as a tool to examine patterns of nucleotide substitutions from sequence alignments or mutation experiments and identify plausible bp interactions. We infer these interactions based on the observation that each non-Watson-Crick bp has a signature pattern of isosteric substitutions where mutations can be made that preserve the 3D structure. Isfold produces a dynamic representation of predicted bps within defined motifs in order of their probabilities. The software was developed under Windows XP, and is capable of running on PC and MAC with Matlab 7.1 (SP3) or higher. A PC standalone version that does not require Matlab also is available. This software and a user manual are freely available at www.ucsf.edu/frankel/isfold.  相似文献   

15.
The members of the NR5A subfamily of nuclear receptors (NRs) are important regulators of pluripotency, lipid and glucose homeostasis, and steroidogenesis. Liver receptor homologue 1 (LRH-1; NR5A2) and steroidogenic factor 1 (SF-1; NR5A1) have therapeutic potential for the treatment of metabolic and neoplastic disease; however, a poor understanding of their ligand regulation has hampered the pursuit of these proteins as pharmaceutical targets. In this study, we dissect how sequence variation among LRH-1 orthologs affects phospholipid (PL) binding and regulation. Both human LRH-1 (hLRH-1) and mouse LRH-1 (mLRH-1) respond to newly discovered medium chain PL agonists to modulate lipid and glucose homeostasis. These PLs activate hLRH-1 by altering receptor dynamics in a newly identified alternate activation function region. Mouse and Drosophila orthologs contain divergent sequences in this region potentially altering PL-driven activation. Structural evidence suggests that these sequence differences in mLRH-1 and Drosophila FTZ-f1 (dmFTZ-f1) confer at least partial ligand independence, making them poor models for hLRH-1 studies; however, the mechanisms of ligand independence remain untested. We show using structural and biochemical methods that the recent evolutionary divergence of the mLRH-1 stabilizes the active conformation in the absence of ligand, yet does not abrogate PL-dependent activation. We also show by mass spectrometry and biochemical assays that FTZ-f1 is incapable of PL binding. This work provides a structural mechanism for the differential tuning of PL sensitivity in NR5A orthologs and supports the use of mice as viable therapeutic models for LRH-1-dependent diseases.  相似文献   

16.
17.
I have studied mutation patterns around very short microsatellites, focusing mainly on sequences carrying only two repeat units. By using human–chimpanzee–orangutan alignments, inferences can be made about both the relative rates of mutations and which bases have mutated. I find remarkable non-randomness, with mutation rate depending on a base’s position relative to the microsatellite, the identity of the base itself and the motif in the microsatellite. Comparing the patterns around (AC)2 with those around other four-base combinations reveals that (AC)2 does not stand out as being special in the sense that non-repetitive tetramers also generate strong mutation biases. However, comparing (AC)2 and (AC)3 with (AC)4 reveals a step change in both the rate and nature of mutations occurring, suggesting a transition state, (AC)4 exhibiting an alternating high–low mutation rate pattern consistent with the sequence patterning seen around longer microsatellites. Surprisingly, most changes in repeat number occur through base substitutions rather than slippage, and the relative probability of gaining versus losing a repeat in this way varies greatly with repeat number. Slippage mutations reveal rather similar patterns of mutability compared with point mutations, being rare at two repeats where most cause the loss of a repeat, with both mutation rate and the proportion of expansion mutations increasing up to 6–8 repeats. Inferences about longer repeat tracts are hampered by uncertainties about the proportion of multi-species alignments that fail due to multi-repeat mutations and other rearrangements.  相似文献   

18.
Database search programs are essential tools for identifying peptides via mass spectrometry (MS) in shotgun proteomics. Simultaneously achieving high sensitivity and high specificity during a database search is crucial for improving proteome coverage. Here we present JUMP, a new hybrid database search program that generates amino acid tags and ranks peptide spectrum matches (PSMs) by an integrated score from the tags and pattern matching. In a typical run of liquid chromatography coupled with high-resolution tandem MS, more than 95% of MS/MS spectra can generate at least one tag, whereas the remaining spectra are usually too poor to derive genuine PSMs. To enhance search sensitivity, the JUMP program enables the use of tags as short as one amino acid. Using a target-decoy strategy, we compared JUMP with other programs (e.g. SEQUEST, Mascot, PEAKS DB, and InsPecT) in the analysis of multiple datasets and found that JUMP outperformed these preexisting programs. JUMP also permitted the analysis of multiple co-fragmented peptides from “mixture spectra” to further increase PSMs. In addition, JUMP-derived tags allowed partial de novo sequencing and facilitated the unambiguous assignment of modified residues. In summary, JUMP is an effective database search algorithm complementary to current search programs.Peptide identification by tandem mass spectra is a critical step in mass spectrometry (MS)-based1 proteomics (1). Numerous computational algorithms and software tools have been developed for this purpose (26). These algorithms can be classified into three categories: (i) pattern-based database search, (ii) de novo sequencing, and (iii) hybrid search that combines database search and de novo sequencing. With the continuous development of high-performance liquid chromatography and high-resolution mass spectrometers, it is now possible to analyze almost all protein components in mammalian cells (7). In contrast to rapid data collection, it remains a challenge to extract accurate information from the raw data to identify peptides with low false positive rates (specificity) and minimal false negatives (sensitivity) (8).Database search methods usually assign peptide sequences by comparing MS/MS spectra to theoretical peptide spectra predicted from a protein database, as exemplified in SEQUEST (9), Mascot (10), OMSSA (11), X!Tandem (12), Spectrum Mill (13), ProteinProspector (14), MyriMatch (15), Crux (16), MS-GFDB (17), Andromeda (18), BaMS2 (19), and Morpheus (20). Some other programs, such as SpectraST (21) and Pepitome (22), utilize a spectral library composed of experimentally identified and validated MS/MS spectra. These methods use a variety of scoring algorithms to rank potential peptide spectrum matches (PSMs) and select the top hit as a putative PSM. However, not all PSMs are correctly assigned. For example, false peptides may be assigned to MS/MS spectra with numerous noisy peaks and poor fragmentation patterns. If the samples contain unknown protein modifications, mutations, and contaminants, the related MS/MS spectra also result in false positives, as their corresponding peptides are not in the database. Other false positives may be generated simply by random matches. Therefore, it is of importance to remove these false PSMs to improve dataset quality. One common approach is to filter putative PSMs to achieve a final list with a predefined false discovery rate (FDR) via a target-decoy strategy, in which decoy proteins are merged with target proteins in the same database for estimating false PSMs (2326). However, the true and false PSMs are not always distinguishable based on matching scores. It is a problem to set up an appropriate score threshold to achieve maximal sensitivity and high specificity (13, 27, 28).De novo methods, including Lutefisk (29), PEAKS (30), NovoHMM (31), PepNovo (32), pNovo (33), Vonovo (34), and UniNovo (35), identify peptide sequences directly from MS/MS spectra. These methods can be used to derive novel peptides and post-translational modifications without a database, which is useful, especially when the related genome is not sequenced. High-resolution MS/MS spectra greatly facilitate the generation of peptide sequences in these de novo methods. However, because MS/MS fragmentation cannot always produce all predicted product ions, only a portion of collected MS/MS spectra have sufficient quality to extract partial or full peptide sequences, leading to lower sensitivity than achieved with the database search methods.To improve the sensitivity of the de novo methods, a hybrid approach has been proposed to integrate peptide sequence tags into PSM scoring during database searches (36). Numerous software packages have been developed, such as GutenTag (37), InsPecT (38), Byonic (39), DirecTag (40), and PEAKS DB (41). These methods use peptide tag sequences to filter a protein database, followed by error-tolerant database searching. One restriction in most of these algorithms is the requirement of a minimum tag length of three amino acids for matching protein sequences in the database. This restriction reduces the sensitivity of the database search, because it filters out some high-quality spectra in which consecutive tags cannot be generated.In this paper, we describe JUMP, a novel tag-based hybrid algorithm for peptide identification. The program is optimized to balance sensitivity and specificity during tag derivation and MS/MS pattern matching. JUMP can use all potential sequence tags, including tags consisting of only one amino acid. When we compared its performance to that of two widely used search algorithms, SEQUEST and Mascot, JUMP identified ∼30% more PSMs at the same FDR threshold. In addition, the program provides two additional features: (i) using tag sequences to improve modification site assignment, and (ii) analyzing co-fragmented peptides from mixture MS/MS spectra.  相似文献   

19.
Abstract The microbial biomass and community structure of eight Chinese red soils with different fertility and land use history was investigated. Two community based microbiological measurements, namely, community level physiological profiling (CLPP) using Biolog sole C source utilization tests and phospholipid fatty acid (PLFA) profiles, were used to investigate the microbial ecology of these soils and to determine how land use alters microbial community structure. Microbial biomass-C and total PLFAs were closely correlated to organic carbon and total nitrogen, indicating that these soil microbial measures are potentially good indices of soil fertility in these highly weathered soils. Metabolic quotients and C source utilization were not correlated with organic carbon or microbial biomass. Multivariate analysis of sole carbon source utilization patterns and PLFAs demonstrated that land use history and plant cover type had a significant impact on microbial community structure. PLFAs showed these differences more than CLPP methods. Consequently, PLFA analysis was a better method for assessing broad-spectrum community differences and at the same time attempting to correlate changes with soil fertility. Soils from tea orchards were particularly distinctive in their CLPP. A modified CLPP method, using absorbance readings at 405 nm and different culture media at pH values of 4.7 and 7.0, showed that the discrimination obtained can be influenced by the culture conditions. This method was used to show that the distinctive microbial community structure in tea orchard soils was not, however, due to differences in pH alone. Received: 1 December 1999; Accepted: 6 June 2000; Online Publication: 28 August 2000  相似文献   

20.
An important step in the proteomic analysis of missing proteins is the use of a wide range of tissues, optimal extraction, and the processing of protein material in order to ensure the highest sensitivity in downstream protein detection. This work describes a purification protocol for identifying low-abundance proteins in human chorionic villi using the proposed “1DE-gel concentration” method. This involves the removal of SDS in a short electrophoresis run in a stacking gel without protein separation. Following the in-gel digestion of the obtained holistic single protein band, we used the peptide mixture for further LC–MS/MS analysis. Statistically significant results were derived from six datasets, containing three treatments, each from two tissue sources (elective or missed abortions). The 1DE-gel concentration increased the coverage of the chorionic villus proteome. Our approach allowed the identification of 15 low-abundance proteins, of which some had not been previously detected via the mass spectrometry of trophoblasts. In the post hoc data analysis, we found a dubious or uncertain protein (PSG7) encoded on human chromosome 19 according to neXtProt. A proteomic sample preparation workflow with the 1DE-gel concentration can be used as a prospective tool for uncovering the low-abundance part of the human proteome.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号