共查询到20条相似文献,搜索用时 0 毫秒
1.
MOTIVATION: We present techniques for increasing the speed of sequence analysis using scoring matrices. Our techniques are based on calculating, for a given scoring matrix, the quantile function, which assigns a probability, or p, value to each segmental score. Our techniques also permit the user to specify a p threshold to indicate the desired trade-off between sensitivity and speed for a particular sequence analysis. The resulting increase in speed should allow scoring matrices to be used more widely in large-scale sequencing and annotation projects. RESULTS: We develop three techniques for increasing the speed of sequence analysis: probability filtering, lookahead scoring, and permuted lookahead scoring. In probability filtering, we compute the score threshold that corresponds to the user-specified p threshold. We use the score threshold to limit the number of segments that are retained in the search process. In lookahead scoring, we test intermediate scores to determine whether they will possibly exceed the score threshold. In permuted lookahead scoring, we score each segment in a particular order designed to maximize the likelihood of early termination. Our two lookahead scoring techniques reduce substantially the number of residues that must be examined. The fraction of residues examined ranges from 62 to 6%, depending on the p threshold chosen by the user. These techniques permit sequence analysis with scoring matrices at speeds that are several times faster than existing programs. On a database of 12 177 alignment blocks, our techniques permit sequence analysis at a speed of 225 residues/s for a p threshold of 10-6, and 541 residues/s for a p threshold of 10-20. In order to compute the quantile function, we may use either an independence assumption or a Markov assumption. We measure the effect of first- and second-order Markov assumptions and find that they tend to raise the p value of segments, when compared with the independence assumption, by average ratios of 1.30 and 1.69, respectively. We also compare our technique with the empirical 99. 5th percentile scores compiled in the BLOCKSPLUS database, and find that they correspond on average to a p value of 1.5 x 10-5. AVAILABILITY: The techniques described above are implemented in a software package called EMATRIX. This package is available from the authors for free academic use or for licensed commercial use. The EMATRIX set of programs is also available on the Internet at http://motif.stanford.edu/ematrix. 相似文献
2.
MOTIVATION: Pairwise local sequence alignment is commonly used to search data bases for sequences related to some query sequence. Alignments are obtained using a scoring matrix that takes into account the different frequencies of occurrence of the various types of amino acid substitutions. Software like BLAST provides the user with a set of scoring matrices available to choose from, and in the literature it is sometimes recommended to try several scoring matrices on the sequences of interest. The significance of an alignment is usually assessed by looking at E-values and p-values. While sequence lengths and data base sizes enter the standard calculations of significance, it is much less common to take the use of several scoring matrices on the same sequences into account. Altschul proposed corrections of the p-value that account for the simultaneous use of an infinite number of PAM matrices. Here we consider the more realistic situation where the user may choose from a finite set of popular PAM and BLOSUM matrices, in particular the ones available in BLAST. It turns out that the significance of a result can be considerably overestimated, if a set of substitution matrices is used in an alignment problem and the most significant alignment is then quoted. RESULTS: Based on extensive simulations, we study the multiple testing problem that occurs when several scoring matrices for local sequence alignment are used. We consider a simple Bonferroni correction of the p-values and investigate its accuracy. Finally, we propose a more accurate correction based on extreme value distributions fitted to the maximum of the normalized scores obtained from different scoring matrices. For various sets of matrices we provide correction factors which can be easily applied to adjust p- and E-values reported by software packages. 相似文献
3.
Background
Classification of protein sequences is a central problem in computational biology. Currently, among computational methods discriminative kernel-based approaches provide the most accurate results. However, kernel-based methods often lack an interpretable model for analysis of discriminative sequence features, and predictions on new sequences usually are computationally expensive. 相似文献4.
Protein secondary structure prediction based on position-specific scoring matrices. 总被引:46,自引:0,他引:46
D T Jones 《Journal of molecular biology》1999,292(2):195-202
A two-stage neural network has been used to predict protein secondary structure based on the position specific scoring matrices generated by PSI-BLAST. Despite the simplicity and convenience of the approach used, the results are found to be superior to those produced by other methods, including the popular PHD method according to our own benchmarking results and the results from the recent Critical Assessment of Techniques for Protein Structure Prediction experiment (CASP3), where the method was evaluated by stringent blind testing. Using a new testing set based on a set of 187 unique folds, and three-way cross-validation based on structural similarity criteria rather than sequence similarity criteria used previously (no similar folds were present in both the testing and training sets) the method presented here (PSIPRED) achieved an average Q3 score of between 76.5% to 78.3% depending on the precise definition of observed secondary structure used, which is the highest published score for any method to date. Given the success of the method in CASP3, it is reasonable to be confident that the evaluation presented here gives a fair indication of the performance of the method in general. 相似文献
5.
Sequence alignment of citrate synthase proteins using a multiple sequence alignment algorithm and multiple scoring matrices 总被引:1,自引:0,他引:1
The alignment of Escherichia coli citrate synthase to pig heart citrate synthase and the multiple alignment of the known sequences of the citrate synthase family of enzymes have been performed using six different amino acid similarity scoring matrices and a large range of gap penalty ratios for insertions and deletions of amino acids. The alignment studies have been performed as the first step in a project aimed at homology modelling E. coli citrate synthase (a hexamer) from pig heart citrate synthase (a dimer) in a molecular modelling approach to the study of multi-subunit enzymes. The effects of several important variables in producing realistic alignments have been investigated. The difference between multiple alignment of the family of enzymes versus simple pairwise alignment of the pig heart and E. coli proteins was explored. The effects of initial separate multiple alignments of the most highly related or most homologous species of the family of enzymes upon a subsequent pairwise alignment between species was evaluated. The value of 'fingerprinting' certain residues to bias the alignment in favour of matching those residues, as well as the worth of the computerized approach compared to an intuitive alignment technique, were assessed. 相似文献
6.
A C May 《Protein engineering》1999,12(9):707-712
Hierarchical classifications of the 20 amino acids according to residue relationships within scoring matrices have not hitherto been tested for reliability. In fact, testing here of the residue groupings obtained thus from 18 published matrices shows that they vary considerably in reliability. This behaviour gives a new insight then into the matrices with respect to the relationships between the amino acid scores contained therein. For example, other than the trivial grouping of the 20 amino acids, no reliable residue groupings are present in all 18 matrix amino acid hierarchical classifications. Hierarchical classification of the 18 scoring matrices themselves is investigated in terms of matrix representation and choice of similarity and dissimilarity measures for matrix comparison. There is no absolute standard against which to compare a matrix clustering, of course, but it is possible to assess the usefulness of a measure for the purpose in terms of the reliability of the calculated tree. Matrix representation is shown to be important. Finally, a novel two-step approach for hierarchical classification of the 18 amino acid scoring matrices is described. 相似文献
7.
Fractionation of several type II specific restriction endonucleases was achieved by separation on two novel biospecific matrices. The matrices are pyran, a copolymer of divinyl ether of maleic anhydride, and Cibacron Blue F3GA, a blue dye commonly used for the calibration of molecular sieves. Both compounds are insolubilized by coupling to sepharose through a cyanogen bromide linkage and in their soluble form inhibit the restriction endonucleases which we have tested. These affinity matrices can be used to obtain restriction endonucleases from crude extracts after removal of nucleic acids. They have also proven to have a high capacity when used as subsequent steps in enzyme purification. Their additional advantage is the rapid development time and reusability of columns packed with the two matrices. 相似文献
8.
Kim E Kececioglu J 《IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM》2008,5(4):546-556
When aligning biological sequences, the choice of parameter values for the alignment scoring function is critical. Small changes in gap penalties, for example, can yield radically different alignments. A rigorous way to compute parameter values that are appropriate for aligning biological sequences is through inverse parametric sequence alignment. Given a collection of examples of biologically correct alignments, this is the problem of finding parameter values that make the scores of the example alignments close to those of optimal alignments for their sequences. We extend prior work on inverse parametric alignment to partial examples, which contain regions where the alignment is left unspecified, and to an improved formulation based on minimizing the average error between the score of an example and the score of an optimal alignment. Experiments on benchmark biological alignments show we can find parameters that generalize across protein families and that boost the accuracy of multiple sequence alignment by as much as 25%. 相似文献
9.
Background
In biological sequence analysis, position specific scoring matrices (PSSMs) are widely used to represent sequence motifs in nucleotide as well as amino acid sequences. Searching with PSSMs in complete genomes or large sequence databases is a common, but computationally expensive task. 相似文献10.
Position-specific scoring matrices (PSSMs) are useful for detecting weak homology in protein sequence analysis, and they are thought to contain some essential signatures of the protein families. In order to elucidate what kind of ingredients constitute such family-specific signatures, we apply singular value decomposition to a set of PSSMs and examine the properties of dominant right and left singular vectors. The first right singular vectors were correlated with various amino acid indices including relative mutability, amino acid composition in protein interior, hydropathy, or turn propensity, depending on proteins. A significant correlation between the first left singular vector and a measure of site conservation was observed. It is shown that the contribution of the first singular component to the PSSMs act to disfavor potentially but falsely functionally important residues at conserved sites. The second right singular vectors were highly correlated with hydrophobicity scales, and the corresponding left singular vectors with contact numbers of protein structures. It is suggested that sequence alignment with a PSSM is essentially equivalent to threading supplemented with functional information. In addition, singular vectors may be useful for analyzing and annotating the characteristics of conserved sites in protein families. 相似文献
11.
Identifying non-coding RNA regions on the genome using computational methods is currently receiving a lot of attention. In
general, it is essentially more difficult than the problem of detecting protein-coding genes because non-coding RNA regions
have only weak statistical signals. On the other hand, most functional RNA families have conserved sequences and secondary
structures which are characteristic of their molecular function in a cell. These are known as sequence motifs and consensus
structures, respectively. In this paper, we propose an improved method which extends a pairwise structural alignment method
for RNA sequences to handle position specific scoring matrices and hence to incorporate motifs into structural alignment of
RNA sequences. To model sequence motifs, we employ position specific scoring matrices (PSSMs). Experimental results show that
PSSMs enable us to find individual RNA families efficiently, especially if we have biological knowledge such as sequence motifs.
K. Sato and K. Morita contributed equally to this work. 相似文献
12.
MOTIVATION: In recent years, several methods have been proposed for aligning two protein sequence profiles, with reported improvements in alignment accuracy and homolog discrimination versus sequence-sequence methods (e.g. BLAST) and profile-sequence methods (e.g. PSI-BLAST). Profile-profile alignment is also the iterated step in progressive multiple sequence alignment algorithms such as CLUSTALW. However, little is known about the relative performance of different profile-profile scoring functions. In this work, we evaluate the alignment accuracy of 23 different profile-profile scoring functions by comparing alignments of 488 pairs of sequences with identity < or =30% against structural alignments. We optimize parameters for all scoring functions on the same training set and use profiles of alignments from both PSI-BLAST and SAM-T99. Structural alignments are constructed from a consensus between the FSSP database and CE structural aligner. We compare the results with sequence-sequence and sequence-profile methods, including BLAST and PSI-BLAST. RESULTS: We find that profile-profile alignment gives an average improvement over our test set of typically 2-3% over profile-sequence alignment and approximately 40% over sequence-sequence alignment. No statistically significant difference is seen in the relative performance of most of the scoring functions tested. Significantly better results are obtained with profiles constructed from SAM-T99 alignments than from PSI-BLAST alignments. AVAILABILITY: Source code, reference alignments and more detailed results are freely available at http://phylogenomics.berkeley.edu/profilealignment/ 相似文献
13.
F Sakiyama 《Trends in biotechnology》1990,8(10):282-288
Analysis of protein sequence is an important tool in studies of both native and recombinant proteins. Novel techniques and instrumentation which facilitate determination of protein primary structure have recently been developed. 相似文献
14.
Computer programs are described which allow (a) analysis of DNA sequences to be performed on a laboratory microcomputer or (b) transfer of DNA sequences between a laboratory microcomputer and another computer system, such as a DNA library. The sequence analysis programs are interactive, do not require prior experience with computers and in many other respects resemble programs which have been written for larger computer systems (1-7). The user enters sequence data into a text file, accesses this file with the programs, and is then able to (a) search for restriction enzyme sites or other specified sequences, (b) translate in one or more reading frames in one or both directions in order to find open reading frames, or (c) determine codon usage in the sequence in one or more given reading frames. The results are given in table format and a restriction map is generated. The modem program permits collection of large amounts of data from a sequence library into a permanent file on the microcomputer disc system, or transfer of laboratory data in the reverse direction to a remote computer system. 相似文献
15.
The Tradescantia-micronucleus (Trad-MCN) bioassay is an efficient short-term test for genotoxicity of pollutants. In order to increase the efficiency and to standardize the micronucleus (MCN) scoring process, an automated scoring system was developed using the principle of image analysis in computer science. This assemblage is called the Tradescantia-micronucleus image analysis (Trad-MCNIA) system. The MCN frequencies scored by this system were compared with those scored by human observation for its proficiency. A set of low MCN frequency (around 5 MCN/100 tetrads) slides prepared from a control group, a set of medium MCN frequency (around 20 MCN/100 tetrads) slides prepared from sodium azide treated plant cuttings and a set of high MCN frequency (around 50 MCN/100 tetrads) slides prepared from X-ray treated materials were used for this study. In the low MCN frequency slides, the Trad-MCNIA system scored about the same value as human observation. In the medium and high frequency slides, MCN frequencies scored by the system were lower than those scored by human observers. This discrepancy was corrected by increasing the power of the objective of the microscope in the system. The MCN frequencies scored by the system attained 90% congruity with those scored by human observers after the correction. The scoring speed of the system was about 3.5 times as fast as that by human observers, and the data could be statistically analyzed immediately after the data scores were recorded. Further improvements can be made by upgrading the video camera and the computer speed. 相似文献
16.
Background
Molecular database search tools need statistical models to assess the significance for the resulting hits. In the classical approach one asks the question how probable a certain score is observed by pure chance. Asymptotic theories for such questions are available for two random i.i.d. sequences. Some effort had been made to include effects of finite sequence lengths and to account for specific compositions of the sequences. In many applications, such as a large-scale database homology search for transmembrane proteins, these models are not the most appropriate ones. Search sensitivity and specificity benefit from position-dependent scoring schemes or use of Hidden Markov Models. Additional, one may wish to go beyond the assumption that the sequences are i.i.d. Despite their practical importance, the statistical properties of these settings have not been well investigated yet. 相似文献17.
Membrane protein plays an important role in some biochemical process such as signal transduction, transmembrane transport, etc. Membrane proteins are usually classified into five types [Chou, K.C., Elrod, D.W., 1999. Prediction of membrane protein types and subcellular locations. Proteins: Struct. Funct. Genet. 34, 137-153] or six types [Chou, K.C., Cai, Y.D., 2005. J. Chem. Inf. Modelling 45, 407-413]. Designing in silico methods to identify and classify membrane protein can help us understand the structure and function of unknown proteins. This paper introduces an integrative approach, IAMPC, to classify membrane proteins based on protein sequences and protein profiles. These modules extract the amino acid composition of the whole profiles, the amino acid composition of N-terminal and C-terminal profiles, the amino acid composition of profile segments and the dipeptide composition of the whole profiles. In the computational experiment, the overall accuracy of the proposed approach is comparable with the functional-domain-based method. In addition, the performance of the proposed approach is complementary to the functional-domain-based method for different membrane protein types. 相似文献
18.
We describe here a new strategy of fragment preparation for sequencing procedures using endlabelled DNA fragments as substrates (2,3) which is directly applicable to DNA fragments cloned into the Pst I site of pBR322, or in modified form, to inserts into the BamH I or Sal I site of the same plasmid. Ordered sets of subclones of predetermined overlap are are generated. These can be sequenced directly without further strand- or fragment separation steps. 相似文献
19.
The structure of an oligodeoxyribonucleotide may be determined by a simple two-dimensional separation on a polyethyleneimine-cellulose thin layer sheet. Chromatography in the first dimension fractionates by chain length a nested set of fragments that are generated by subjecting the oligomer to partial spleen phosphodiesterase degradation and then labelling their non-common ends with 32P using polynucleotide kinase. A subsequent in situ treatment with nuclease Bal 31 produces labelled mononucleotides, and these are identified by chromatography in the second dimension. Since the method does not identify the 3' terminal nucleotide, a convenient procedure involving 3' end labelling followed by enzymatic digestion to monomers has been developed for this purpose. This approach to sequence analysis also has the advantage of permitting assignment of the identity and location of any modified or unusual bases within the oligonucleotide. 相似文献
20.
A simple miniaturized gel system suitable for DNA sequencing is described. Small ultrathin polyacrylamide gels are cast, eight or more at a time, using standard microscope slides. Gels, ready to use, can be stored for approximately 2 weeks. Gels are run horizontally in a standard mini-agarose gel apparatus. Typical run times are 6-8 min. A novel sample loading system permits volumes of standard sequencing reactions as small as 0.1 microl to be analyzed. Sequencing ladders were visualized using 35S-labeled DNA by autoradiography and by colorimetric detection. Band resolution compares favorably with that of large gels. The methods introduced here serve as a step toward the miniaturization of DNA sequencing and are amenable to automated sample loading and detection. 相似文献