共查询到20条相似文献,搜索用时 10 毫秒
1.
Codon-based substitution models have been widely used to identify amino acid sites under positive selection in comparative analysis of protein-coding DNA sequences. The nonsynonymous-synonymous substitution rate ratio (d(N)/d(S), denoted omega) is used as a measure of selective pressure at the protein level, with omega > 1 indicating positive selection. Statistical distributions are used to model the variation in omega among sites, allowing a subset of sites to have omega > 1 while the rest of the sequence may be under purifying selection with omega < 1. An empirical Bayes (EB) approach is then used to calculate posterior probabilities that a site comes from the site class with omega > 1. Current implementations, however, use the naive EB (NEB) approach and fail to account for sampling errors in maximum likelihood estimates of model parameters, such as the proportions and omega ratios for the site classes. In small data sets lacking information, this approach may lead to unreliable posterior probability calculations. In this paper, we develop a Bayes empirical Bayes (BEB) approach to the problem, which assigns a prior to the model parameters and integrates over their uncertainties. We compare the new and old methods on real and simulated data sets. The results suggest that in small data sets the new BEB method does not generate false positives as did the old NEB approach, while in large data sets it retains the good power of the NEB approach for inferring positively selected sites. 相似文献
2.
Detecting amino acid sites under positive selection and purifying selection 总被引:15,自引:0,他引:15 下载免费PDF全文
An excess of nonsynonymous over synonymous substitution at individual amino acid sites is an important indicator that positive selection has affected the evolution of a protein between the extant sequences under study and their most recent common ancestor. Several methods exist to detect the presence, and sometimes location, of positively selected sites in alignments of protein-coding sequences. This article describes the "sitewise likelihood-ratio" (SLR) method for detecting nonneutral evolution, a statistical test that can identify sites that are unusually conserved as well as those that are unusually variable. We show that the SLR method can be more powerful than currently published methods for detecting the location of positive selection, especially in difficult cases where the strength of selection is low. The increase in power is achieved while relaxing assumptions about how the strength of selection varies over sites and without elevated rates of false-positive results that have been reported with some other methods. We also show that the SLR method performs well even under circumstances where the results from some previous methods can be misleading. 相似文献
3.
A method for detecting positive selection at single amino acid sites 总被引:23,自引:0,他引:23
A method was developed for detecting the selective force at single amino acid sites given a multiple alignment of protein-coding sequences. The phylogenetic tree was reconstructed using the number of synonymous substitutions. Then, the neutrality was tested for each codon site using the numbers of synonymous and nonsynonymous changes throughout the phylogenetic tree. Computer simulation showed that this method accurately estimated the numbers of synonymous and nonsynonymous substitutions per site, as long as the substitution number on each branch was relatively small. The false-positive rate for detecting the selective force was generally low. On the other hand, the true-positive rate for detecting the selective force depended on the parameter values. Within the range of parameter values used in the simulation, the true-positive rate increased as the strength of the selective force and the total branch length (namely the total number of synonymous substitutions per site) in the phylogenetic tree increased. In particular, with the relative rate of nonsynonymous substitutions to synonymous substitutions being 5.0, most of the positively selected codon sites were correctly detected when the total branch length in the phylogenetic tree was > or = 2.5. When this method was applied to the human leukocyte antigen (HLA) gene, which included antigen recognition sites (ARSs), positive selection was detected mainly on ARSs. This finding confirmed the effectiveness of the present method with actual data. Moreover, two amino acid sites were newly identified as positively selected in non-ARSs. The three-dimensional structure of the HLA molecule indicated that these sites might be involved in antigen recognition. Positively selected amino acid sites were also identified in the envelope protein of human immunodeficiency virus and the influenza virus hemagglutinin protein. This method may be helpful for predicting functions of amino acid sites in proteins, especially in the present situation, in which sequence data are accumulating at an enormous speed. 相似文献
4.
The reliabilities of parsimony-based and likelihood-based methods for inferring positive selection at single amino acid sites were studied using the nucleotide sequences of human leukocyte antigen (HLA) genes, in which positive selection is known to be operating at the antigen recognition site. The results indicate that the inference by parsimony-based methods is robust to the use of different evolutionary models and generally more reliable than that by likelihood-based methods. In contrast, the results obtained by likelihood-based methods depend on the models and on the initial parameter values used. It is sometimes difficult to obtain the maximum likelihood estimates of parameters for a given model, and the results obtained may be false negatives or false positives depending on the initial parameter values. It is therefore preferable to use parsimony-based methods as long as the number of sequences is relatively large and the branch lengths of the phylogenetic tree are relatively small. 相似文献
5.
Effect of recombination on the accuracy of the likelihood method for detecting positive selection at amino acid sites 总被引:41,自引:0,他引:41
Maximum-likelihood methods based on models of codon substitution accounting for heterogeneous selective pressures across sites have proved to be powerful in detecting positive selection in protein-coding DNA sequences. Those methods are phylogeny based and do not account for the effects of recombination. When recombination occurs, such as in population data, no unique tree topology can describe the evolutionary history of the whole sequence. This violation of assumptions raises serious concerns about the likelihood method for detecting positive selection. Here we use computer simulation to evaluate the reliability of the likelihood-ratio test (LRT) for positive selection in the presence of recombination. We examine three tests based on different models of variable selective pressures among sites. Sequences are simulated using a coalescent model with recombination and analyzed using codon-based likelihood models ignoring recombination. We find that the LRT is robust to low levels of recombination (with fewer than three recombination events in the history of a sample of 10 sequences). However, at higher levels of recombination, the type I error rate can be as high as 90%, especially when the null model in the LRT is unrealistic, and the test often mistakes recombination as evidence for positive selection. The test that compares the more realistic models M7 (beta) against M8 (beta and omega) is more robust to recombination, where the null model M7 allows the positive selection pressure to vary between 0 and 1 (and so does not account for positive selection), and the alternative model M8 allows an additional discrete class with omega = d(N)/d(S) that could be estimated to be >1 (and thus accounts for positive selection). Identification of sites under positive selection by the empirical Bayes method appears to be less affected than the LRT by recombination. 相似文献
6.
Stéphane Aris-Brosou 《Génome》2006,49(7):767-776
Codon-based substitution models are routinely used to measure selective pressures acting on protein-coding genes. To this effect, the nonsynonymous to synonymous rate ratio (dN/dS = omega) is estimated. The proportion of amino-acid sites potentially under positive selection, as indicated by omega > 1, is inferred by fitting a probability distribution where some sites are permitted to have omega > 1. These sites are then inferred by means of an empirical Bayes or by a Bayes empirical Bayes approach that, respectively, ignores or accounts for sampling errors in maximum-likelihood estimates of the distribution used to infer the proportion of sites with omega > 1. Here, we extend a previous full-Bayes approach to include models with high power and low false-positive rates when inferring sites under positive selection. We propose some heuristics to alleviate the computational burden, and show that (i) full Bayes can be superior to empirical Bayes when analyzing a small data set or small simulated data, (ii) full Bayes has only a small advantage over Bayes empirical Bayes with our small test data, and (iii) Bayesian methods appear relatively insensitive to mild misspecifications of the random process generating adaptive evolution in our simulations, but in practice can prove extremely sensitive to model specification. We suggest that the codon model used to detect amino acids under selection should be carefully selected, for instance using Akaike information criterion (AIC). 相似文献
7.
Codon-substitution models for heterogeneous selection pressure at amino acid sites 总被引:77,自引:0,他引:77
Comparison of relative fixation rates of synonymous (silent) and nonsynonymous (amino acid-altering) mutations provides a means for understanding the mechanisms of molecular sequence evolution. The nonsynonymous/synonymous rate ratio (omega = d(N)d(S)) is an important indicator of selective pressure at the protein level, with omega = 1 meaning neutral mutations, omega < 1 purifying selection, and omega > 1 diversifying positive selection. Amino acid sites in a protein are expected to be under different selective pressures and have different underlying omega ratios. We develop models that account for heterogeneous omega ratios among amino acid sites and apply them to phylogenetic analyses of protein-coding DNA sequences. These models are useful for testing for adaptive molecular evolution and identifying amino acid sites under diversifying selection. Ten data sets of genes from nuclear, mitochondrial, and viral genomes are analyzed to estimate the distributions of omega among sites. In all data sets analyzed, the selective pressure indicated by the omega ratio is found to be highly heterogeneous among sites. Previously unsuspected Darwinian selection is detected in several genes in which the average omega ratio across sites is <1, but in which some sites are clearly under diversifying selection with omega > 1. Genes undergoing positive selection include the beta-globin gene from vertebrates, mitochondrial protein-coding genes from hominoids, the hemagglutinin (HA) gene from human influenza virus A, and HIV-1 env, vif, and pol genes. Tests for the presence of positively selected sites and their subsequent identification appear quite robust to the specific distributional form assumed for omega and can be achieved using any of several models we implement. However, we encountered difficulties in estimating the precise distribution of omega among sites from real data sets. 相似文献
8.
Simulation study of the reliability and robustness of the statistical methods for detecting positive selection at single amino acid sites 总被引:11,自引:0,他引:11
Inferring positive selection at single amino acid sites is of biological and medical importance. Parsimony-based and likelihood-based methods have been developed for this purpose, but the reliabilities of these methods are not well understood. Because the evolutionary models assumed in these methods are only rough approximations to reality, it is desirable that the methods are not very sensitive to violation of the assumptions made. In this study we show by computer simulation that the likelihood-based method is sensitive to violation of the assumptions and produces many false-positive results under certain conditions, whereas the parsimony-based method tends to be conservative. These observations, together with those from previous studies, suggest that the positively selected sites inferred by the parsimony-based method are more reliable than those inferred by the likelihood-based method. 相似文献
9.
ADAPTSITE: detecting natural selection at single amino acid sites. 总被引:12,自引:0,他引:12
ADAPTSITE is a program package for detecting natural selection at single amino acid sites, using a multiple alignment of protein-coding sequences for a given phylogenetic tree. The program infers ancestral codons at all interior nodes, and computes the total numbers of synonymous (c(S)) and nonsynonymous (c(N)) substitutions as well as the average numbers of synonymous (s(S)) and nonsynonymous (s(N)) sites for each codon site. The probabilities of occurrence of synonymous and nonsynonymous substitutions are approximated by s(S) / (s(S) + s(N)) and s(N) / (s(S) + s(N)), respectively. The null hypothesis of selective neutrality is tested for each codon site, assuming a binomial distribution for the probability of obtaining c(S) and c(N). AVAILABILITY: ADAPTSITE is available free of charge at the World-Wide Web sites http://mep.bio.psu.edu/adaptivevol.html and http://www.cib.nig.ac.jp/dda/yossuzuk/welcome.html. The package includes the source code written in C, binary files for UNIX operating systems, manual, and example files. 相似文献
10.
Not so different after all: a comparison of methods for detecting amino acid sites under selection 总被引:22,自引:0,他引:22
We consider three approaches for estimating the rates of nonsynonymous and synonymous changes at each site in a sequence alignment in order to identify sites under positive or negative selection: (1) a suite of fast likelihood-based "counting methods" that employ either a single most likely ancestral reconstruction, weighting across all possible ancestral reconstructions, or sampling from ancestral reconstructions; (2) a random effects likelihood (REL) approach, which models variation in nonsynonymous and synonymous rates across sites according to a predefined distribution, with the selection pressure at an individual site inferred using an empirical Bayes approach; and (3) a fixed effects likelihood (FEL) method that directly estimates nonsynonymous and synonymous substitution rates at each site. All three methods incorporate flexible models of nucleotide substitution bias and variation in both nonsynonymous and synonymous substitution rates across sites, facilitating the comparison between the methods. We demonstrate that the results obtained using these approaches show broad agreement in levels of Type I and Type II error and in estimates of substitution rates. Counting methods are well suited for large alignments, for which there is high power to detect positive and negative selection, but appear to underestimate the substitution rate. A REL approach, which is more computationally intensive than counting methods, has higher power than counting methods to detect selection in data sets of intermediate size but may suffer from higher rates of false positives for small data sets. A FEL approach appears to capture the pattern of rate variation better than counting methods or random effects models, does not suffer from as many false positives as random effects models for data sets comprising few sequences, and can be efficiently parallelized. Our results suggest that previously reported differences between results obtained by counting methods and random effects models arise due to a combination of the conservative nature of counting-based methods, the failure of current random effects models to allow for variation in synonymous substitution rates, and the naive application of random effects models to extremely sparse data sets. We demonstrate our methods on sequence data from the human immunodeficiency virus type 1 env and pol genes and simulated alignments. 相似文献
11.
Influenza A viruses are single-stranded RNA viruses capable of evolving rapidly to adapt to environmental conditions. Examples include the establishment of a virus in a novel host or an adaptation to increasing immunity within the host population due to prior infection or vaccination against a circulating strain. Knowledge of the viral protein regions under positive selection is therefore crucial for surveillance. We have developed a method for detecting positively selected patches of sites on the surface of viral proteins, which we assume to be relevant for adaptive evolution. We measure positive selection based on dN/dS ratios of genetic changes inferred by considering the phylogenetic structure of the data and suggest a graph-cut algorithm to identify such regions. Our algorithm searches for dense and spatially distinct clusters of sites under positive selection on the protein surface. For the hemagglutinin protein of human influenza A viruses of the subtypes H3N2 and H1N1, our predicted sites significantly overlap with known antigenic and receptor-binding sites. From the structure and sequence data of the 2009 swine-origin influenza A/H1N1 hemagglutinin and PB2 protein, we identified regions that provide evidence of evolution under positive selection since introduction of the virus into the human population. The changes in PB2 overlap with sites reported to be associated with mammalian adaptation of the influenza A virus. Application of our technique to the protein structures of viruses of yet unknown adaptive behavior could identify further candidate regions that are important for host-virus interaction. 相似文献
12.
Cohen S 《Molecular biology and evolution》2002,19(11):1870-1880
Population-level studies using the major histocompatibility complex (Mhc) have linked specific alleles with specific diseases, but data requirements are high and the power to detect disease association is low. A novel use of Mhc population surveys involves mapping allelic substitutions onto the inferred structural molecular model to show functional differentiation related to local selective pressures. In the estuarine fish Fundulus heteroclitus, populations experiencing strong differences in antigenic challenges show significant differences in amino acid substitution patterns that are reflected as variation in the structural location of changes between populations. Fish from a population genetically adapted to severe chemical pollution also show novel patterns of DNA substitution at a highly variable Mhc class II B locus including strong signals of positive selection at inferred antigen-binding sites and population-specific signatures of amino acid substitution. Heavily parasitized fish from an extreme PCB-contaminated (U.S. Environmental Protection Agency Superfund) site show enhanced population-specific substitutions in the a-helix portion of the inferred antigen-binding region. In contrast, fish from an unpolluted site show a significantly different pattern focused on the first strand of the B-pleated sheet. Whether Mhc population profile differences represent the direct effects of chemical toxicants or indirect parasite-mediated selection, the result is a composite habitat-specific signature of strong selection and evolution affecting the genetic repertoire of the major histocompatibility complex. 相似文献
13.
Suzuki Y 《Genes & genetic systems》2007,82(4):341-360
Natural selection operating on amino acid substitution at single amino acid sites can be detected by comparing the rates of synonymous (r(S)) and nonsynonymous (r(N)) nucleotide substitution at single codon sites. Amino acid substitutions can be classified as conservative or radical according to whether they retain the properties of the substituted amino acid. Here methods for comparing the rates of conservative (r(C)) and radical (r(R)) nonsynonymous substitution with r(S) at single codon sites were developed to detect natural selection operating on these substitutions at single amino acid sites. A method for comparing r(C) and r(R) at single codon sites was also developed to detect biases toward these substitutions at single amino acid sites. Charge was used as the property of the amino acids. In a computer simulation, false-positive rates of these methods were always < 5%, unless termination sites were included in the computation of the numbers of sites and estimates of transition/transversion rate ratio were highly biased. The frequency of detection of natural selection operating on conservative substitution was almost independent of the presence of natural selection operating on radical substitution, and vice versa. Natural selection operating specifically on conservative and radical substitution was detected more efficiently by comparing r(S) with r(C) and r(S) with r(R) than by comparing r(S) with r(N). These methods also appeared to be robust against the occurrence of recombination during evolution. In an analysis of class I human leukocyte antigen, negative selection operating on conservative substitution, but not positive selection operating on radical substitution, was observed at some of the codon sites with r(R) > r(C), suggesting that r(R) > r(C) may not necessarily be an indicator of positive selection operating on radical substitution. 相似文献
14.
Zhen Chen Yuan Zhou Jiangning Song Ziding Zhang 《Biochimica et Biophysica Acta - Proteins and Proteomics》2013,1834(8):1461-1467
As one of the most common post-translational modifications, ubiquitination regulates the quantity and function of a variety of proteins. Experimental and clinical investigations have also suggested the crucial roles of ubiquitination in several human diseases. The complicated sequence context of human ubiquitination sites revealed by proteomic studies highlights the need of developing effective computational strategies to predict human ubiquitination sites. Here we report the establishment of a novel human-specific ubiquitination site predictor through the integration of multiple complementary classifiers. Firstly, a Support Vector Machine (SVM) classier was constructed based on the composition of k-spaced amino acid pairs (CKSAAP) encoding, which has been utilized in our previous yeast ubiquitination site predictor. To further exploit the pattern and properties of the ubiquitination sites and their flanking residues, three additional SVM classifiers were constructed using the binary amino acid encoding, the AAindex physicochemical property encoding and the protein aggregation propensity encoding, respectively. Through an integration that relied on logistic regression, the resulting predictor termed hCKSAAP_UbSite achieved an area under ROC curve (AUC) of 0.770 in 5-fold cross-validation test on a class-balanced training dataset. When tested on a class-balanced independent testing dataset that contains 3419 ubiquitination sites, hCKSAAP_UbSite has also achieved a robust performance with an AUC of 0.757. Specifically, it has consistently performed better than the predictor using the CKSAAP encoding alone and two other publicly available predictors which are not human-specific. Given its promising performance in our large-scale datasets, hCKSAAP_UbSite has been made publicly available at our server (http://protein.cau.edu.cn/cksaap_ubsite/). 相似文献
15.
Hitchhiking under positive Darwinian selection 总被引:77,自引:0,他引:77
Positive selection can be inferred from its effect on linked neutral variation. In the restrictive case when there is no recombination, all linked variation is removed. If recombination is present but rare, both deterministic and stochastic models of positive selection show that linked variation hitchhikes to either low or high frequencies. While the frequency distribution of variation can be influenced by a number of evolutionary processes, an excess of derived variants at high frequency is a unique pattern produced by hitchhiking (derived refers to the nonancestral state as determined from an outgroup). We adopt a statistic, H, to measure an excess of high compared to intermediate frequency variants. Only a few high-frequency variants are needed to detect hitchhiking since not many are expected under neutrality. This is of particular utility in regions of low recombination where there is not much variation and in regions of normal or high recombination, where the hitchhiking effect can be limited to a small (<1 kb) region. Application of the H test to published surveys of Drosophila variation reveals an excess of high frequency variants that are likely to have been influenced by positive selection. 相似文献
16.
Resch AM Carmel L Mariño-Ramírez L Ogurtsov AY Shabalina SA Rogozin IB Koonin EV 《Molecular biology and evolution》2007,24(8):1821-1831
Evolution of protein sequences is largely governed by purifying selection, with a small fraction of proteins evolving under positive selection. The evolution at synonymous positions in protein-coding genes is not nearly as well understood, with the extent and types of selection remaining, largely, unclear. A statistical test to identify purifying and positive selection at synonymous sites in protein-coding genes was developed. The method compares the rate of evolution at synonymous sites (Ks) to that in intron sequences of the same gene after sampling the aligned intron sequences to mimic the statistical properties of coding sequences. We detected purifying selection at synonymous sites in approximately 28% of the 1,562 analyzed orthologous genes from mouse and rat, and positive selection in approximately 12% of the genes. Thus, the fraction of genes with readily detectable positive selection at synonymous sites is much greater than the fraction of genes with comparable positive selection at nonsynonymous sites, i.e., at the level of the protein sequence. Unlike other genes, the genes with positive selection at synonymous sites showed no correlation between Ks and the rate of evolution in nonsynonymous sites (Ka), indicating that evolution of synonymous sites under positive selection is decoupled from protein evolution. The genes with purifying selection at synonymous sites showed significant anticorrelation between Ks and expression level and breadth, indicating that highly expressed genes evolve slowly. The genes with positive selection at synonymous sites showed the opposite trend, i.e., highly expressed genes had, on average, higher Ks. For the genes with positive selection at synonymous sites, a significantly lower mRNA stability is predicted compared to the genes with negative selection. Thus, mRNA destabilization could be an important factor driving positive selection in nonsynonymous sites, probably, through regulation of expression at the level of mRNA degradation and, possibly, also translation rate. So, unexpectedly, we found that positive selection at synonymous sites of mammalian genes is substantially more common than positive selection at the level of protein sequences. Positive selection at synonymous sites might act through mRNA destabilization affecting mRNA levels and translation. 相似文献
17.
Sawyer SA Kulathinal RJ Bustamante CD Hartl DL 《Journal of molecular evolution》2003,57(Z1):S154-S164
One of the principal goals of population genetics is to understand the processes by which genetic variation within species (polymorphism) becomes converted into genetic differences between species (divergence). In this transformation, selective neutrality, near neutrality, and positive selection may each play a role, differing from one gene to the next. Synonymous nucleotide sites are often used as a uniform standard of comparison across genes on the grounds that synonymous sites are subject to relatively weak selective constraints and so may, to a first approximation, be regarded as neutral. Synonymous sites are also interdigitated with nonsynonymous sites and so are affected equally by genomic context and demographic factors. Hence a comparison of levels of polymorphism and divergence between synonymous sites and amino acid replacement sites in a gene is potentially informative about the magnitude of selective forces associated with amino acid replacements. We have analyzed 56 genes in which polymorphism data from D. simulans are compared with divergence from a reference strain of D. melanogaster. The framework of the analysis is Bayesian and assumes that the distribution of selective effects (Malthusian fitnesses) is Gaussian with a mean that differs for each gene. In such a model, the average scaled selection intensity (gamma = N(e)s) of amino acid replacements eligible to become polymorphic or fixed is -7.31, and the standard deviation of selective effects within each locus is 6.79 (assuming homoscedasticity across loci). For newly arising mutations of this type that occur in autosomal or X-linked genes, the average proportion of beneficial mutations is 19.7%. Among the amino acid polymorphisms in the sample, the expected average proportion of beneficial mutations is 47.7%, and among amino acid replacements that become fixed the average proportion of beneficial mutations is 94.3%. The average scaled selection intensity of fixed mutations is +5.1. The presence of positive selection is pervasive with the single exception of kl-5, a Y-linked fertility gene. We find no evidence that a significant fraction of fixed amino acid replacements is neutral or nearly neutral or that positive selection drives amino acid replacements at only a subset of the loci. These results are model dependent and we discuss possible modifications of the model that might allow more neutral and nearly neutral amino acid replacements to be fixed. 相似文献
18.
Detection of positive Darwinian selection has become ever more important with the rapid growth of genomic data sets. Recent branch-site models of codon substitution account for variation of selective pressure over branches on the tree and across sites in the sequence and provide a means to detect short episodes of molecular adaptation affecting just a few sites. In likelihood ratio tests based on such models, the branches to be tested for positive selection have to be specified a priori. In the absence of a biological hypothesis to designate so-called foreground branches, one may test many branches, but a correction for multiple testing becomes necessary. In this paper, we employ computer simulation to evaluate the performance of 6 multiple test correction procedures when the branch-site models are used to test every branch on the phylogeny for positive selection. Four of the methods control the familywise error rates (FWERs), whereas the other 2 control the false discovery rate (FDR). We found that all correction procedures achieved acceptable FWER except for extremely divergent sequences and serious model violations, when the test may become unreliable. The power of the test to detect positive selection is influenced by the strength of selection and the sequence divergence, with the highest power observed at intermediate divergences. The 4 correction procedures that control the FWER had similar power. We recommend Rom's procedure for its slightly higher power, but the simple Bonferroni correction is useable as well. The 2 correction procedures that control the FDR had slightly more power and also higher FWER. We demonstrate the multiple test procedures by analyzing gene sequences from the extracellular domain of the cluster of differentiation 2 (CD2) gene from 10 mammalian species. Both our simulation and real data analysis suggest that the multiple test procedures are useful when multiple branches have to be tested on the same data set. 相似文献
19.
Translation initiation sites (TISs) are important signals in cDNA sequences. In many previous attempts to predict TISs in cDNA sequences, three major factors affect the prediction performance: the nature of the cDNA sequence sets, the relevant features selected. and the classification methods used. In this paper, we examine different approaches to select and integrate relevant features for TIS prediction. The top selected significant features include the features from the position weight matrix and the propensity matrix, the number of nucleotide C in the sequence downstream ATG, the number of downstream stop codons. the number of upstream ATGs, and the number of some amino acids, such as amino acids A and D. With the numerical data generated from these features, different classification methods, including decision tree. naive Bayes, and support vector machine, were applied to three independent sequence sets. The identified significant features were found to be biologically meaningful. while the experiments showed promising results. 相似文献
20.