共查询到20条相似文献,搜索用时 15 毫秒
1.
A popular approach to detecting positive selection is to estimate the parameters of a probabilistic model of codon evolution and perform inference based on its maximum likelihood parameter values. This approach has been evaluated intensively in a number of simulation studies and found to be robust when the available data set is large. However, uncertainties in the estimated parameter values can lead to errors in the inference, especially when the data set is small or there is insufficient divergence between the sequences. We introduce a Bayesian model comparison approach to infer whether the sequence as a whole contains sites at which the rate of nonsynonymous substitution is greater than the rate of synonymous substitution. We incorporated this probabilistic model comparison into a Bayesian approach to site-specific inference of positive selection. Using simulated sequences, we compared this approach to the commonly used empirical Bayes approach and investigated the effect of tree length on the performance of both methods. We found that the Bayesian approach outperforms the empirical Bayes method when the amount of sequence divergence is small and is less prone to false-positive inference when the sequences are saturated, while the results are indistinguishable for intermediate levels of sequence divergence. 相似文献
2.
Molecular markers derived from polymerase chain reaction (PCR) amplification of genomic DNA are an important part of the toolkit of evolutionary geneticists. Random amplified polymorphic DNA markers (RAPDs), amplified fragment length polymorphisms (AFLPs) and intersimple sequence repeat (ISSR) polymorphisms allow analysis of species for which previous DNA sequence information is lacking, but dominance makes it impossible to apply standard techniques to calculate F-statistics. We describe a Bayesian method that allows direct estimates of FST from dominant markers. In contrast to existing alternatives, we do not assume previous knowledge of the degree of within-population inbreeding. In particular, we do not assume that genotypes within populations are in Hardy-Weinberg proportions. Our estimate of FST incorporates uncertainty about the magnitude of within-population inbreeding. Simulations show that samples from even a relatively small number of loci and populations produce reliable estimates of FST. Moreover, some information about the degree of within-population inbreeding (FIS) is available from data sets with a large number of loci and populations. We illustrate the method with a reanalysis of RAPD data from 14 populations of a North American orchid, Platanthera leucophaea. 相似文献
3.
Background
In this study we present a single population test (Ewens-Waterson) applied in a genomic context to investigate the presence of recent positive selection in the Irish population. The Irish population is an interesting focus for the investigation of recent selection since several lines of evidence suggest that it may have a relatively undisturbed genetic heritage. 相似文献4.
S M Ossadnik S V Buldyrev A L Goldberger S Havlin R N Mantegna C K Peng M Simons H E Stanley 《Biophysical journal》1994,67(1):64-70
Recently, it was observed that noncoding regions of DNA sequences possess long-range power-law correlations, whereas coding regions typically display only short-range correlations. We develop an algorithm based on this finding that enables investigators to perform a statistical analysis on long DNA sequences to locate possible coding regions. The algorithm is particularly successful in predicting the location of lengthy coding regions. For example, for the complete genome of yeast chromosome III (315,344 nucleotides), at least 82% of the predictions correspond to putative coding regions; the algorithm correctly identified all coding regions larger than 3000 nucleotides, 92% of coding regions between 2000 and 3000 nucleotides long, and 79% of coding regions between 1000 and 2000 nucleotides. The predictive ability of this new algorithm supports the claim that there is a fundamental difference in the correlation property between coding and noncoding sequences. This algorithm, which is not species-dependent, can be implemented with other techniques for rapidly and accurately locating relatively long coding regions in genomic sequences. 相似文献
5.
Making choices is a fundamental aspect of human life. For over a century experimental economists have characterized the decisions people make based on the concept of a utility function. This function increases with increasing desirability of the outcome, and people are assumed to make decisions so as to maximize utility. When utility depends on several variables, indifference curves arise that represent outcomes with identical utility that are therefore equally desirable. Whereas in economics utility is studied in terms of goods and services, the sensorimotor system may also have utility functions defining the desirability of various outcomes. Here, we investigate the indifference curves when subjects experience forces of varying magnitude and duration. Using a two-alternative forced-choice paradigm, in which subjects chose between different magnitude–duration profiles, we inferred the indifference curves and the utility function. Such a utility function defines, for example, whether subjects prefer to lift a 4-kg weight for 30 s or a 1-kg weight for a minute. The measured utility function depends nonlinearly on the force magnitude and duration and was remarkably conserved across subjects. This suggests that the utility function, a central concept in economics, may be applicable to the study of sensorimotor control. 相似文献
6.
Inference of population structure from genetic data plays an important role in population and medical genetics studies. With the advancement and decreasing cost of sequencing technology, the increasingly available whole genome sequencing data provide much richer information about the underlying population structure. The traditional method originally developed for array-based genotype data for computing and selecting top principal components (PCs) that capture population structure may not perform well on sequencing data for two reasons. First, the number of genetic variants p is much larger than the sample size n in sequencing data such that the sample-to-marker ratio is nearly zero, violating the assumption of the Tracy-Widom test used in their method. Second, their method might not be able to handle the linkage disequilibrium well in sequencing data. To resolve those two practical issues, we propose a new method called ERStruct to determine the number of top informative PCs based on sequencing data. More specifically, we propose to use the ratio of consecutive eigenvalues as a more robust test statistic, and then we approximate its null distribution using modern random matrix theory. Both simulation studies and applications to two public data sets from the HapMap 3 and the 1000 Genomes Projects demonstrate the empirical performance of our ERStruct method. 相似文献
7.
Based on nearly complete genome sequences from a variety of organisms data on naturally occurring genetic variation on the scale of hundreds of loci to entire genomes have been collected in recent years. In parallel, new statistical tests have been developed to infer evidence of recent positive selection from these data and to localize the target regions of selection in the genome. These methods have now been successfully applied to Drosophila melanogaster , humans, mice and a few plant species. In genomic regions of normal recombination rates, the targets of positive selection have been mapped down to the level of individual genes. 相似文献
8.
MOTIVATION: Accurate detection of positive Darwinian selection can provide important insights to researchers investigating the evolution of pathogens. However, many pathogens (particularly viruses) undergo frequent recombination and the phylogenetic methods commonly applied to detect positive selection have been shown to give misleading results when applied to recombining sequences. We propose a method that makes maximum likelihood inference of positive selection robust to the presence of recombination. This is achieved by allowing tree topologies and branch lengths to change across detected recombination breakpoints. Further improvements are obtained by allowing synonymous substitution rates to vary across sites. RESULTS: Using simulation we show that, even for extreme cases where recombination causes standard methods to reach false positive rates >90%, the proposed method decreases the false positive rate to acceptable levels while retaining high power. We applied the method to two HIV-1 datasets for which we have previously found that inference of positive selection is invalid owing to high rates of recombination. In one of these (env gene) we still detected positive selection using the proposed method, while in the other (gag gene) we found no significant evidence of positive selection. AVAILABILITY: A HyPhy batch language implementation of the proposed methods and the HIV-1 datasets analysed are available at http://www.cbio.uct.ac.za/pub_support/bioinf06. The HyPhy package is available at http://www.hyphy.org, and it is planned that the proposed methods will be included in the next distribution. RDP2 is available at http://darwin.uvigo.es/rdp/rdp.html 相似文献
9.
Most of the gene prediction algorithms for prokaryotes are based on Hidden Markov Models or similar machine-learning approaches, which imply the optimization of a high number of parameters. The present paper presents a novel method for the classification of coding and non-coding regions in prokaryotic genomes, based on a suitably defined compression index of a DNA sequence. The main features of this new method are the non-parametric logic and the costruction of a dictionary of words extracted from the sequences. These dictionaries can be very useful to perform further analyses on the genomic sequences themselves. The proposed approach has been applied on some prokaryotic complete genomes, obtaining optimal scores of correctly recognized coding and non-coding regions. Several false-positive and false-negative cases have been investigated in detail, which have revealed that this approach can fail in the presence of highly structured coding regions (e.g., genes coding for modular proteins) or quasi-random non-coding regions (e.g., regions hosting non-functional fragments of copies of functional genes; regions hosting promoters or other protein-binding sequences). We perform an overall comparison with other gene-finder software, since at this step we are not interested in building another gene-finder system, but only in exploring the possibility of the suggested approach. 相似文献
10.
Coding nucleotide sequences contain myriad functions independent of their encoded protein sequences. We present the COMIT
algorithm to detect functional noncoding motifs in coding regions using sequence conservation, explicitly separating nucleotide
from amino acid effects. COMIT concurs with diverse experimental datasets, including splicing enhancers, silencers, replication
motifs, and microRNA targets, and predicts many novel functional motifs. Intriguingly, COMIT scores are well-correlated to
scores uncalibrated for amino acids, suggesting that nucleotide motifs often override peptide-level constraints. 相似文献
11.
12.
One of the most important tasks of modern bioinformatics is the development of computational tools that can be used to understand and treat human disease. To date, a variety of methods have been explored and algorithms for candidate gene prioritization are gaining in their usefulness. Here, we propose an algorithm for detecting gene-disease associations based on the human protein-protein interaction network, known gene-disease associations, protein sequence, and protein functional information at the molecular level. Our method, PhenoPred, is supervised: first, we mapped each gene/protein onto the spaces of disease and functional terms based on distance to all annotated proteins in the protein interaction network. We also encoded sequence, function, physicochemical, and predicted structural properties, such as secondary structure and flexibility. We then trained support vector machines to detect gene-disease associations for a number of terms in Disease Ontology and provided evidence that, despite the noise/incompleteness of experimental data and unfinished ontology of diseases, identification of candidate genes can be successful even when a large number of candidate disease terms are predicted on simultaneously. Availability: www.phenopred.org. 相似文献
13.
Estimating evolution of temporal sequence changes: a practical approach to inferring ancestral developmental sequences and sequence heterochrony 总被引:2,自引:0,他引:2
Developmental biology often yields data in a temporal context. Temporal data in phylogenetic systematics has important uses in the field of evolutionary developmental biology and, in general, comparative biology. The evolution of temporal sequences, specifically developmental sequences, has proven difficult to examine due to the highly variable temporal progression of development. Issues concerning the analysis of temporal sequences and problems with current methods of analysis are discussed. We present here an algorithm to infer ancestral temporal sequences, quantify sequence heterochronies, and estimate pseudoreplicate consensus support for sequence changes using Parsimov-based genetic inference [PGi]. Real temporal developmental sequence data sets are used to compare PGi with currently used approaches, and PGi is shown to be the most efficient, accurate, and practical method to examine biological data and infer ancestral states on a phylogeny. The method is also expandable to address further issues in developmental evolution, namely modularity. 相似文献
14.
A Bayesian heterogeneous analysis of variance approach to inferring recent selective sweeps 下载免费PDF全文
The distribution of microsatellite allele sizes in populations aids in understanding the genetic diversity of species and the evolutionary history of recent selective sweeps. We propose a heterogeneous Bayesian analysis of variance model for inferring loci involved in recent selective sweeps by analyzing the distribution of allele sizes at multiple loci in multiple populations. Our model is shown to be consistent with a multilocus test statistic, ln RV, proposed for identifying microsatellite loci involved in recent selective sweeps. Our methodology differs in that it accepts original allele size data rather than summary statistics and allows the incorporation of prior knowledge about allele frequencies using a hierarchical prior distribution consisting of log normal and gamma probability distributions. Interesting features of the model are its ability to simultaneously analyze allele size data for any number of populations and to cope with the presence of any number of selected loci. The utility of the method is illustrated by application to two sets of microsatellite allele size data for a group of West African Anopheles gambiae populations. The results are consistent with the suppressed-recombination model of speciation, and additional candidate loci on chromosomes 2 (079 and 175) and 3 (088) are discovered that escaped former analysis. 相似文献
15.
16.
17.
18.
19.
20.
Selection mapping applies the population genetics theory of hitchhiking to the localization of genomic regions containing genes under selection. This approach predicts that neutral loci linked to genes under positive selection will have reduced diversity due to their shared history with a selected locus, and thus, genome scans of diversity levels can be used to identify regions containing selected loci. Most previous approaches to this problem ignore the spatial genomic pattern of diversity expected under selection. The regression-based approach advocated in this paper takes into account the expected pattern of decreasing genetic diversity with increased proximity to a selected locus. Simulated data are used to examine the patterns of diversity under different scenarios, in order to assess the power of a regression-based approach to the identification of regions under selection. Application of this method to both simulated and empirical data demonstrates its potential to detect selection. In contrast to some other methods, the regression approach described in this paper can be applied to any marker type. Results also suggest that this approach may give more precise estimates of the location of the selected locus than alternative methods, although the power is slightly lower in some cases. 相似文献