首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Wang J 《Genetical research》2001,78(3):243-257
A pseudo maximum likelihood method is proposed to estimate effective population size (Ne) using temporal changes in allele frequencies at multi-allelic loci. The computation is simplified dramatically by (1) approximating the multi-dimensional joint probabilities of all the data by the product of marginal probabilities (hence the name pseudo-likelihood), (2) exploiting the special properties of transition matrix and (3) using a hidden Markov chain algorithm. Simulations show that the pseudo-likelihood method has a similar performance but needs much less computing time and storage compared with the full likelihood method in the case of 3 alleles per locus. Due to computational developments, I was able to assess the performance of the pseudo-likelihood method against the F-statistic method over a wide range of parameters by extensive simulations. It is shown that the pseudo-likelihood method gives more accurate and precise estimates of Ne than the F-statistic method, and the performance difference is mainly due to the presence of rare alleles in the samples. The pseudo-likelihood method is also flexible and can use three or more temporal samples simultaneously to estimate satisfactorily the NeS of each period, or the growth parameters of the population. The accuracy and precision of both methods depend on the ratio of the product of sample size and the number of generations involved to Ne, and the number of independent alleles used. In an application of the pseudo-likelihood method to a large data set of an olive fly population, more precise estimates of Ne are obtained than those from the F-statistic method.  相似文献   

2.
Liu L  Ho YK  Yau S 《DNA and cell biology》2007,26(7):477-483
The inhomogeneous Markov chain model is used to discriminate acceptor and donor sites in genomic DNA sequences. It outperforms statistical methods such as homogeneous Markov chain model, higher order Markov chain and interpolated Markov chain models, and machine-learning methods such as k-nearest neighbor and support vector machine as well. Besides its high accuracy, another advantage of inhomogeneous Markov chain model is its simplicity in computation. In the three states system (acceptor, donor, and neither), the inhomogeneous Markov chain model is combined with a three-layer feed forward neural network. Using this combined system 3175 primate splice-junction gene sequences have been tested, with a prediction accuracy of greater than 98%.  相似文献   

3.

Background  

Several phylogenetic approaches have been developed to estimate species trees from collections of gene trees. However, maximum likelihood approaches for estimating species trees under the coalescent model are limited. Although the likelihood of a species tree under the multispecies coalescent model has already been derived by Rannala and Yang, it can be shown that the maximum likelihood estimate (MLE) of the species tree (topology, branch lengths, and population sizes) from gene trees under this formula does not exist. In this paper, we develop a pseudo-likelihood function of the species tree to obtain maximum pseudo-likelihood estimates (MPE) of species trees, with branch lengths of the species tree in coalescent units.  相似文献   

4.
5.
Vonesh EF  Chinchilli VM  Pu K 《Biometrics》1996,52(2):572-587
In recent years, generalized linear and nonlinear mixed-effects models have proved to be powerful tools for the analysis of unbalanced longitudinal data. To date, much of the work has focused on various methods for estimating and comparing the parameters of mixed-effects models. Very little work has been done in the area of model selection and goodness-of-fit, particularly with respect to the assumed variance-covariance structure. In this paper, we present a goodness-of-fit statistic which can be used in a manner similar to the R2 criterion in linear regression for assessing the adequacy of an assumed mean and variance-covariance structure. In addition, we introduce an approximate pseudo-likelihood ratio test for testing the adequacy of the hypothesized convariance structure. These methods are illustrated and compared to the usual normal theory likelihood methods (Akaike's information criterion and the likelihood ratio test) using three examples. Simulation results indicate the pseudo-likelihood ratio test compares favorably with the standard normal theory likelihood ratio test, but both procedures are sensitive to departures from normality.  相似文献   

6.
The melting of the coding and non-coding classes of natural DNA sequences was investigated using a program, MELTSIM, which simulates DNA melting based upon an empirically parameterized nearest neighbor thermodynamic model. We calculated T(m) results of 8144 natural sequences from 28 eukaryotic organisms of varying F(GC) (mole fraction of G and C) and of 3775 coding and 3297 non-coding sequences derived from those natural sequences. These data demonstrated that the T(m) vs. F(GC) relationships in coding and non-coding DNAs are both linear but have a statistically significant difference (6.6%) in their slopes. These relationships are significantly different from the T(m) vs. F(GC) relationship embodied in the classical Marmur-Schildkraut-Doty (MSD) equation for the intact long natural sequences. By analyzing the simulation results from various base shufflings of the original DNAs and the average nearest neighbor frequencies of those natural sequences across the F(GC) range, we showed that these differences in the T(m) vs. F(GC) relationships are largely a direct result of systematic F(GC)-dependent biases in nearest neighbor frequencies for those two different DNA classes. Those differences in the T(m) vs. F(GC) relationships and biases in nearest neighbor frequencies also appear between the sequences from multicellular and unicellular organisms in the same coding or non-coding classes, albeit of smaller but significant magnitudes.  相似文献   

7.
Measurements on embryonic epithelial tissues in a diverse range of organisms have shown that the statistics of cell neighbor numbers are universal in tissues where cell proliferation is the primary cell activity. Highly simplified non-spatial models of proliferation are claimed to accurately reproduce these statistics. Using a systematic critical analysis, we show that non-spatial models are not capable of robustly describing the universal statistics observed in proliferating epithelia, indicating strong spatial correlations between cells. Furthermore we show that spatial simulations using the Subcellular Element Model are able to robustly reproduce the universal histogram. In addition these simulations are able to unify ostensibly divergent experimental data in the literature. We also analyze cell neighbor statistics in early stages of chick embryo development in which cell behaviors other than proliferation are important. We find from experimental observation that cell neighbor statistics in the primitive streak region, where cell motility and ingression are also important, show a much broader distribution. A non-spatial Markov process model provides excellent agreement with this broader histogram indicating that cells in the primitive streak may have significantly weaker spatial correlations. These findings show that cell neighbor statistics provide a potentially useful signature of collective cell behavior.  相似文献   

8.
Most of the gene prediction algorithms for prokaryotes are based on Hidden Markov Models or similar machine-learning approaches, which imply the optimization of a high number of parameters. The present paper presents a novel method for the classification of coding and non-coding regions in prokaryotic genomes, based on a suitably defined compression index of a DNA sequence. The main features of this new method are the non-parametric logic and the costruction of a dictionary of words extracted from the sequences. These dictionaries can be very useful to perform further analyses on the genomic sequences themselves. The proposed approach has been applied on some prokaryotic complete genomes, obtaining optimal scores of correctly recognized coding and non-coding regions. Several false-positive and false-negative cases have been investigated in detail, which have revealed that this approach can fail in the presence of highly structured coding regions (e.g., genes coding for modular proteins) or quasi-random non-coding regions (e.g., regions hosting non-functional fragments of copies of functional genes; regions hosting promoters or other protein-binding sequences). We perform an overall comparison with other gene-finder software, since at this step we are not interested in building another gene-finder system, but only in exploring the possibility of the suggested approach.  相似文献   

9.
Optimal spaced seeds were developed as a method to increase sensitivity of local alignment programs similar to BLASTN. Such seeds have been used before in the program PatternHunter, and have given improved sensitivity and running time relative to BLASTN in genome-genome comparison. We study the problem of computing optimal spaced seeds for detecting homologous coding regions in unannotated genomic sequences. By using well-chosen seeds, we are able to improve the sensitivity of coding sequence alignment over that of TBLASTX, while keeping runtime comparable to BLASTN. We identify good seeds by first giving effective hidden Markov models of conservation in alignments of homologous coding regions. We give an efficient algorithm to compute the optimal spaced seed when conservation patterns are generated by these models. Our results offer the hope of improved gene finding due to fewer missed exons in DNA/DNA comparison, and more effective homology search in general, and may have applications outside of bioinformatics.  相似文献   

10.
Estimating intraclass correlation for binary data   总被引:5,自引:0,他引:5  
This paper reviews many different estimators of intraclass correlation that have been proposed for binary data and compares them in an extensive simulation study. Some of the estimators are very specific, while others result from general methods such as pseudo-likelihood and extended quasi-likelihood estimation. The simulation study identifies several useful estimators, one of which does not seem to have been considered previously for binary data. Estimators based on extended quasi-likelihood are found to have a substantial bias in some circumstances.  相似文献   

11.
Sixty-four eucaryotic nuclear DNA sequences, half of them coding and half noncoding, have been examined as expressions of first-, second-, or third-order Markov chains. Standard statistical tests found that most of the sequences required at least second-order Markov chains for their representation, and some required chains of third order. For all 64 sequences the observed one-step second-order transition count matrices were effective in predicting the two-step transition count matrices, and 56 of 64 were effective in predicting the three-step transition count matrices. The departure from random expectation of the observed first- and second-order transition count matrices meant that a considerable sample of eucaryotic nuclear DNA sequences, both protein coding and noncoding, have significant local structure over subsequences of three to five contiguous bases, and that this structure occurs throughout the total length of the sequence. These results suggested that present DNA sequences may have arisen from the duplication, concatenation, and gradual modification of very early short sequences.  相似文献   

12.
Summary Sixty-four eucaryotic nuclear DNA sequences, half of them coding and half noncoding, have been examined as expressions of first-, second-, or third-order Markov chains. Standard statistical tests found that most of the sequences required at least second-order Markov chains for their representation, and some required chains of third order. For all 64 sequences the observed one-step second-order transition count matrices were effective in predicting the two-step transition count matrices, and 56 of 64 were effective in predicting the three-step transition count matrices. The departure from random expectation of the observed first- and second-order transition count matrices meant that a considerable sample of eucaryotic nuclear DNA sequences, both protein coding and noncoding, have significant local structure over subsequences of three to five contiguous bases, and that this structure occurs throughout the total length of the sequence. These results suggested that present DNA sequences may have arisen from the duplication, concatenation, and gradual modification of very early short sequences.  相似文献   

13.

Background

Genomic data are used in animal breeding to assist genetic evaluation. Several models to estimate genomic breeding values have been studied. In general, two approaches have been used. One approach estimates the marker effects first and then, genomic breeding values are obtained by summing marker effects. In the second approach, genomic breeding values are estimated directly using an equivalent model with a genomic relationship matrix. Allele coding is the method chosen to assign values to the regression coefficients in the statistical model. A common allele coding is zero for the homozygous genotype of the first allele, one for the heterozygote, and two for the homozygous genotype for the other allele. Another common allele coding changes these regression coefficients by subtracting a value from each marker such that the mean of regression coefficients is zero within each marker. We call this centered allele coding. This study considered effects of different allele coding methods on inference. Both marker-based and equivalent models were considered, and restricted maximum likelihood and Bayesian methods were used in inference.

Results

Theoretical derivations showed that parameter estimates and estimated marker effects in marker-based models are the same irrespective of the allele coding, provided that the model has a fixed general mean. For the equivalent models, the same results hold, even though different allele coding methods lead to different genomic relationship matrices. Calculated genomic breeding values are independent of allele coding when the estimate of the general mean is included into the values. Reliabilities of estimated genomic breeding values calculated using elements of the inverse of the coefficient matrix depend on the allele coding because different allele coding methods imply different models. Finally, allele coding affects the mixing of Markov chain Monte Carlo algorithms, with the centered coding being the best.

Conclusions

Different allele coding methods lead to the same inference in the marker-based and equivalent models when a fixed general mean is included in the model. However, reliabilities of genomic breeding values are affected by the allele coding method used. The centered coding has some numerical advantages when Markov chain Monte Carlo methods are used.  相似文献   

14.
15.
MOTIVATION: As the number of fully sequenced prokaryotic genomes continues to grow rapidly, computational methods for reliably detecting protein-coding regions become even more important. Audic and Claverie (1998) Proc. Natl Acad. Sci. USA, 95, 10026-10031, have proposed a clustering algorithm for protein-coding regions in microbial genomes. The algorithm is based on three Markov models of order k associated with subsequences extracted from a given genome. The parameters of the three Markov models are recursively updated by the algorithm which, in simulations, always appear to converge to a unique stable partition of the genome. The partition corresponds to three kinds of regions: (1) coding on the direct strand, (2) coding on the complementary strand, (3) non-coding. RESULTS: Here we provide an explanation for the convergence of the algorithm by observing that it is essentially a form of the expectation maximization (EM) algorithm applied to the corresponding mixture model. We also provide a partial justification for the uniqueness of the partition based on identifiability. Other possible variations and improvements are briefly discussed.  相似文献   

16.
Zhang Y  Jamshidian M 《Biometrics》2003,59(4):1099-1106
In this article, we study nonparametric estimation of the mean function of a counting process with panel observations. We introduce the gamma frailty variable to account for the intracorrelation between the panel counts of the counting process and construct a maximum pseudo-likelihood estimate with the frailty variable. Three simulated examples are given to show that this estimation procedure, while preserving the robustness and simplicity of the computation, improves the efficiency of the nonparametric maximum pseudo-likelihood estimate studied in Wellner and Zhang (2000, Annals of Statistics 28, 779-814). A real example from a bladder tumor study is used to illustrate the method.  相似文献   

17.
MacNab YC 《Biometrics》2003,59(2):305-315
We present Bayesian hierarchical spatial models for spatially correlated small-area health service outcome and utilization rates, with a particular emphasis on the estimation of both measured and unmeasured or unknown covariate effects. This Bayesian hierarchical model framework enables simultaneous modeling of fixed covariate effects and random residual effects. The random effects are modeled via Bayesian prior specifications reflecting spatial heterogeneity globally and relative homogeneity among neighboring areas. The model inference is implemented using Markov chain Monte Carlo methods. Specifically, a hybrid Markov chain Monte Carlo algorithm (Neal, 1995, Bayesian Learning for Neural Networks; Gustafson, MacNab, and Wen, 2003, Statistics and Computing, to appear) is used for posterior sampling of the random effects. To illustrate relevant problems, methods, and techniques, we present an analysis of regional variation in intraventricular hemorrhage incidence rates among neonatal intensive care unit patients across Canada.  相似文献   

18.
The role of competition in community structure and species interactions is universal. However, how one quantifies the outcome of competitive interactions is frequently debated. Here, we review the strengths and weaknesses of the target–neighbor design, a type of additive design where one of the competing species is reduced to a single individual and where controls and analyses are used for the target, but not for the neighbors. We conducted a literature review to determine how the target–neighbor design has been typically used and analyzed. We found that historically, targets were often smaller than neighbors and introduced after neighbor establishment; thus, targets would have little effect on neighbors. However, as co‐establishment of targets and neighbors of similar size is now common, the target is more likely to affect the neighbors than in its earlier usage. This can be problematic, because if targets have a significant effect on neighbor performance, bias is introduced into the assessment of the target results. As target treatment controls are necessary to determine the absolute effect of neighbors on target growth, we advocate that analysis of the neighbor competitive response serves as a necessary control for unexpected target x neighbor interactions.  相似文献   

19.
Statistical characterization of nucleic acid sequence functional domains   总被引:20,自引:14,他引:6  
It has long been recognized that various genome classes were distinguishable on the basis of base composition and nearest neighbor frequencies. In addition Grantham et al. (8) have recently presented evidence that these distinctions are preserved at the level of codon usage. As discussed in this report it is now clear that these and related statistics can uniquely characterize the various functional domains of the genome. In particular peptide coding, intervening segments, structural RNA coding and mitochondrial domains of the vertebrate genome are uniquely characterizable. The statistical measures not only reflect understood functional differences among these domains but suggest others. The ability of these simple statistics of nucleic acid sequences to reflect so much of the encoded complex pattern information and/or effects of selective constraints is somewhat surprising. Here, we investigated the statistical measures most distinctive of the various domains and then linked them to our current understandings in so far as possible.  相似文献   

20.
MOTIVATION: Prediction of the coding potential for stretches of DNA is crucial in gene calling and genome annotation, where it is used to identify potential exons and to position their boundaries in conjunction with functional sites, such as splice sites and translation initiation sites. The ability to discriminate between coding and non-coding sequences relates to the structure of coding sequences, which are organized in codons, and by their biased usage. For statistical reasons, the longer the sequences, the easier it is to detect this codon bias. However, in many eukaryotic genomes, where genes harbour many introns, both introns and exons might be small and hard to distinguish based on coding potential. RESULTS: Here, we present novel approaches that specifically aim at a better detection of coding potential in short sequences. The methods use complementary sequence features, combined with identification of which features are relevant in discriminating between coding and non-coding sequences. These newly developed methods are evaluated on different species, representative of four major eukaryotic kingdoms, and extensively compared to state-of-the-art Markov models, which are often used for predicting coding potential. The main conclusions drawn from our analyses are that (1) combining complementary sequence features clearly outperforms current Markov models for coding potential prediction in short sequence fragments, (2) coding potential prediction benefits from length-specific models, and these models are not necessarily the same for different sequence lengths and (3) comparing the results across several species indicates that, although our combined method consistently performs extremely well, there are important differences across genomes. SUPPLEMENTARY DATA: http://bioinformatics.psb.ugent.be/.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号