首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The structural simplicity and ability to capture serial correlations make Markov models a popular modeling choice in several genomic analyses, such as identification of motifs, genes and regulatory elements. A critical, yet relatively unexplored, issue is the determination of the order of the Markov model. Most biological applications use a predetermined order for all data sets indiscriminately. Here, we show the vast variation in the performance of such applications with the order. To identify the ‘optimal’ order, we investigated two model selection criteria: Akaike information criterion and Bayesian information criterion (BIC). The BIC optimal order delivers the best performance for mammalian phylogeny reconstruction and motif discovery. Importantly, this order is different from orders typically used by many tools, suggesting that a simple additional step determining this order can significantly improve results. Further, we describe a novel classification approach based on BIC optimal Markov models to predict functionality of tissue-specific promoters. Our classifier discriminates between promoters active across 12 different tissues with remarkable accuracy, yielding 3 times the precision expected by chance. Application to the metagenomics problem of identifying the taxum from a short DNA fragment yields accuracies at least as high as the more complex mainstream methodologies, while retaining conceptual and computational simplicity.  相似文献   

2.
Ball RD 《Genetics》2007,177(4):2399-2416
We calculate posterior probabilities for candidate genes as a function of genomic location. Posterior probabilities for quantitative trait loci (QTL) presence in a small interval are calculated using a Bayesian model-selection approach based on the Bayesian information criterion (BIC) and used to combine QTL colocation information with sequence-specific evidence, e.g., from differential expression and/or association studies. Our method takes into account uncertainty in estimation of number and locations of QTL and estimated map position. Posterior probabilities for QTL presence were calculated for simulated data with n = 100, 300, and 1200 QTL progeny and compared with interval mapping and composite-interval mapping. Candidate genes that mapped to QTL regions had substantially larger posterior probabilities. Among candidates with a given Bayes factor, those that map near a QTL are more promising for further investigation with association studies and functional testing or for use in marker-aided selection. The BIC is shown to correspond very closely to Bayes factors for linear models with a nearly noninformative Zellner prior for the simulated QTL data with n > or = 100. It is shown how to modify the BIC to use a subjective prior for the QTL effects.  相似文献   

3.
Membrane proteins move in heterogeneous environments with spatially (sometimes temporally) varying friction and with biochemical interactions with various partners. It is important to reliably distinguish different modes of motion to improve our knowledge of the membrane architecture and to understand the nature of interactions between membrane proteins and their environments. Here, we present an analysis technique for single molecule tracking (SMT) trajectories that can determine the preferred model of motion that best matches observed trajectories. The method is based on Bayesian inference to calculate the posteriori probability of an observed trajectory according to a certain model. Information theory criteria, such as the Bayesian information criterion (BIC), the Akaike information criterion (AIC), and modified AIC (AICc), are used to select the preferred model. The considered group of models includes free Brownian motion, and confined motion in 2nd or 4th order potentials. We determine the best information criteria for classifying trajectories. We tested its limits through simulations matching large sets of experimental conditions and we built a decision tree. This decision tree first uses the BIC to distinguish between free Brownian motion and confined motion. In a second step, it classifies the confining potential further using the AIC. We apply the method to experimental Clostridium Perfingens -toxin (CPT) receptor trajectories to show that these receptors are confined by a spring-like potential. An adaptation of this technique was applied on a sliding window in the temporal dimension along the trajectory. We applied this adaptation to experimental CPT trajectories that lose confinement due to disaggregation of confining domains. This new technique adds another dimension to the discussion of SMT data. The mode of motion of a receptor might hold more biologically relevant information than the diffusion coefficient or domain size and may be a better tool to classify and compare different SMT experiments.  相似文献   

4.
Hummingbirds are an important model system in avian biology, but to date the group has been the subject of remarkably few phylogenetic investigations. Here we present partitioned Bayesian and maximum likelihood phylogenetic analyses for 151 of approximately 330 species of hummingbirds and 12 outgroup taxa based on two protein-coding mitochondrial genes (ND2 and ND4), flanking tRNAs, and two nuclear introns (AK1 and BFib). We analyzed these data under several partitioning strategies ranging between unpartitioned and a maximum of nine partitions. In order to select a statistically justified partitioning strategy following partitioned Bayesian analysis, we considered four alternative criteria including Bayes factors, modified versions of the Akaike information criterion for small sample sizes (AIC(c)), Bayesian information criterion (BIC), and a decision-theoretic methodology (DT). Following partitioned maximum likelihood analyses, we selected a best-fitting strategy using hierarchical likelihood ratio tests (hLRTS), the conventional AICc, BIC, and DT, concluding that the most stringent criterion, the performance-based DT, was the most appropriate methodology for selecting amongst partitioning strategies. In the context of our well-resolved and well-supported phylogenetic estimate, we consider the historical biogeography of hummingbirds using ancestral state reconstructions of (1) primary geographic region of occurrence (i.e., South America, Central America, North America, Greater Antilles, Lesser Antilles), (2) Andean or non-Andean geographic distribution, and (3) minimum elevational occurrence. These analyses indicate that the basal hummingbird assemblages originated in the lowlands of South America, that most of the principle clades of hummingbirds (all but Mountain Gems and possibly Bees) originated on this continent, and that there have been many (at least 30) independent invasions of other primary landmasses, especially Central America.  相似文献   

5.
Bogdan M  Ghosh JK  Doerge RW 《Genetics》2004,167(2):989-999
The problem of locating multiple interacting quantitative trait loci (QTL) can be addressed as a multiple regression problem, with marker genotypes being the regressor variables. An important and difficult part in fitting such a regression model is the estimation of the QTL number and respective interactions. Among the many model selection criteria that can be used to estimate the number of regressor variables, none are used to estimate the number of interactions. Our simulations demonstrate that epistatic terms appearing in a model without the related main effects cause the standard model selection criteria to have a strong tendency to overestimate the number of interactions, and so the QTL number. With this as our motivation we investigate the behavior of the Schwarz Bayesian information criterion (BIC) by explaining the phenomenon of the overestimation and proposing a novel modification of BIC that allows the detection of main effects and pairwise interactions in a backcross population. Results of an extensive simulation study demonstrate that our modified version of BIC performs very well in practice. Our methodology can be extended to general populations and higher-order interactions.  相似文献   

6.
Models of sequence evolution play an important role in molecular evolutionary studies. The use of inappropriate models of evolution may bias the results of the analysis and lead to erroneous conclusions. Several procedures for selecting the best-fit model of evolution for the data at hand have been proposed, like the likelihood ratio test (LRT) and the Akaike (AIC) and Bayesian (BIC) information criteria. The relative performance of these model-selecting algorithms has not yet been studied under a range of different model trees. In this study, the influence of branch length variation upon model selection is characterized. This is done by simulating sequence alignments under a known model of nucleotide substitution, and recording how often this true model is recovered by different model-fitting strategies. Results of this study agree with previous simulations and suggest that model selection is reasonably accurate. However, different model selection methods showed distinct levels of accuracy. Some LRT approaches showed better performance than the AIC or BIC information criteria. Within the LRTs, model selection is affected by the complexity of the initial model selected for the comparisons, and only slightly by the order in which different parameters are added to the model. A specific hierarchy of LRTs, which starts from a simple model of evolution, performed overall better than other possible LRT hierarchies, or than the AIC or BIC. Received: 2 October 2000 / Accepted: 4 January 2001  相似文献   

7.
Nonparametric mixed effects models for unequally sampled noisy curves   总被引:7,自引:0,他引:7  
Rice JA  Wu CO 《Biometrics》2001,57(1):253-259
We propose a method of analyzing collections of related curves in which the individual curves are modeled as spline functions with random coefficients. The method is applicable when the individual curves are sampled at variable and irregularly spaced points. This produces a low-rank, low-frequency approximation to the covariance structure, which can be estimated naturally by the EM algorithm. Smooth curves for individual trajectories are constructed as best linear unbiased predictor (BLUP) estimates, combining data from that individual and the entire collection. This framework leads naturally to methods for examining the effects of covariates on the shapes of the curves. We use model selection techniques--Akaike information criterion (AIC), Bayesian information criterion (BIC), and cross-validation--to select the number of breakpoints for the spline approximation. We believe that the methodology we propose provides a simple, flexible, and computationally efficient means of functional data analysis.  相似文献   

8.
The classical multiple testing model remains an important practical area of statistics with new approaches still being developed. In this paper we develop a new multiple testing procedure inspired by a method sometimes used in a problem with a different focus. Namely, the inference after model selection problem. We note that solutions to that problem are often accomplished by making use of a penalized likelihood function. A classic example is the Bayesian information criterion (BIC) method. In this paper we construct a generalized BIC method and evaluate its properties as a multiple testing procedure. The procedure is applicable to a wide variety of statistical models including regression, contrasts, treatment versus control, change point, and others. Numerical work indicates that, in particular, for sparse models the new generalized BIC would be preferred over existing multiple testing procedures.  相似文献   

9.
In the analysis of data generated by change-point processes, one critical challenge is to determine the number of change-points. The classic Bayes information criterion (BIC) statistic does not work well here because of irregularities in the likelihood function. By asymptotic approximation of the Bayes factor, we derive a modified BIC for the model of Brownian motion with changing drift. The modified BIC is similar to the classic BIC in the sense that the first term consists of the log likelihood, but it differs in the terms that penalize for model dimension. As an example of application, this new statistic is used to analyze array-based comparative genomic hybridization (array-CGH) data. Array-CGH measures the number of chromosome copies at each genome location of a cell sample, and is useful for finding the regions of genome deletion and amplification in tumor cells. The modified BIC performs well compared to existing methods in accurately choosing the number of regions of changed copy number. Unlike existing methods, it does not rely on tuning parameters or intensive computing. Thus it is impartial and easier to understand and to use.  相似文献   

10.
Currently available methods for model selection used in phylogenetic analysis are based on an initial fixed-tree topology. Once a model is picked based on this topology, a rigorous search of the tree space is run under that model to find the maximum-likelihood estimate of the tree (topology and branch lengths) and the maximum-likelihood estimates of the model parameters. In this paper, we propose two extensions to the decision-theoretic (DT) approach that relax the fixed-topology restriction. We also relax the fixed-topology restriction for the Bayesian information criterion (BIC) and the Akaike information criterion (AIC) methods. We compare the performance of the different methods (the relaxed, restricted, and the likelihood-ratio test [LRT]) using simulated data. This comparison is done by evaluating the relative complexity of the models resulting from each method and by comparing the performance of the chosen models in estimating the true tree. We also compare the methods relative to one another by measuring the closeness of the estimated trees corresponding to the different chosen models under these methods. We show that varying the topology does not have a major impact on model choice. We also show that the outcome of the two proposed extensions is identical and is comparable to that of the BIC, Extended-BIC, and DT. Hence, using the simpler methods in choosing a model for analyzing the data is more computationally feasible, with results comparable to the more computationally intensive methods. Another outcome of this study is that earlier conclusions about the DT approach are reinforced. That is, LRT, Extended-AIC, and AIC result in more complicated models that do not contribute to the performance of the phylogenetic inference, yet cause a significant increase in the time required for data analysis.  相似文献   

11.
Clustering is a major tool for microarray gene expression data analysis. The existing clustering methods fall mainly into two categories: parametric and nonparametric. The parametric methods generally assume a mixture of parametric subdistributions. When the mixture distribution approximately fits the true data generating mechanism, the parametric methods perform well, but not so when there is nonnegligible deviation between them. On the other hand, the nonparametric methods, which usually do not make distributional assumptions, are robust but pay the price for efficiency loss. In an attempt to utilize the known mixture form to increase efficiency, and to free assumptions about the unknown subdistributions to enhance robustness, we propose a semiparametric method for clustering. The proposed approach possesses the form of parametric mixture, with no assumptions to the subdistributions. The subdistributions are estimated nonparametrically, with constraints just being imposed on the modes. An expectation-maximization (EM) algorithm along with a classification step is invoked to cluster the data, and a modified Bayesian information criterion (BIC) is employed to guide the determination of the optimal number of clusters. Simulation studies are conducted to assess the performance and the robustness of the proposed method. The results show that the proposed method yields reasonable partition of the data. As an illustration, the proposed method is applied to a real microarray data set to cluster genes.  相似文献   

12.
Statistical methods for mapping quantitative trait loci (QTLs) in full-sib forest trees, in which the number of alleles and linkage phase can vary from locus to locus, are still not well established. Previous studies assumed that the QTL segregation pattern was fixed throughout the genome in a full-sib family, despite the fact that this pattern can vary among regions of the genome. In this paper, we propose a method for selecting the appropriate model for QTL mapping based on the segregation of different types of markers and QTLs in a full-sib family. The QTL segregation patterns were classified into three types: test cross (1:1 segregation), F2 cross (1:2:1 segregation) and full cross (1:1:1:1 segregation). Akaike’s information criterion (AIC), the Bayesian information criterion (BIC) and the Laplace-empirical criterion (LEC) were used to select the most likely QTL segregation pattern. Simulations were used to evaluate the power of these criteria and the precision of parameter estimates. A Windows-based software was developed to run the selected QTL mapping method. A real example is presented to illustrate QTL mapping in forest trees based on an integrated linkage map with various segregation markers. The implications of this method for accurate QTL mapping in outbred species are discussed.  相似文献   

13.
Most existing statistical methods for mapping quantitative trait loci (QTL) are not suitable for analyzing survival traits with a skewed distribution and censoring mechanism. As a result, researchers incorporate parametric and semi-parametric models of survival analysis into the framework of the interval mapping for QTL controlling survival traits. In survival analysis, accelerated failure time (AFT) model is considered as a de facto standard and fundamental model for data analysis. Based on AFT model, we propose a parametric approach for mapping survival traits using the EM algorithm to obtain the maximum likelihood estimates of the parameters. Also, with Bayesian information criterion (BIC) as a model selection criterion, an optimal mapping model is constructed by choosing specific error distributions with maximum likelihood and parsimonious parameters. Two real datasets were analyzed by our proposed method for illustration. The results show that among the five commonly used survival distributions, Weibull distribution is the optimal survival function for mapping of heading time in rice, while Log-logistic distribution is the optimal one for hyperoxic acute lung injury.  相似文献   

14.
15.
生长参数是渔业资源评估和管理策略中的关键参数,因而对目标鱼种选择合适的生长模型至关重要.本文以北部湾多齿蛇鲻为例,采用2006年12月至2009年7逐月采集的体长与年龄鉴定数据(n=2046),运用5个候选生长模型,利用最大似然法在加性误差条件下估算生长参数,并通过模型近似解释率(R2adj)、根平均方差(RMSE)、赤井信息准则(AIC)和贝叶斯信息准则(BIC)检验模型拟合度.结果表明: 在当前大样本的情况下,4种统计方法在模型拟合度排序上表现一致;多模型推论检验结果表明,Generalized VBGF获得足够的模型支持,并占到AIC权重的95.9%,可以独立描述多齿蛇鲻的体长与年龄的生长关系,生长方程为:Lt=578.49\[1-e-0.051(t-0.14)\]0.361.  相似文献   

16.
An important task in the application of Markov models to the analysis of ion channel data is the determination of the correct gating scheme of the ion channel under investigation. Some prior knowledge from other experiments can reduce significantly the number of possible models. If these models are standard statistical procedures nested like likelihood ratio testing, provide reliable selection methods. In the case of non-nested models, information criteria like AIC, BIC, etc., are used. However, it is not known if any of these criteria provide a reliable selection method and which is the best one in the context of ion channel gating. We provide an alternative approach to model selection in the case of non-nested models with an equal number of open and closed states. The models to choose from are embedded in a properly defined general model. Therefore, we circumvent the problems of model selection in the non-nested case and can apply model selection procedures for nested models.  相似文献   

17.
Since hub nodes have been found to play important roles in many networks, highly connected hub genes are expected to play an important role in biology as well. However, the empirical evidence remains ambiguous. An open question is whether (or when) hub gene selection leads to more meaningful gene lists than a standard statistical analysis based on significance testing when analyzing genomic data sets (e.g., gene expression or DNA methylation data). Here we address this question for the special case when multiple genomic data sets are available. This is of great practical importance since for many research questions multiple data sets are publicly available. In this case, the data analyst can decide between a standard statistical approach (e.g., based on meta-analysis) and a co-expression network analysis approach that selects intramodular hubs in consensus modules. We assess the performance of these two types of approaches according to two criteria. The first criterion evaluates the biological insights gained and is relevant in basic research. The second criterion evaluates the validation success (reproducibility) in independent data sets and often applies in clinical diagnostic or prognostic applications. We compare meta-analysis with consensus network analysis based on weighted correlation network analysis (WGCNA) in three comprehensive and unbiased empirical studies: (1) Finding genes predictive of lung cancer survival, (2) finding methylation markers related to age, and (3) finding mouse genes related to total cholesterol. The results demonstrate that intramodular hub gene status with respect to consensus modules is more useful than a meta-analysis p-value when identifying biologically meaningful gene lists (reflecting criterion 1). However, standard meta-analysis methods perform as good as (if not better than) a consensus network approach in terms of validation success (criterion 2). The article also reports a comparison of meta-analysis techniques applied to gene expression data and presents novel R functions for carrying out consensus network analysis, network based screening, and meta analysis.  相似文献   

18.
Luo Y  Lin S 《Biometrics》2003,59(2):393-401
Genetic marker data has been increasingly incorporated into segregation analysis, as combined segregation and linkage analysis has been performed more frequently. In this article, we study the extent of information gains with incorporation of marker data in segregation analysis, a topic that has not been investigated rigorously. Specifically, the current study is to investigate the influence of marker data on genetic model parameter estimation. A variance matrix criterion (as the inverse of the Fisher information matrix) and a relative entropy criterion (a measure of flatness of expected log-likelihood surface) are used to quantify the information gains. Our results indicate that substantial information gain can be achieved with the incorporation of marker data. The amount of variance reduction increases as the heterozygosity of the linked marker increases and as the trait gets closer to the linked marker(s). Incorporation of marker data in larger pedigrees also yields greater information gains based on both criteria. The effect of pedigree structure is also studied.  相似文献   

19.
Fourier transform infrared and Raman spectra of nicorandil have been recorded. The structure, conformational stability, geometry optimisation and vibrational frequencies have been investigated. Complete vibrational assignments were made for the stable conformer of the molecule using restricted Hartree–Fock (RHF) and density functional theory (DFT) calculations (B3LYP) with the 6-31G(d,p) basis set. Comparison of the observed fundamental vibrational frequencies of the molecule and calculated results by RHF and DFT methods indicates that B3LYP is superior for molecular vibrational problems. The thermodynamic functions of the title molecule were also performed using the RHF and DFT methods. Natural bond order analysis of the title molecule was also carried out. Comparison of the simulated spectra with the experimental spectra provides important information about the ability of the computational method to describe the vibration modes.  相似文献   

20.
A number of imprinted genes have been observed in plants, animals and humans. They not only control growth and developmental traits, but may also be responsible for survival traits. Based on the Cox proportional hazards (PH) model, we constructed a general parametric model for dissecting genomic imprinting, in which a baseline hazard function is selectable for fitting the effects of imprinted quantitative trait loci (iQTL) genotypes on the survival curve. The expectation–maximisation (EM) algorithm is derived for solving the maximum likelihood estimates of iQTL parameters. The imprinting patterns of the detected iQTL are statistically tested under a series of null hypotheses. The Bayesian information criterion (BIC) model selection criterion is employed to choose an optimal baseline hazard function with maximum likelihood and parsimonious parameterisation. We applied the proposed approach to analyse the published data in an F2 population of mice and concluded that, among five commonly used survival distributions, the log-logistic distribution is the optimal baseline hazard function for the survival time of hyperoxic acute lung injury (HALI). Under this optimal model, five QTL were detected, among which four are imprinted in different imprinting patterns.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号