首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 390 毫秒
1.
Li Z  Sillanpää MJ 《Genetics》2012,190(1):231-249
Bayesian hierarchical shrinkage methods have been widely used for quantitative trait locus mapping. From the computational perspective, the application of the Markov chain Monte Carlo (MCMC) method is not optimal for high-dimensional problems such as the ones arising in epistatic analysis. Maximum a posteriori (MAP) estimation can be a faster alternative, but it usually produces only point estimates without providing any measures of uncertainty (i.e., interval estimates). The variational Bayes method, stemming from the mean field theory in theoretical physics, is regarded as a compromise between MAP and MCMC estimation, which can be efficiently computed and produces the uncertainty measures of the estimates. Furthermore, variational Bayes methods can be regarded as the extension of traditional expectation-maximization (EM) algorithms and can be applied to a broader class of Bayesian models. Thus, the use of variational Bayes algorithms based on three hierarchical shrinkage models including Bayesian adaptive shrinkage, Bayesian LASSO, and extended Bayesian LASSO is proposed here. These methods performed generally well and were found to be highly competitive with their MCMC counterparts in our example analyses. The use of posterior credible intervals and permutation tests are considered for decision making between quantitative trait loci (QTL) and non-QTL. The performance of the presented models is also compared with R/qtlbim and R/BhGLM packages, using a previously studied simulated public epistatic data set.  相似文献   

2.
The evolution of the influenza A virus to increase its host range is a major concern worldwide. Molecular mechanisms of increasing host range are largely unknown. Influenza surface proteins play determining roles in reorganization of host-sialic acid receptors and host range. In an attempt to uncover the physic-chemical attributes which govern HA subtyping, we performed a large scale functional analysis of over 7000 sequences of 16 different HA subtypes. Large number (896) of physic-chemical protein characteristics were calculated for each HA sequence. Then, 10 different attribute weighting algorithms were used to find the key characteristics distinguishing HA subtypes. Furthermore, to discover machine leaning models which can predict HA subtypes, various Decision Tree, Support Vector Machine, Naïve Bayes, and Neural Network models were trained on calculated protein characteristics dataset as well as 10 trimmed datasets generated by attribute weighting algorithms. The prediction accuracies of the machine learning methods were evaluated by 10-fold cross validation. The results highlighted the frequency of Gln (selected by 80% of attribute weighting algorithms), percentage/frequency of Tyr, percentage of Cys, and frequencies of Try and Glu (selected by 70% of attribute weighting algorithms) as the key features that are associated with HA subtyping. Random Forest tree induction algorithm and RBF kernel function of SVM (scaled by grid search) showed high accuracy of 98% in clustering and predicting HA subtypes based on protein attributes. Decision tree models were successful in monitoring the short mutation/reassortment paths by which influenza virus can gain the key protein structure of another HA subtype and increase its host range in a short period of time with less energy consumption. Extracting and mining a large number of amino acid attributes of HA subtypes of influenza A virus through supervised algorithms represent a new avenue for understanding and predicting possible future structure of influenza pandemics.  相似文献   

3.

Background  

Many recent studies that relax the assumption of independent evolution of sites have done so at the expense of a drastic increase in the number of substitution parameters. While additional parameters cannot be avoided to model context-dependent evolution, a large increase in model dimensionality is only justified when accompanied with careful model-building strategies that guard against overfitting. An increased dimensionality leads to increases in numerical computations of the models, increased convergence times in Bayesian Markov chain Monte Carlo algorithms and even more tedious Bayes Factor calculations.  相似文献   

4.
This paper presents a new statistical techniques — Bayesian Generalized Associative Functional Networks (GAFN), to model the dynamical plant growth process of greenhouse crops. GAFNs are able to incorporate the domain knowledge and data to model complex ecosystem. By use of the functional networks and Bayesian framework, the prior knowledge can be naturally embedded into the model, and the functional relationship between inputs and outputs can be learned during the training process. Our main interest is focused on the Generalized Associative Functional Networks (GAFNs), which are appropriate to model multiple variable processes. Three main advantages are obtained through the applications of Bayesian GAFN methods to modeling dynamic process of plant growth. Firstly, this approach provides a powerful tool for revealing some useful relationships between the greenhouse environmental factors and the plant growth parameters. Secondly, Bayesian GAFN can model Multiple-Input Multiple-Output (MIMO) systems from the given data, and presents a good generalization capability from the final single model for successfully fitting all 12 data sets over 5-year field experiments. Thirdly, the Bayesian GAFN method can also play as an optimization tool to estimate the interested parameter in the agro-ecosystem. In this work, two algorithms are proposed for the statistical inference of parameters in GAFNs. Both of them are based on the variational inference, also called variational Bayes (VB) techniques, which may provide probabilistic interpretations for the built models. VB-based learning methods are able to yield estimations of the full posterior probability of model parameters. Synthetic and real-world examples are implemented to confirm the validity of the proposed methods.  相似文献   

5.
P F Thall 《Biometrics》1988,44(1):197-209
In many longitudinal studies it is desired to estimate and test the rate over time of a particular recurrent event. Often only the event counts corresponding to the elapsed time intervals between each subject's successive observation times, and baseline covariate data, are available. The intervals may vary substantially in length and number between subjects, so that the corresponding vectors of counts are not directly comparable. A family of Poisson likelihood regression models incorporating a mixed random multiplicative component in the rate function of each subject is proposed for this longitudinal data structure. A related empirical Bayes estimate of random-effect parameters is also described. These methods are illustrated by an analysis of dyspepsia data from the National Cooperative Gallstone Study.  相似文献   

6.
Recent work on mutation-selection models has revealed that, under specific assumptions on the fitness function and the mutation rates, asymptotic estimates for the leading eigenvalue of the mutation-reproduction matrix may be obtained through a low-dimensional maximum principle in the limit N-->infinity (where N, or N(d) with d> or =1, is proportional to the number of types). In order to extend this variational principle to a larger class of models, we consider here a family of reversible matrices of asymptotic dimension N(d) and identify conditions under which the high-dimensional Rayleigh-Ritz variational problem may be reduced to a low-dimensional one that yields the leading eigenvalue up to an error term of order 1/N. For a large class of mutation-selection models, this implies estimates for the mean fitness, as well as a concentration result for the ancestral distribution of types.  相似文献   

7.
Detection-nondetection data are often used to investigate species range dynamics using Bayesian occupancy models which rely on the use of Markov chain Monte Carlo (MCMC) methods to sample from the posterior distribution of the parameters of the model. In this article we develop two Variational Bayes (VB) approximations to the posterior distribution of the parameters of a single-season site occupancy model which uses logistic link functions to model the probability of species occurrence at sites and of species detection probabilities. This task is accomplished through the development of iterative algorithms that do not use MCMC methods. Simulations and small practical examples demonstrate the effectiveness of the proposed technique. We specifically show that (under certain circumstances) the variational distributions can provide accurate approximations to the true posterior distributions of the parameters of the model when the number of visits per site (K) are as low as three and that the accuracy of the approximations improves as K increases. We also show that the methodology can be used to obtain the posterior distribution of the predictive distribution of the proportion of sites occupied (PAO).  相似文献   

8.
MOTIVATION: The classification of samples using gene expression profiles is an important application in areas such as cancer research and environmental health studies. However, the classification is usually based on a small number of samples, and each sample is a long vector of thousands of gene expression levels. An important issue in parametric modeling for so many gene expression levels is the control of the number of nuisance parameters in the model. Large models often lead to intensive or even intractable computation, while small models may be inadequate for complex data.Methodology: We propose a two-step empirical Bayes classification method as a solution to this issue. At the first step, we use the model-based cluster algorithm with a non-traditional purpose of assigning gene expression levels to form abundance groups. At the second step, by assuming the same variance for all the genes in the same group, we substantially reduce the number of nuisance parameters in our statistical model. RESULTS: The proposed model is more parsimonious, which leads to efficient computation under an empirical Bayes estimation procedure. We consider two real examples and simulate data using our method. Desired low classification error rates are obtained even when a large number of genes are pre-selected for class prediction.  相似文献   

9.
We have evaluated two mathematical models to describe the increase in coronary sinus pressure (CSP) following pressure controlled intermittent coronary sinus occlusion (PICSO). The models are evaluated and compared on the basis of human and canine data. Both models were fitted by non-linear least squares algorithms. Next, derived quantities, such as plateau, rise-time and mean integral of the coronary sinus pressure were calculated from the model parameters. Corresponding quantities for the two models were compared with regard to mean values, rate of successful calculation and specific features characterizing the human or canine case. One model was found to be superior for investigational purposes. The other model was found to be more stable in critical situations and is therefore suggested for usage in closed loop regulation of PICSO. Physiologically, the differences in mean values of the derived quantities between the two models were found to be negligible. The formal statistical significance of the differences is but a consequence of the large number of PICSO cycles analysed.  相似文献   

10.
Estimation of extreme quantal-response statistics, such as the concentration required to kill 99.9% of test subjects (LC99.9), remains a challenge in the presence of multiple covariates and complex study designs. Accurate and precise estimates of the LC99.9 for mixtures of toxicants are critical to ongoing control of a parasitic invasive species, the sea lamprey, in the Laurentian Great Lakes of North America. The toxicity of those chemicals is affected by local and temporal variations in water chemistry, which must be incorporated into the modeling. We develop multilevel empirical Bayes models for data from multiple laboratory studies. Our approach yields more accurate and precise estimation of the LC99.9 compared to alternative models considered. This study demonstrates that properly incorporating hierarchical structure in laboratory data yields better estimates of LC99.9 stream treatment values that are critical to larvae control in the field. In addition, out-of-sample prediction of the results of in situ tests reveals the presence of a latent seasonal effect not manifest in the laboratory studies, suggesting avenues for future study and illustrating the importance of dual consideration of both experimental and observational data.  相似文献   

11.
Tools for estimating population structure from genetic data are now used in a wide variety of applications in population genetics. However, inferring population structure in large modern data sets imposes severe computational challenges. Here, we develop efficient algorithms for approximate inference of the model underlying the STRUCTURE program using a variational Bayesian framework. Variational methods pose the problem of computing relevant posterior distributions as an optimization problem, allowing us to build on recent advances in optimization theory to develop fast inference tools. In addition, we propose useful heuristic scores to identify the number of populations represented in a data set and a new hierarchical prior to detect weak population structure in the data. We test the variational algorithms on simulated data and illustrate using genotype data from the CEPH–Human Genome Diversity Panel. The variational algorithms are almost two orders of magnitude faster than STRUCTURE and achieve accuracies comparable to those of ADMIXTURE. Furthermore, our results show that the heuristic scores for choosing model complexity provide a reasonable range of values for the number of populations represented in the data, with minimal bias toward detecting structure when it is very weak. Our algorithm, fastSTRUCTURE, is freely available online at http://pritchardlab.stanford.edu/structure.html.  相似文献   

12.
Recent advances in statistical software have led to the rapid diffusion of new methods for modelling longitudinal data. Multilevel (also known as hierarchical or random effects) models for binary outcomes have generally been based on a logistic-normal specification, by analogy with earlier work for normally distributed data. The appropriate application and interpretation of these models remains somewhat unclear, especially when compared with the computationally more straightforward semiparametric or 'marginal' modelling (GEE) approaches. In this paper we pose two interrelated questions. First, what limits should be placed on the interpretation of the coefficients and inferences derived from random-effect models involving binary outcomes? Second, what diagnostic checks are appropriate for evaluating whether such random-effect models provide adequate fits to the data? We address these questions by means of an extended case study using data on adolescent smoking from a large cohort study. Bayesian estimation methods are used to fit a discrete-mixture alternative to the standard logistic-normal model, and posterior predictive checking is used to assess model fit. Surprising parallels in the parameter estimates from the logistic-normal and mixture models are described and used to question the interpretability of the so-called 'subject-specific' regression coefficients from the standard multilevel approach. Posterior predictive checks suggest a serious lack of fit of both multilevel models. The results do not provide final answers to the two questions posed, but we expect that lessons learned from the case study will provide general guidance for further investigation of these important issues.  相似文献   

13.
In biomedical research, hierarchical models are very widely used to accommodate dependence in multivariate and longitudinal data and for borrowing of information across data from different sources. A primary concern in hierarchical modeling is sensitivity to parametric assumptions, such as linearity and normality of the random effects. Parametric assumptions on latent variable distributions can be challenging to check and are typically unwarranted, given available prior knowledge. This article reviews some recent developments in Bayesian nonparametric methods motivated by complex, multivariate and functional data collected in biomedical studies. The author provides a brief review of flexible parametric approaches relying on finite mixtures and latent class modeling. Dirichlet process mixture models are motivated by the need to generalize these approaches to avoid assuming a fixed finite number of classes. Focusing on an epidemiology application, the author illustrates the practical utility and potential of nonparametric Bayes methods.  相似文献   

14.

Background

Accurate QTL mapping is a prerequisite in the search for causative mutations. Bayesian genomic selection models that analyse many markers simultaneously should provide more accurate QTL detection results than single-marker models. Our objectives were to (a) evaluate by simulation the influence of heritability, number of QTL and number of records on the accuracy of QTL mapping with Bayes Cπ and Bayes C; (b) estimate the QTL status (homozygous vs. heterozygous) of the individuals analysed. This study focussed on the ten largest detected QTL, assuming they are candidates for further characterization.

Methods

Our simulations were based on a true dairy cattle population genotyped for 38 277 phased markers. Some of these markers were considered biallelic QTL and used to generate corresponding phenotypes. Different numbers of records (4387 and 1500), heritability values (0.1, 0.4 and 0.7) and numbers of QTL (10, 100 and 1000) were studied. QTL detection was based on the posterior inclusion probability for individual markers, or on the sum of the posterior inclusion probabilities for consecutive markers, estimated using Bayes C or Bayes Cπ. The QTL status of the individuals was derived from the contrast between the sums of the SNP allelic effects of their chromosomal segments.

Results

The proportion of markers with null effect (π) frequently did not reach convergence, leading to poor results for Bayes Cπ in QTL detection. Fixing π led to better results. Detection of the largest QTL was most accurate for medium to high heritability, for low to moderate numbers of QTL, and with a large number of records. The QTL status was accurately inferred when the distribution of the contrast between chromosomal segment effects was bimodal.

Conclusions

QTL detection is feasible with Bayes C. For QTL detection, it is recommended to use a large dataset and to focus on highly heritable traits and on the largest QTL. QTL statuses were inferred based on the distribution of the contrast between chromosomal segment effects.  相似文献   

15.
We introduce new robust small area estimation procedures basedon area-level models. We first find influence functions correspondingto each individual area-level observation by measuring the divergencebetween the posterior density functions of regression coefficientswith and without that observation. Next, based on these influencefunctions, properly standardized, we propose some new robustBayes and empirical Bayes small area estimators. The mean squarederrors and estimated mean squared errors of these estimatorsare also found. A small simulation study compares the performanceof the robust and the regular empirical Bayes estimators. Whenthe model variance is larger than the sample variance, the proposedrobust empirical Bayes estimators are superior.  相似文献   

16.
Zhao JX  Foulkes AS  George EI 《Biometrics》2005,61(2):591-599
Characterizing the process by which molecular and cellular level changes occur over time will have broad implications for clinical decision making and help further our knowledge of disease etiology across many complex diseases. However, this presents an analytic challenge due to the large number of potentially relevant biomarkers and the complex, uncharacterized relationships among them. We propose an exploratory Bayesian model selection procedure that searches for model simplicity through independence testing of multiple discrete biomarkers measured over time. Bayes factor calculations are used to identify and compare models that are best supported by the data. For large model spaces, i.e., a large number of multi-leveled biomarkers, we propose a Markov chain Monte Carlo (MCMC) stochastic search algorithm for finding promising models. We apply our procedure to explore the extent to which HIV-1 genetic changes occur independently over time.  相似文献   

17.
Lee K  Daniels MJ 《Biometrics》2007,63(4):1060-1067
Generalized linear models with serial dependence are often used for short longitudinal series. Heagerty (2002, Biometrics58, 342-351) has proposed marginalized transition models for the analysis of longitudinal binary data. In this article, we extend this work to accommodate longitudinal ordinal data. Fisher-scoring algorithms are developed for estimation. Methods are illustrated on quality-of-life data from a recent colorectal cancer clinical trial.  相似文献   

18.
Patients who have undergone renal transplantation are monitored longitudinally at irregular time intervals over 10 years or more. This yields a set of biochemical and physiological markers containing valuable information to anticipate a failure of the graft. A general linear, generalized linear, or nonlinear mixed model is used to describe the longitudinal profile of each marker. To account for the correlation between markers, the univariate mixed models are combined into a multivariate mixed model (MMM) by specifying a joint distribution for the random effects. Due to the high number of markers, a pairwise modeling strategy, where all possible pairs of bivariate mixed models are fitted, is used to obtain parameter estimates for the MMM. These estimates are used in a Bayes rule to obtain, at each point in time, the prognosis for long-term success of the transplant. It is shown that allowing the markers to be correlated can improve this prognosis.  相似文献   

19.
Lifestyle and genetic factors play a large role in the development of Type 2 Diabetes (T2D). Despite the important role of genetic factors, genetic information is not incorporated into the clinical assessment of T2D risk. We assessed and compared Whole Genome Regression methods to predict the T2D status of 5,245 subjects from the Framingham Heart Study. For evaluating each method we constructed the following set of regression models: A clinical baseline model (CBM) which included non-genetic covariates only. CBM was extended by adding the first two marker-derived principal components and 65 SNPs identified by a recent GWAS consortium for T2D (M-65SNPs). Subsequently, it was further extended by adding 249,798 genome-wide SNPs from a high-density array. The Bayesian models used to incorporate genome-wide marker information as predictors were: Bayes A, Bayes Cπ, Bayesian LASSO (BL), and the Genomic Best Linear Unbiased Prediction (G-BLUP). Results included estimates of the genetic variance and heritability, genetic scores for T2D, and predictive ability evaluated in a 10-fold cross-validation. The predictive AUC estimates for CBM and M-65SNPs were: 0.668 and 0.684, respectively. We found evidence of contribution of genetic effects in T2D, as reflected in the genomic heritability estimates (0.492±0.066). The highest predictive AUC among the genome-wide marker Bayesian models was 0.681 for the Bayesian LASSO. Overall, the improvement in predictive ability was moderate and did not differ greatly among models that included genetic information. Approximately 58% of the total number of genetic variants was found to contribute to the overall genetic variation, indicating a complex genetic architecture for T2D. Our results suggest that the Bayes Cπ and the G-BLUP models with a large set of genome-wide markers could be used for predicting risk to T2D, as an alternative to using high-density arrays when selected markers from large consortiums for a given complex trait or disease are unavailable.  相似文献   

20.
In longitudinal studies investigators frequently have to assess and address potential biases introduced by missing data. New methods are proposed for modeling longitudinal categorical data with nonignorable dropout using marginalized transition models and shared random effects models. Random effects are introduced for both serial dependence of outcomes and nonignorable missingness. Fisher‐scoring and Quasi–Newton algorithms are developed for parameter estimation. Methods are illustrated with a real dataset.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号