首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
SUMMARY: The fundamental problem of gene selection via cDNA data is to identify which genes are differentially expressed across different kinds of tissue samples (e.g. normal and cancer). cDNA data contain large number of variables (genes) and usually the sample size is relatively small so the selection process can be unstable. Therefore, models which incorporate sparsity in terms of variables (genes) are desirable for this kind of problem. This paper proposes a two-level hierarchical Bayesian model for variable selection which assumes a prior that favors sparseness. We adopt a Markov chain Monte Carlo (MCMC) based computation technique to simulate the parameters from the posteriors. The method is applied to leukemia data from a previous study and a published dataset on breast cancer. SUPPLEMENTARY INFORMATION: http://stat.tamu.edu/people/faculty/bmallick.html.  相似文献   

2.
MOTIVATION: A common task in microarray data analysis consists of identifying genes associated with a phenotype. When the outcomes of interest are censored time-to-event data, standard approaches assess the effect of genes by fitting univariate survival models. In this paper, we propose a Bayesian variable selection approach, which allows the identification of relevant markers by jointly assessing sets of genes. We consider accelerated failure time (AFT) models with log-normal and log-t distributional assumptions. A data augmentation approach is used to impute the failure times of censored observations and mixture priors are used for the regression coefficients to identify promising subsets of variables. The proposed method provides a unified procedure for the selection of relevant genes and the prediction of survivor functions. RESULTS: We demonstrate the performance of the method on simulated examples and on several microarray datasets. For the simulation study, we consider scenarios with large number of noisy variables and different degrees of correlation between the relevant and non-relevant (noisy) variables. We are able to identify the correct covariates and obtain good prediction of the survivor functions. For the microarray applications, some of our selected genes are known to be related to the diseases under study and a few are in agreement with findings from other researchers. AVAILABILITY: The Matlab code for implementing the Bayesian variable selection method may be obtained from the corresponding author. CONTACT: mvannucci@stat.tamu.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

3.
MOTIVATION: Discriminant analysis is an effective tool for the classification of experimental units into groups. Here, we consider the typical problem of classifying subjects according to phenotypes via gene expression data and propose a method that incorporates variable selection into the inferential procedure, for the identification of the important biomarkers. To achieve this goal, we build upon a conjugate normal discriminant model, both linear and quadratic, and include a stochastic search variable selection procedure via an MCMC algorithm. Furthermore, we incorporate into the model prior information on the relationships among the genes as described by a gene-gene network. We use a Markov random field (MRF) prior to map the network connections among genes. Our prior model assumes that neighboring genes in the network are more likely to have a joint effect on the relevant biological processes. RESULTS: We use simulated data to assess performances of our method. In particular, we compare the MRF prior to a situation where independent Bernoulli priors are chosen for the individual predictors. We also illustrate the method on benchmark datasets for gene expression. Our simulation studies show that employing the MRF prior improves on selection accuracy. In real data applications, in addition to identifying markers and improving prediction accuracy, we show how the integration of existing biological knowledge into the prior model results in an increased ability to identify genes with strong discriminatory power and also aids the interpretation of the results.  相似文献   

4.
ABSTRACT: BACKGROUND: An important question in the analysis of biochemical data is that of identifying subsets of molecular variables that may jointly influence a biological response. Statistical variable selection methods have been widely used for this purpose. In many settings, it may be important to incorporate ancillary biological information concerning the variables of interest. Pathway and network maps are one example of a source of such information. However, although ancillary information is increasingly available, it is not always clear how it should be used nor how it should be weighted in relation to primary data. RESULTS: We put forward an approach in which biological knowledge is incorporated using informative prior distributions over variable subsets, with prior information selected and weighted in an automated, objective manner using an empirical Bayes formulation. We employ continuous, linear models with interaction terms and exploit biochemically-motivated sparsity constraints to permit exact inference. We show an example of priors for pathway- and network-based information and illustrate our proposed method on both synthetic response data and by an application to cancer drug response data. Comparisons are also made to alternative Bayesian and frequentist penalised-likelihood methods for incorporating network-based information. CONCLUSIONS: The empirical Bayes method proposed here can aid prior elicitation for Bayesian variable selection studies and help to guard against mis-specification of priors. Empirical Bayes, together with the proposed pathway-based priors, results in an approach with a competitive variable selection performance. In addition, the overall procedure is fast, deterministic, and has very few user-set parameters, yet is capable of capturing interplay between molecular players. The approach presented is general and readily applicable in any setting with multiple sources of biological prior knowledge.  相似文献   

5.
The use of large-scale microarray expression profiling to identify predictors of disease class has become of major interest. Beyond their impact in the clinical setting (i.e. improving diagnosis and treatment), these markers are also likely to provide clues on the molecular mechanisms underlining the diseases. In this paper we describe a new method for the identification of multiple gene predictors of disease class. The method is applied to the classification of two forms of arthritis that have a similar clinical endpoint but different underlying molecular mechanisms: rheumatoid arthritis (RA) and osteoarthritis (OA). We aim at both the classification of samples and the location of genes characterizing the different classes. We achieve both goals simultaneously by combining a binary probit model for classification with Bayesian variable selection methods to identify important genes.We find very small sets of genes that lead to good classification results. Some of the selected genes are clearly correlated with known aspects of the biology of arthritis and, in some cases, reflect already known differences between RA and OA.  相似文献   

6.
We consider the problems of multi-class cancer classification from gene expression data. After discussing the multinomial probit regression model with Bayesian gene selection, we propose two Bayesian gene selection schemes: one employs different strongest genes for different probit regressions; the other employs the same strongest genes for all regressions. Some fast implementation issues for Bayesian gene selection are discussed, including preselection of the strongest genes and recursive computation of the estimation errors using QR decomposition. The proposed gene selection techniques are applied to analyse real breast cancer data, small round blue-cell tumours, the national cancer institute's anti-cancer drug-screen data and acute leukaemia data. Compared with existing multi-class cancer classifications, our proposed methods can find which genes are the most important genes affecting which kind of cancer. Also, the strongest genes selected using our methods are consistent with the biological significance. The recognition accuracies are very high using our proposed methods.  相似文献   

7.
Here we focus on discrimination problems where the number of predictors substantially exceeds the sample size and we propose a Bayesian variable selection approach to multinomial probit models. Our method makes use of mixture priors and Markov chain Monte Carlo techniques to select sets of variables that differ among the classes. We apply our methodology to a problem in functional genomics using gene expression profiling data. The aim of the analysis is to identify molecular signatures that characterize two different stages of rheumatoid arthritis.  相似文献   

8.
9.
10.

Background

Among the primary goals of microarray analysis is the identification of genes that could distinguish between different phenotypes (feature selection). Previous studies indicate that incorporating prior information of the genes'' function could help identify physiologically relevant features. However, current methods that incorporate prior functional information do not provide a relative estimate of the effect of different genes on the biological processes of interest.

Results

Here, we present a method that integrates gene ontology (GO) information and expression data using Bayesian regression mixture models to perform unsupervised clustering of the samples and identify physiologically relevant discriminating features. As a model application, the method was applied to identify the genes that play a role in the cytotoxic responses of human hepatoblastoma cell line (HepG2) to saturated fatty acid (SFA) and tumor necrosis factor (TNF)-α, as compared to the non-toxic response to the unsaturated FFAs (UFA) and TNF-α. Incorporation of prior knowledge led to a better discrimination of the toxic phenotypes from the others. The model identified roles of lysosomal ATPases and adenylate cyclase (AC9) in the toxicity of palmitate. To validate the role of AC in palmitate-treated cells, we measured the intracellular levels of cyclic AMP (cAMP). The cAMP levels were found to be significantly reduced by palmitate treatment and not by the other FFAs, in accordance with the model selection of AC9.

Conclusions

A framework is presented that incorporates prior ontology information, which helped to (a) perform unsupervised clustering of the phenotypes, and (b) identify the genes relevant to each cluster of phenotypes. We demonstrate the proposed framework by applying it to identify physiologically-relevant feature genes that conferred differential toxicity to saturated vs. unsaturated FFAs. The framework can be applied to other problems to efficiently integrate ontology information and expression data in order to identify feature genes.  相似文献   

11.
MOTIVATION: Clinical data, such as patient history, laboratory analysis, ultrasound parameters--which are the basis of day-to-day clinical decision support--are often underused to guide the clinical management of cancer in the presence of microarray data. We propose a strategy based on Bayesian networks to treat clinical and microarray data on an equal footing. The main advantage of this probabilistic model is that it allows to integrate these data sources in several ways and that it allows to investigate and understand the model structure and parameters. Furthermore using the concept of a Markov Blanket we can identify all the variables that shield off the class variable from the influence of the remaining network. Therefore Bayesian networks automatically perform feature selection by identifying the (in)dependency relationships with the class variable. RESULTS: We evaluated three methods for integrating clinical and microarray data: decision integration, partial integration and full integration and used them to classify publicly available data on breast cancer patients into a poor and a good prognosis group. The partial integration method is most promising and has an independent test set area under the ROC curve of 0.845. After choosing an operating point the classification performance is better than frequently used indices.  相似文献   

12.
We explore the problem of variable selection in a case‐control setting with mass spectrometry proteomic data consisting of paired measurements. Each pair corresponds to a distinct isotope cluster and each component within pair represents a summary of isotopic expression based on either the intensity or the shape of the cluster. Our objective is to identify a collection of isotope clusters associated with the disease outcome and at the same time assess the predictive added‐value of shape beyond intensity while maintaining predictive performance. We propose a Bayesian model that exploits the paired structure of our data and utilizes prior information on the relative predictive power of each source by introducing multiple layers of selection. This allows us to make simultaneous inference on which are the most informative pairs and for which—and to what extent—shape has a complementary value in separating the two groups. We evaluate the Bayesian model on pancreatic cancer data. Results from the fitted model show that most predictive potential is achieved with a subset of just six (out of 1289) pairs while the contribution of the intensity components is much higher than the shape components. To demonstrate how the method behaves under a controlled setting we consider a simulation study. Results from this study indicate that the proposed approach can successfully select the truly predictive pairs and accurately estimate the effects of both components although, in some cases, the model tends to overestimate the inclusion probability of the second component.  相似文献   

13.
Wilson DJ  McVean G 《Genetics》2006,172(3):1411-1425
Models of molecular evolution that incorporate the ratio of nonsynonymous to synonymous polymorphism (dN/dS ratio) as a parameter can be used to identify sites that are under diversifying selection or functional constraint in a sample of gene sequences. However, when there has been recombination in the evolutionary history of the sequences, reconstructing a single phylogenetic tree is not appropriate, and inference based on a single tree can give misleading results. In the presence of high levels of recombination, the identification of sites experiencing diversifying selection can suffer from a false-positive rate as high as 90%. We present a model that uses a population genetics approximation to the coalescent with recombination and use reversible-jump MCMC to perform Bayesian inference on both the dN/dS ratio and the recombination rate, allowing each to vary along the sequence. We demonstrate that the method has the power to detect variation in the dN/dS ratio and the recombination rate and does not suffer from a high false-positive rate. We use the method to analyze the porB gene of Neisseria meningitidis and verify the inferences using prior sensitivity analysis and model criticism techniques.  相似文献   

14.
Analysis of microarray data is associated with the methodological problems of high dimension and small sample size. Various methods have been used for variable selection in highdimension and small sample size cases with a single survival endpoint. However, little effort has been directed toward addressing competing risks where there is more than one failure risks. This study compared three typical variable selection techniques including Lasso, elastic net, and likelihood-based boosting for high-dimensional time-to-event data with competing risks. The performance of these methods was evaluated via a simulation study by analyzing a real dataset related to bladder cancer patients using time-dependent receiver operator characteristic(ROC) curve and bootstrap.632+ prediction error curves. The elastic net penalization method was shown to outperform Lasso and boosting. Based on the elastic net, 33 genes out of 1381 genes related to bladder cancer were selected. By fitting to the Fine and Gray model, eight genes were highly significant(P 0.001). Among them, expression of RTN4, SON, IGF1 R, SNRPE, PTGR1, PLEK, and ETFDH was associated with a decrease in survival time, whereas SMARCAD1 expression was associated with an increase in survival time. This study indicates that the elastic net has a higher capacity than the Lasso and boosting for the prediction of survival time in bladder cancer patients.Moreover, genes selected by all methods improved the predictive power of the model based on only clinical variables, indicating the value of information contained in the microarray features.  相似文献   

15.
Variable selection is critical in competing risks regression with high-dimensional data. Although penalized variable selection methods and other machine learning-based approaches have been developed, many of these methods often suffer from instability in practice. This paper proposes a novel method named Random Approximate Elastic Net (RAEN). Under the proportional subdistribution hazards model, RAEN provides a stable and generalizable solution to the large-p-small-n variable selection problem for competing risks data. Our general framework allows the proposed algorithm to be applicable to other time-to-event regression models, including competing risks quantile regression and accelerated failure time models. We show that variable selection and parameter estimation improved markedly using the new computationally intensive algorithm through extensive simulations. A user-friendly R package RAEN is developed for public use. We also apply our method to a cancer study to identify influential genes associated with the death or progression from bladder cancer.  相似文献   

16.
Gene selection using support vector machines with non-convex penalty   总被引:2,自引:0,他引:2  
MOTIVATION: With the development of DNA microarray technology, scientists can now measure the expression levels of thousands of genes simultaneously in one single experiment. One current difficulty in interpreting microarray data comes from their innate nature of 'high-dimensional low sample size'. Therefore, robust and accurate gene selection methods are required to identify differentially expressed group of genes across different samples, e.g. between cancerous and normal cells. Successful gene selection will help to classify different cancer types, lead to a better understanding of genetic signatures in cancers and improve treatment strategies. Although gene selection and cancer classification are two closely related problems, most existing approaches handle them separately by selecting genes prior to classification. We provide a unified procedure for simultaneous gene selection and cancer classification, achieving high accuracy in both aspects. RESULTS: In this paper we develop a novel type of regularization in support vector machines (SVMs) to identify important genes for cancer classification. A special nonconvex penalty, called the smoothly clipped absolute deviation penalty, is imposed on the hinge loss function in the SVM. By systematically thresholding small estimates to zeros, the new procedure eliminates redundant genes automatically and yields a compact and accurate classifier. A successive quadratic algorithm is proposed to convert the non-differentiable and non-convex optimization problem into easily solved linear equation systems. The method is applied to two real datasets and has produced very promising results. AVAILABILITY: MATLAB codes are available upon request from the authors.  相似文献   

17.
Yize Zhao  Ben Wu  Jian Kang 《Biometrics》2023,79(2):655-668
Multimodality or multiconstruct data arise increasingly in functional neuroimaging studies to characterize brain activity under different cognitive states. Relying on those high-resolution imaging collections, it is of great interest to identify predictive imaging markers and intermodality interactions with respect to behavior outcomes. Currently, most of the existing variable selection models do not consider predictive effects from interactions, and the desired higher-order terms can only be included in the predictive mechanism following a two-step procedure, suffering from potential misspecification. In this paper, we propose a unified Bayesian prior model to simultaneously identify main effect features and intermodality interactions within the same inference platform in the presence of high-dimensional data. To accommodate the brain topological information and correlation between modalities, our prior is designed by compiling the intermediate selection status of sequential partitions in light of the data structure and brain anatomical architecture, so that we can improve posterior inference and enhance biological plausibility. Through extensive simulations, we show the superiority of our approach in main and interaction effects selection, and prediction under multimodality data. Applying the method to the Adolescent Brain Cognitive Development (ABCD) study, we characterize the brain functional underpinnings with respect to general cognitive ability under different memory load conditions.  相似文献   

18.
We propose a model-based clustering method for high-dimensional longitudinal data via regularization in this paper. This study was motivated by the Trial of Activity in Adolescent Girls (TAAG), which aimed to examine multilevel factors related to the change of physical activity by following up a cohort of 783 girls over 10 years from adolescence to early adulthood. Our goal is to identify the intrinsic grouping of subjects with similar patterns of physical activity trajectories and the most relevant predictors within each group. The previous analyses conducted clustering and variable selection in two steps, while our new method can perform the tasks simultaneously. Within each cluster, a linear mixed-effects model (LMM) is fitted with a doubly penalized likelihood to induce sparsity for parameter estimation and effect selection. The large-sample joint properties are established, allowing the dimensions of both fixed and random effects to increase at an exponential rate of the sample size, with a general class of penalty functions. Assuming subjects are drawn from a Gaussian mixture distribution, model effects and cluster labels are estimated via a coordinate descent algorithm nested inside the Expectation-Maximization (EM) algorithm. Bayesian Information Criterion (BIC) is used to determine the optimal number of clusters and the values of tuning parameters. Our numerical studies show that the new method has satisfactory performance and is able to accommodate complex data with multilevel and/or longitudinal effects.  相似文献   

19.
A vast amount of ecological knowledge generated over the past two decades has hinged upon the ability of model selection methods to discriminate among various ecological hypotheses. The last decade has seen the rise of Bayesian hierarchical models in ecology. Consequently, commonly used tools, such as the AIC, become largely inapplicable and there appears to be no consensus about a particular model selection tool that can be universally applied. We focus on a specific class of competing Bayesian spatial capture–recapture (SCR) models and apply and evaluate some of the recommended Bayesian model selection tools: (1) Bayes Factor—using (a) Gelfand‐Dey and (b) harmonic mean methods, (2) Deviance Information Criterion (DIC), (3) Watanabe‐Akaike's Information Criterion (WAIC) and (4) posterior predictive loss criterion. In all, we evaluate 25 variants of model selection tools in our study. We evaluate these model selection tools from the standpoint of selecting the “true” model and parameter estimation. In all, we generate 120 simulated data sets using the true model and assess the frequency with which the true model is selected and how well the tool estimates N (population size), a parameter of much importance to ecologists. We find that when information content is low in the data, no particular model selection tool can be recommended to help realize, simultaneously, both the goals of model selection and parameter estimation. But, in general (when we consider both the objectives together), we recommend the use of our application of the Bayes Factor (Gelfand‐Dey with MAP approximation) for Bayesian SCR models. Our study highlights the point that although new model selection tools are emerging (e.g., WAIC) in the applied statistics literature, those tools based on sound theory even under approximation may still perform much better.  相似文献   

20.
Chen  Jiahua; Chen  Zehua 《Biometrika》2008,95(3):759-771
The ordinary Bayesian information criterion is too liberal formodel selection when the model space is large. In this paper,we re-examine the Bayesian paradigm for model selection andpropose an extended family of Bayesian information criteria,which take into account both the number of unknown parametersand the complexity of the model space. Their consistency isestablished, in particular allowing the number of covariatesto increase to infinity with the sample size. Their performancein various situations is evaluated by simulation studies. Itis demonstrated that the extended Bayesian information criteriaincur a small loss in the positive selection rate but tightlycontrol the false discovery rate, a desirable property in manyapplications. The extended Bayesian information criteria areextremely useful for variable selection in problems with a moderatesample size but with a huge number of covariates, especiallyin genome-wide association studies, which are now an activearea in genetics research.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号