首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Gene selection and classification of microarray data using random forest   总被引:9,自引:0,他引:9  

Background  

Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection.  相似文献   

2.

Background  

Protein alignments are an essential tool for many bioinformatics analyses. While sequence alignments are accurate for proteins of high sequence similarity, they become unreliable as they approach the so-called 'twilight zone' where sequence similarity gets indistinguishable from random. For such distant pairs, structure alignment is of much better quality. Nevertheless, sequence alignment is the only choice in the majority of cases where structural data is not available. This situation demands development of methods that extend the applicability of accurate sequence alignment to distantly related proteins.  相似文献   

3.
4.
Tree-based models are a popular tool for predicting a response given a set of explanatory variables when the regression function is characterized by a certain degree of complexity. Sometimes, they are also used to identify important variables and for variable selection. We show that if the generating model contains chains of direct and indirect effects, then the typical variable importance measures suggest selecting as important mainly the background variables, which have a strong indirect effect, disregarding the variables that directly influence the response. This is attributable mainly to the variable choice in the first steps of the algorithm selecting the splitting variable and to the greedy nature of such search. This pitfall could be relevant when using tree-based algorithms for understanding the underlying generating process, for population segmentation and for causal inference.  相似文献   

5.

Background

Random forest (RF) is a machine-learning method that generally works well with high-dimensional problems and allows for nonlinear relationships between predictors; however, the presence of correlated predictors has been shown to impact its ability to identify strong predictors. The Random Forest-Recursive Feature Elimination algorithm (RF-RFE) mitigates this problem in smaller data sets, but this approach has not been tested in high-dimensional omics data sets.

Results

We integrated 202,919 genotypes and 153,422 methylation sites in 680 individuals, and compared the abilities of RF and RF-RFE to detect simulated causal associations, which included simulated genotype–methylation interactions, between these variables and triglyceride levels. Results show that RF was able to identify strong causal variables with a few highly correlated variables, but it did not detect other causal variables.

Conclusions

Although RF-RFE decreased the importance of correlated variables, in the presence of many correlated variables, it also decreased the importance of causal variables, making both hard to detect. These findings suggest that RF-RFE may not scale to high-dimensional data.
  相似文献   

6.

Background

Although many predictors have been evaluated, a set of strong independent prognostic mortality indicators has not been established in children with pediatric pulmonary arterial hypertension (PAH). The aim of this study was to identify a combination of clinical and molecular predictors of survival in PAH.

Methods

This single-center, retrospective cohort study was performed from children with PAH between 2001 and 2008 at Children''s Hospital Colorado. Blood samples from 83 patients (median age of 8.3 years-old) were obtained. We retrospectively analyzed 46 variables, which included 27 circulating proteins, 7 demographic variables and 12 hemodynamic and echocardiographic variables for establishing the best predictors of mortality. A data mining approach was utilized to evaluate predictor variables and to uncover complex data structures while performing variable selection in high dimensional problems.

Results

Thirteen children (16%) died during follow-up (median; 3.1 years) and survival rates from time of sample collection at 1 year, 3 years and 5 years were 95%, 85% and 79%, respectively. A subset of potentially informative predictors were identified, the top four are listed here in order of importance: Tissue inhibitors of metalloproteinases-1 (TIMP-1), apolipoprotein-AI, RV/LV diastolic dimension ratio and age at diagnosis. In univariate analysis, TIMP-1 and apolipoprotein-AI had significant association with survival time (hazard ratio [95% confidence interval]: 1.25 [1.03, 1.51] and 0.70 [0.54–0.90], respectively). Patients grouped by TIMP-1 and apolipoprotein-AI values had significantly different survival risks (p<0.01).

Conclusion

Important predictors of mortality were identified from a large number of circulating proteins and clinical markers in this cohort. If confirmed in other populations, measurement of a subset of these predictors could aid in management of pediatric PAH by identifying patients at risk for death. These findings also further support a role for the clinical utility of measuring circulating proteins.  相似文献   

7.

Background  

Methodologies like phage display selection, in vitro mutagenesis and the determination of allelic expression differences include steps where large numbers of clones need to be compared and characterised. In the current study we show that high-resolution melt curve analysis (HRMA) is a simple, cost-saving tool to quickly study clonal variation without prior nucleotide sequence knowledge.  相似文献   

8.

Background  

Genome-wide association studies for complex diseases will produce genotypes on hundreds of thousands of single nucleotide polymorphisms (SNPs). A logical first approach to dealing with massive numbers of SNPs is to use some test to screen the SNPs, retaining only those that meet some criterion for futher study. For example, SNPs can be ranked by p-value, and those with the lowest p-values retained. When SNPs have large interaction effects but small marginal effects in a population, they are unlikely to be retained when univariate tests are used for screening. However, model-based screens that pre-specify interactions are impractical for data sets with thousands of SNPs. Random forest analysis is an alternative method that produces a single measure of importance for each predictor variable that takes into account interactions among variables without requiring model specification. Interactions increase the importance for the individual interacting variables, making them more likely to be given high importance relative to other variables. We test the performance of random forests as a screening procedure to identify small numbers of risk-associated SNPs from among large numbers of unassociated SNPs using complex disease models with up to 32 loci, incorporating both genetic heterogeneity and multi-locus interaction.  相似文献   

9.

Background

Hierarchical partitioning (HP) is an analytical method of multiple regression that identifies the most likely causal factors while alleviating multicollinearity problems. Its use is increasing in ecology and conservation by its usefulness for complementing multiple regression analysis. A public-domain software “hier.part package” has been developed for running HP in R software. Its authors highlight a “minor rounding error” for hierarchies constructed from >9 variables, however potential bias by using this module has not yet been examined. Knowing this bias is pivotal because, for example, the ranking obtained in HP is being used as a criterion for establishing priorities of conservation.

Methodology/Principal Findings

Using numerical simulations and two real examples, we assessed the robustness of this HP module in relation to the order the variables have in the analysis. Results indicated a considerable effect of the variable order on the amount of independent variance explained by predictors for models with >9 explanatory variables. For these models the nominal ranking of importance of the predictors changed with variable order, i.e. predictors declared important by its contribution in explaining the response variable frequently changed to be either most or less important with other variable orders. The probability of changing position of a variable was best explained by the difference in independent explanatory power between that variable and the previous one in the nominal ranking of importance. The lesser is this difference, the more likely is the change of position.

Conclusions/Significance

HP should be applied with caution when more than 9 explanatory variables are used to know ranking of covariate importance. The explained variance is not a useful parameter to use in models with more than 9 independent variables. The inconsistency in the results obtained by HP should be considered in future studies as well as in those already published. Some recommendations to improve the analysis with this HP module are given.  相似文献   

10.

Background

LASSO is a penalized regression method that facilitates model fitting in situations where there are as many, or even more explanatory variables than observations, and only a few variables are relevant in explaining the data. We focus on the Bayesian version of LASSO and consider four problems that need special attention: (i) controlling false positives, (ii) multiple comparisons, (iii) collinearity among explanatory variables, and (iv) the choice of the tuning parameter that controls the amount of shrinkage and the sparsity of the estimates. The particular application considered is association genetics, where LASSO regression can be used to find links between chromosome locations and phenotypic traits in a biological organism. However, the proposed techniques are relevant also in other contexts where LASSO is used for variable selection.

Results

We separate the true associations from false positives using the posterior distribution of the effects (regression coefficients) provided by Bayesian LASSO. We propose to solve the multiple comparisons problem by using simultaneous inference based on the joint posterior distribution of the effects. Bayesian LASSO also tends to distribute an effect among collinear variables, making detection of an association difficult. We propose to solve this problem by considering not only individual effects but also their functionals (i.e. sums and differences). Finally, whereas in Bayesian LASSO the tuning parameter is often regarded as a random variable, we adopt a scale space view and consider a whole range of fixed tuning parameters, instead. The effect estimates and the associated inference are considered for all tuning parameters in the selected range and the results are visualized with color maps that provide useful insights into data and the association problem considered. The methods are illustrated using two sets of artificial data and one real data set, all representing typical settings in association genetics.  相似文献   

11.
Understanding the mechanisms of habitat selection is fundamental to the construction of proper conservation and management plans for many avian species. Habitat changes caused by human beings increase the landscape complexity and thus the complexity of data available for explaining species distribution. New techniques that assume no linearity and capable to extrapolate the response variables across landscapes are needed for dealing with difficult relationships between habitat variables and distribution data. We used a random forest algorithm to study breeding-site selection of herons and egrets in a human-influenced landscape by analyzing land use around their colonies. We analyzed the importance of each land-use variable for different scales and its relationship to the probability of colony presence. We found that there exist two main spatial scales on which herons and egrets select their colony sites: medium scale (4 km) and large scale (10–15 km). Colonies were attracted to areas with large amounts of evergreen forests at the medium scale, whereas avoidance of high-density urban areas was important at the large scale. Previous studies used attractive factors, mainly foraging areas, to explain bird-colony distributions, but our study is the first to show the major importance of repellent factors at large scales. We believe that the newest non-linear methods, such as random forests, are needed when modelling complex variable interactions when organisms are distributed in complex landscapes. These methods could help to improve the conservation plans of those species threatened by the advance of highly human-influenced landscapes.  相似文献   

12.

Motivation

A grand challenge in the modeling of biological systems is the identification of key variables which can act as targets for intervention. Boolean networks are among the simplest of models, yet they have been shown to adequately model many of the complex dynamics of biological systems. In our recent work, we utilized a logic minimization approach to identify quality single variable targets for intervention from the state space of a Boolean network. However, as the number of variables in a network increases, the more likely it is that a successful intervention strategy will require multiple variables. Thus, for larger networks, such an approach is required in order to identify more complex intervention strategies while working within the limited view of the network’s state space. Specifically, we address three primary challenges for the large network arena: the first challenge is how to consider many subsets of variables, the second is to design clear methods and measures to identify the best targets for intervention in a systematic way, and the third is to work with an intractable state space through sampling.

Results

We introduce a multiple variable intervention target called a template and show through simulation studies of random networks that these templates are able to identify top intervention targets in increasingly large Boolean networks. We first show that, when other methods show drastic loss in performance, template methods show no significant performance loss between fully explored and partially sampled Boolean state spaces. We also show that, when other methods show a complete inability to produce viable intervention targets in sampled Boolean state spaces, template methods maintain significantly consistent success rates even as state space sizes increase exponentially with larger networks. Finally, we show the utility of the template approach on a real-world Boolean network modeling T-LGL leukemia.

Conclusions

Overall, these results demonstrate how template-based approaches now effectively take over for our previous single variable approaches and produce quality intervention targets in larger networks requiring sampled state spaces.
  相似文献   

13.

Background

Abel and Trevors have delineated three aspects of sequence complexity, Random Sequence Complexity (RSC), Ordered Sequence Complexity (OSC) and Functional Sequence Complexity (FSC) observed in biosequences such as proteins. In this paper, we provide a method to measure functional sequence complexity.

Methods and Results

We have extended Shannon uncertainty by incorporating the data variable with a functionality variable. The resulting measured unit, which we call Functional bit (Fit), is calculated from the sequence data jointly with the defined functionality variable. To demonstrate the relevance to functional bioinformatics, a method to measure functional sequence complexity was developed and applied to 35 protein families. Considerations were made in determining how the measure can be used to correlate functionality when relating to the whole molecule and sub-molecule. In the experiment, we show that when the proposed measure is applied to the aligned protein sequences of ubiquitin, 6 of the 7 highest value sites correlate with the binding domain.

Conclusion

For future extensions, measures of functional bioinformatics may provide a means to evaluate potential evolving pathways from effects such as mutations, as well as analyzing the internal structural and functional relationships within the 3-D structure of proteins.  相似文献   

14.

Background  

The population mutation rate (θ) remains one of the most fundamental parameters in genetics, ecology, and evolutionary biology. However, its accurate estimation can be seriously compromised when working with error prone data such as expressed sequence tags, low coverage draft sequences, and other such unfinished products. This study is premised on the simple idea that a random sequence error due to a chance accident during data collection or recording will be distributed within a population dataset as a singleton (i.e., as a polymorphic site where one sampled sequence exhibits a unique base relative to the common nucleotide of the others). Thus, one can avoid these random errors by ignoring the singletons within a dataset.  相似文献   

15.

Background  

The identification of relevant biological features in large and complex datasets is an important step towards gaining insight in the processes underlying the data. Other advantages of feature selection include the ability of the classification system to attain good or even better solutions using a restricted subset of features, and a faster classification. Thus, robust methods for fast feature selection are of key importance in extracting knowledge from complex biological data.  相似文献   

16.
Yang X  Belin TR  Boscardin WJ 《Biometrics》2005,61(2):498-506
Across multiply imputed data sets, variable selection methods such as stepwise regression and other criterion-based strategies that include or exclude particular variables typically result in models with different selected predictors, thus presenting a problem for combining the results from separate complete-data analyses. Here, drawing on a Bayesian framework, we propose two alternative strategies to address the problem of choosing among linear regression models when there are missing covariates. One approach, which we call "impute, then select" (ITS) involves initially performing multiple imputation and then applying Bayesian variable selection to the multiply imputed data sets. A second strategy is to conduct Bayesian variable selection and missing data imputation simultaneously within one Gibbs sampling process, which we call "simultaneously impute and select" (SIAS). The methods are implemented and evaluated using the Bayesian procedure known as stochastic search variable selection for multivariate normal data sets, but both strategies offer general frameworks within which different Bayesian variable selection algorithms could be used for other types of data sets. A study of mental health services utilization among children in foster care programs is used to illustrate the techniques. Simulation studies show that both ITS and SIAS outperform complete-case analysis with stepwise variable selection and that SIAS slightly outperforms ITS.  相似文献   

17.

Background  

Random forests (RF) have been increasingly used in applications such as genome-wide association and microarray studies where predictor correlation is frequently observed. Recent works on permutation-based variable importance measures (VIMs) used in RF have come to apparently contradictory conclusions. We present an extended simulation study to synthesize results.  相似文献   

18.

Background  

The number of genes declared differentially expressed is a random variable and its variability can be assessed by resampling techniques. Another important stability indicator is the frequency with which a given gene is selected across subsamples. We have conducted studies to assess stability and some other properties of several gene selection procedures with biological and simulated data.  相似文献   

19.

Background

Imperfect diagnostic testing reduces the power to detect significant predictors in classical cross-sectional studies. Assuming that the misclassification in diagnosis is random this can be dealt with by increasing the sample size of a study. However, the effects of imperfect tests in longitudinal data analyses are not as straightforward to anticipate, especially if the outcome of the test influences behaviour. The aim of this paper is to investigate the impact of imperfect test sensitivity on the determination of predictor variables in a longitudinal study.

Methodology/Principal Findings

To deal with imperfect test sensitivity affecting the response variable, we transformed the observed response variable into a set of possible temporal patterns of true disease status, whose prior probability was a function of the test sensitivity. We fitted a Bayesian discrete time survival model using an MCMC algorithm that treats the true response patterns as unknown parameters in the model. We applied our approach to epidemiological data of bovine tuberculosis outbreaks in England and investigated the effect of reduced test sensitivity in the determination of risk factors for the disease. We found that reduced test sensitivity led to changes to the collection of risk factors associated with the probability of an outbreak that were chosen in the ‘best’ model and to an increase in the uncertainty surrounding the parameter estimates for a model with a fixed set of risk factors that were associated with the response variable.

Conclusions/Significance

We propose a novel algorithm to fit discrete survival models for longitudinal data where values of the response variable are uncertain. When analysing longitudinal data, uncertainty surrounding the response variable will affect the significance of the predictors and should therefore be accounted for either at the design stage by increasing the sample size or at the post analysis stage by conducting appropriate sensitivity analyses.  相似文献   

20.

Background  

Molecular database search tools need statistical models to assess the significance for the resulting hits. In the classical approach one asks the question how probable a certain score is observed by pure chance. Asymptotic theories for such questions are available for two random i.i.d. sequences. Some effort had been made to include effects of finite sequence lengths and to account for specific compositions of the sequences. In many applications, such as a large-scale database homology search for transmembrane proteins, these models are not the most appropriate ones. Search sensitivity and specificity benefit from position-dependent scoring schemes or use of Hidden Markov Models. Additional, one may wish to go beyond the assumption that the sequences are i.i.d. Despite their practical importance, the statistical properties of these settings have not been well investigated yet.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号