首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A recent study examined the stability of rankings from random forests using two variable importance measures (mean decrease accuracy (MDA) and mean decrease Gini (MDG)) and concluded that rankings based on the MDG were more robust than MDA. However, studies examining data-specific characteristics on ranking stability have been few. Rankings based on the MDG measure showed sensitivity to within-predictor correlation and differences in category frequencies, even when the number of categories was held constant, and thus may produce spurious results. The MDA measure was robust to these data characteristics. Further, under strong within-predictor correlation, MDG rankings were less stable than those using MDA.  相似文献   

2.

Background  

Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories.  相似文献   

3.

Background  

Genome-wide association studies for complex diseases will produce genotypes on hundreds of thousands of single nucleotide polymorphisms (SNPs). A logical first approach to dealing with massive numbers of SNPs is to use some test to screen the SNPs, retaining only those that meet some criterion for futher study. For example, SNPs can be ranked by p-value, and those with the lowest p-values retained. When SNPs have large interaction effects but small marginal effects in a population, they are unlikely to be retained when univariate tests are used for screening. However, model-based screens that pre-specify interactions are impractical for data sets with thousands of SNPs. Random forest analysis is an alternative method that produces a single measure of importance for each predictor variable that takes into account interactions among variables without requiring model specification. Interactions increase the importance for the individual interacting variables, making them more likely to be given high importance relative to other variables. We test the performance of random forests as a screening procedure to identify small numbers of risk-associated SNPs from among large numbers of unassociated SNPs using complex disease models with up to 32 loci, incorporating both genetic heterogeneity and multi-locus interaction.  相似文献   

4.
Questions: What are important forest characteristics determining colonization of forest patches by forest understorey species? Location: Planted forests on land recently reclaimed from the sea, the Netherlands. Methods: We related the distribution of forest specialist species in the understorey of 55 forests in Dutch IJsselmeer polders to the following forest characteristics: age, area, connectivity, distance to mainland (as a proxy for distance to seed source) and path density. We used species of the Fraxino‐Ulmetum association for the Netherlands as reference for species that could potentially occur in the study area. Results: Area and age of the surveyed forests explained a large part of the variation in overall species composition and species number of forest plant species. The importance of connectivity and distance to the mainland of forest habitats became apparent only at a more detailed level of dispersal groups and individual species. The importance of forest parameters differed between dispersal groups and also between individual species. After 60 years, 75% of the potential pool of wind‐dispersed species has reached the polders, whereas this was only 50% for species lacking specific adaptations to long‐distance dispersal. However, the average percentage of successful colonizing species present per forest was substantially lower, ranging from 15 to 37%. Conclusions: The data strongly suggest that the colonization process in polder forests is still in its initial phase, during which easily dispersed species dominate the vegetation. Colonization success of common species that lack adaptations to long‐distance dispersal is affected by spatial configuration of the forests, and most rare species that could potentially occur in these forests are still absent. Implications for conservation of rare species in fragmented landscapes are discussed.  相似文献   

5.
Prostate cancer is the most common non-skin cancer and the second leading cause of cancer related mortality for men in the United States. There is strong empirical and epidemiological evidence supporting a stronger role of genetics in early-onset prostate cancer. We performed a genome-wide association scan for early-onset prostate cancer. Novel aspects of this study include the focus on early-onset disease (defined as men with prostate cancer diagnosed before age 56 years) and use of publically available control genotype data from previous genome-wide association studies. We found genome-wide significant (p<5×10−8) evidence for variants at 8q24 and 11p15 and strong supportive evidence for a number of previously reported loci. We found little evidence for individual or systematic inflated association findings resulting from using public controls, demonstrating the utility of using public control data in large-scale genetic association studies of common variants. Taken together, these results demonstrate the importance of established common genetic variants for early-onset prostate cancer and the power of including early-onset prostate cancer cases in genetic association studies.  相似文献   

6.
D. Mishmar  I. Zhidkov 《BBA》2010,1797(6-7):1099-1104
Mitochondrial DNA (mtDNA) mutations are long known to cause diseases but also underlie tremendous population divergence in humans. It was assumed that the two types of mutations differ in one major trait: functionality. However, evidence from disease association studies, cell culture and animal models support the functionality of common mtDNA genetic variants, leading to the hypothesis that disease-causing mutations and mtDNA genetic variants share considerable common features. Here we provide evidence showing that the two types of mutations obey the rules of evolution, including random genetic drift and natural selection. This similarity does not only converge at the principle level; rather, disease-causing mutations could recapitulate the ancestral DNA sequence state. Thus, the very same mutations could either mark ancient evolutionary changes or cause disease.  相似文献   

7.
Geographic patterns of genetic variation are strongly influenced by historical changes in species habitats. Whether such patterns are common to co‐distributed taxa may depend on the extent to which species vary in ecology and vagility. We investigated whether broad‐scale phylogeographic patterns common to a number of small‐bodied vertebrate and invertebrate species in eastern Australian forests were reflected in the population genetic structure of an Australo‐Papuan forest marsupial, the red‐legged pademelon (Macropodidae: Thylogale stigmatica). Strong genetic structuring of mtDNA haplotypes indicated the persistence of T. stigmatica populations across eastern Australia and southern New Guinea in Pleistocene refugial areas consistent with those inferred from studies of smaller, poorly dispersing species. However, there was limited divergence of haplotypes across two known historical barriers in the northeastern Wet Tropics (Black Mountain Barrier) and coastal mideastern Queensland (Burdekin Gap) regions. Lack of divergence across these barriers may reflect post‐glacial recolonization of forests from a large, central refugium in the Wet Tropics. Additionally, genetic structure is not consistent with the present delimitation of subspecies T. s. wilcoxi and T. s. stigmatica across the Burdekin Gap. Instead, the genetic division occurs further to the south in mideastern Queensland. Thus, while larger‐bodied marsupials such as T. stigmatica did persist in Pleistocene refugia common to a number of other forest‐restricted species, species‐specific local extinction and recolonization events have resulted in cryptic patterns of genetic variation. Our study demonstrates the importance of understanding individualistic responses to historical climate change in order to adequately conserve genetic diversity and the evolutionary potential of species.  相似文献   

8.
Elucidating the relationship between polymorphic sequences and risk of common disease is a challenge. For example, although it is clear that variation in DNA repair genes is associated with familial cancer, aging and neurological disease, progress toward identifying polymorphisms associated with elevated risk of sporadic disease has been slow. This is partly due to the complexity of the genetic variation, the existence of large numbers of mostly low frequency variants and the contribution of many genes to variation in susceptibility. There has been limited development of methods to find associations between genotypes having many polymorphisms and pathway function or health outcome. We have explored several statistical methods for identifying polymorphisms associated with variation in DNA repair phenotypes. The model system used was 80 cell lines that had been resequenced to identify variation; 191 single nucleotide substitution polymorphisms (SNPs) are included, of which 172 are in 31 base excision repair pathway genes, 19 in 5 anti-oxidation genes, and DNA repair phenotypes based on single strand breaks measured by the alkaline Comet assay. Univariate analyses were of limited value in identifying SNPs associated with phenotype variation. Of the multivariable model selection methods tested: the easiest that provided reduced error of prediction of phenotype was simple counting of the variant alleles predicted to encode proteins with reduced activity, which led to a genotype including 52 SNPs; the best and most parsimonious model was achieved using a two-step analysis without regard to potential functional relevance: first SNPs were ranked by importance determined by random forests regression (RFR), followed by cross-validation in a second round of RFR modeling that included ever more SNPs in declining order of importance. With this approach six SNPs were found to minimize prediction error. The results should encourage research into utilization of multivariate analytical methods for epidemiological studies of the association of genetic variation in complex genotypes with risk of common diseases.  相似文献   

9.
Understanding the mechanisms of habitat selection is fundamental to the construction of proper conservation and management plans for many avian species. Habitat changes caused by human beings increase the landscape complexity and thus the complexity of data available for explaining species distribution. New techniques that assume no linearity and capable to extrapolate the response variables across landscapes are needed for dealing with difficult relationships between habitat variables and distribution data. We used a random forest algorithm to study breeding-site selection of herons and egrets in a human-influenced landscape by analyzing land use around their colonies. We analyzed the importance of each land-use variable for different scales and its relationship to the probability of colony presence. We found that there exist two main spatial scales on which herons and egrets select their colony sites: medium scale (4 km) and large scale (10–15 km). Colonies were attracted to areas with large amounts of evergreen forests at the medium scale, whereas avoidance of high-density urban areas was important at the large scale. Previous studies used attractive factors, mainly foraging areas, to explain bird-colony distributions, but our study is the first to show the major importance of repellent factors at large scales. We believe that the newest non-linear methods, such as random forests, are needed when modelling complex variable interactions when organisms are distributed in complex landscapes. These methods could help to improve the conservation plans of those species threatened by the advance of highly human-influenced landscapes.  相似文献   

10.
The genome-wide association studies (GWAS) designed for next-generation sequencing data involve testing association of genomic variants, including common, low frequency, and rare variants. The current strategies for association studies are well developed for identifying association of common variants with the common diseases, but may be ill-suited when large amounts of allelic heterogeneity are present in sequence data. Recently, group tests that analyze their collective frequency differences between cases and controls shift the current variant-by-variant analysis paradigm for GWAS of common variants to the collective test of multiple variants in the association analysis of rare variants. However, group tests ignore differences in genetic effects among SNPs at different genomic locations. As an alternative to group tests, we developed a novel genome-information content-based statistics for testing association of the entire allele frequency spectrum of genomic variation with the diseases. To evaluate the performance of the proposed statistics, we use large-scale simulations based on whole genome low coverage pilot data in the 1000 Genomes Project to calculate the type 1 error rates and power of seven alternative statistics: a genome-information content-based statistic, the generalized T(2), collapsing method, multivariate and collapsing (CMC) method, individual χ(2) test, weighted-sum statistic, and variable threshold statistic. Finally, we apply the seven statistics to published resequencing dataset from ANGPTL3, ANGPTL4, ANGPTL5, and ANGPTL6 genes in the Dallas Heart Study. We report that the genome-information content-based statistic has significantly improved type 1 error rates and higher power than the other six statistics in both simulated and empirical datasets.  相似文献   

11.
Pine wilt disease (PWD) caused by the pine wood nematode is the most serious global threat to pine forests. Hazard ratings of trees and forests to pest attacks provide important information to efficiently identify current or future hazardous conditions. However, in spite of the importance of hazard ratings for managing PWD, there are few studies on hazard ratings in this system. In this study, we evaluated the hazard ratings of pine trees and pine stands to PWD by considering environmental factors at the level of the stand and the individual tree. Our results showed that trees with larger diameter at breast height (DBH) showed a higher risk rate than those with smaller DBH, indicating that large trees have an increased probability of exposure to vector beetles because they are tall and have a large crown volume. We also found that reduced tree vigour could be related to susceptibility to PWD. In pine stands, geographical factors showed a high correlation with the occurrence of PWD. PWD occurrence was rare at high altitudes, but was more common on steep and south-facing slopes. These patterns were consistently observed in the results from 2 computational approaches: self-organizing map (SOM) and random forest models. The combination of SOM and random forest was effective to extract ecological information from the dataset. The SOM efficiently characterized relations among variables, and the random forest model was effective at predicting ecological variables, including the hazard rating of trees to disturbances.  相似文献   

12.
During the last 1000 years, massive deforestation events have occurred in Flanders (the northern part of Belgium) and the remaining forests have become very isolated patches. It is expected that organisms bound to these patchy forest habitats and with limited dispersal capacities will likely experience strong effects of genetic drift. One such organism is the spider Coelotes terrestris. Allozyme data suggested that 10 Flemish populations of this spider showed little genetic variation, as only one out of 20 loci was polymorphic (phosphoglucose isomerase). In view of this result, we used random amplified polymorphic DNA (RAPD) markers to test whether this lack of allozyme diversity is an inherent feature of the populations and/or species studied or whether it rather reflects a characteristic of the markers and/or methods used. Since the RAPD data revealed a substantial amount of genetic diversity in the same 10 populations, our results suggest that the latter is true. Furthermore, the RAPD data agree with the expectations for an organism with low dispersal capacities that has lived in isolated forest patches for at least 200 generations. Supplemented with the results of other techniques and studies, these findings might be of importance for the future conservation of this spider species in Flanders.  相似文献   

13.
The availability of high-density single nucleotide polymorphisms (SNPs) data has made the human genetic association studies possible to identify common and rare variants underlying complex diseases in a genome-wide scale. A handful of novel genetic variants have been identified, which gives much hope and prospects for the future of genetic association studies. In this process, statistical and computational methods play key roles, among which information-based association tests have gained large popularity. This paper is intended to give a comprehensive review of the current literature in genetic association analysis casted in the framework of information theory. We focus our review on the following topics: (1) information theoretic approaches in genetic linkage and association studies; (2) entropy-based strategies for optimal SNP subset selection; and (3) the usage of theoretic information criteria in gene clustering and gene regulatory network construction.  相似文献   

14.
PURPOSE OF REVIEW: Recently, genome-wide genetic screening of common DNA sequence variants has proven a successful approach to identify novel genetic contributors to complex traits. This review summarizes recent genome-wide association studies for lipid phenotypes, and evaluates the next steps needed to obtain a full picture of genotype-phenotype correlation and apply these findings to inform clinical practice. RECENT FINDINGS: So far, genome-wide association studies have defined at least 19 genomic regions that contain common DNA single nucleotide polymorphisms associated with LDL cholesterol, HDL cholesterol and/or triglycerides. Of these, eight represent novel loci in humans, whereas 11 genes have been previously implicated in lipoprotein metabolism. Many of the same loci with common variants have already been shown to lead to monogenic lipid disorders in humans and/or mice, suggesting that a spectrum of common and rare alleles at each validated locus contributes to blood lipid concentrations. SUMMARY: At least 19 loci harbor common variations that contribute to blood lipid concentrations in humans. Larger scale genome-wide association studies should identify additional loci, and sequencing of these loci should pinpoint all relevant alleles. With a full catalog of DNA polymorphisms in hand, a panel of lipid-related variants can be studied to provide clinical risk stratification and targeting of therapeutic interventions.  相似文献   

15.
Although tropical wet forests are generally more diverse than dry forests for many faunal groups, few studies have compared bat diversity among dry forests. I compared ground level phyllostomid bat community structure between two tropical dry forests with different precipitation regimes. Parque National Palo Verde in northwestern Costa Rica represents one of the wettest tropical dry forests (rainfall 1.5 m/yr), whereas the Chamela‐Cuixmala Biosphere Reserve on the Pacific coast of central Mexico represents one of the driest (750 mm/yr). Mist net sampling was conducted at the two study sites to compare changes in ground level phyllostomid bat community structure between regions and seasons. Palo Verde was more diverse than Chamela and phyllostomid species showed low similarity between sites (Classic Jaccard = 0.263). The distinct phyllostomid communities observed at these two dry forest sites demonstrates that variants of tropical dry forest can be sufficiently different in structure and composition to affect phyllostomid communities. At both dry forest sites, abundance of the two most common foraging guilds (frugivores and nectarivores) differed between seasons, with greatest numbers of individuals captured coinciding with highest chiropterophilic resource abundance.  相似文献   

16.
Since the seminal work of Prentice and Pyke, the prospective logistic likelihood has become the standard method of analysis for retrospectively collected case‐control data, in particular for testing the association between a single genetic marker and a disease outcome in genetic case‐control studies. In the study of multiple genetic markers with relatively small effects, especially those with rare variants, various aggregated approaches based on the same prospective likelihood have been developed to integrate subtle association evidence among all the markers considered. Many of the commonly used tests are derived from the prospective likelihood under a common‐random‐effect assumption, which assumes a common random effect for all subjects. We develop the locally most powerful aggregation test based on the retrospective likelihood under an independent‐random‐effect assumption, which allows the genetic effect to vary among subjects. In contrast to the fact that disease prevalence information cannot be used to improve efficiency for the estimation of odds ratio parameters in logistic regression models, we show that it can be utilized to enhance the testing power in genetic association studies. Extensive simulations demonstrate the advantages of the proposed method over the existing ones. A real genome‐wide association study is analyzed for illustration.  相似文献   

17.
Dynamic balance in human locomotion can be assessed through the local dynamic stability (LDS) method. Whereas gait LDS has been used successfully in many settings and applications, little is known about its sensitivity to individual characteristics of healthy adults. Therefore, we reanalyzed a large dataset of accelerometric data measured for 100 healthy adults from 20 to 70 years of age performing 10 min treadmill walking. We sought to assess the extent to which the variations of age, body mass and height, sex, and preferred walking speed (PWS) could influence gait LDS. The random forest (RF) and multiple adaptive regression splines (MARS) algorithms were selected for their good bias-variance tradeoff and their capabilities to handle nonlinear associations. First, through variable importance measure (VIM), we used RF to evaluate which individual characteristics had the highest influence on gait LDS. Second, we used MARS to detect potential interactions among individual characteristics that may influence LDS. The VIM and MARS results indicated that PWS and age correlated with LDS, whereas no associations were found for sex, body height, and body mass. Further, the MARS model detected an age by PWS interaction: on one hand, at high PWS, gait stability is constant across age while, on the other hand, at low PWS, gait instability increases substantially with age. We conclude that it is advisable to consider the participants’ age as well as their PWS to avoid potential biases in evaluating dynamic balance through LDS.  相似文献   

18.
In modern genetic epidemiology studies, the association between the disease and a genomic region, such as a candidate gene, is often investigated using multiple SNPs. We propose a multilocus test of genetic association that can account for genetic effects that might be modified by variants in other genes or by environmental factors. We consider use of the venerable and parsimonious Tukey's 1-degree-of-freedom model of interaction, which is natural when individual SNPs within a gene are associated with disease through a common biological mechanism; in contrast, many standard regression models are designed as if each SNP has unique functional significance. On the basis of Tukey's model, we propose a novel but computationally simple generalized test of association that can simultaneously capture both the main effects of the variants within a genomic region and their interactions with the variants in another region or with an environmental exposure. We compared performance of our method with that of two standard tests of association, one ignoring gene-gene/gene-environment interactions and the other based on a saturated model of interactions. We demonstrate major power advantages of our method both in analysis of data from a case-control study of the association between colorectal adenoma and DNA variants in the NAT2 genomic region, which are well known to be related to a common biological phenotype, and under different models of gene-gene interactions with use of simulated data.  相似文献   

19.
An individual's disease risk is determined by the compounded action of both common variants, inherited from remote ancestors, that segregated within the population and rare variants, inherited from recent ancestors, that segregated mainly within pedigrees. Next-generation sequencing (NGS) technologies generate high-dimensional data that allow a nearly complete evaluation of genetic variation. Despite their promise, NGS technologies also suffer from remarkable limitations: high error rates, enrichment of rare variants, and a large proportion of missing values, as well as the fact that most current analytical methods are designed for population-based association studies. To meet the analytical challenges raised by NGS, we propose a general framework for sequence-based association studies that can use various types of family and unrelated-individual data sampled from any population structure and a universal procedure that can transform any population-based association test statistic for use in family-based association tests. We develop family-based functional principal-component analysis (FPCA) with or without smoothing, a generalized T(2), combined multivariate and collapsing (CMC) method, and single-marker association test statistics. Through intensive simulations, we demonstrate that the family-based smoothed FPCA (SFPCA) has the correct type I error rates and much more power to detect association of (1) common variants, (2) rare variants, (3) both common and rare variants, and (4) variants with opposite directions of effect from other population-based or family-based association analysis methods. The proposed statistics are applied to two data sets with pedigree structures. The results show that the smoothed FPCA has a much smaller p value than other statistics.  相似文献   

20.
The success of genome-wide association studies relies on much of the risk of common diseases being due to common genetic variants; but evidence for this is inconclusive. The results of published genome-wide association studies are examined to see what can be learnt about the distribution of disease-associated variants and how this might influence future study design. Although replicated disease-associated variants tend to be very common and frequency is inversely correlated with estimated effect size, our simulations suggest that such observations are the result of power. We find that for studies conducted to date, the frequency and effect size of significantly associated alleles are likely to be similar to those of the underlying disease alleles that they represent. Little of the genetic variation of disease has been explained so far, but current studies are only adequately powered to detect very common alleles unless they greatly increase disease risk. Thus, although the truth of the common disease / common variant hypothesis remains undecided, recent successes suggest that there are many more common genetic disease-associated variants, requiring larger studies to be identified.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号