首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Summary Current ongoing genome‐wide association (GWA) studies represent a powerful approach to uncover common unknown genetic variants causing common complex diseases. The discovery of these genetic variants offers an important opportunity for early disease prediction, prevention, and individualized treatment. We describe here a method of combining multiple genetic variants for early disease prediction, based on the optimality theory of the likelihood ratio (LR). Such theory simply shows that the receiver operating characteristic (ROC) curve based on the LR has maximum performance at each cutoff point and that the area under the ROC curve so obtained is highest among that of all approaches. Through simulations and a real data application, we compared it with the commonly used logistic regression and classification tree approaches. The three approaches show similar performance if we know the underlying disease model. However, for most common diseases we have little prior knowledge of the disease model and in this situation the new method has an advantage over logistic regression and classification tree approaches. We applied the new method to the type 1 diabetes GWA data from the Wellcome Trust Case Control Consortium. Based on five single nucleotide polymorphisms, the test reaches medium level classification accuracy. With more genetic findings to be discovered in the future, we believe a predictive genetic test for type 1 diabetes can be successfully constructed and eventually implemented for clinical use.  相似文献   

2.
We introduce a novel approach for describing patterns of HIV genetic variation using regression modeling techniques. Parameters are defined for describing genetic variation within and between viral populations by generalizing Simpson's index of diversity. Regression models are specified for these variation parameters and the generalized estimating equation framework is used for estimating both the regression parameters and their corresponding variances. Conditions are described under which the usual asymptotic approximations to the distribution of the estimators are met. This approach provides a formal statistical framework for testing hypotheses regarding the changing patterns of HIV genetic variation over time within an infected patient. The application of these methods for testing biologically relevant hypotheses concerning HIV genetic variation is demonstrated in an example using sequence data from a subset of patients from the Multicenter AIDS Cohort Study.  相似文献   

3.
Keith P. Lewis 《Oikos》2004,104(2):305-315
Ecologists rely heavily upon statistics to make inferences concerning ecological phenomena and to make management recommendations. It is therefore important to use statistical tests that are most appropriate for a given data-set. However, inappropriate statistical tests are often used in the analysis of studies with categorical data (i.e. count data or binary data). Since many types of statistical tests have been used in artificial nests studies, a review and comparison of these tests provides an opportunity to demonstrate the importance of choosing the most appropriate statistical approach for conceptual reasons as well as type I and type II errors.
Artificial nests have routinely been used to study the influences of habitat fragmentation, and habitat edges on nest predation. I review the variety of statistical tests used to analyze artificial nest data within the framework of the generalized linear model and argue that logistic regression is the most appropriate and flexible statistical test for analyzing binary data-sets. Using artificial nest data from my own studies and an independent data set from the medical literature as examples, I tested equivalent data using a variety of statistical methods. I then compared the p-values and the statistical power of these tests. Results vary greatly among statistical methods. Methods inappropriate for analyzing binary data often fail to yield significant results even when differences between study groups appear large, while logistic regression finds these differences statistically significant. Statistical power is is 2–3 times higher for logistic regression than for other tests. I recommend that logistic regression be used to analyze artificial nest data and other data-sets with binary data.  相似文献   

4.
The investigation of associations between rare genetic variants and diseases or phenotypes has two goals. Firstly, the identification of which genes or genomic regions are associated, and secondly, discrimination of associated variants from background noise within each region. Over the last few years, many new methods have been developed which associate genomic regions with phenotypes. However, classical methods for high-dimensional data have received little attention. Here we investigate whether several classical statistical methods for high-dimensional data: ridge regression (RR), principal components regression (PCR), partial least squares regression (PLS), a sparse version of PLS (SPLS), and the LASSO are able to detect associations with rare genetic variants. These approaches have been extensively used in statistics to identify the true associations in data sets containing many predictor variables. Using genetic variants identified in three genes that were Sanger sequenced in 1998 individuals, we simulated continuous phenotypes under several different models, and we show that these feature selection and feature extraction methods can substantially outperform several popular methods for rare variant analysis. Furthermore, these approaches can identify which variants are contributing most to the model fit, and therefore both goals of rare variant analysis can be achieved simultaneously with the use of regression regularization methods. These methods are briefly illustrated with an analysis of adiponectin levels and variants in the ADIPOQ gene.  相似文献   

5.
Haplotype inference has become an important part of human genetic data analysis due to its functional and statistical advantages over the single-locus approach in linkage disequilibrium mapping. Different statistical methods have been proposed for detecting haplotype - disease associations using unphased multi-locus genotype data, ranging from the early approach by the simple gene-counting method to the recent work using the generalized linear model. However, these methods are either confined to case - control design or unable to yield unbiased point and interval estimates of haplotype effects. Based on the popular logistic regression model, we present a new approach for haplotype association analysis of human disease traits. Using haplotype-based parameterization, our model infers the effects of specific haplotypes (point estimation) and constructs confidence interval for the risks of haplotypes (interval estimation). Based on the estimated parameters, the model calculates haplotype frequency conditional on the trait value for both discrete and continuous traits. Moreover, our model provides an overall significance level for the association between the disease trait and a group or all of the haplotypes. Featured by the direct maximization in haplotype estimation, our method also facilitates a computer simulation approach for correcting the significance level of individual haplotype to adjust for multiple testing. We show, by applying the model to an empirical data set, that our method based on the well-known logistic regression model is a useful tool for haplotype association analysis of human disease traits.  相似文献   

6.
In the current post-genomic era, the genetic basis of pig growth can be understood by assessing SNP marker effects and genomic breeding values (GEBV) based on estimates of these growth curve parameters as phenotypes. Although various statistical methods, such as random regression (RR-BLUP) and Bayesian LASSO (BL), have been applied to genomic selection (GS), none of these has yet been used in a growth curve approach. In this work, we compared the accuracies of RR-BLUP and BL using empirical weight-age data from an outbred F2 (Brazilian Piau X commercial) population. The phenotypes were determined by parameter estimates using a nonlinear logistic regression model and the halothane gene was considered as a marker for evaluating the assumptions of the GS methods in relation to the genetic variation explained by each locus. BL yielded more accurate values for all of the phenotypes evaluated and was used to estimate SNP effects and GEBV vectors. The latter allowed the construction of genomic growth curves, which showed substantial genetic discrimination among animals in the final growth phase. The SNP effect estimates allowed identification of the most relevant markers for each phenotype, the positions of which were coincident with reported QTL regions for growth traits.  相似文献   

7.
It was shown recently using experimental data that it is possible under certain conditions to determine whether a person with known genotypes at a number of markers was part of a sample from which only allele frequencies are known. Using population genetic and statistical theory, we show that the power of such identification is, approximately, proportional to the number of independent SNPs divided by the size of the sample from which the allele frequencies are available. We quantify the limits of identification and propose likelihood and regression analysis methods for the analysis of data. We show that these methods have similar statistical properties and have more desirable properties, in terms of type-I error rate and statistical power, than test statistics suggested in the literature.  相似文献   

8.
基因芯片筛选差异表达基因方法比较   总被引:1,自引:0,他引:1  
单文娟  童春发  施季森 《遗传》2008,30(12):1640-1646
摘要: 使用计算机模拟数据和真实的芯片数据, 对8种筛选差异表达基因的方法进行了比较分析, 旨在比较不同方法对基因芯片数据的筛选效果。模拟数据分析表明, 所使用的8种方法对均匀分布的差异表达基因有很好的识别、检出作用。算法方面, SAM和Wilcoxon秩和检验方法较好; 数据分布方面, 正态分布的识别效果较好, 卡方分布和指数分布的识别效果较差。杨树cDNA芯片分析表明, SAM、Samroc和回归模型方法相近, 而Wilcoxon秩和检验方法与它们有较大差异。  相似文献   

9.

Background

For several immune-mediated diseases, immunological analysis will become more complex in the future with datasets in which cytokine and gene expression data play a major role. These data have certain characteristics that require sophisticated statistical analysis such as strategies for non-normal distribution and censoring. Additionally, complex and multiple immunological relationships need to be adjusted for potential confounding and interaction effects.

Objective

We aimed to introduce and apply different methods for statistical analysis of non-normal censored cytokine and gene expression data. Furthermore, we assessed the performance and accuracy of a novel regression approach in order to allow adjusting for covariates and potential confounding.

Methods

For non-normally distributed censored data traditional means such as the Kaplan-Meier method or the generalized Wilcoxon test are described. In order to adjust for covariates the novel approach named Tobit regression on ranks was introduced. Its performance and accuracy for analysis of non-normal censored cytokine/gene expression data was evaluated by a simulation study and a statistical experiment applying permutation and bootstrapping.

Results

If adjustment for covariates is not necessary traditional statistical methods are adequate for non-normal censored data. Comparable with these and appropriate if additional adjustment is required, Tobit regression on ranks is a valid method. Its power, type-I error rate and accuracy were comparable to the classical Tobit regression.

Conclusion

Non-normally distributed censored immunological data require appropriate statistical methods. Tobit regression on ranks meets these requirements and can be used for adjustment for covariates and potential confounding in large and complex immunological datasets.  相似文献   

10.
MOTIVATION: The identification of risk-associated genetic variants in common diseases remains a challenge to the biomedical research community. It has been suggested that common statistical approaches that exclusively measure main effects are often unable to detect interactions between some of these variants. Detecting and interpreting interactions is a challenging open problem from the statistical and computational perspectives. Methods in computing science may improve our understanding on the mechanisms of genetic disease by detecting interactions even in the presence of very low heritabilities. RESULTS: We have implemented a method using Genetic Programming that is able to induce a Decision Tree to detect interactions in genetic variants. This method has a cross-validation strategy for estimating classification and prediction errors and tests for consistencies in the results. To have better estimates, a new consistency measure that takes into account interactions and can be used in a genetic programming environment is proposed. This method detected five different interaction models with heritabilities as low as 0.008 and with prediction errors similar to the generated errors. AVAILABILITY: Information on the generated data sets and executable code is available upon request.  相似文献   

11.
12.
The timing of when the embryonic left−right (LR) axis is first established and the mechanisms driving this process are subjects of strong debate. While groups have focused on the role of cilia in establishing the LR axis during gastrula and neurula stages, many animals appear to orient the LR axis prior to the appearance of, or without the benefit of, motile cilia. Because of the large amount of data available in the published literature and the similarities in the type of data collected across laboratories, I have examined relationships between the studies that do and do not implicate cilia, the choice of animal model, the kinds of LR patterning defects observed, and the penetrance of LR phenotypes. I found that treatments affecting cilia structure and motility had a higher penetrance for both altered gene expression and improper organ placement compared to treatments that affect processes in early cleavage stage embryos. I also found differences in penetrance that could be attributed to the animal models used; the mouse is highly prone to LR randomization. Additionally, the data were examined to address whether gene expression can be used to predict randomized organ placement. Using regression analysis, gene expression was found to be predictive of organ placement in frogs, but much less so in the other animals examined. Together, these results challenge previous ideas about the conservation of LR mechanisms, with the mouse model being significantly different from fish, frogs, and chick in almost every aspect examined. Additionally, this analysis indicates that there may be missing pieces in the molecular pathways that dictate how genetic information becomes organ positional information in vertebrates; these gaps will be important for future studies to identify, as LR asymmetry is not only a fundamentally fascinating aspect of development but also of considerable biomedical importance.  相似文献   

13.
ABSTRACT: BACKGROUND: There is a need for automated methods to learn general features of the interactions of a ligand class with its diverse set of protein receptors. An appropriate machine learning approach is Inductive Logic Programming (ILP), which automatically generates comprehensible rules in addition to prediction. The development of ILP systems whichcan learn rules of the complexity required for studies on protein structure remains a challenge. In this work we use a new ILP system, ProGolem, and demonstrate its performance on learning features of hexose-protein interactions. RESULTS: The rules induced by ProGolem detect interactions mediated by aromatics and by planar-polar residues, in addition to less common features such as the aromatic sandwich. The rules also reveal a previously unreported dependency for residues CYS and LEU. They also specify interactions involving aromatic and hydrogen bonding residues. CONCLUSIONS: In addition to confirming literature results, ProGolem's model has a 10-fold cross-validated predictive accuracy that is superior, at the 95% confidence level, to another ILP system previously used to study protein/hexose interactions and is comparable with state-of-the-art statistical learners.  相似文献   

14.
15.
Disease resistance‐related traits have received increasing importance in aquaculture breeding programs worldwide. Currently, genomic information offers new possibilities in breeding to address the improvement of this kind of traits. The turbot is one of the most promising European aquaculture species, and Philasterides dicentrarchi is a scuticociliate parasite causing fatal disease in farmed turbot. An appealing approach to fight against disease is to achieve a more robust broodstock, which could prevent or diminish the devastating effects of scuticociliatosis on farmed individuals. In the present study, a genome scan for quantitative trait loci (QTL) affecting resistance and survival time to P. dicentrarchi in four turbot families was carried out. The objectives were to identify QTL using different statistical approaches [linear regression (LR) and maximum likelihood (ML)] and to locate significantly associated markers for their application in genetic breeding strategies. Several genomic regions controlling resistance and survival time to P. dicentrarchi were detected. When analyzing each family separately, significant QTL for resistance were identified by the LR method in two linkage groups (LG1 and LG9) and for survival time in LG1, while the ML methodology identified QTL for resistance in LG9 and LG23 and for survival time in LG6 and LG23. The analysis of the total data set identified an additional significant QTL for resistance and survival time in LG3 with the LR method. Significant association between disease resistance‐related traits and genotypes was detected for several markers, a single one explaining up to 22% of the phenotypic variance. Obtained results will be essential to identify candidate genes for resistance and to apply them in marker‐assisted selection programs to improve turbot production.  相似文献   

16.
The goal of landscape genetics is to detect and explain landscape effects on genetic diversity and structure. Despite the increasing popularity of landscape genetic approaches, the statistical methods for linking genetic and landscape data remain largely untested. This lack of method evaluation makes it difficult to compare studies utilizing different statistics, and compromises the future development and application of the field. To investigate the suitability and comparability of various statistical approaches used in landscape genetics, we simulated data sets corresponding to five landscape-genetic scenarios. We then analyzed these data with eleven methods, and compared the methods based on their statistical power, type-1 error rates, and their overall ability to lead researchers to accurate conclusions about landscape-genetic relationships. Results suggest that some of the most commonly applied techniques (e.g. Mantel and partial Mantel tests) have high type-1 error rates, and that multivariate, non-linear methods are better suited for landscape genetic data analysis. Furthermore, different methods generally show only moderate levels of agreement. Thus, analyzing a data set with only one method could yield method-dependent results, potentially leading to erroneous conclusions. Based on these findings, we give recommendations for choosing optimal combinations of statistical methods, and identify future research needs for landscape genetic data analyses.  相似文献   

17.
Case–control designs are commonly employed in genetic association studies. In addition to the case–control status, data on secondary traits are often collected. Directly regressing secondary traits on genetic variants from a case–control sample often leads to biased estimation. Several statistical methods have been proposed to address this issue. The inverse probability weighting (IPW) approach and the semiparametric maximum-likelihood (SPML) approach are the most commonly used. A new weighted estimating equation (WEE) approach is proposed to provide unbiased estimation of genetic associations with secondary traits, by combining observed and counterfactual outcomes. Compared to the existing approaches, WEE is more robust against biased sampling and disease model misspecification. We conducted simulations to evaluate the performance of the WEE under various models and sampling schemes. The WEE demonstrated robustness in all scenarios investigated, had appropriate type I error, and was as powerful or more powerful than the IPW and SPML approaches. We applied the WEE to an asthma case–control study to estimate the associations between the thymic stromal lymphopoietin gene and two secondary traits: overweight status and serum IgE level. The WEE identified two SNPs associated with overweight in logistic regression, three SNPs associated with serum IgE levels in linear regression, and an additional four SNPs that were missed in linear regression to be associated with the 75th quantile of IgE in quantile regression. The WEE approach provides a general and robust secondary analysis framework, which complements the existing approaches and should serve as a valuable tool for identifying new associations with secondary traits.  相似文献   

18.
Understanding the mechanics of adaptive evolution requires not only knowing the quantitative genetic bases of the traits of interest but also obtaining accurate measures of the strengths and modes of selection acting on these traits. Most recent empirical studies of multivariate selection have employed multiple linear regression to obtain estimates of the strength of selection. We reconsider the motivation for this approach, paying special attention to the effects of nonnormal traits and fitness measures. We apply an alternative statistical method, logistic regression, to estimate the strength of selection on multiple phenotypic traits. First, we argue that the logistic regression model is more suitable than linear regression for analyzing data from selection studies with dichotomous fitness outcomes. Subsequently, we show that estimates of selection obtained from the logistic regression analyses can be transformed easily to values that directly plug into equations describing adaptive microevolutionary change. Finally, we apply this methodology to two published datasets to demonstrate its utility. Because most statistical packages now provide options to conduct logistic regression analyses, we suggest that this approach should be widely adopted as an analytical tool for empirical studies of multivariate selection.  相似文献   

19.
Many human diseases are attributable to complex interactions among genetic and environmental factors. Statistical tools capable of modeling such complex interactions are necessary to improve identification of genetic factors that increase a patient''s risk of disease. Logic Forest (LF), a bagging ensemble algorithm based on logic regression (LR), is able to discover interactions among binary variables predictive of response such as the biologic interactions that predispose individuals to disease. However, LF''s ability to recover interactions degrades for more infrequently occurring interactions. A rare genetic interaction may occur if, for example, the interaction increases disease risk in a patient subpopulation that represents only a small proportion of the overall patient population. We present an alternative ensemble adaptation of LR based on boosting rather than bagging called LBoost. We compare the ability of LBoost and LF to identify variable interactions in simulation studies. Results indicate that LBoost is superior to LF for identifying genetic interactions associated with disease that are infrequent in the population. We apply LBoost to a subset of single nucleotide polymorphisms on the PRDX genes from the Cancer Genetic Markers of Susceptibility Breast Cancer Scan to investigate genetic risk for breast cancer. LBoost is publicly available on CRAN as part of the LogicForest package, http://cran.r-project.org/.  相似文献   

20.
MOTIVATION: A central problem in bioinformatics is the assignment of function to sequenced open reading frames (ORFs). The most common approach is based on inferred homology using a statistically based sequence similarity (SIM) method, e.g. PSI-BLAST. Alternative non-SIM based bioinformatic methods are becoming popular. One such method is Data Mining Prediction (DMP). This is based on combining evidence from amino-acid attributes, predicted structure and phylogenic patterns; and uses a combination of Inductive Logic Programming data mining, and decision trees to produce prediction rules for functional class. DMP predictions are more general than is possible using homology. In 2000/1, DMP was used to make public predictions of the function of 1309 Escherichia coli ORFs. Since then biological knowledge has advanced allowing us to test our predictions. RESULTS: We examined the updated (20.02.02) Riley group genome annotation, and examined the scientific literature for direct experimental derivations of ORF function. Both tests confirmed the DMP predictions. Accuracy varied between rules, and with the detail of prediction, but they were generally significantly better than random. For voting rules, accuracies of 75-100% were obtained. Twenty-one of these DMP predictions have been confirmed by direct experimentation. The DMP rules also have interesting biological explanations. DMP is, to the best of our knowledge, the first non-SIM based prediction method to have been tested directly on new data. AVAILABILITY: We have designed the "Genepredictions" database for protein functional predictions. This is intended to act as an open repository for predictions for any organism and can be accessed at http://www.genepredictions.org  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号