首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
为比较稀有变异遗传关联研究中常用负担检验方法(CMC、WST、SUM及其扩展)在不同遗传情境下的统计性能,本文通过计算机模拟产生不同样本量、连锁不平衡(linkage disequilibrium, LD)参数、混杂非关联变异的个数和不同效应的关联变异等条件的稀有变异病例对照数据集,运用各种负担检验方法进行分析,分别计算各方法的一类错误和效能。结果表明,各方法一类错误均在0.05附近;当稀有变异效应方向一致时,除aSUM法外,LD参数越大、混杂非关联变异越少、各法效能越高;当效应方向不一致时,各法效能则显著降低。除强LD外,有方向考虑的方法效能均比无方向考虑的方法高,且样本量越大效能越高。负担检验的统计性能受效应大小和方向、噪音变异和连锁不平衡等多种因素影响。在实际应用中,在各类方法选择、确定集合单位,权重等时最好结合遗传变异的生物信息先验以提高研究效能。  相似文献   

Environmental sequencing shows that plants harbor complex communities of microbes that vary across environments. However, many approaches for mapping plant genetic variation to microbe‐related traits were developed in the relatively simple context of binary host–microbe interactions under controlled conditions. Recent advances in sequencing and statistics make genome‐wide association studies (GWAS) an increasingly promising approach for identifying the plant genetic variation associated with microbes in a community context. This review discusses early efforts on GWAS of the plant phyllosphere microbiome and the outlook for future studies based on human microbiome GWAS. A workflow for GWAS of the phyllosphere microbiome is then presented, with particular attention to how perspectives on the mechanisms, evolution and environmental dependence of plant–microbe interactions will influence the choice of traits to be mapped.  相似文献   

Yang F  Thomas DC 《Human heredity》2011,71(4):209-220
Multiple rare variants have been suggested as accounting for some of the associations with common single nucleotide polymorphisms identified in genome-wide association studies or possibly some of the as yet undiscovered heritability. We consider the power of various approaches to designing substudies aimed at using next-generation sequencing technologies to discover novel variants and to select some subsets that are possibly causal for genotyping in the original case-control study and testing for association using various weighted sum indices. We find that the selection of variants based on the statistical significance of the case-control difference in the subsample yields good power for testing rare variant indices in the main study, and that multivariate models including both the summary index of rare variants and the associated common single nucleotide polymorphisms can distinguish which is the causal factor. By simulation, we explore the effects of varying the size of the discovery subsample, choice of index, and true causal model.  相似文献   

Next-generation sequencing data will soon become routinely available for association studies between complex traits and rare variants. Sequencing data, however, are characterized by the presence of sequencing errors at each individual genotype. This makes it especially challenging to perform association studies of rare variants, which, due to their low minor allele frequencies, can be easily perturbed by genotype errors. In this article, we develop the quality-weighted multivariate score association test (qMSAT), a new procedure that allows powerful association tests between complex traits and multiple rare variants under the presence of sequencing errors. Simulation results based on quality scores from real data show that the qMSAT often dominates over current methods, that do not utilize quality information. In particular, the qMSAT can dramatically increase power over existing methods under moderate sample sizes and relatively low coverage. Moreover, in an obesity data study, we identified using the qMSAT two functional regions (MGLL promoter and MGLL 3'-untranslated region) where rare variants are associated with extreme obesity. Due to the high cost of sequencing data, the qMSAT is especially valuable for large-scale studies involving rare variants, as it can potentially increase power without additional experimental cost. qMSAT is freely available at http://qmsat.sourceforge.net/.  相似文献   

Genome-wide association studies (GWAS) have been successful in identifying common genetic variation reproducibly associated with disease. However, most associated variants confer very small risk and after meta-analysis of large cohorts a large fraction of expected heritability still remains unexplained. A possible explanation is that rare variants currently undetected by GWAS with SNP arrays could contribute a large fraction of risk when present in cases. This concept has spurred great interest in exploring the role of rare variants in disease. As the cost of sequencing continue to plummet, it is becoming feasible to directly sequence case-control samples for testing disease association including rare variants. We have developed a test statistic that allows for association testing among cases and controls using data directly from sequencing reads. In addition, our method allows for random errors in reads. We determine the probability of a true genotype call based on the observed base pair reads using the expectation-maximization algorithm. We apply the SumStat procedure to obtain a single statistic for a group of multiple rare variant loci. We document the validity of our method through simulations. Our results suggest that our statistic maintains the correct type I error rate, even in the presence of differential misclassification for sequence reads, and that it has good power under a number of scenarios. Finally, our SumStat results show power at least as good as the maximum single locus results.  相似文献   

With development of massively parallel sequencing technologies, there is a substantial need for developing powerful rare variant association tests. Common approaches include burden and non-burden tests. Burden tests assume all rare variants in the target region have effects on the phenotype in the same direction and of similar magnitude. The recently proposed sequence kernel association test (SKAT) (Wu, M. C., and others, 2011. Rare-variant association testing for sequencing data with the SKAT. The American Journal of Human Genetics 89, 82-93], an extension of the C-alpha test (Neale, B. M., and others, 2011. Testing for an unusual distribution of rare variants. PLoS Genetics 7, 161-165], provides a robust test that is particularly powerful in the presence of protective and deleterious variants and null variants, but is less powerful than burden tests when a large number of variants in a region are causal and in the same direction. As the underlying biological mechanisms are unknown in practice and vary from one gene to another across the genome, it is of substantial practical interest to develop a test that is optimal for both scenarios. In this paper, we propose a class of tests that include burden tests and SKAT as special cases, and derive an optimal test within this class that maximizes power. We show that this optimal test outperforms burden tests and SKAT in a wide range of scenarios. The results are illustrated using simulation studies and triglyceride data from the Dallas Heart Study. In addition, we have derived sample size/power calculation formula for SKAT with a new family of kernels to facilitate designing new sequence association studies.  相似文献   

Genetic epidemiologic studies often involve investigation of the association of a disease with a genomic region in terms of the underlying haplotypes, that is the combination of alleles at multiple loci along homologous chromosomes. In this article, we consider the problem of estimating haplotype-environment interactions from case-control studies when some of the environmental exposures themselves may be influenced by genetic susceptibility. We specify the distribution of the diplotypes (haplotype pair) given environmental exposures for the underlying population based on a novel semiparametric model that allows haplotypes to be potentially related with environmental exposures, while allowing the marginal distribution of the diplotypes to maintain certain population genetics constraints such as Hardy-Weinberg equilibrium. The marginal distribution of the environmental exposures is allowed to remain completely nonparametric. We develop a semiparametric estimating equation methodology and related asymptotic theory for estimation of the disease odds ratios associated with the haplotypes, environmental exposures, and their interactions, parameters that characterize haplotype-environment associations and the marginal haplotype frequencies. The problem of phase ambiguity of genotype data is handled using a suitable expectation-maximization algorithm. We study the finite-sample performance of the proposed methodology using simulated data. An application of the methodology is illustrated using a case-control study of colorectal adenoma, designed to investigate how the smoking-related risk of colorectal adenoma can be modified by "NAT2," a smoking-metabolism gene that may potentially influence susceptibility to smoking itself.  相似文献   

Landscape genomics is an emerging research field that aims to identify the environmental factors that shape adaptive genetic variation and the gene variants that drive local adaptation. Its development has been facilitated by next‐generation sequencing, which allows for screening thousands to millions of single nucleotide polymorphisms in many individuals and populations at reasonable costs. In parallel, data sets describing environmental factors have greatly improved and increasingly become publicly accessible. Accordingly, numerous analytical methods for environmental association studies have been developed. Environmental association analysis identifies genetic variants associated with particular environmental factors and has the potential to uncover adaptive patterns that are not discovered by traditional tests for the detection of outlier loci based on population genetic differentiation. We review methods for conducting environmental association analysis including categorical tests, logistic regressions, matrix correlations, general linear models and mixed effects models. We discuss the advantages and disadvantages of different approaches, provide a list of dedicated software packages and their specific properties, and stress the importance of incorporating neutral genetic structure in the analysis. We also touch on additional important aspects such as sampling design, environmental data preparation, pooled and reduced‐representation sequencing, candidate‐gene approaches, linearity of allele–environment associations and the combination of environmental association analyses with traditional outlier detection tests. We conclude by summarizing expected future directions in the field, such as the extension of statistical approaches, environmental association analysis for ecological gene annotation, and the need for replication and post hoc validation studies.  相似文献   

Biological and empirical evidence suggests that rare variants account for a large proportion of the genetic contributions to complex human diseases. Recent technological advances in high-throughput sequencing platforms have made it possible for researchers to generate comprehensive information on rare variants in large samples. We provide a general framework for association testing with rare variants by combining mutation information across multiple variant sites within a gene and relating the enriched genetic information to disease phenotypes through appropriate regression models. Our framework covers all major study designs (i.e., case-control, cross-sectional, cohort and family studies) and all common phenotypes (e.g., binary, quantitative, and age at onset), and it allows arbitrary covariates (e.g., environmental factors and ancestry variables). We derive theoretically optimal procedures for combining rare mutations and construct suitable test statistics for various biological scenarios. The allele-frequency threshold can be fixed or variable. The effects of the combined rare mutations on the phenotype can be in the same direction or different directions. The proposed methods are statistically more powerful and computationally more efficient than existing ones. An application to a deep-resequencing study of drug targets led to a discovery of rare variants associated with total cholesterol. The relevant software is freely available.  相似文献   

IntroductionStroke is a multifactorial and heterogeneous disorder, correlates with heritability and considered as one of the major diseases. The prior reports performed the variable models such as genome-wide association studies (GWAS), replication, case-control, cross-sectional and meta-analysis studies and still, we lack diagnostic marker in the global world. There are limited studies were carried out in Saudi population, and we aim to investigate the molecular association of single nucleotide polymorphisms (SNPs) identified through GWAS and meta-analysis studies in stroke patients in the Saudi population.MethodsIn this case-control study, we have opted gender equality of 207 cases and 207 controls from the capital city of Saudi Arabia in King Saud University Hospital. The peripheral blood (5 ml) sample will be collected in two different vacutainers, and three mL of the coagulated blood will be used for lipid analysis (biochemical tests) and two mL will be used for DNA analysis (molecular tests). Genomic DNA will be extracted with the collected blood samples, and specific primers will be designed for the opted SNPs (SORT1-rs646218 and OLR1-rs11053646 polymorphisms) and PCR-RFLP will be performed and randomly DNA sequencing will be carried out to cross check the results.ResultsThe rs646218 and rs11053646 polymorphisms were significantly associated with allele, genotype and dominant models with and without crude odds ratios (OR’s) and Multiple logistic regression analysis (p < 0.05). Correlation between lipid profile and genotypes has confirmed the significant relation between triglycerides and rs646218 and rs1105364 6polymorphisms. However, rs11053646 polymorphism was correlated with HDLC (p = 0.04). Genotypes were examined in both males' vs. males and females' vs. females in cases and control and we concluded that in rs11053646 polymorphisms with male subjects compared between cases and controls found to be associated with dominant model heterozygote genotypes (p < 0.05).ConclusionThe results of the current study confirmed the SORT1 and OLR1 SNPs were associated in the Saudi population. The current results were in the association with the prior study results documented through GWAS and meta-analysis association. However, other ethnic population studies should be performed to rule out in the human hereditary diseases.  相似文献   

Marginal tests based on individual SNPs are routinely used in genetic association studies. Studies have shown that haplotype‐based methods may provide more power in disease mapping than methods based on single markers when, for example, multiple disease‐susceptibility variants occur within the same gene. A limitation of haplotype‐based methods is that the number of parameters increases exponentially with the number of SNPs, inducing a commensurate increase in the degrees of freedom and weakening the power to detect associations. To address this limitation, we introduce a hierarchical linkage disequilibrium model for disease mapping, based on a reparametrization of the multinomial haplotype distribution, where every parameter corresponds to the cumulant of each possible subset of a set of loci. This hierarchy present in the parameters enables us to employ flexible testing strategies over a range of parameter sets: from standard single SNP analyses through the full haplotype distribution tests, reducing degrees of freedom and increasing the power to detect associations. We show via extensive simulations that our approach maintains the type I error at nominal level and has increased power under many realistic scenarios, as compared to single SNP and standard haplotype‐based studies. To evaluate the performance of our proposed methodology in real data, we analyze genome‐wide data from the Wellcome Trust Case‐Control Consortium.  相似文献   



Recently mixed linear models are used to address the issue of “missing" heritability in traditional Genome-wide association studies (GWAS). The models assume that all single-nucleotide polymorphisms (SNPs) are associated with the phenotypes of interest. However, it is more common that only a small proportion of SNPs have significant effects on the phenotypes, while most SNPs have no or very small effects. To incorporate this feature, we propose an efficient Hierarchical Bayesian Model (HBM) that extends the existing mixed models to enforce automatic selection of significant SNPs. The HBM models the SNP effects using a mixture distribution of a point mass at zero and a normal distribution, where the point mass corresponds to those non-associative SNPs.


We estimate the HBM using Gibbs sampling. The estimation performance of our method is first demonstrated through two simulation studies. We make the simulation setups realistic by using parameters fitted on the Framingham Heart Study (FHS) data. The simulation studies show that our method can accurately estimate the proportion of SNPs associated with the simulated phenotype and identify these SNPs, as well as adapt to certain model mis-specification than the standard mixed models. In addition, we analyze data from the FHS and the Health and Retirement Study (HRS) to study the association between Body Mass Index (BMI) and SNPs on Chromosome 16, and replicate the identified genetic associations. The analysis of the FHS data identifies 0.3% SNPs on Chromosome 16 that affect BMI, including rs9939609 and rs9939973 on the FTO gene. These two SNPs are in strong linkage disequilibrium with rs1558902 (Rsq =0.901 for rs9939609 and Rsq =0.905 for rs9939973), which has been reported to be linked with obesity in previous GWAS. We then replicate the findings using the HRS data: the analysis finds 0.4% of SNPs associated with BMI on Chromosome 16. Furthermore, around 25% of the genes that are identified to be associated with BMI are common between the two studies.


The results demonstrate that the HBM and the associated estimation algorithm offer a powerful tool for identifying significant genetic associations with phenotypes of interest, among a large number of SNPs that are common in modern genetics studies.  相似文献   

Single nucleotide polymorphisms (SNPs) in thioredoxin‐interacting protein (TXNIP) gene may modulate TXNIP expression, then increase the risk of coronary artery disease (CAD). In a two‐stage case–control study with a total of 1818 CAD patients and 1963 controls, we genotyped three SNPs in TXNIP and found that the variant genotypes of SNPs rs7212 [odds ratio (OR) = 1.26, P = 0.001] and rs7211 (OR = 1.23, P = 0.005) were significantly associated with increased CAD risk under a dominant model. In haplotype analyses, compared with the reference haplotype, haplotype ‘G‐T’ had a 1.22‐fold increased risk of CAD (P = 0.003). We also observed the cumulative effects of SNPs rs7212 and rs7211 on CAD risk and the severity of coronary atherosclerosis. Moreover, the gene–environment interactions among the variant genotypes of SNP rs7212, smoking habit, alcohol drinking habit and history of type 2 diabetes were associated with a 3.70‐fold increased risk of CAD (P < 0.001). Subsequent genotype‐phenotype correlation analyses further observed the significant effects of SNP rs7212 on TXNIP mRNA expression, plasma TXNIP and malondialdehyde levels. Taken together, our data suggest that TXNIP SNPs may individually and cumulatively affect CAD risk through a possible mechanism for regulating TXNIP expression and gene–environment interactions.  相似文献   

Large genomic studies are becoming increasingly common with advances in sequencing technology, and our ability to understand how genomic variation influences phenotypic variation between individuals has never been greater. The exploration of such relationships first requires the identification of associations between molecular markers and phenotypes. Here, we explore the use of Random Forest (RF), a powerful machine‐learning algorithm, in genomic studies to discern loci underlying both discrete and quantitative traits, particularly when studying wild or nonmodel organisms. RF is becoming increasingly used in ecological and population genetics because, unlike traditional methods, it can efficiently analyse thousands of loci simultaneously and account for nonadditive interactions. However, understanding both the power and limitations of Random Forest is important for its proper implementation and the interpretation of results. We therefore provide a practical introduction to the algorithm and its use for identifying associations between molecular markers and phenotypes, discussing such topics as data limitations, algorithm initiation and optimization, as well as interpretation. We also provide short R tutorials as examples, with the aim of providing a guide to the implementation of the algorithm. Topics discussed here are intended to serve as an entry point for molecular ecologists interested in employing Random Forest to identify trait associations in genomic data sets.  相似文献   

Psychiatric phenotypes are multifactorial and polygenic, resulting from the complex interplay of genes and environmental factors that act cumulatively throughout an organism's lifetime. Adverse life events are strong predictors of risk for a number of psychiatric disorders and a number of studies have focused on gene–environment interactions (GxEs) occurring at genetic loci involved in the stress response. Such a locus that has received increasing attention is the gene encoding FK506 binding protein 51 (FKBP5), a heat shock protein 90 cochaperone of the steroid receptor complex that among other functions regulates sensitivity of the glucocorticoid receptor. Interactions between FKBP5 gene variants and life stressors alter the risk not only for mood and anxiety disorders, but also for a number of other disease phenotypes. In this review, we will focus on molecular and system‐wide mechanisms of this GxE with the aim of establishing a framework that explains GxE interactions. We will also discuss how an understanding of the biological effects of this GxE may lead to novel therapeutic approaches .  相似文献   

Genome–environment association methods aim to detect genetic markers associated with environmental variables. The detected associations are usually analysed separately to identify the genomic regions involved in local adaptation. However, a recent study suggests that single‐locus associations can be combined and used in a predictive way to estimate environmental variables for new individuals on the basis of their genotypes. Here, we introduce an original approach to predict the environmental range (values and upper and lower limits) of species genotypes from the genetic markers significantly associated with those environmental variables in an independent set of individuals. We illustrate this approach to predict aridity in a database constituted of 950 individuals of wild beets and 299 individuals of cultivated beets genotyped at 14,409 random single nucleotide polymorphisms (SNPs). We detected 66 alleles associated with aridity and used them to calculate the fraction (I) of aridity‐associated alleles in each individual. The fraction I correctly predicted the values of aridity in an independent validation set of wild individuals and was then used to predict aridity in the 299 cultivated individuals. Wild individuals had higher median values and a wider range of values of aridity than the cultivated individuals, suggesting that wild individuals have higher ability to resist to stress‐aridity conditions and could be used to improve the resistance of cultivated varieties to aridity.  相似文献   

With recent advances in genotyping and sequencing technologies,many disease susceptibility loci have been identified.However,much of the genetic heritability remains unexplained and the replication rate between independent studies is still low.Meanwhile,there have been increasing efforts on functional annotations of the entire human genome,such as the Encyclopedia of DNA Elements(ENCODE)project and other similar projects.It has been shown that incorporating these functional annotations to prioritize genome wide association signals may help identify true association signals.However,to our knowledge,the extent of the improvement when functional annotation data are considered has not been studied in the literature.In this article,we propose a statistical framework to estimate the improvement in replication rate with annotation data,and apply it to Crohn’s disease and DNase I hypersensitive sites.The results show that with cell line specific functional annotations,the expected replication rate is improved,but only at modest level.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号