首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 36 毫秒
1.
Cui Y  Kang G  Sun K  Qian M  Romero R  Fu W 《Genetics》2008,179(1):637-650
Genes are the functional units in most organisms. Compared to genetic variants located outside genes, genic variants are more likely to affect disease risk. The development of the human HapMap project provides an unprecedented opportunity for genetic association studies at the genomewide level for elucidating disease etiology. Currently, most association studies at the single-nucleotide polymorphism (SNP) or the haplotype level rely on the linkage information between SNP markers and disease variants, with which association findings are difficult to replicate. Moreover, variants in genes might not be sufficiently covered by currently available methods. In this article, we present a gene-centric approach via entropy statistics for a genomewide association study to identify disease genes. The new entropy-based approach considers genic variants within one gene simultaneously and is developed on the basis of a joint genotype distribution among genetic variants for an association test. A grouping algorithm based on a penalized entropy measure is proposed to reduce the dimension of the test statistic. Type I error rates and power of the entropy test are evaluated through extensive simulation studies. The results indicate that the entropy test has stable power under different disease models with a reasonable sample size. Compared to single SNP-based analysis, the gene-centric approach has greater power, especially when there is more than one disease variant in a gene. As the genomewide genic SNPs become available, our entropy-based gene-centric approach would provide a robust and computationally efficient way for gene-based genomewide association study.  相似文献   

2.
Recent high-throughput genotyping technologies, such as the Affymetrix 500k array and the Illumina HumanHap 550 beadchip, have driven down the costs of association studies and have enabled the measurement of single-nucleotide polymorphism (SNP) allele frequency differences between case and control populations on a genomewide scale. A key aspect in the efficiency of association studies is the notion of "indirect association," where only a subset of SNPs are collected to serve as proxies for the uncollected SNPs, taking advantage of the correlation structure between SNPs. Recently, a new class of methods for indirect association, multimarker methods, has been proposed. Although the multimarker methods are a considerable advancement, current methods do not fully take advantage of the correlation structure between SNPs and their multimarker proxies. In this article, we propose a novel multimarker indirect-association method, WHAP, that is based on a weighted sum of the haplotype frequency differences. In contrast to traditional indirect-association methods, we show analytically that there is a considerable gain in power achieved by our method compared with both single-marker and multimarker tests, as well as traditional haplotype-based tests. Our results are supported by empirical evaluation across the HapMap reference panel data sets, and a software implementation for the Affymetrix 500k and Illumina HumanHap 550 chips is available for download.  相似文献   

3.
Power and sample size calculations are critical parts of any research design for genetic association. We present a method that utilizes haplotype frequency information and average marker-marker linkage disequilibrium on SNPs typed in and around all genes on a chromosome. The test statistic used is the classic likelihood ratio test applied to haplotypes in case/control populations. Haplotype frequencies are computed through specification of genetic model parameters. Power is determined by computation of the test's non-centrality parameter. Power per gene is computed as a weighted average of the power assuming each haplotype is associated with the trait. We apply our method to genotype data from dense SNP maps across three entire chromosomes (6, 21, and 22) for three different human populations (African-American, Caucasian, Chinese), three different models of disease (additive, dominant, and multiplicative) and two trait allele frequencies (rare, common). We perform a regression analysis using these factors, average marker-marker disequilibrium, and the haplotype diversity across the gene region to determine which factors most significantly affect average power for a gene in our data. Also, as a 'proof of principle' calculation, we perform power and sample size calculations for all genes within 100 kb of the PSORS1 locus (chromosome 6) for a previously published association study of psoriasis. Results of our regression analysis indicate that four highly significant factors that determine average power to detect association are: disease model, average marker-marker disequilibrium, haplotype diversity, and the trait allele frequency. These findings may have important implications for the design of well-powered candidate gene association studies. Our power and sample size calculations for the PSORS1 gene appear consistent with published findings, namely that there is substantial power (>0.99) for most genes within 100 kb of the PSORS1 locus at the 0.01 significance level.  相似文献   

4.
An entropy-based statistic for genomewide association studies   总被引:8,自引:0,他引:8       下载免费PDF全文
Efficient genotyping methods and the availability of a large collection of single-nucleotide polymorphisms provide valuable tools for genetic studies of human disease. The standard chi2 statistic for case-control studies, which uses a linear function of allele frequencies, has limited power when the number of marker loci is large. We introduce a novel test statistic for genetic association studies that uses Shannon entropy and a nonlinear function of allele frequencies to amplify the differences in allele and haplotype frequencies to maintain statistical power with large numbers of marker loci. We investigate the relationship between the entropy-based test statistic and the standard chi2 statistic and show that, in most cases, the power of the entropy-based statistic is greater than that of the standard chi2 statistic. The distribution of the entropy-based statistic and the type I error rates are validated using simulation studies. Finally, we apply the new entropy-based test statistic to two real data sets, one for the COMT gene and schizophrenia and one for the MMP-2 gene and esophageal carcinoma, to evaluate the performance of the new method for genetic association studies. The results show that the entropy-based statistic obtained smaller P values than did the standard chi2 statistic.  相似文献   

5.
Recent studies have indicated that linkage disequilibrium (LD) between single nucleotide polymorphism (SNP) markers can be used to derive a reduced set of tagging SNPs (tSNPs) for genetic association studies. Previous strategies for identifying tSNPs have focused on LD measures or haplotype diversity, but the statistical power to detect disease-associated variants using tSNPs in genetic studies has not been fully characterized. We propose a new approach of selecting tSNPs based on determining the set of SNPs with the highest power to detect association. Two-locus genotype frequencies are used in the power calculations. To show utility, we applied this power method to a large number of SNPs that had been genotyped in Caucasian samples. We demonstrate that a significant reduction in genotyping efforts can be achieved although the reduction depends on genotypic relative risk, inheritance mode and the prevalence of disease in the human population. The tSNP sets identified by our method are remarkably robust to changes in the disease model when small relative risk and additive mode of inheritance are employed. We have also evaluated the ability of the method to detect unidentified SNPs. Our findings have important implications in applying tSNPs from different data sources in association studies.  相似文献   

6.
Liu W  Zhao W  Chase GA 《Human heredity》2006,61(1):31-44
OBJECTIVE: Single nucleotide polymorphisms (SNPs) serve as effective markers for localizing disease susceptibility genes, but current genotyping technologies are inadequate for genotyping all available SNP markers in a typical linkage/association study. Much attention has recently been paid to methods for selecting the minimal informative subset of SNPs in identifying haplotypes, but there has been little investigation of the effect of missing or erroneous genotypes on the performance of these SNP selection algorithms and subsequent association tests using the selected tagging SNPs. The purpose of this study is to explore the effect of missing genotype or genotyping error on tagging SNP selection and subsequent single marker and haplotype association tests using the selected tagging SNPs. METHODS: Through two sets of simulations, we evaluated the performance of three tagging SNP selection programs in the presence of missing or erroneous genotypes: Clayton's diversity based program htstep, Carlson's linkage disequilibrium (LD) based program ldSelect, and Stram's coefficient of determination based program tagsnp.exe. RESULTS: When randomly selected known loci were relabeled as 'missing', we found that the average number of tagging SNPs selected by all three algorithms changed very little and the power of subsequent single marker and haplotype association tests using the selected tagging SNPs remained close to the power of these tests in the absence of missing genotype. When random genotyping errors were introduced, we found that the average number of tagging SNPs selected by all three algorithms increased. In data sets simulated according to the haplotype frequecies in the CYP19 region, Stram's program had larger increase than Carlson's and Clayton's programs. In data sets simulated under the coalescent model, Carlson's program had the largest increase and Clayton's program had the smallest increase. In both sets of simulations, with the presence of genotyping errors, the power of the haplotype tests from all three programs decreased quickly, but there was not much reduction in power of the single marker tests. CONCLUSIONS: Missing genotypes do not seem to have much impact on tagging SNP selection and subsequent single marker and haplotype association tests. In contrast, genotyping errors could have severe impact on tagging SNP selection and haplotype tests, but not on single marker tests.  相似文献   

7.
Recent studies have shown that the human genome has a haplotype block structure, such that it can be divided into discrete blocks of limited haplotype diversity. In each block, a small fraction of single-nucleotide polymorphisms (SNPs), referred to as "tag SNPs," can be used to distinguish a large fraction of the haplotypes. These tag SNPs can potentially be extremely useful for association studies, in that it may not be necessary to genotype all SNPs; however, this depends on how much power is lost. Here we develop a simulation study to quantitatively assess the power loss for a variety of study designs, including case-control designs and case-parental control designs. First, a number of data sets containing case-parental or case-control samples are generated on the basis of a disease model. Second, a small fraction of case and control individuals in each data set are genotyped at all the loci, and a dynamic programming algorithm is used to determine the haplotype blocks and the tag SNPs based on the genotypes of the sampled individuals. Third, the statistical power of tests was evaluated on the basis of three kinds of data: (1) all of the SNPs and the corresponding haplotypes, (2) the tag SNPs and the corresponding haplotypes, and (3) the same number of randomly chosen SNPs as the number of tag SNPs and the corresponding haplotypes. We study the power of different association tests with a variety of disease models and block-partitioning criteria. Our study indicates that the genotyping efforts can be significantly reduced by the tag SNPs, without much loss of power. Depending on the specific haplotype block-partitioning algorithm and the disease model, when the identified tag SNPs are only 25% of all the SNPs, the power is reduced by only 4%, on average, compared with a power loss of approximately 12% when the same number of randomly chosen SNPs is used in a two-locus haplotype analysis. When the identified tag SNPs are approximately 14% of all the SNPs, the power is reduced by approximately 9%, compared with a power loss of approximately 21% when the same number of randomly chosen SNPs is used in a two-locus haplotype analysis. Our study also indicates that haplotype-based analysis can be much more powerful than marker-by-marker analysis.  相似文献   

8.
Association tests that pool minor alleles into a measure of burden at a locus have been proposed for case-control studies using sequence data containing rare variants. However, such pooling tests are not robust to the inclusion of neutral and protective variants, which can mask the association signal from risk variants. Early studies proposing pooling tests dismissed methods for locus-wide inference using nonnegative single-variant test statistics based on unrealistic comparisons. However, such methods are robust to the inclusion of neutral and protective variants and therefore may be more useful than previously appreciated. In fact, some recently proposed methods derived within different frameworks are equivalent to performing inference on weighted sums of squared single-variant score statistics. In this study, we compared two existing methods for locus-wide inference using nonnegative single-variant test statistics to two widely cited pooling tests under more realistic conditions. We established analytic results for a simple model with one rare risk and one rare neutral variant, which demonstrated that pooling tests were less powerful than even Bonferroni-corrected single-variant tests in most realistic situations. We also performed simulations using variants with realistic minor allele frequency and linkage disequilibrium spectra, disease models with multiple rare risk variants and extensive neutral variation, and varying rates of missing genotypes. In all scenarios considered, existing methods using nonnegative single-variant test statistics had power comparable to or greater than two widely cited pooling tests. Moreover, in disease models with only rare risk variants, an existing method based on the maximum single-variant Cochran-Armitage trend chi-square statistic in the locus had power comparable to or greater than another existing method closely related to some recently proposed methods. We conclude that efficient locus-wide inference using single-variant test statistics should be reconsidered as a useful framework for devising powerful association tests in sequence data with rare variants.  相似文献   

9.
The genome-wide association studies (GWAS) designed for next-generation sequencing data involve testing association of genomic variants, including common, low frequency, and rare variants. The current strategies for association studies are well developed for identifying association of common variants with the common diseases, but may be ill-suited when large amounts of allelic heterogeneity are present in sequence data. Recently, group tests that analyze their collective frequency differences between cases and controls shift the current variant-by-variant analysis paradigm for GWAS of common variants to the collective test of multiple variants in the association analysis of rare variants. However, group tests ignore differences in genetic effects among SNPs at different genomic locations. As an alternative to group tests, we developed a novel genome-information content-based statistics for testing association of the entire allele frequency spectrum of genomic variation with the diseases. To evaluate the performance of the proposed statistics, we use large-scale simulations based on whole genome low coverage pilot data in the 1000 Genomes Project to calculate the type 1 error rates and power of seven alternative statistics: a genome-information content-based statistic, the generalized T(2), collapsing method, multivariate and collapsing (CMC) method, individual χ(2) test, weighted-sum statistic, and variable threshold statistic. Finally, we apply the seven statistics to published resequencing dataset from ANGPTL3, ANGPTL4, ANGPTL5, and ANGPTL6 genes in the Dallas Heart Study. We report that the genome-information content-based statistic has significantly improved type 1 error rates and higher power than the other six statistics in both simulated and empirical datasets.  相似文献   

10.
Although genomewide scans have identified several potential chromosomal susceptibility regions in several human populations, finding a causative gene for type 2 diabetes has remained elusive. Others have reported a novel gene, calpain-10 (CAPN10), located in a previously identified region on chromosome 2q37.3, as a putative susceptibility gene for type 2 diabetes. Three single-nucleotide polymorphisms (SNPs) (UCSNP43, UCSNP19, and UCSNP63) were shown to be involved in increased risk of the disease among Mexican Americans. We have tested the association of these three SNPs with type 2 diabetes among the Samoans of Polynesia, who have a very high prevalence of the disease. In the U.S. territory of American Samoa, prevalence is 25% and 15% in men and women, respectively, whereas, in the independent nation of Samoa, prevalence is 3% and 5% in men and women, respectively. In our study sample, which consisted of 172 unrelated affected case subjects and 96 control subjects, we failed to detect any association between case subjects and control subjects in allele frequencies, haplotype frequencies, or haplotype combinations of UCSNP43, -19, and -63. Also, our data showed no evidence of linkage, among 201 affected sib pairs, in the region of chromosome 2 that contains these SNPs. Three plausible scenarios could explain these observations. (1) CAPN10 is a susceptibility gene only in particular ethnic groups; (2) our study lacks power to detect the effects of CAPN10 polymorphisms (but our sample size is comparable to that of earlier reports); or (3) the underlying biological mechanism is too complex and requires further research.  相似文献   

11.
12.
Marginal tests based on individual SNPs are routinely used in genetic association studies. Studies have shown that haplotype‐based methods may provide more power in disease mapping than methods based on single markers when, for example, multiple disease‐susceptibility variants occur within the same gene. A limitation of haplotype‐based methods is that the number of parameters increases exponentially with the number of SNPs, inducing a commensurate increase in the degrees of freedom and weakening the power to detect associations. To address this limitation, we introduce a hierarchical linkage disequilibrium model for disease mapping, based on a reparametrization of the multinomial haplotype distribution, where every parameter corresponds to the cumulant of each possible subset of a set of loci. This hierarchy present in the parameters enables us to employ flexible testing strategies over a range of parameter sets: from standard single SNP analyses through the full haplotype distribution tests, reducing degrees of freedom and increasing the power to detect associations. We show via extensive simulations that our approach maintains the type I error at nominal level and has increased power under many realistic scenarios, as compared to single SNP and standard haplotype‐based studies. To evaluate the performance of our proposed methodology in real data, we analyze genome‐wide data from the Wellcome Trust Case‐Control Consortium.  相似文献   

13.
Family-based association tests for genomewide association scans   总被引:7,自引:1,他引:6       下载免费PDF全文
With millions of single-nucleotide polymorphisms (SNPs) identified and characterized, genomewide association studies have begun to identify susceptibility genes for complex traits and diseases. These studies involve the characterization and analysis of very-high-resolution SNP genotype data for hundreds or thousands of individuals. We describe a computationally efficient approach to testing association between SNPs and quantitative phenotypes, which can be applied to whole-genome association scans. In addition to observed genotypes, our approach allows estimation of missing genotypes, resulting in substantial increases in power when genotyping resources are limited. We estimate missing genotypes probabilistically using the Lander-Green or Elston-Stewart algorithms and combine high-resolution SNP genotypes for a subset of individuals in each pedigree with sparser marker data for the remaining individuals. We show that power is increased whenever phenotype information for ungenotyped individuals is included in analyses and that high-density genotyping of just three carefully selected individuals in a nuclear family can recover >90% of the information available if every individual were genotyped, for a fraction of the cost and experimental effort. To aid in study design, we evaluate the power of strategies that genotype different subsets of individuals in each pedigree and make recommendations about which individuals should be genotyped at a high density. To illustrate our method, we performed genomewide association analysis for 27 gene-expression phenotypes in 3-generation families (Centre d'Etude du Polymorphisme Humain pedigrees), in which genotypes for ~860,000 SNPs in 90 grandparents and parents are complemented by genotypes for ~6,700 SNPs in a total of 168 individuals. In addition to increasing the evidence of association at 15 previously identified cis-acting associated alleles, our genotype-inference algorithm allowed us to identify associated alleles at 4 cis-acting loci that were missed when analysis was restricted to individuals with the high-density SNP data. Our genotype-inference algorithm and the proposed association tests are implemented in software that is available for free.  相似文献   

14.
It was shown recently using experimental data that it is possible under certain conditions to determine whether a person with known genotypes at a number of markers was part of a sample from which only allele frequencies are known. Using population genetic and statistical theory, we show that the power of such identification is, approximately, proportional to the number of independent SNPs divided by the size of the sample from which the allele frequencies are available. We quantify the limits of identification and propose likelihood and regression analysis methods for the analysis of data. We show that these methods have similar statistical properties and have more desirable properties, in terms of type-I error rate and statistical power, than test statistics suggested in the literature.  相似文献   

15.
Population differentiation (PD) and ecological association (EA) tests have recently emerged as prominent statistical methods to investigate signatures of local adaptation using population genomic data. Based on statistical models, these genomewide testing procedures have attracted considerable attention as tools to identify loci potentially targeted by natural selection. An important issue with PD and EA tests is that incorrect model specification can generate large numbers of false‐positive associations. Spurious association may indeed arise when shared demographic history, patterns of isolation by distance, cryptic relatedness or genetic background are ignored. Recent works on PD and EA tests have widely focused on improvements of test corrections for those confounding effects. Despite significant algorithmic improvements, there is still a number of open questions on how to check that false discoveries are under control and implement test corrections, or how to combine statistical tests from multiple genome scan methods. This tutorial study provides a detailed answer to these questions. It clarifies the relationships between traditional methods based on allele frequency differentiation and EA methods and provides a unified framework for their underlying statistical tests. We demonstrate how techniques developed in the area of genomewide association studies, such as inflation factors and linear mixed models, benefit genome scan methods and provide guidelines for good practice while conducting statistical tests in landscape and population genomic applications. Finally, we highlight how the combination of several well‐calibrated statistical tests can increase the power to reject neutrality, improving our ability to infer patterns of local adaptation in large population genomic data sets.  相似文献   

16.
Single-nucleotide polymorphisms (SNPs) determined based on SNP arrays from the international HapMap consortium (HapMap) and the genetic variants detected in the 1000 genomes project (1KGP) can serve as two references for genomewide association studies (GWAS). We conducted comparative analyses to provide a means for assessing concerns regarding SNP array-based GWAS findings as well as for realistically bounding expectations for next generation sequencing (NGS)-based GWAS. We calculated and compared base composition, transitions to transversions ratio, minor allele frequency and heterozygous rate for SNPs from HapMap and 1KGP for the 622 common individuals. We analysed the genotype discordance between HapMap and 1KGP to assess consistency in the SNPs from the two references. In 1KGP, 90.58% of 36,817,799 SNPs detected were not measured in HapMap. More SNPs with minor allele frequencies less than 0.01 were found in 1KGP than HapMap. The two references have low discordance (generally smaller than 0.02) in genotypes of common SNPs, with most discordance from heterozygous SNPs. Our study demonstrated that SNP array-based GWAS findings were reliable and useful, although only a small portion of genetic variances were explained. NGS can detect not only common but also rare variants, supporting the expectation that NGS-based GWAS will be able to incorporate a much larger portion of genetic variance than SNP arrays-based GWAS.  相似文献   

17.
Klei L  Roeder K 《Human genetics》2007,121(5):549-557
Samples consisting of a mix of unrelated cases and controls, small pedigrees, and much larger pedigrees present a unique challenge for association studies. Few methods are available for efficient analysis of such a broad spectrum of data structures. In this paper we introduce a new matching statistic that is well suited to complex data structures and compare it with frequency-based methods available in the literature. To investigate and compare the power of these methods we simulate datasets based on complex pedigrees. We examine the influence of various levels of linkage disequilibrium (LD) of the disease allele with a marker allele (or equivalently a haplotype). For low frequency marker alleles/haplotypes, frequency-based statistics are more powerful in detecting association. In contrast, for high frequency marker alleles, the matching statistic has greater power. The highest power for frequency-based statistics occurs when the disease allele frequency closely matches the frequency of the linked marker allele. In contrast maximum power of the matching statistic always occurs for intermediate marker allele frequency regardless of the disease allele frequency. Moreover, the matching and frequency-based statistics exhibit little correlation. We conclude that these two approaches can be viewed as complementary in finding possible association between a disease and a marker for many different situations.  相似文献   

18.
Genetic risk factors for pain sensitivity may also play a role in susceptibility to chronic pain disorders, in which subjects have low pain thresholds. The aim of this study was to determine if proposed functional single nucleotide polymorphisms (SNPs) in the GTP cyclohydrolase (GCH1) and μ opioid receptor (OPRM1) genes previously associated with pain sensitivity affect susceptibility to chronic widespread pain (CWP). Pain data was collected using body manikins via questionnaire at three time-points over a four year period from subjects aged 25-65 in the North-West of England as part of a population based cohort study, EPIFUND. CWP was defined at each time point using standard criteria. Three SNPs forming a proposed "pain-protective" haplotype in GCH1 (rs10483639, rs3783641 and rs8007267) and two SNPs in OPRM1 (rs1777971 (A118G) and rs563649) were genotyped in cases with persistent CWP (CWP present at ≥2 time-points) and controls who were pain-free at all time-points. The expectation-maximisation algorithm was used to estimate haplotype frequencies. The frequency of the "pain-protective" (CAT - C allele of rs10483639, A allele of rs3783641 and T allele of rs8007267) haplotype was compared to the frequency of the other haplotypes between cases and controls using the χ2 test. Allele frequencies and carriage of the minor allele was compared between cases and controls using χ2 tests for the OPRM1 SNPs. The frequency of the proposed GCH1 "pain-protective" haplotype (CAT) did not significantly differ between cases and controls and no significant associations were observed between the OPRM1 SNPs and CWP. In conclusion, there was no evidence of association between proposed functional SNPs, previously reported to influence pain sensitivity, in GCH1 and OPRM1 with CWP. Further evidence of null association in large independent cohorts is required to truly exclude these SNPs as genetic risk factors for CWP.  相似文献   

19.
GWAS have emerged as popular tools for identifying genetic variants that are associated with disease risk. Standard analysis of a case-control GWAS involves assessing the association between each individual genotyped SNP and disease risk. However, this approach suffers from limited reproducibility and difficulties in detecting multi-SNP and epistatic effects. As an alternative analytical strategy, we propose grouping SNPs together into SNP sets on the basis of proximity to genomic features such as genes or haplotype blocks, then testing the joint effect of each SNP set. Testing of each SNP set proceeds via the logistic kernel-machine-based test, which is based on a statistical framework that allows for flexible modeling of epistatic and nonlinear SNP effects. This flexibility and the ability to naturally adjust for covariate effects are important features of our test that make it appealing in comparison to individual SNP tests and existing multimarker tests. Using simulated data based on the International HapMap Project, we show that SNP-set testing can have improved power over standard individual-SNP analysis under a wide range of settings. In particular, we find that our approach has higher power than individual-SNP analysis when the median correlation between the disease-susceptibility variant and the genotyped SNPs is moderate to high. When the correlation is low, both individual-SNP analysis and the SNP-set analysis tend to have low power. We apply SNP-set analysis to analyze the Cancer Genetic Markers of Susceptibility (CGEMS) breast cancer GWAS discovery-phase data.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号