首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Gene-disease association studies based on case-control designs may often be used to identify candidate polymorphisms (markers) conferring disease risk. If a large number of markers are studied, genotyping all markers on all samples is inefficient in resource utilization. Here, we propose an alternative two-stage method to identify disease-susceptibility markers. In the first stage all markers are evaluated on a fraction of the available subjects. The most promising markers are then evaluated on the remaining individuals in Stage 2. This approach can be cost effective since markers unlikely to be associated with the disease can be eliminated in the first stage. Using simulations we show that, when the markers are independent and when they are correlated, the two-stage approach provides a substantial reduction in the total number of marker evaluations for a minimal loss of power. The power of the two-stage approach is evaluated when a single marker is associated with the disease, and in the presence of multiple disease-susceptibility markers. As a general guideline, the simulations over a wide range of parametric configurations indicate that evaluating all the markers on 50% of the individuals in Stage 1 and evaluating the most promising 10% of the markers on the remaining individuals in Stage 2 provides near-optimal power while resulting in a 45% decrease in the total number of marker evaluations.  相似文献   

2.
Zhao Y  Wang S 《Human heredity》2009,67(1):46-56
Study cost remains the major limiting factor for genome-wide association studies due to the necessity of genotyping a large number of SNPs for a large number of subjects. Both DNA pooling strategies and two-stage designs have been proposed to reduce genotyping costs. In this study, we propose a cost-effective, two-stage approach with a DNA pooling strategy. During stage I, all markers are evaluated on a subset of individuals using DNA pooling. The most promising set of markers is then evaluated with individual genotyping for all individuals during stage II. The goal is to determine the optimal parameters (pi(p)(sample ), the proportion of samples used during stage I with DNA pooling; and pi(p)(marker ), the proportion of markers evaluated during stage II with individual genotyping) that minimize the cost of a two-stage DNA pooling design while maintaining a desired overall significance level and achieving a level of power similar to that of a one-stage individual genotyping design. We considered the effects of three factors on optimal two-stage DNA pooling designs. Our results suggest that, under most scenarios considered, the optimal two-stage DNA pooling design may be much more cost-effective than the optimal two-stage individual genotyping design, which use individual genotyping during both stages.  相似文献   

3.
Two-stage designs in case-control association analysis   总被引:1,自引:0,他引:1       下载免费PDF全文
Zuo Y  Zou G  Zhao H 《Genetics》2006,173(3):1747-1760
DNA pooling is a cost-effective approach for collecting information on marker allele frequency in genetic studies. It is often suggested as a screening tool to identify a subset of candidate markers from a very large number of markers to be followed up by more accurate and informative individual genotyping. In this article, we investigate several statistical properties and design issues related to this two-stage design, including the selection of the candidate markers for second-stage analysis, statistical power of this design, and the probability that truly disease-associated markers are ranked among the top after second-stage analysis. We have derived analytical results on the proportion of markers to be selected for second-stage analysis. For example, to detect disease-associated markers with an allele frequency difference of 0.05 between the cases and controls through an initial sample of 1000 cases and 1000 controls, our results suggest that when the measurement errors are small (0.005), approximately 3% of the markers should be selected. For the statistical power to identify disease-associated markers, we find that the measurement errors associated with DNA pooling have little effect on its power. This is in contrast to the one-stage pooling scheme where measurement errors may have large effect on statistical power. As for the probability that the disease-associated markers are ranked among the top in the second stage, we show that there is a high probability that at least one disease-associated marker is ranked among the top when the allele frequency differences between the cases and controls are not <0.05 for reasonably large sample sizes, even though the errors associated with DNA pooling in the first stage are not small. Therefore, the two-stage design with DNA pooling as a screening tool offers an efficient strategy in genomewide association studies, even when the measurement errors associated with DNA pooling are nonnegligible. For any disease model, we find that all the statistical results essentially depend on the population allele frequency and the allele frequency differences between the cases and controls at the disease-associated markers. The general conclusions hold whether the second stage uses an entirely independent sample or includes both the samples used in the first stage and an independent set of samples.  相似文献   

4.
ABSTRACT: BACKGROUND: For gene expression or gene association studies with a large number of hypotheses the number of measurements per marker in a conventional single-stage design is often low due to limited resources. Two-stage designs have been proposed where in a first stage promising hypotheses are identified and further investigated in the second stage with larger sample sizes. For two types of two-stage designs proposed in the literature we derive multiple testing procedures controlling the False Discovery Rate (FDR) demonstrating FDR control by simulations: designs where a fixed number of top-ranked hypotheses are selected and designs where the selection in the interim analysis is based on an FDR threshold. In contrast to earlier approaches which use only the second-stage data in the hypothesis tests (pilot approach), the proposed testing procedures are based on the pooled data from both stages (integrated approach). Results: For both selection rules the multiple testing procedures control the FDR in the considered simulation scenarios. This holds for the case of independent observations across hypotheses as well as for certain correlation structures. Additionally, we show that in scenarios with small effect sizes the testing procedures based on the pooled data from both stages can give a considerable improvement in power compared to tests based on the second-stage data only. Conclusion: The proposed hypothesis tests provide a tool for FDR control for the considered two-stage designs. Comparing the integrated approaches for both selection rules with the corresponding pilot approaches showed an advantage of the integrated approach in many simulation scenarios.  相似文献   

5.
Genomewide association studies (GWAS) are being conducted to unravel the genetic etiology of complex diseases, in which complex epistasis may play an important role. One-stage method in which interactions are tested using all samples at one time may be computationally problematic, may have low power as the number of markers tested increases and may not be cost-efficient. A common two-stage method may be a reasonable and powerful approach for detecting interacting genes using all samples in both two stages. In this study, we introduce an alternative two-stage method, in which some promising markers are selected using a proportion of samples in the first stage and interactions are then tested using the remaining samples in the second stage. This two-stage method is called mixed two-stage method. We then investigate the power of both one-stage method and mixed two-stage method to detect interacting disease loci for a range of two-locus epistatic models in a case-control study design. Our results suggest that mixed two-stage method may be more powerful than one-stage method if we choose about 30% of samples for single-locus tests in the first stage, and identify less than and equal to 1% of markers for follow-up interaction tests. In addition, we compare both two-stage methods and find that our two-stage method will lose power because we only use part of samples in both two stages.  相似文献   

6.
The cost of experiments aimed at determining linkage between marker loci and quantitative trait loci (QTL) was investigated as a function of marker spacing and number of individuals scored. It was found that for a variety of experimental designs, fairly wide marker spacings (ca. 50 cM) are optimum or close to optimum for initial studies of marker-QTL linkage, in the sense of minimizing overall cost of the experiment. Thus, even when large numbers of more or less evenly spaced markers are available, it will not always be cost effective to make full utilization of this capacity. This is particularly true when costs of rearing and trait evaluation per individual scored are low, as when marker data are obtained on individuals raised and evaluated for quantitative traits as part of existing programs. When costs of rearing and trait evaluation per individual scored are high, however, as in human family data collection carried out primarily for subsequent marker — QTL analyses, or when plants or animals are raised specifically for purposes of marker — QTL linkage experiments, optimum spacing may be rather narrow. It is noteworthy that when marginal costs of additional markers or individuals are constant, total resources allocated to a given experiment will determine total number of individuals sampled, but not the optimal marker spacing.  相似文献   

7.
Kang G  Lin D  Hakonarson H  Chen J 《Human heredity》2012,73(3):139-147
Next-generation sequencing technology provides an unprecedented opportunity to identify rare susceptibility variants. It is not yet financially feasible to perform whole-genome sequencing on a large number of subjects, and a two-stage design has been advocated to be a practical option. In stage I, variants are discovered by sequencing the whole genomes of a small number of carefully selected individuals. In stage II, the discovered variants of a large number of individuals are genotyped to assess associations. Individuals with extreme phenotypes are typically selected in stage I. Using simulated data for unrelated individuals, we explore two important aspects of this two-stage design: the efficiency of discovering common and rare single-nucleotide polymorphisms (SNPs) in stage I and the impact of incomplete SNP discovery in stage I on the power of testing associations in stage II. We applied a sum test and a sum of squared score test for gene-based association analyses evaluating the power of the two-stage design. We obtained the following results from extensive simulation studies and analysis of the GAW17 dataset. When individuals with trait values more extreme than the 99.7-99th quantile were included in stage I, the two-stage design could achieve the same power as or even higher than the one-stage design if the rare causal variants had large effect sizes. In such design, fewer than half of the total SNPs including more than half of the causal SNPs were discovered, which included nearly all SNPs with minor allele frequencies (MAFs) ≥5%, more than half of the SNPs with MAFs between 1% and 5%, and fewer than half of the SNPs with MAFs <1%. Although a one-stage design may be preferable to identify multiple rare variants having small to moderate effect sizes, our observations support using the two-stage design as a cost-effective option for next-generation sequencing studies.  相似文献   

8.
Two-stage designs for experiments with a large number of hypotheses   总被引:1,自引:0,他引:1  
MOTIVATION: When a large number of hypotheses are investigated the false discovery rate (FDR) is commonly applied in gene expression analysis or gene association studies. Conventional single-stage designs may lack power due to low sample sizes for the individual hypotheses. We propose two-stage designs where the first stage is used to screen the 'promising' hypotheses which are further investigated at the second stage with an increased sample size. A multiple test procedure based on sequential individual P-values is proposed to control the FDR for the case of independent normal distributions with known variance. RESULTS: The power of optimal two-stage designs is impressively larger than the power of the corresponding single-stage design with equal costs. Extensions to the case of unknown variances and correlated test statistics are investigated by simulations. Moreover, it is shown that the simple multiple test procedure using first stage data for screening purposes and deriving the test decisions only from second stage data is a very powerful option.  相似文献   

9.
Before new markers are thoroughly characterized, they are usually screened for high polymorphism on the basis of a small panel of individuals. Four commonly used screening strategies are compared in terms of their power to correctly classify a marker as having heterozygosity of 70% or higher. A small number of typed individuals (10, say) are shown to provide good discrimination power between low- and high-heterozygosity markers when the markers have a small number of alleles. Characterizing markers in more detail requires larger sample sizes (e.g., at least 80-100 individuals) if there is to be a high probability of detecting most or all alleles. For linkage analyses involving highly polymorphic markers, the practice of arbitrarily assuming equal gene frequencies can cause serious trouble. In the presence of untyped individuals, when gene frequencies are unequal but are assumed to be equal in the analysis, recombination-fraction estimates tend to be badly biased, leading to strong false-positive evidence for linkage.  相似文献   

10.
Recent studies indicate that polymorphic genetic markers are potentially helpful in resolving genealogical relationships among individuals in a natural population. Genetic data provide opportunities for paternity exclusion when genotypic incompatibilities are observed among individuals, and the present investigation examines the resolving power of genetic markers in unambiguous positive determination of paternity. Under the assumption that the mother for each offspring in a population is unambiguously known, an analytical expression for the fraction of males excluded from paternity is derived for the case where males and females may be derived from two different gene pools. This theoretical formulation can also be used to predict the fraction of births for each of which all but one male can be excluded from paternity. We show that even when the average probability of exclusion approaches unity, a substantial fraction of births yield equivocal mother-father-offspring determinations. The number of loci needed to increase the frequency of unambiguous determinations to a high level is beyond the scope of current electrophoretic studies in most species. Applications of this theory to electrophoretic data on Chamaelirium luteum (L.) shows that in 2255 offspring derived from 273 males and 70 females, only 57 triplets could be unequivocally determined with eight polymorphic protein loci, even though the average combined exclusionary power of these loci was 73%. The distribution of potentially compatible male parents, based on multilocus genotypes, was reasonably well predicted from the allele frequency data available for these loci. We demonstrate that genetic paternity analysis in natural populations cannot be reliably based on exclusionary principles alone. In order to measure the reproductive contributions of individuals in natural populations, more elaborate likelihood principles must be deployed.  相似文献   

11.
In quantitative trait locus (QTL) mapping studies, it is mandatory that the available financial resources are spent in such a way that the power for detection of QTL is maximized. The objective of this study was to optimize for three different fixed budgets the power of QTL detection 1 − β* in recombinant inbred line (RIL) populations derived from a nested design by varying (1) the genetic complexity of the trait, (2) the costs for developing, genotyping, and phenotyping RILs, (3) the total number of RILs, and (4) the number of environments and replications per environment used for phenotyping. Our computer simulations were based on empirical data of 653 single nucleotide polymorphism markers of 26 diverse maize inbred lines which were selected on the basis of 100 simple sequence repeat markers out of a worldwide sample of 260 maize inbreds to capture the maximum genetic diversity. For the standard scenario of costs, the optimum number of test environments (E opt) ranged across the examined total budgets from 7 to 19 in the scenarios with 25 QTL. In comparison, the E opt values observed for the scenarios with 50 and 100 QTL were slightly higher. Our finding of differences in 1 − β* estimates between experiments with optimally and sub-optimally allocated resources illustrated the potential to improve the power for QTL detection without increasing the total resources necessary for a QTL mapping experiment. Furthermore, the results of our study indicated that also in studies using the latest genomics tools to dissect quantitative traits, it is required to evaluate the individuals of the mapping population in a high number of environments with a high number of replications per environment.  相似文献   

12.
? Premise of the study: Microsatellite markers from cellulose synthase genes were developed for the Chinese white poplar, Populus tomentosa, to investigate the genetic diversity of wild germplasm resources and to further identify favorable alleles significantly associated with wood cellulose content. ? Methods and Results: Fifteen microsatellite markers were developed in P. tomentosa by deep sequencing of cellulose synthase genes. Polymorphisms were evaluated in 460 individuals from three climatic regions of P. tomentosa, and all 15 markers revealed polymorphic variation. The number of alleles per locus ranged from two to nine with an average of 4.3; the observed and expected heterozygosity per locus varied from 0.029 to 0.962 and from 0.051 to 0.713, respectively. ? Conclusions: These polymorphic markers will potentially be useful for genetic mapping and in molecular breeding for improvement of wood fiber traits in Populus.  相似文献   

13.
Detection of linkage with a systematic genome scan in nuclear families including an affected sibling pair is an important initial step on the path to cloning susceptibility genes for complex genetic disorders, and it is desirable to optimize the efficiency of such studies. The aim is to maximize power while simultaneously minimizing the total number of genotypings and probability of type I error. One approach to increase efficiency, which has been investigated by other workers, is grid tightening: a sample is initially typed using a coarse grid of markers, and promising results are followed up by use of a finer grid. Another approach, not previously considered in detail in the context of an affected-sib-pair genome scan for linkage, is sample splitting: a portion of the sample is typed in the screening stage, and promising results are followed up in the whole sample. In the current study, we have used computer simulation to investigate the relative efficiency of two-stage strategies involving combinations of both grid tightening and sample splitting and found that the optimal strategy incorporates both approaches. In general, typing half the sample of affected pairs with a coarse grid of markers in the screening stage is an efficient strategy under a variety of conditions. If Hardy-Weinberg equilibrium holds, it is most efficient not to type parents in the screening stage. If Hardy-Weinberg equilibrium does not hold (e.g., because of stratification) failure to type parents in the first stage increases the amount of genotyping required, although the overall probability of type I error is not greatly increased, provided the parents are used in the final analysis.  相似文献   

14.
Summary Many studies have shown that segregating quantitative trait loci (QTL) can be detected via linkage to genetic markers. Power to detect a QTL effect on the trait mean as a function of the number of individuals genotyped for the marker is increased by selectively genotyping individuals with extreme values for the quantitative trait. Computer simulations were employed to study the effect of various sampling strategies on the statistical power to detect QTL variance effects. If only individuals with extreme phenotypes for the quantitative trait are selected for genotyping, then power to detect a variance effect is less than by random sampling. If 0.2 of the total number of individuals genotyped are selected from the center of the distribution, then power to detect a variance effect is equal to that obtained with random selection. Power to detect a variance effect was maximum when 0.2 to 0.5 of the individuals selected for genotyping were selected from the tails of the distribution and the remainder from the center.  相似文献   

15.
The control of natural variation in cytosine methylation in Arabidopsis   总被引:1,自引:0,他引:1  
Riddle NC  Richards EJ 《Genetics》2002,161(1):355-363
The distance of pollen movement is an important determinant of the neighborhood area of plant populations. In earlier studies, we designed a method for estimating the distance of pollen dispersal, on the basis of the analysis of the differentiation among the pollen clouds of a sample of females, spaced across the landscape. The method was based solely on an estimate of the global level of differentiation among the pollen clouds of the total array of sampled females. Here, we develop novel estimators, on the basis of the divergence of pollen clouds for all pairs of females, assuming that an independent estimate of adult population density is available. A simulation study shows that the estimators are all slightly biased, but that most have enough precision to be useful, at least with adequate sample sizes. We show that one of the novel pairwise methods provides estimates that are slightly better than the best global estimate, especially when the markers used have low exclusion probability. The new method can also be generalized to the case where there is no prior information on the density of reproductive adults. In that case, we can jointly estimate the density itself and the pollen dispersal distance, given sufficient sample sizes. The bias of this last estimator is larger and the precision is lower than for those estimates based on independent estimates of density, but the estimate is of some interest, because a meaningful independent estimate of the density of reproducing individuals is difficult to obtain in most cases.  相似文献   

16.
Although approaches for performing genome‐wide association studies (GWAS) are well developed, conventional GWAS requires high‐density genotyping of large numbers of individuals from a diversity panel. Here we report a method for performing GWAS that does not require genotyping of large numbers of individuals. Instead XP‐GWAS (extreme‐phenotype GWAS) relies on genotyping pools of individuals from a diversity panel that have extreme phenotypes. This analysis measures allele frequencies in the extreme pools, enabling discovery of associations between genetic variants and traits of interest. This method was evaluated in maize (Zea mays) using the well‐characterized kernel row number trait, which was selected to enable comparisons between the results of XP‐GWAS and conventional GWAS. An exome‐sequencing strategy was used to focus sequencing resources on genes and their flanking regions. A total of 0.94 million variants were identified and served as evaluation markers; comparisons among pools showed that 145 of these variants were statistically associated with the kernel row number phenotype. These trait‐associated variants were significantly enriched in regions identified by conventional GWAS. XP‐GWAS was able to resolve several linked QTL and detect trait‐associated variants within a single gene under a QTL peak. XP‐GWAS is expected to be particularly valuable for detecting genes or alleles responsible for quantitative variation in species for which extensive genotyping resources are not available, such as wild progenitors of crops, orphan crops, and other poorly characterized species such as those of ecological interest.  相似文献   

17.
As the field of phylogeography has matured, it has become clear that analyses of one or a few genes may reveal more about the history of those genes than the populations and species that are the targets of study. To alleviate these concerns, the discipline has moved towards larger analyses of more individuals and more genes, although little attention has been paid to the qualitative or quantitative gains that such increases in scale and scope may yield. Here, we increase the number of individuals and markers by an order of magnitude over previously published work to comprehensively assess the phylogeographical history of a well‐studied declining species, the western pond turtle (Emys marmorata). We present a new analysis of 89 independent nuclear SNP markers and one mitochondrial gene sequence scored for rangewide sampling of >900 individuals, and compare these to smaller‐scale, rangewide genetic and morphological analyses. Our enlarged SNP data fundamentally revise our understanding of evolutionary history for this lineage. Our results indicate that the gains from greatly increasing both the number of markers and individuals are substantial and worth the effort, particularly for species of high conservation concern such as the pond turtle, where accurate assessments of population history are a prerequisite for effective management.  相似文献   

18.
Multimarker Transmission/Disequilibrium Tests (TDTs) are very robust association tests to population admixture and structure which may be used to identify susceptibility loci in genome-wide association studies. Multimarker TDTs using several markers may increase power by capturing high-degree associations. However, there is also a risk of spurious associations and power reduction due to the increase in degrees of freedom. In this study we show that associations found by tests built on simple null hypotheses are highly reproducible in a second independent data set regardless the number of markers. As a test exhibiting this feature to its maximum, we introduce the multimarker 2-Groups TDT (mTDT(2G)), a test which under the hypothesis of no linkage, asymptotically follows a χ2 distribution with 1 degree of freedom regardless the number of markers. The statistic requires the division of parental haplotypes into two groups: disease susceptibility and disease protective haplotype groups. We assessed the test behavior by performing an extensive simulation study as well as a real-data study using several data sets of two complex diseases. We show that mTDT(2G) test is highly efficient and it achieves the highest power among all the tests used, even when the null hypothesis is tested in a second independent data set. Therefore, mTDT(2G) turns out to be a very promising multimarker TDT to perform genome-wide searches for disease susceptibility loci that may be used as a preprocessing step in the construction of more accurate genetic models to predict individual susceptibility to complex diseases.  相似文献   

19.
Primulina eburnea is a promising candidate for domestication and floriculture, since it is easy to culture and has beautiful flowers. An F 2 population of 189 individuals was established for the construction of first-generation linkage maps based on expressed sequence tags-derived single-nucleotide polymorphism markers using the massARRAY genotyping platform. Of the 232 screened markers, 215 were assigned to 18 LG according to the haploid number of chromosomes in the species. The linkage map spanned a total of 3774.7 cM with an average distance of 17.6 cM between adjacent markers. This linkage map provides a framework for identification of important genes in breeding programmes.  相似文献   

20.
Genomewide association studies have been advocated as a promising alternative to genomewide linkage scans for detection of small-effect genes in complex diseases. Comparisons of power and sample size between the two strategies have shown considerable advantages for the association studies. These comparisons assume that the set of markers includes the exact disease-related polymorphism. A concern, however, is that the power of an association study decreases when this is not the case, because of discrepant allele frequencies and less-than-maximum disequilibrium between the disease-related polymorphism and its nearest marker. Here, we quantify this concern by comparing the sample sizes needed by the two strategies when the markers exclude the disease-related polymorphism. For affected sib pairs and their parents, we found that incomplete disequilibrium and differing allele frequencies can have substantial negative impact on the power of association studies, resulting, in some circumstances, in little gain and even in loss of power, compared with linkage analysis. We provide some guidelines for choosing between strategies, for the detection of genes for complex diseases.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号