首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Zheng T  Wang H  Lo SH 《Human heredity》2006,62(4):196-212
BACKGROUND: The studies of complex traits project new challenges to current methods that evaluate association between genotypes and a specific trait. Consideration of possible interactions among loci leads to overwhelming dimensions that cannot be handled using current statistical methods. METHODS: In this article, we evaluate a multi-marker screening algorithm--the backward genotype-trait association (BGTA) algorithm for case-control designs, which uses unphased multi-locus genotypes. BGTA carries out a global investigation on a candidate marker set and automatically screens out markers carrying diminutive amounts of information regarding the trait in question. To address the 'too many possible genotypes, too few informative chromosomes' dilemma of a genomic-scale study that consists of hundreds to thousands of markers, we further investigate a BGTA-based marker selection procedure, in which the screening algorithm is repeated on a large number of random marker subsets. Results of these screenings are then aggregated into counts that the markers are retained by the BGTA algorithm. Markers with exceptional high counts of returns are selected for further analysis. RESULTS AND CONCLUSION: Evaluated using simulations under several disease models, the proposed methods prove to be more powerful in dealing with epistatic traits. We also demonstrate the proposed methods through an application to a study on the inflammatory bowel disease.  相似文献   

2.
MOTIVATION: Selecting SNP markers for genome-wide association studies is an important and challenging task. The goal is to minimize the number of markers selected for genotyping in a particular platform and therefore reduce genotyping cost while simultaneously maximizing the information content provided by selected markers. RESULTS: We devised an improved algorithm for tagSNP selection using the pairwise r(2) criterion. We first break down large marker sets into disjoint pieces, where more exhaustive searches can replace the greedy algorithm for tagSNP selection. These exhaustive searches lead to smaller tagSNP sets being generated. In addition, our method evaluates multiple solutions that are equivalent according to the linkage disequilibrium criteria to accommodate additional constraints. Its performance was assessed using HapMap data. AVAILABILITY: A computer program named FESTA has been developed based on this algorithm. The program is freely available and can be downloaded at http://www.sph.umich.edu/csg/qin/FESTA/  相似文献   

3.
Liu W  Zhao W  Chase GA 《Human heredity》2006,61(1):31-44
OBJECTIVE: Single nucleotide polymorphisms (SNPs) serve as effective markers for localizing disease susceptibility genes, but current genotyping technologies are inadequate for genotyping all available SNP markers in a typical linkage/association study. Much attention has recently been paid to methods for selecting the minimal informative subset of SNPs in identifying haplotypes, but there has been little investigation of the effect of missing or erroneous genotypes on the performance of these SNP selection algorithms and subsequent association tests using the selected tagging SNPs. The purpose of this study is to explore the effect of missing genotype or genotyping error on tagging SNP selection and subsequent single marker and haplotype association tests using the selected tagging SNPs. METHODS: Through two sets of simulations, we evaluated the performance of three tagging SNP selection programs in the presence of missing or erroneous genotypes: Clayton's diversity based program htstep, Carlson's linkage disequilibrium (LD) based program ldSelect, and Stram's coefficient of determination based program tagsnp.exe. RESULTS: When randomly selected known loci were relabeled as 'missing', we found that the average number of tagging SNPs selected by all three algorithms changed very little and the power of subsequent single marker and haplotype association tests using the selected tagging SNPs remained close to the power of these tests in the absence of missing genotype. When random genotyping errors were introduced, we found that the average number of tagging SNPs selected by all three algorithms increased. In data sets simulated according to the haplotype frequecies in the CYP19 region, Stram's program had larger increase than Carlson's and Clayton's programs. In data sets simulated under the coalescent model, Carlson's program had the largest increase and Clayton's program had the smallest increase. In both sets of simulations, with the presence of genotyping errors, the power of the haplotype tests from all three programs decreased quickly, but there was not much reduction in power of the single marker tests. CONCLUSIONS: Missing genotypes do not seem to have much impact on tagging SNP selection and subsequent single marker and haplotype association tests. In contrast, genotyping errors could have severe impact on tagging SNP selection and haplotype tests, but not on single marker tests.  相似文献   

4.
The search for the association between complex diseases and single nucleotide polymorphisms (SNPs) or haplotypes has recently received great attention. For these studies, it is essential to use a small subset of informative SNPs accurately representing the rest of the SNPs. Informative SNP selection can achieve (1) considerable budget savings by genotyping only a limited number of SNPs and computationally inferring all other SNPs or (2) necessary reduction of the huge SNP sets (obtained, e.g. from Affymetrix) for further fine haplotype analysis. A novel informative SNP selection method for unphased genotype data based on multiple linear regression (MLR) is implemented in the software package MLR-tagging. This software can be used for informative SNP (tag) selection and genotype prediction. The stepwise tag selection algorithm (STSA) selects positions of the given number of informative SNPs based on a genotype sample population. The MLR SNP prediction algorithm predicts a complete genotype based on the values of its informative SNPs, their positions among all SNPs, and a sample of complete genotypes. An extensive experimental study on various datasets including 10 regions from HapMap shows that the MLR prediction combined with stepwise tag selection uses fewer tags than the state-of-the-art method of Halperin et al. (2005). AVAILABILITY: MLR-Tagging software package is publicly available at http://alla.cs.gsu.edu/~software/tagging/tagging.html  相似文献   

5.
Zhang J 《PloS one》2010,5(11):e13734
Identification of a small panel of population structure informative markers can reduce genotyping cost and is useful in various applications, such as ancestry inference in association mapping, forensics and evolutionary theory in population genetics. Traditional methods to ascertain ancestral informative markers usually require the prior knowledge of individual ancestry and have difficulty for admixed populations. Recently Principal Components Analysis (PCA) has been employed with success to select SNPs which are highly correlated with top significant principal components (PCs) without use of individual ancestral information. The approach is also applicable to admixed populations. Here we propose a novel approach based on our recent result on summarizing population structure by graph laplacian eigenfunctions, which differs from PCA in that it is geometric and robust to outliers. Our approach also takes advantage of the priori sparseness of informative markers in the genome. Through simulation of a ring population and the real global population sample HGDP of 650K SNPs genotyped in 940 unrelated individuals, we validate the proposed algorithm at selecting most informative markers, a small fraction of which can recover the similar underlying population structure efficiently. Employing a standard Support Vector Machine (SVM) to predict individuals' continental memberships on HGDP dataset of seven continents, we demonstrate that the selected SNPs by our method are more informative but less redundant than those selected by PCA. Our algorithm is a promising tool in genome-wide association studies and population genetics, facilitating the selection of structure informative markers, efficient detection of population substructure and ancestral inference.  相似文献   

6.
OBJECTIVES: Genetic association studies are usually based upon restricted sets of 'tag' markers selected to represent the total sequence variation. Tag selection is often determined by some threshold for the r(2) coefficients of linkage disequilibrium (LD) between tag and untyped markers, it being widely assumed that power to detect an effect at the untyped sites is retained by typing the tag marker in a sample scaled by the inverse of the selected threshold (1/r(2)). However, unless only a single causal variant occurs at a locus, it has been shown [Eur J Hum Genet 2006;14:426-437] that significant power loss can occur if this principle is applied. We sought to investigate whether unexpected loss of power might be an exceptional case or more general concern. In the absence of detailed knowledge about the genetic architecture at complex disease loci, we developed a mathematical approach to test all possible situations. METHODS: We derived mathematical formulae allowing the calculation of all possible odds ratios (OR) at a tag marker locus given the effect size that would be observed by typing a second locus and the r(2) between the two loci. For a range of allele frequencies, r(2) between loci, and strengths of association at the causal locus (OR from 0.5 to 2) that we consider realistic for complex disease loci, we next determined the sample sizes that would be necessary to give equivalent power to detect association by genotyping tag and causal loci and compared these with the sample sizes predicted by applying 1/r(2). RESULTS: Under most of the hypothetical scenarios we examined, the calculated sample sizes required to maintain power by typing markers that tag the causal locus at even moderately high r(2) (0.8) were greater than that calculated by applying 1/r(2). Even in populations with apparently similar measurements of allele frequency, LD structure, and effect size at the susceptibility allele, the required sample size to detect association with a tag marker can vary substantially. We also show that in apparently similar populations, associations to either allele at the tag site are possible. CONCLUSIONS: Indirect tests of association are less powered than sizes predicted by applying 1/r(2) in the majority of hypothetical scenarios we examined. Our findings pertain even for what we consider likely to be larger than average effect sizes in complex diseases (OR = 1.5-2) and even for moderately high r(2) values between the markers. Until a substantial number of disease genes have been identified through methods that are not based on tagging, and therefore biased towards those situations most favourable to tagging, it is impossible to know how the true scenarios are distributed across the range of possible scenarios. Nevertheless, while association designs based upon tag marker selection by necessity are the tool of choice for de novo gene discovery, our data suggest power to initially detect association may often be less than assumed. Moreover, our data suggest that to avoid genuine findings being subsequently discarded by unpredictable losses of power, follow up studies in other samples should be based upon more detailed analyses of the gene rather than simply on the tag SNPs showing association in the discovery study.  相似文献   

7.
MOTIVATIONS: The tag SNP approach is a valuable tool in whole genome association studies, and a variety of algorithms have been proposed to identify the optimal tag SNP set. Currently, most tag SNP selection is based on two-marker (pairwise) linkage disequilibrium (LD). Recent literature has shown that multiple-marker LD also contains useful information that can further increase the genetic coverage of the tag SNP set. Thus, tag SNP selection methods that incorporate multiple-marker LD are expected to have advantages in terms of genetic coverage and statistical power. RESULTS: We propose a novel algorithm to select tag SNPs in an iterative procedure. In each iteration loop, the SNP that captures the most neighboring SNPs (through pair-wise and multiple-marker LD) is selected as a tag SNP. We optimize the algorithm and computer program to make our approach feasible on today's typical workstations. Benchmarked using HapMap release 21, our algorithm outperforms standard pair-wise LD approach in several aspects. (i) It improves genetic coverage (e.g. by 7.2% for 200 K tag SNPs in HapMap CEU) compared to its conventional pair-wise counterpart, when conditioning on a fixed tag SNP number. (ii) It saves genotyping costs substantially when conditioning on fixed genetic coverage (e.g. 34.1% saving in HapMap CEU at 90% coverage). (iii) Tag SNPs identified using multiple-marker LD have good portability across closely related ethnic groups and (iv) show higher statistical power in association tests than those selected using conventional methods. AVAILABILITY: A computer software suite, multiTag, has been developed based on this novel algorithm. The program is freely available by written request to the author at ke_hao@merck.com  相似文献   

8.
A strategy for using multiple linked markers for genetic counseling.   总被引:12,自引:6,他引:6  
A strategy for using multiple linked markers for genetic counseling is to test sequentially individual markers until a diagnosis can be made. We show that in order to minimize the number of tests performed per case while diagnosing all informative cases the order in which the markers are to be tested is critical. We describe an algorithm to obtain this order using the parameter "I," the frequency of informative cases. The I value for a specific locus used depends on the marker frequency, association with the disease locus, and also on the informativeness of the marker loci already tested. Realizing that a direct assay for the beta S gene already exists, and that most cases of beta-thalassemia in Mediterraneans can be directly diagnosed using synthetic oligonucleotide probes, we illustrate the above technique by examining nine DNA polymorphisms in the human beta-globin cluster for their ability to diagnose sickle-cell anemia in American blacks and beta-thalassemia in Mediterraneans. This analysis shows that 95.39% of all sickle-cell pregnancies can be diagnosed by testing a subset of only six markers chosen by our algorithm. Furthermore, six markers can also diagnose 88.03% of beta-thalassemia in Greeks and 83.56% of beta-thalassemia in Italians. The test set is different from that suggested by the individual informative frequencies due to nonrandom associations between the restriction sites.  相似文献   

9.
ABSTRACT: BACKGROUND: Low cost genotyping of individuals using high density genomic markers were recently introduced as genomic selection in genetic improvement programs in dairy cattle. Most implementations of genomic selection only use marker information, in the models used for prediction of genetic merit. However, in other species it has been shown that only a fraction of the total genetic variance can be explained by markers. Using 5217 bulls in the Nordic Holstein population that were genotyped and had genetic evaluations based on progeny, we partitioned the total additive genetic variance into a genomic component explained by markers and a remaining component explained by familial relationships. The traits analyzed were production and fitness related traits in dairy cattle. Furthermore, we estimated the genomic variance that can be attributed to individual chromosomes and we illustrate methods that can predict the amount of additive genetic variance that can be explained by sets of markers with different density. RESULTS: The amount of additive genetic variance that can be explained by markers was estimated by an analysis of the matrix of genomic relationships. For the traits in the analysis, most of the additive genetic variance can be explained by 44 K informative SNP markers. The same amount of variance can be attributed to individual chromosomes but surprisingly the relation between chromosomal variance and chromosome length was weak. In models including both genomic (marker) and familial (pedigree) effects most (on average 77.2%) of total additive genetic variance was explained by genomic effects while the remaining was explained by familial relationships. CONCLUSIONS: Most of the additive genetic variance for the traits in the Nordic Holstein population can be explained using 44 K informative SNP markers. By analyzing the genomic relationship matrix it is possible to predict the amount of additive genetic variance that can be explained by a reduced (or increased) set of markers. For the population analyzed the improvement of genomic prediction by increasing marker density beyond 44 K is limited.  相似文献   

10.
Identification of population structure can help trace population histories and identify disease genes. Structured association (SA) is a commonly used approach for population structure identification and association mapping. A major issue with SA is that its performance greatly depends on the informativeness and the numbers of ancestral informative markers (AIMs). Present major AIM selection methods mostly require prior individual ancestry information, which is usually not available or uncertain in practice. To address this potential weakness, we herein develop a novel approach for AIM selection based on principle component analysis (PCA), which does not require prior ancestry information of study subjects. Our simulation and real genetic data analysis results suggest that, with equivalent AIMs, PCA-based selected AIMs can significantly increase the accuracy of inferred individual ancestries compared with traditionally randomly selected AIMs. Our method can easily be applied to whole genome data to select a set of highly informative AIMs in population structure, which can then be used to identify potential population structure and correct possible statistical biases caused by population stratification.  相似文献   

11.
If marker alleles that identify a gene for introgression are not completely unique to the different base populations, the trait allele can be lost quickly during the process of backcrossing. This study considers ways to deal with incompletely informative markers in order to retain the desired allele. Selection was based on the probability of the presence of the desired (introgressed) trait allele, which was calculated for each marker genotype, using a single marker or a diallelic or triallelic marker bracket. The percentage of individuals retaining the introgressed allele was calculated over five generations of backcrossing, for selected fractions between 0 and 1, for marker alleles that could occur in both base populations. The best results were obtained with a rather large selected fraction, when all individuals, heterozygous and homozygous for the most desirable allele at the marker loci, were selected. Additional selection against marker homozygotes (which might have the highest probability of carrying the desired-trait allele, but produce uninformative gametes) altered the optimum selected fraction, making the selected fraction more consistently inversely related to a better retention of the desired-trait allele. A marker bracket was found to give a better retention of the desired-trait allele than a single marker and triallelic markers were better than diallelic markers, giving a retention of almost 50%. The earlier that preselection of parents (on informativeness) took place the better the overall result; preselection should occur preferably in the base populations. Preselection could make marker alleles unique to alternative base populations and markers would effectively become fully informative. Selection in the base populations might not be possible or not desirable, for example, because of the available number of individuals. This is unlikely to be a problem when parents are paired up to exclude any common marker alleles.  相似文献   

12.
Two-locus population genetic models are analyzed to evaluate the utility of restriction fragment length polymorphisms for purposes of genetic counseling. It is shown that the linkage disequilibrium between a neutral marker and a tightly linked overdominant mutant will increase rapidly as the mutant moves to its polymorphic equilibrium. The linkage disequilibrium decays for deleterious recessive mutants. Two measures involving the linkage disequilibrium are investigated to determine how much information the transmission of the neutral marker provides about the transmission of the selected gene. In certain kinds of matings, where the parental two-locus genotypes and linkage phases are known, it is possible to determine whether or not a progeny is homozygous for the selected gene on the basis of the fetal genotype at the marker locus. A quantity of primary interest is the fraction of matings between individuals heterozygous for the selected gene in which exact diagnosis can be made in this way. The expected proportion of such matings, taken over all two-locus matings involving heterozygotes at the selected locus, is calculated as a function of the gene frequencies at the two loci and the linkage disequilibrium between them. This expected value is maximized when the linkage disequilibrium is at its maximum in absolute value. Fewer than half of all matings are informative if the linkage disequilibrium is small in magnitude or if the gene frequencies at the two loci are quite different. Consideration is also given to various conditional measures of association that may be useful when the parental two-locus genotypes are unknown. The results suggest that the utility of tightly linked neutral marker genes in predicting the transmission of a selected gene is generally less when selection acts against a recessive gene than for overdominant selection.  相似文献   

13.
Two-stage designs in case-control association analysis   总被引:1,自引:0,他引:1       下载免费PDF全文
Zuo Y  Zou G  Zhao H 《Genetics》2006,173(3):1747-1760
DNA pooling is a cost-effective approach for collecting information on marker allele frequency in genetic studies. It is often suggested as a screening tool to identify a subset of candidate markers from a very large number of markers to be followed up by more accurate and informative individual genotyping. In this article, we investigate several statistical properties and design issues related to this two-stage design, including the selection of the candidate markers for second-stage analysis, statistical power of this design, and the probability that truly disease-associated markers are ranked among the top after second-stage analysis. We have derived analytical results on the proportion of markers to be selected for second-stage analysis. For example, to detect disease-associated markers with an allele frequency difference of 0.05 between the cases and controls through an initial sample of 1000 cases and 1000 controls, our results suggest that when the measurement errors are small (0.005), approximately 3% of the markers should be selected. For the statistical power to identify disease-associated markers, we find that the measurement errors associated with DNA pooling have little effect on its power. This is in contrast to the one-stage pooling scheme where measurement errors may have large effect on statistical power. As for the probability that the disease-associated markers are ranked among the top in the second stage, we show that there is a high probability that at least one disease-associated marker is ranked among the top when the allele frequency differences between the cases and controls are not <0.05 for reasonably large sample sizes, even though the errors associated with DNA pooling in the first stage are not small. Therefore, the two-stage design with DNA pooling as a screening tool offers an efficient strategy in genomewide association studies, even when the measurement errors associated with DNA pooling are nonnegligible. For any disease model, we find that all the statistical results essentially depend on the population allele frequency and the allele frequency differences between the cases and controls at the disease-associated markers. The general conclusions hold whether the second stage uses an entirely independent sample or includes both the samples used in the first stage and an independent set of samples.  相似文献   

14.
 A common problem in mapping quantitative trait loci (QTLs) is that marker data are often incomplete. This includes missing data, dominant markers, and partially informative markers, arising in outbred populations. Here we briefly present an iteratively re-weighted least square method (IRWLS) to incorporate dominant and missing markers for mapping QTLs in four-way crosses under a heterogeneous variance model. The algorithm uses information from all markers in a linkage group to infer the QTL genotype. Monte Carlo simulations indicate that with half dominant markers, QTL detection is almost as efficient as with all co-dominant markers. However, the precision of the estimated QTL parameters generally decreases as more markers become missing or dominant. Notable differences are observed on the standard deviation of the estimated QTL position for varying levels of marker information content. The method is relatively simple so that more complex models including multiple QTLs or fixed effects can be fitted. Finally, the method can be readily extended to QTL mapping in full-sib families. Received: 16 June 1998 / Accepted: 29 September 1998  相似文献   

15.
Two bulking procedures (bulking individuals before and after genotyping) are commonly applied in similarity based studies of genetic distance at the population or higher level, but their effectiveness is largely unknown. In this study, expected population-pairwise similarity for both bulking procedures is derived with dominant and co-dominant diallelic markers. Numerical examples for the derived formulae are given with up to ten individuals randomly selected from each population. The procedure of bulking individuals after genotyping with either marker system is generally more informative than the procedure of bulking individuals before genotyping, because the former incorporates the information from marker alleles of intermediate frequency. Both procedures are effective with 5–10 individuals selected randomly from either population, but the procedure of bulking before genotyping requires a genotyping effort several-fold less than the procedure of bulking after genotyping. For either bulking procedure, a co-dominant marker system is generally more informative than a dominant marker system. Received: 20 October 1999 / Accepted: 11 November 1999  相似文献   

16.
Genetic association studies increasingly rely on the use of linkage disequilibrium (LD) tag SNPs to reduce genotyping costs. We developed a software package TAGster to select, evaluate and visualize LD tag SNPs both for single and multiple populations. We implement several strategies to improve the efficiency of current LD tag SNP selection algorithms: (1) we modify the tag SNP selection procedure of Carlson et al. to improve selection efficiency and further generalize it to multiple populations. (2) We propose a redundant SNP elimination step to speed up the exhaustive tag SNP search algorithm proposed by Qin et al. (3) We present an additional multiple population tag SNP selection algorithm based on the framework of Howie et al., but using our modified exhaustive search procedure. We evaluate these methods using resequenced candidate gene data from the Environmental Genome Project and show improvements in both computational and tagging efficiency. AVAILABILITY: The software Package TAGster is freely available at http://www.niehs.nih.gov/research/resources/software/tagster/  相似文献   

17.
Genes that underlie ethnic differences in disease risk can be mapped in affected individuals of mixed descent if the ancestry of the alleles at each marker locus can be assigned to one of the two founding populations. Linkage can be detected by testing for association of the disease with the ancestry of alleles at the marker locus, by conditioning on the admixture (defined as the proportion of genes that have ancestry from the high-risk population) of both parents. With regard to exploiting the effects of admixture, this test is more flexible and powerful than the transmission-disequilibrium test. Under the assumption of a multiplicative model, the statistical power for a given sample size depends only on parental admixture and the risk ratio r between populations that is generated by the locus. The most informative families are those in which mean parental admixture is .2-.7 and in which admixture is similar in both parents. The number of markers required for a genome search depends on the number of generations since admixture and on the information content for ancestry (f) of the markers, defined as a function of allele frequencies in the two founding populations. Simulations using a hidden Markov model suggest that, when admixture has occurred 2-10 generations earlier, a multipoint analysis using 2,000 biallelic markers, with f values of 30%, can extract 70%-90% of the ancestry information for each locus. Sets of such markers could be selected from libraries of single-nucleotide polymorphisms, when these become available.  相似文献   

18.
19.
For the analysis of affected sib pairs (ASPs), a variety of test statistics is applied in genomewide scans with microsatellite markers. Even in multipoint analyses, these statistics might not fully exploit the power of a given sample, because they do not account for incomplete informativity of an ASP. For meta-analyses of linkage and association studies, it has been shown recently that weighting by informativity increases statistical power. With this idea in mind, the first aim of this article was to introduce a new class of tests for ASPs that are based on the mean test. To take into account how much informativity an ASP contributes, we weighted families inversely proportional to their marker informativity. The weighting scheme is obtained by use of the de Finetti representation of the distribution of identity-by-descent values. We derive the limiting distribution of the weighted mean test and demonstrate the validity of the proposed test. We show that it can be much more powerful than the classical mean test in the case of low marker informativity. In the second part of the article, we propose a Monte Carlo simulation approach for evaluating significance among ASPs. We demonstrate the validity of the simulation approach for both the classical and the weighted mean test. Finally, we illustrate the use of the weighted mean test by reanalyzing two published data sets. In both applications, the maximum LOD score of the weighted mean test is 0.6 higher than that of the classical mean test.  相似文献   

20.
An analytical procedure for estimating the risk of X-linked diseases based on presence/absence of a series of restriction sites is presented. Multiple-locus linkage phase of the carrier mother is first inferred from previous offspring, from parents, and by molecular means. Bayesian risk estimates are then obtained using this information and the recombination-segregation distribution. The improvement afforded by using multiple flanking markers rather than a single marker is dramatic. Whereas the upper bound on the probability that a family will be informative using a single diallelic X-linked marker is .5, in the case of m markers, the bound on the probability of an informative family becomes 1 - .5m. With a single linked marker, the precision in the risk estimate is bounded by the frequency of recombination, whereas the requirement of very tight linkage is relaxed somewhat when multiple flanking markers are used. Recombination interference and multiple-locus linkage disequilibria can further improve the risk estimates, but it is important to understand how the statistical confidence in these parameters affects the reliability of the risk estimates.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号