首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The genotyping of closely spaced single-nucleotide polymorphism (SNP) markers frequently yields highly correlated data, owing to extensive linkage disequilibrium (LD) between markers. The extent of LD varies widely across the genome and drives the number of frequent haplotypes observed in small regions. Several studies have illustrated the possibility that LD or haplotype data could be used to select a subset of SNPs that optimize the information retained in a genomic region while reducing the genotyping effort and simplifying the analysis. We propose a method based on the spectral decomposition of the matrices of pairwise LD between markers, and we select markers on the basis of their contributions to the total genetic variation. We also modify Clayton's "haplotype tagging SNP" selection method, which utilizes haplotype information. For both methods, we propose sliding window-based algorithms that allow the methods to be applied to large chromosomal regions. Our procedures require genotype information about a small number of individuals for an initial set of SNPs and selection of an optimum subset of SNPs that could be efficiently genotyped on larger numbers of samples while retaining most of the genetic variation in samples. We identify suitable parameter combinations for the procedures, and we show that a sample size of 50-100 individuals achieves consistent results in studies of simulated data sets in linkage equilibrium and LD. When applied to experimental data sets, both procedures were similarly effective at reducing the genotyping requirement while maintaining the genetic information content throughout the regions. We also show that haplotype-association results that Hosking et al. obtained near CYP2D6 were almost identical before and after marker selection.  相似文献   

2.
Whole genome scans analyze large sets of genetic markers, mainly single nucleotide polymorphisms, over the entire genome in order to find variants and regions associated with complex traits so these can be further investigated. Analyzing the results of such scans becomes difficult due to multiple testing problems and to the genomic distributions of recombination, linkage disequilibrium and true associations, which generate an extremely complex network of dependences between markers. Here we present Association Cluster Detector (ACD), a simple tool aiming to ease the analysis of the results of whole genome scans. ACD facilitates correction for multiple tests using several standard procedures and implements a sliding-window heuristic method that helps in detecting potentially interesting candidate regions by exploiting the property of non-random distribution of significantly associated markers. AVAILABILITY: The tool can be downloaded from http://www.upf.es/cexs/recerca/bioevo/softanddata.htm  相似文献   

3.
MOTIVATION: Selecting SNP markers for genome-wide association studies is an important and challenging task. The goal is to minimize the number of markers selected for genotyping in a particular platform and therefore reduce genotyping cost while simultaneously maximizing the information content provided by selected markers. RESULTS: We devised an improved algorithm for tagSNP selection using the pairwise r(2) criterion. We first break down large marker sets into disjoint pieces, where more exhaustive searches can replace the greedy algorithm for tagSNP selection. These exhaustive searches lead to smaller tagSNP sets being generated. In addition, our method evaluates multiple solutions that are equivalent according to the linkage disequilibrium criteria to accommodate additional constraints. Its performance was assessed using HapMap data. AVAILABILITY: A computer program named FESTA has been developed based on this algorithm. The program is freely available and can be downloaded at http://www.sph.umich.edu/csg/qin/FESTA/  相似文献   

4.
Individual genome scans tend to have low power and can produce markedly biased estimates of QTL effects. Further, the confidence interval for their location is often prohibitively large for subsequent fine mapping and positional cloning. Given that a large number of genome scans have been conducted, not to mention the large number of variables and subsets tested, it is difficult to confidently rule out type 1 error as an explanation for significant effects even when there is apparent replication in a separate data set. We adapted Empirical Bayes (EB) methods [1] to analyze data from multiple genome scans simultaneously and alleviate each of these problems while still allowing for different QTL population effects across studies. We investigated the effects of using the EB method to include data from background studies to update the results of a single study of interest via simulation and demonstrated that it has a stable confidence level over a wide range of parameters defining the background studies and increased the power to detect linkage, even when some of the background studies were null or had QTL effect at other markers. This EB method for incorporating data from multiple studies into genome scan analyses seems promising.  相似文献   

5.
Multiallelic short tandem repeat polymorphisms, or microsatellites, are useful markers in genome wide scans to identify chromosomal regions containing genes underlying disease loci. The biallelic single nucleotide polymorphism (SNP) can be used to fine map previously identified large candidate regions or to test functional candidate genes by association analysis. In the GenomEUtwin project the population based impact of susceptibility genes for six multifactorial traits will be studied. A genome wide panel of informative human microsatellite markers will be analyzed by fluorescent capillary electrophoresis in well characterized twin and population samples. Contrary to microsatellites, selection of the most informative panels of SNPs is hampered by imperfect data on the allele frequencies and population distribution of SNPs markers in the databases. Therefore, selection of SNPs requires a substantial amount of bioinformatics, and, the SNPs need to be validated experimentally in the relevant populations prior to genotyping large sample sets. In the GenomEUtwin project, large scale genotyping of SNPs will be performed using the SNPstreamUHT and MassARRAY genotyping systems that are based on the primer extension reaction principle combined with fluorescent and mass spectrometric detection, respectively. Production of the genotyping data will be a joint effort by GenomEUtwin partners at the University of Helsinki, the National Public Health Institute in Helsinki, Finland and Uppsala University, Sweden. All genotyping data will be stored in a common database established specifically for the GenomEUtwin project, from where it can be accessed by the twin research centres that provided the samples for genotyping.  相似文献   

6.
Computations for genome scans need to adapt to the increasing use of dense diallelic markers as well as of full-chromosome multipoint linkage analysis with either diallelic or multiallelic markers. Whereas suitable exact-computation tools are available for use with small pedigrees, equivalent exact computation for larger pedigrees remains infeasible. Markov chain-Monte Carlo (MCMC)-based methods currently provide the only computationally practical option. To date, no systematic comparison of the performance of MCMC-based programs is available, nor have these programs been systematically evaluated for use with dense diallelic markers. Using simulated data, we evaluate the performance of two MCMC-based linkage-analysis programs--lm_markers from the MORGAN package and SimWalk2--under a variety of analysis conditions. Pedigrees consisted of 14, 52, or 98 individuals in 3, 5, or 6 generations, respectively, with increasing amounts of missing data in larger pedigrees. One hundred replicates of markers and trait data were simulated on a 100-cM chromosome, with up to 10 multiallelic and up to 200 diallelic markers used simultaneously for computation of multipoint LOD scores. Exact computation was available for comparison in most situations, and comparison with a perfectly informative marker or interprogram comparison was available in the remaining situations. Our results confirm the accuracy of both programs in multipoint analysis with multiallelic markers on pedigrees of varied sizes and missing-data patterns, but there are some computational differences. In contrast, for large numbers of dense diallelic markers, only the lm_markers program was able to provide accurate results within a computationally practical time. Thus, programs in the MORGAN package are the first available to provide a computationally practical option for accurate linkage analyses in genome scans with both large numbers of diallelic markers and large pedigrees.  相似文献   

7.
Scanning the genome for association between markers and complex diseases typically requires testing hundreds of thousands of genetic polymorphisms. Testing such a large number of hypotheses exacerbates the trade-off between power to detect meaningful associations and the chance of making false discoveries. Even before the full genome is scanned, investigators often favor certain regions on the basis of the results of prior investigations, such as previous linkage scans. The remaining regions of the genome are investigated simultaneously because genotyping is relatively inexpensive compared with the cost of recruiting participants for a genetic study and because prior evidence is rarely sufficient to rule out these regions as harboring genes with variation of conferring liability (liability genes). However, the multiple testing inherent in broad genomic searches diminishes power to detect association, even for genes falling in regions of the genome favored a priori. Multiple testing problems of this nature are well suited for application of the false-discovery rate (FDR) principle, which can improve power. To enhance power further, a new FDR approach is proposed that involves weighting the hypotheses on the basis of prior data. We present a method for using linkage data to weight the association P values. Our investigations reveal that if the linkage study is informative, the procedure improves power considerably. Remarkably, the loss in power is small, even when the linkage study is uninformative. For a class of genetic models, we calculate the sample size required to obtain useful prior information from a linkage study. This inquiry reveals that, among genetic models that are seemingly equal in genetic information, some are much more promising than others for this mode of analysis.  相似文献   

8.
OBJECTIVE: Given the cost and complexity of genome-wide scans, optimization of study design is of critical importance. Available algorithms only partially satisfy this need. We designed a software package called 'POLYMORPHISM' to meet these needs. METHODS: The program is designed to calculate linkage parameters for both 'single-point' and 'two-point' settings that are applicable also to incompletely informative microsatellite markers. In single-point analysis, the heterozygosity, polymorphism information content (PIC) and linkage information content (LIC) statistics based on marker allele frequencies are provided. In two-point analysis, joint PIC values for two markers, the conditional probability of detecting linkage phase, the frequency of double heterozygotes and the expected number of informative meioses are calculated. RESULTS: Results were obtained using S.A.G.E./DESPAIR (Design of Linkage Studies Based on Pairs of Relatives) in addition to applying this program to a Centre d'Etude du Polymorphisme pedigree-derived genotyping data set, which estimated critical parameters used in a two-stage genome scan. A single nucleotide polymorphism (SNP)-based one-stage genomic screen strategy is also considered. CONCLUSIONS: LIC values are crucial for getting accurate estimates on those parameters that are important for a two-stage genome screening study. Optimization of the cost-effectiveness of an SNP-based genomic screen strategy is possible by modeling a balance between marker information content and marker density.  相似文献   

9.

Background

The identification of disease-associated genes using single nucleotide polymorphisms (SNPs) has been increasingly reported. In particular, the Affymetrix Mapping 10 K SNP microarray platform uses one PCR primer to amplify the DNA samples and determine the genotype of more than 10,000 SNPs in the human genome. This provides the opportunity for large scale, rapid and cost-effective genotyping assays for linkage analysis. However, the analysis of such datasets is nontrivial because of the large number of markers, and visualizing the linkage scores in the context of genome maps remains less automated using the current linkage analysis software packages. For example, the haplotyping results are commonly represented in the text format.

Results

Here we report the development of a novel software tool called CompareLinkage for automated formatting of the Affymetrix Mapping 10 K genotype data into the "Linkage" format and the subsequent analysis with multi-point linkage software programs such as Merlin and Allegro. The new software has the ability to visualize the results for all these programs in dChip in the context of genome annotations and cytoband information. In addition we implemented a variant of the Lander-Green algorithm in the dChipLinkage module of dChip software (V1.3) to perform parametric linkage analysis and haplotyping of SNP array data. These functions are integrated with the existing modules of dChip to visualize SNP genotype data together with LOD score curves. We have analyzed three families with recessive and dominant diseases using the new software programs and the comparison results are presented and discussed.

Conclusions

The CompareLinkage and dChipLinkage software packages are freely available. They provide the visualization tools for high-density oligonucleotide SNP array data, as well as the automated functions for formatting SNP array data for the linkage analysis programs Merlin and Allegro and calling these programs for linkage analysis. The results can be visualized in dChip in the context of genes and cytobands. In addition, a variant of the Lander-Green algorithm is provided that allows parametric linkage analysis and haplotyping.  相似文献   

10.
MOTIVATION: Haplotype reconstruction is an essential step in genetic linkage and association studies. Although many methods have been developed to estimate haplotype frequencies and reconstruct haplotypes for a sample of unrelated individuals, haplotype reconstruction in large pedigrees with a large number of genetic markers remains a challenging problem. METHODS: We have developed an efficient computer program, HAPLORE (HAPLOtype REconstruction), to identify all haplotype sets that are compatible with the observed genotypes in a pedigree for tightly linked genetic markers. HAPLORE consists of three steps that can serve different needs in applications. In the first step, a set of logic rules is used to reduce the number of compatible haplotypes of each individual in the pedigree as much as possible. After this step, the haplotypes of all individuals in the pedigree can be completely or partially determined. These logic rules are applicable to completely linked markers and they can be used to impute missing data and check genotyping errors. In the second step, a haplotype-elimination algorithm similar to the genotype-elimination algorithms used in linkage analysis is applied to delete incompatible haplotypes derived from the first step. All superfluous haplotypes of the pedigree members will be excluded after this step. In the third step, the expectation-maximization (EM) algorithm combined with the partition and ligation technique is used to estimate haplotype frequencies based on the inferred haplotype configurations through the first two steps. Only compatible haplotype configurations with haplotypes having frequencies greater than a threshold are retained. RESULTS: We test the effectiveness and the efficiency of HAPLORE using both simulated and real datasets. Our results show that, the rule-based algorithm is very efficient for completely genotyped pedigree. In this case, almost all of the families have one unique haplotype configuration. In the presence of missing data, the number of compatible haplotypes can be substantially reduced by HAPLORE, and the program will provide all possible haplotype configurations of a pedigree under different circumstances, if such multiple configurations exist. These inferred haplotype configurations, as well as the haplotype frequencies estimated by the EM algorithm, can be used in genetic linkage and association studies. AVAILABILITY: The program can be downloaded from http://bioinformatics.med.yale.edu.  相似文献   

11.
Understanding the genetics of biological diversification across micro‐ and macro‐evolutionary time scales is a vibrant field of research for molecular ecologists as rapid advances in sequencing technologies promise to overcome former limitations. In palms, an emblematic, economically and ecologically important plant family with high diversity in the tropics, studies of diversification at the population and species levels are still hampered by a lack of genomic markers suitable for the genotyping of large numbers of recently diverged taxa. To fill this gap, we used a whole genome sequencing approach to develop target sequencing for molecular markers in 4,184 genome regions, including 4,051 genes and 133 non‐genic putatively neutral regions. These markers were chosen to cover a wide range of evolutionary rates allowing future studies at the family, genus, species and population levels. Special emphasis was given to the avoidance of copy number variation during marker selection. In addition, a set of 149 well‐known sequence regions previously used as phylogenetic markers by the palm biological research community were included in the target regions, to open the possibility to combine and jointly analyse already available data sets with genomic data to be produced with this new toolkit. The bait set was effective for species belonging to all three palm sub‐families tested (Arecoideae, Ceroxyloideae and Coryphoideae), with high mapping rates, specificity and efficiency. The number of high‐quality single nucleotide polymorphisms (SNPs) detected at both the sub‐family and population levels facilitates efficient analyses of genomic diversity across micro‐ and macro‐evolutionary time scales.  相似文献   

12.
The extremes of phenotype displayed by the domestic dog, as well as the largest number of naturally occurring inherited diseases in any mammalian species except man (>450), have generated a large interest in genomic linkage mapping in the species. Marker sets for linkage mapping should ideally show both high levels of polymorphism among the target group of animals and an even spacing of markers across the whole genome. Currently a microsatellite marker set known as Minimal Screening Set 2 (MSS2) is widely used. Here, we have extended this marker set by filling in gaps as noted from the marker positions in the CanFam genome assembly (1.0) and the 5000cR radiation hybrid (RH) map. An additional 183 markers have been positioned to increase the coverage of the MSS2 set wherever it contains a gap >9 mb or 1000(5000) RH units. We have called the marker set derived from the MSS2 set and these 183 markers, MSS3. The average physical spacing of markers in the complete 507 marker MSS3 set is 5 mb, whereas average heterozygosity of the 183 new markers on a panel of 10 dogs of differing breeds is 0.74. This marker group will allow genome-wide scans in the dog to be conducted at close to 5 cM resolution.  相似文献   

13.
Mapping by admixture linkage disequilibrium (MALD) is a potentially powerful technique for the mapping of complex genetic diseases. The practical requirements of this method include (a) a set of markers spanning the genome that have large allele-frequency differences between the parental ethnicities contributing to the admixed population and (b) an understanding of the extent of admixture in the study population. To this end, a DNA-pooling technique was used to screen microsatellite and diallelic insertion/deletion markers for allele-frequency differences between putative representatives of the parental populations of the admixed Mexican American (MA) and African American (AA) populations. Markers with promising pooled differences were then confirmed by individual genotyping in both the parental and admixed populations. For the MA population, screening of >600 markers identified 151 ethnic-difference markers (EDMs) with delta>0.30 (where delta is the absolute value of each allele-frequency difference between two populations, summed over all marker alleles and divided by two) that are likely to be useful for MALD analysis. For the AA population, analysis of >400 markers identified 97 EDMs. In addition, individual genotyping of these markers in Pima Amerindians, Yavapai Amerindians, European American (EA) individuals, Africans from Zimbabwe, MA individuals, and AA individuals, as well as comparison to the CEPH genotyping set, suggests that the differences between subpopulations of an ethnicity are small for many markers with large interethnic differences. Estimates of admixture that are based on individual genotyping of these markers are consistent with a 60% EA:40% Amerindian contribution to MA populations and with a 20% EA:80% African contribution to AA populations. Taken together, these data suggest that EDMs with large interpopulation and small intrapopulation differences can be readily identified for MALD studies in both AA and MA populations.  相似文献   

14.
This paper presents a method of performing model-free LOD-score based linkage analysis on quantitative traits. It is implemented in the QMFLINK program. The method is used to perform a genome screen on the Framingham Heart Study data. A number of markers that show some support for linkage in our study coincide substantially with those implicated in other linkage studies of hypertension. Although the new method needs further testing on additional real and simulated data sets we can already say that it is straightforward to apply and may offer a useful complementary approach to previously available methods for the linkage analysis of quantitative traits.  相似文献   

15.
Selective genotyping concerns the genotyping of a portion of individuals chosen on the basis of their phenotypic values. Often individuals are selected for genotyping from the high and low extremes of the phenotypic distribution. This procedure yields savings in cost and time by decreasing the total number of individuals genotyped. Previous work by Darvasi et al. (1993) has shown that the power to detect a QTL by genotyping 40-50 % of a population is roughly equivalent to genotyping the entire sample. However, these power studies have not accounted for different strategies of analysing the data when phenotypes of individuals in the middle are excluded, nor have they investigated the genome-wide type I error rate under these different strategies or different selection percentages. Further, these simulation studies have not considered markers over the entire genome. In this paper, we present simulation studies of power for the maximum likelihood approach to QTL mapping by Lander & Botstein (1989) in the context of selective genotyping. We calculate the power of selectively genotyping the individuals from the middle of the phenotypic distribution when performing QTL mapping over the whole mouse genome.  相似文献   

16.
Large-scale whole genome association studies are increasingly common, due in large part to recent advances in genotyping technology. With this change in paradigm for genetic studies of complex diseases, it is vital to develop valid, powerful, and efficient statistical tools and approaches to evaluate such data. Despite a dramatic drop in genotyping costs, it is still expensive to genotype thousands of individuals for hundreds of thousands single nucleotide polymorphisms (SNPs) for large-scale whole genome association studies. A multi-stage (or two-stage) design has been a promising alternative: in the first stage, only a fraction of samples are genotyped and tested using a dense set of SNPs, and only a small subset of markers that show moderate associations with the disease will be genotyped in later stages. Multi-stage designs have also been used in candidate gene association studies, usually in regions that have shown strong signals by linkage studies. To decide which set of SNPs to be genotyped in the next stage, a common practice is to utilize a simple test (such as a chi2 test for case-control data) and a liberal significance level without corrections for multiple testing, to ensure that no true signals will be filtered out. In this paper, I have developed a novel SNP selection procedure within the framework of multi-stage designs. Based on data from stage 1, the method explicitly explores correlations (linkage disequilibrium) among SNPs and their possible interactions in determining the disease phenotype. Comparing with a regular multi-stage design, the approach can select a much reduced set of SNPs with high discriminative power for later stages. Therefore, not only does it reduce the genotyping cost in later stages, it also increases the statistical power by reducing the number of tests. Combined analysis is proposed to further improve power, and the theoretical significance level of the combined statistic is derived. Extensive simulations have been performed, and results have shown that the procedure can reduce the number of SNPs required in later stages, with improved power to detect associations. The procedure has also been applied to a real data set from a genome-wide association study of the sporadic amyotrophic lateral sclerosis (ALS) disease, and an interesting set of candidate SNPs has been identified.  相似文献   

17.
? Premise of the study: Discrepancies in terms of genotyping data are frequently observed when comparing simple sequence repeat (SSR) data sets across genotyping technologies and laboratories. This technical concern introduces biases that hamper any synthetic studies or comparison of genetic diversity between collections. To prevent this for Sorghum bicolor, we developed a control kit of 48 SSR markers. ? Methods and Results: One hundred seventeen markers were selected along the genome to provide coverage across the length of all 10 sorghum linkage groups. They were tested for polymorphism and reproducibility across two laboratories (Centre de Cooperation Internationale en Recherche Agronomique pour le Developpement [CIRAD], France, and International Crops Research Institute for the Semi-Arid Tropics [ICRISAT], India) using two commonly used genotyping technologies (polyacrylamide gel-based technology with LI-COR sequencing machines and capillary systems with ABI sequencing apparatus) with DNA samples from a diverse set of 48 S. bicolor accessions. ? Conclusions: A kit for diversity analysis (http://sat.cirad.fr/sat/sorghum_SSR_kit/) was developed. It contains information on 48 technically robust sorghum microsatellite markers and 10 DNA controls. It can further be used to calibrate sorghum SSR genotyping data acquired with different technologies and compare those to genetic diversity references.  相似文献   

18.
Siberian stone pine, Pinus sibirica Du Tour is one of the most economically and environmentally important forest-forming species of conifers in Russia. To study these forests a large number of highly polymorphic molecular genetic markers, such as microsatellite loci, are required. Prior to the new high-throughput next generation sequencing (NGS) methods, discovery of microsatellite loci and development of micro-satellite markers were very time consuming and laborious. The recently developed draft assembly of the Siberian stone pine genome, sequenced using the NGS methods, allowed us to identify a large number of microsatellite loci in the Siberian stone pine genome and to develop species-specific PCR primers for amplification and genotyping of 70 microsatellite loci. The primers were designed using contigs containing short simple sequence tandem repeats from the Siberian stone pine whole genome draft assembly. Based on the testing of primers for 70 microsatellite loci with tri-, tetra- or pentanucleotide repeats, 18 most promising, reliable and polymorphic loci were selected that can be used further as molecular genetic markers in population genetic studies of Siberian stone pine.  相似文献   

19.
Robust estimation of allele frequencies in pools of DNA has the potential to reduce genotyping costs and/or increase the number of individuals contributing to a study where hundreds of thousands of genetic markers need to be genotyped in very large populations sample sets, such as genome wide association studies. In order to make accurate allele frequency estimations from pooled samples a correction for unequal allele representation must be applied. We have developed the polynomial based probe specific correction (PPC) which is a novel correction algorithm for accurate estimation of allele frequencies in data from high-density microarrays. This algorithm was validated through comparison of allele frequencies from a set of 10 individually genotyped DNA's and frequencies estimated from pools of these 10 DNAs using GeneChip 10K Mapping Xba 131 arrays. Our results demonstrate that when using the PPC to correct for allelic biases the accuracy of the allele frequency estimates increases dramatically.  相似文献   

20.
Understanding how and why populations evolve is of fundamental importance to molecular ecology. Restriction site‐associated DNA sequencing (RADseq), a popular reduced representation method, has ushered in a new era of genome‐scale research for assessing population structure, hybridization, demographic history, phylogeography and migration. RADseq has also been widely used to conduct genome scans to detect loci involved in adaptive divergence among natural populations. Here, we examine the capacity of those RADseq‐based genome scan studies to detect loci involved in local adaptation. To understand what proportion of the genome is missed by RADseq studies, we developed a simple model using different numbers of RAD‐tags, genome sizes and extents of linkage disequilibrium (length of haplotype blocks). Under the best‐case modelling scenario, we found that RADseq using six‐ or eight‐base pair cutting restriction enzymes would fail to sample many regions of the genome, especially for species with short linkage disequilibrium. We then surveyed recent studies that have used RADseq for genome scans and found that the median density of markers across these studies was 4.08 RAD‐tag markers per megabase (one marker per 245 kb). The length of linkage disequilibrium for many species is one to three orders of magnitude less than density of the typical recent RADseq study. Thus, we conclude that genome scans based on RADseq data alone, while useful for studies of neutral genetic variation and genetic population structure, will likely miss many loci under selection in studies of local adaptation.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号