首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Multiple algorithms have been developed for the purpose of calling single nucleotide polymorphisms (SNPs) from Affymetrix microarrays. We extend and validate the algorithm CRLMM, which incorporates HapMap information within an empirical Bayes framework. We find CRLMM to be more accurate than the Affymetrix default programs (BRLMM and Birdseed). Also, we tie our call confidence metric to percent accuracy. We intend that our validation datasets and methods, refered to as SNPaffycomp, serve as standard benchmarks for future SNP calling algorithms.  相似文献   

2.
Current genotype-calling methods such as Robust Linear Model with Mahalanobis Distance Classifier (RLMM) and Corrected Robust Linear Model with Maximum Likelihood Classification (CRLMM) provide accurate calling results for Affymetrix Single Nucleotide Polymorphisms (SNP) chips. However, these methods are computationally expensive as they employ preprocess procedures, including chip data normalization and other sophisticated statistical techniques. In the small sample case the accuracy rate may drop significantly. We develop a new genotype calling method for Affymetrix 100 k and 500 k SNP chips. A two-stage classification scheme is proposed to obtain a fast genotype calling algorithm. The first stage uses unsupervised classification to quickly discriminate genotypes with high accuracy for the majority of the SNPs. And the second stage employs a supervised classification method to incorporate allele frequency information either from the HapMap data or from a self-training scheme. Confidence score is provided for every genotype call. The overall performance is shown to be comparable to that of CRLMM as verified by the known gold standard HapMap data and is superior in small sample cases. The new algorithm is computationally simple and standalone in the sense that a self-training scheme can be used without employing any other training data. A package implementing the calling algorithm is freely available at http://www.sfs.ecnu.edu.cn/teachers/xuj_en.html.  相似文献   

3.
MOTIVATION: The technology to genotype single nucleotide polymorphisms (SNPs) at extremely high densities provides for hypothesis-free genome-wide scans for common polymorphisms associated with complex disease. However, we find that some errors introduced by commonly employed genotyping algorithms may lead to inflation of false associations between markers and phenotype. RESULTS: We have developed a novel SNP genotype calling program, SNiPer-High Density (SNiPer-HD), for highly accurate genotype calling across hundreds of thousands of SNPs. The program employs an expectation-maximization (EM) algorithm with parameters based on a training sample set. The algorithm choice allows for highly accurate genotyping for most SNPs. Also, we introduce a quality control metric for each assayed SNP, such that poor-behaving SNPs can be filtered using a metric correlating to genotype class separation in the calling algorithm. SNiPer-HD is superior to the standard dynamic modeling algorithm and is complementary and non-redundant to other algorithms, such as BRLMM. Implementing multiple algorithms together may provide highly accurate genotyping calls, without inflation of false positives due to systematically miss-called SNPs. A reliable and accurate set of SNP genotypes for increasingly dense panels will eliminate some false association signals and false negative signals, allowing for rapid identification of disease susceptibility loci for complex traits. AVAILABILITY: SNiPer-HD is available at TGen's website: http://www.tgen.org/neurogenomics/data.  相似文献   

4.

Background  

Illumina's Infinium SNP BeadChips are extensively used in both small and large-scale genetic studies. A fundamental step in any analysis is the processing of raw allele A and allele B intensities from each SNP into genotype calls (AA, AB, BB). Various algorithms which make use of different statistical models are available for this task. We compare four methods (GenCall, Illuminus, GenoSNP and CRLMM) on data where the true genotypes are known in advance and data from a recently published genome-wide association study.  相似文献   

5.
Genome-wide association studies (GWAS) examine the entire human genome with the goal of identifying genetic variants (usually single nucleotide polymorphisms (SNPs)) that are associated with phenotypic traits such as disease status and drug response. The discordance of significantly associated SNPs for the same disease identified from different GWAS indicates that false associations exist in such results. In addition to the possible sources of spurious associations that have been investigated and discussed intensively, such as sample size and population stratification, an accurate and reproducible genotype calling algorithm is required for concordant GWAS results from different studies. However, variations of genotype calling of an algorithm and their effects on significantly associated SNPs identified in downstream association analyses have not been systematically investigated. In this paper, the variations of genotype calling using the Bayesian Robust Linear Model with Mahalanobis distance classifier (BRLMM) algorithm and the resulting influence on the lists of significantly associated SNPs were evaluated using the raw data of 270 HapMap samples analysed with the Affymetrix Human Mapping 500K Array Set (Affy500K) by changing algorithmic parameters. Modified were the Dynamic Model (DM) call confidence threshold (threshold) and the number of randomly selected SNPs (size). Comparative analysis of the calling results and the corresponding lists of significantly associated SNPs identified through association analysis revealed that algorithmic parameters used in BRLMM affected the genotype calls and the significantly associated SNPs. Both the threshold and the size affected the called genotypes and the lists of significantly associated SNPs in association analysis. The effect of the threshold was much larger than the effect of the size. Moreover, the heterozygous calls had lower consistency compared to the homozygous calls.  相似文献   

6.
Recent advances in nanofluidic technologies have enabled the use of Integrated Fluidic Circuits (IFCs) for high-throughput Single Nucleotide Polymorphism (SNP) genotyping (GT). In this study, we implemented and validated a relatively low cost nanofluidic system for SNP-GT with and without Specific Target Amplification (STA). As proof of principle, we first validated the effect of input DNA copy number on genotype call rate using well characterised, digital PCR (dPCR) quantified human genomic DNA samples and then implemented the validated method to genotype 45 SNPs in the humpback whale, Megaptera novaeangliae, nuclear genome. When STA was not incorporated, for a homozygous human DNA sample, reaction chambers containing, on average 9 to 97 copies, showed 100% call rate and accuracy. Below 9 copies, the call rate decreased, and at one copy it was 40%. For a heterozygous human DNA sample, the call rate decreased from 100% to 21% when predicted copies per reaction chamber decreased from 38 copies to one copy. The tightness of genotype clusters on a scatter plot also decreased. In contrast, when the same samples were subjected to STA prior to genotyping a call rate and a call accuracy of 100% were achieved. Our results demonstrate that low input DNA copy number affects the quality of data generated, in particular for a heterozygous sample. Similar to human genomic DNA, a call rate and a call accuracy of 100% was achieved with whale genomic DNA samples following multiplex STA using either 15 or 45 SNP-GT assays. These calls were 100% concordant with their true genotypes determined by an independent method, suggesting that the nanofluidic system is a reliable platform for executing call rates with high accuracy and concordance in genomic sequences derived from biological tissue.  相似文献   

7.
A genotype calling algorithm for the Illumina BeadArray platform   总被引:2,自引:0,他引:2  
MOTIVATION: Large-scale genotyping relies on the use of unsupervised automated calling algorithms to assign genotypes to hybridization data. A number of such calling algorithms have been recently established for the Affymetrix GeneChip genotyping technology. Here, we present a fast and accurate genotype calling algorithm for the Illumina BeadArray genotyping platforms. As the technology moves towards assaying millions of genetic polymorphisms simultaneously, there is a need for an integrated and easy-to-use software for calling genotypes. RESULTS: We have introduced a model-based genotype calling algorithm which does not rely on having prior training data or require computationally intensive procedures. The algorithm can assign genotypes to hybridization data from thousands of individuals simultaneously and pools information across multiple individuals to improve the calling. The method can accommodate variations in hybridization intensities which result in dramatic shifts of the position of the genotype clouds by identifying the optimal coordinates to initialize the algorithm. By incorporating the process of perturbation analysis, we can obtain a quality metric measuring the stability of the assigned genotype calls. We show that this quality metric can be used to identify SNPs with low call rates and accuracy. AVAILABILITY: The C++ executable for the algorithm described here is available by request from the authors.  相似文献   

8.
Liu W  Zhao W  Chase GA 《Human heredity》2006,61(1):31-44
OBJECTIVE: Single nucleotide polymorphisms (SNPs) serve as effective markers for localizing disease susceptibility genes, but current genotyping technologies are inadequate for genotyping all available SNP markers in a typical linkage/association study. Much attention has recently been paid to methods for selecting the minimal informative subset of SNPs in identifying haplotypes, but there has been little investigation of the effect of missing or erroneous genotypes on the performance of these SNP selection algorithms and subsequent association tests using the selected tagging SNPs. The purpose of this study is to explore the effect of missing genotype or genotyping error on tagging SNP selection and subsequent single marker and haplotype association tests using the selected tagging SNPs. METHODS: Through two sets of simulations, we evaluated the performance of three tagging SNP selection programs in the presence of missing or erroneous genotypes: Clayton's diversity based program htstep, Carlson's linkage disequilibrium (LD) based program ldSelect, and Stram's coefficient of determination based program tagsnp.exe. RESULTS: When randomly selected known loci were relabeled as 'missing', we found that the average number of tagging SNPs selected by all three algorithms changed very little and the power of subsequent single marker and haplotype association tests using the selected tagging SNPs remained close to the power of these tests in the absence of missing genotype. When random genotyping errors were introduced, we found that the average number of tagging SNPs selected by all three algorithms increased. In data sets simulated according to the haplotype frequecies in the CYP19 region, Stram's program had larger increase than Carlson's and Clayton's programs. In data sets simulated under the coalescent model, Carlson's program had the largest increase and Clayton's program had the smallest increase. In both sets of simulations, with the presence of genotyping errors, the power of the haplotype tests from all three programs decreased quickly, but there was not much reduction in power of the single marker tests. CONCLUSIONS: Missing genotypes do not seem to have much impact on tagging SNP selection and subsequent single marker and haplotype association tests. In contrast, genotyping errors could have severe impact on tagging SNP selection and haplotype tests, but not on single marker tests.  相似文献   

9.
High-throughput SNP genotyping platforms use automated genotype calling algorithms to assign genotypes. While these algorithms work efficiently for individual platforms, they are not compatible with other platforms, and have individual biases that result in missed genotype calls. Here we present data on the use of a second complementary SNP genotype clustering algorithm. The algorithm was originally designed for individual fluorescent SNP genotyping assays, and has been optimized to permit the clustering of large datasets generated from custom-designed Affymetrix SNP panels. In an analysis of data from a 3K array genotyped on 1,560 samples, the additional analysis increased the overall number of genotypes by over 45,000, significantly improving the completeness of the experimental data. This analysis suggests that the use of multiple genotype calling algorithms may be advisable in high-throughput SNP genotyping experiments. The software is written in Perl and is available from the corresponding author.  相似文献   

10.

Background  

Human genome contains millions of common single nucleotide polymorphisms (SNPs) and these SNPs play an important role in understanding the association between genetic variations and human diseases. Many SNPs show correlated genotypes, or linkage disequilibrium (LD), thus it is not necessary to genotype all SNPs for association study. Many algorithms have been developed to find a small subset of SNPs called tag SNPs that are sufficient to infer all the other SNPs. Algorithms based on the r 2 LD statistic have gained popularity because r 2 is directly related to statistical power to detect disease associations. Most of existing r 2 based algorithms use pairwise LD. Recent studies show that multi-marker LD can help further reduce the number of tag SNPs. However, existing tag SNP selection algorithms based on multi-marker LD are both time-consuming and memory-consuming. They cannot work on chromosomes containing more than 100 k SNPs using length-3 tagging rules.  相似文献   

11.
Single nucleotide polymorphisms (SNPs), due to their abundance and low mutation rate, are very useful genetic markers for genetic association studies. However, the current genotyping technology cannot afford to genotype all common SNPs in all the genes. By making use of linkage disequilibrium, we can reduce the experiment cost by genotyping a subset of SNPs, called Tag SNPs, which have a strong association with the ungenotyped SNPs, while are as independent from each other as possible. The problem of selecting Tag SNPs is NP-complete; when there are large number of SNPs, in order to avoid extremely long computational time, most of the existing Tag SNP selection methods first partition the SNPs into blocks based on certain block definitions, then Tag SNPs are selected in each block by brute-force search. The size of the Tag SNP set obtained in this way may usually be reduced further due to the inter-dependency among blocks. This paper proposes two algorithms, TSSA and TSSD, to tackle the block-independent Tag SNP selection problem. TSSA is based on A* search algorithm, and TSSD is a heuristic algorithm. Experiments show that TSSA can find the optimal solutions for medium-sized problems in reasonable time, while TSSD can handle very large problems and report approximate solutions very close to the optimal ones.  相似文献   

12.
MOTIVATION: Preliminary results on the data produced using the Affymetrix large-scale genotyping platforms show that it is necessary to construct improved genotype calling algorithms. There is evidence that some of the existing algorithms lead to an increased error rate in heterozygous genotypes, and a disproportionately large rate of heterozygotes with missing genotypes. Non-random errors and missing data can lead to an increase in the number of false discoveries in genetic association studies. Therefore, the factors that need to be evaluated in assessing the performance of an algorithm are the missing data (call) and error rates, but also the heterozygous proportions in missing data and errors. RESULTS: We introduce a novel genotype calling algorithm (GEL) for the Affymetrix GeneChip arrays. The algorithm uses likelihood calculations that are based on distributions inferred from the observed data. A key ingredient in accurate genotype calling is weighting the information that comes from each probe quartet according to the quality/reliability of the data in the quartet, and prior information on the performance of the quartet. AVAILABILITY: The GEL software is implemented in R and is available by request from the corresponding author at nicolae@galton.uchicago.edu.  相似文献   

13.
14.
Genotyping and annotation of Affymetrix SNP arrays   总被引:1,自引:0,他引:1  
In this paper we develop a new method for genotyping Affymetrix single nucleotide polymorphism (SNP) array. The method is based on (i) using multiple arrays at the same time to determine the genotypes and (ii) a model that relates intensities of individual SNPs to each other. The latter point allows us to annotate SNPs that have poor performance, either because of poor experimental conditions or because for one of the alleles the probes do not behave in a dose–response manner. Generally, our method agrees well with a method developed by Affymetrix. When both methods make a call they agree in 99.25% (using standard settings) of the cases, using a sample of 113 Affymetrix 10k SNP arrays. In the majority of cases where the two methods disagree, our method makes a genotype call, whereas the method by Affymetrix makes a no call, i.e. the genotype of the SNP is not determined. By visualization it is indicated that our method is likely to be correct in majority of these cases. In addition, we demonstrate that our method produces more SNPs that are in concordance with Hardy–Weinberg equilibrium than the method by Affymetrix. Finally, we have validated our method on HapMap data and shown that the performance of our method is comparable to other methods.  相似文献   

15.
Massively parallel sequencing (MPS), since its debut in 2005, has transformed the field of genomic studies. These new sequencing technologies have resulted in the successful identification of causal variants for several rare Mendelian disorders. They have also begun to deliver on their promise to explain some of the missing heritability from genome-wide association studies (GWAS) of complex traits. We anticipate a rapidly growing number of MPS-based studies for a diverse range of applications in the near future. One crucial and nearly inevitable step is to detect SNPs and call genotypes at the detected polymorphic sites from the sequencing data. Here, we review statistical methods that have been proposed in the past five years for this purpose. In addition, we discuss emerging issues and future directions related to SNP detection and genotype calling from MPS data.  相似文献   

16.
High-throughput SNP genotyping platforms use automated genotype calling algo- rithms to assign genotypes. While these algorithms work efficiently for individual platforms, they are not compatible with other platforms, and have individual biases that result in missed genotype calls. Here we present data on the use of a second complementary SNP genotype clustering algorithm. The algorithm was originally designed for individual fluorescent SNP genotyping assays, and has been opti- mized to permit the clustering of large datasets generated from custom-designed Affymetrix SNP panels. In an analysis of data from a 3K array genotyped on 1,560 samples, the additional analysis increased the overall number of genotypes by over 45,000, significantly improving the completeness of the experimental data. This analysis suggests that the use of multiple genotype calling algorithms may be ad- visable in high-throughput SNP genotyping experiments. The software is written in Perl and is available from the corresponding author.  相似文献   

17.
Hao K  Cawley S 《Human heredity》2007,63(3-4):219-228
BACKGROUND: Current biotechnologies are able to achieve high accuracy and call rates. Concerns are raised on how differential performance on various genotypes may bias association tests. Quantitatively, we define differential dropout rate as the ratio of no-call rate among heterozygotes and homozygotes. METHODS: The hazard ofdifferential dropout is examined for population- and family-based association tests through a simulation study. Also, we investigate detection approaches such as Hardy-Weinberg Equilibrium (HWE) and testing for correlation between sample call rate and sample heterozygosity. Finally, we analyze two public datasets and evaluate the magnitudes of differential dropout. RESULTS: In case-control settings, differential dropout has negligible effect on power and odds ratio (OR) estimation. However, the impact on family-based tests range from minor to severe depending on the disease parameters. Such impact is more prominent when disease allele frequency is relatively low (e.g., 5%), where a differential dropout rate of 2.5 can dramatically bias OR estimation and reduce power even at a decent 98% overall call rate and moderate effect size (e.g., OR(true) = 2.11). Both of the two public datasets follow HWE; however, HapMap data carries detectable differential dropout that may endanger family-based studies. CONCLUSIONS: Case-control approach appears to be robust to differential dropout; however, family-based association tests can be heavily biased. Both of the public genotype data show high call rate, but differential dropout is detected in HapMap data. We suggest researchers carefully control this potential confounder even using data of high accuracy and high overall call rate.  相似文献   

18.
High-throughput SNP genotyping platforms use automated genotype calling algo- rithms to assign genotypes. While these algorithms work efficiently for individual platforms, they are not compatible with other platforms, and have individual biases that result in missed genotype calls. Here we present data on the use of a second complementary SNP genotype clustering algorithm. The algorithm was originally designed for individual fluorescent SNP genotyping assays, and has been opti- mized to permit the clustering of large datasets generated from custom-designed Affymetrix SNP panels. In an analysis of data from a 3K array genotyped on 1,560 samples, the additional analysis increased the overall number of genotypes by over 45,000, significantly improving the completeness of the experimental data. This analysis suggests that the use of multiple genotype calling algorithms may be ad- visable in high-throughput SNP genotyping experiments. The software is written in Perl and is available from the corresponding author.  相似文献   

19.

Background

High-throughput sequencing has opened up exciting possibilities in population and conservation genetics by enabling the assessment of genetic variation at genome-wide scales. One approach to reduce genome complexity, i.e. investigating only parts of the genome, is reduced-representation library (RRL) sequencing. Like similar approaches, RRL sequencing reduces ascertainment bias due to simultaneous discovery and genotyping of single-nucleotide polymorphisms (SNPs) and does not require reference genomes. Yet, generating such datasets remains challenging due to laboratory and bioinformatical issues. In the laboratory, current protocols require improvements with regards to sequencing homologous fragments to reduce the number of missing genotypes. From the bioinformatical perspective, the reliance of most studies on a single SNP caller disregards the possibility that different algorithms may produce disparate SNP datasets.

Results

We present an improved RRL (iRRL) protocol that maximizes the generation of homologous DNA sequences, thus achieving improved genotyping-by-sequencing efficiency. Our modifications facilitate generation of single-sample libraries, enabling individual genotype assignments instead of pooled-sample analysis. We sequenced ~1% of the orangutan genome with 41-fold median coverage in 31 wild-born individuals from two populations. SNPs and genotypes were called using three different algorithms. We obtained substantially different SNP datasets depending on the SNP caller. Genotype validations revealed that the Unified Genotyper of the Genome Analysis Toolkit and SAMtools performed significantly better than a caller from CLC Genomics Workbench (CLC). Of all conflicting genotype calls, CLC was only correct in 17% of the cases. Furthermore, conflicting genotypes between two algorithms showed a systematic bias in that one caller almost exclusively assigned heterozygotes, while the other one almost exclusively assigned homozygotes.

Conclusions

Our enhanced iRRL approach greatly facilitates genotyping-by-sequencing and thus direct estimates of allele frequencies. Our direct comparison of three commonly used SNP callers emphasizes the need to question the accuracy of SNP and genotype calling, as we obtained considerably different SNP datasets depending on caller algorithms, sequencing depths and filtering criteria. These differences affected scans for signatures of natural selection, but will also exert undue influences on demographic inferences. This study presents the first effort to generate a population genomic dataset for wild-born orangutans with known population provenance.  相似文献   

20.
Single nucleotide polymorphisms (SNPs) are the most widely used marker in studies to assess associations between genetic variants and complex traits or diseases. They are also becoming increasingly important in the study of the evolution and history of humans and other species. The analysis and processing of SNPs obtained thanks to high-throughput technologies imply the time consuming and costly use of different, complex and usually format-incompatible software. SNPator is a user-friendly web-based SNP data analysis suite that integrates, among many other algorithms, the most common steps of a SNP association study. It frees the user from the need to have large computer facilities and an in depth knowledge of genetic software installation and management. Genotype data is directly read from the output files of the usual genotyping platforms. Phenotypic data on the samples can also be easily uploaded. Many different quality control and analysis procedures can be performed either by using built-in SNPator algorithms or by calling standard genetic software. AVAILABILITY: Access is granted from the SNPator webpage http://www.snpator.org.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号