共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
3.
Cari A. Schmitz Carley Joseph J. Coombs David S. Douches Paul C. Bethke Jiwan P. Palta Richard G. Novy Jeffrey B. Endelman 《TAG. Theoretical and applied genetics. Theoretische und angewandte Genetik》2017,130(4):717-726
Key message
New software to make tetraploid genotype calls from SNP array data was developed, which uses hierarchical clustering and multiple F1 populations to calibrate the relationship between signal intensity and allele dosage.Abstract
SNP arrays are transforming breeding and genetics research for autotetraploids. To fully utilize these arrays, the relationship between signal intensity and allele dosage must be calibrated for each marker. We developed an improved computational method to automate this process, which is provided as the R package ClusterCall. In the training phase of the algorithm, hierarchical clustering within an F1 population is used to group samples with similar intensity values, and allele dosages are assigned to clusters based on expected segregation ratios. In the prediction phase, multiple F1 populations and the prediction set are clustered together, and the genotype for each cluster is the mode of the training set samples. A concordance metric, defined as the proportion of training set samples equal to the mode, can be used to eliminate unreliable markers and compare different algorithms. Across three potato families genotyped with an 8K SNP array, ClusterCall scored 5729 markers with at least 0.95 concordance (94.6% of its total), compared to 5325 with the software fitTetra (82.5% of its total). The three families were used to predict genotypes for 5218 SNPs in the SolCAP diversity panel, compared with 3521 SNPs in a previous study in which genotypes were called manually. One of the additional markers produced a significant association for vine maturity near a well-known causal locus on chromosome 5. In conclusion, when multiple F1 populations are available, ClusterCall is an efficient method for accurate, autotetraploid genotype calling that enables the use of SNP data for research and plant breeding.4.
MOTIVATION: Preliminary results on the data produced using the Affymetrix large-scale genotyping platforms show that it is necessary to construct improved genotype calling algorithms. There is evidence that some of the existing algorithms lead to an increased error rate in heterozygous genotypes, and a disproportionately large rate of heterozygotes with missing genotypes. Non-random errors and missing data can lead to an increase in the number of false discoveries in genetic association studies. Therefore, the factors that need to be evaluated in assessing the performance of an algorithm are the missing data (call) and error rates, but also the heterozygous proportions in missing data and errors. RESULTS: We introduce a novel genotype calling algorithm (GEL) for the Affymetrix GeneChip arrays. The algorithm uses likelihood calculations that are based on distributions inferred from the observed data. A key ingredient in accurate genotype calling is weighting the information that comes from each probe quartet according to the quality/reliability of the data in the quartet, and prior information on the performance of the quartet. AVAILABILITY: The GEL software is implemented in R and is available by request from the corresponding author at nicolae@galton.uchicago.edu. 相似文献
5.
A genotype calling algorithm for affymetrix SNP arrays 总被引:11,自引:0,他引:11
MOTIVATION: A classification algorithm, based on a multi-chip, multi-SNP approach is proposed for Affymetrix SNP arrays. Current procedures for calling genotypes on SNP arrays process all the features associated with one chip and one SNP at a time. Using a large training sample where the genotype labels are known, we develop a supervised learning algorithm to obtain more accurate classification results on new data. The method we propose, RLMM, is based on a robustly fitted, linear model and uses the Mahalanobis distance for classification. The chip-to-chip non-biological variance is reduced through normalization. This model-based algorithm captures the similarities across genotype groups and probes, as well as across thousands of SNPs for accurate classification. In this paper, we apply RLMM to Affymetrix 100 K SNP array data, present classification results and compare them with genotype calls obtained from the Affymetrix procedure DM, as well as to the publicly available genotype calls from the HapMap project. 相似文献
6.
We present a statistical framework for estimation and application of sample allele frequency spectra from New-Generation Sequencing (NGS) data. In this method, we first estimate the allele frequency spectrum using maximum likelihood. In contrast to previous methods, the likelihood function is calculated using a dynamic programming algorithm and numerically optimized using analytical derivatives. We then use a Bayesian method for estimating the sample allele frequency in a single site, and show how the method can be used for genotype calling and SNP calling. We also show how the method can be extended to various other cases including cases with deviations from Hardy-Weinberg equilibrium. We evaluate the statistical properties of the methods using simulations and by application to a real data set. 相似文献
7.
A genotype calling algorithm for the Illumina BeadArray platform 总被引:2,自引:0,他引:2
Teo YY Inouye M Small KS Gwilliam R Deloukas P Kwiatkowski DP Clark TG 《Bioinformatics (Oxford, England)》2007,23(20):2741-2746
MOTIVATION: Large-scale genotyping relies on the use of unsupervised automated calling algorithms to assign genotypes to hybridization data. A number of such calling algorithms have been recently established for the Affymetrix GeneChip genotyping technology. Here, we present a fast and accurate genotype calling algorithm for the Illumina BeadArray genotyping platforms. As the technology moves towards assaying millions of genetic polymorphisms simultaneously, there is a need for an integrated and easy-to-use software for calling genotypes. RESULTS: We have introduced a model-based genotype calling algorithm which does not rely on having prior training data or require computationally intensive procedures. The algorithm can assign genotypes to hybridization data from thousands of individuals simultaneously and pools information across multiple individuals to improve the calling. The method can accommodate variations in hybridization intensities which result in dramatic shifts of the position of the genotype clouds by identifying the optimal coordinates to initialize the algorithm. By incorporating the process of perturbation analysis, we can obtain a quality metric measuring the stability of the assigned genotype calls. We show that this quality metric can be used to identify SNPs with low call rates and accuracy. AVAILABILITY: The C++ executable for the algorithm described here is available by request from the authors. 相似文献
8.
Current genotype-calling methods such as Robust Linear Model with Mahalanobis Distance Classifier (RLMM) and Corrected Robust Linear Model with Maximum Likelihood Classification (CRLMM) provide accurate calling results for Affymetrix Single Nucleotide Polymorphisms (SNP) chips. However, these methods are computationally expensive as they employ preprocess procedures, including chip data normalization and other sophisticated statistical techniques. In the small sample case the accuracy rate may drop significantly. We develop a new genotype calling method for Affymetrix 100 k and 500 k SNP chips. A two-stage classification scheme is proposed to obtain a fast genotype calling algorithm. The first stage uses unsupervised classification to quickly discriminate genotypes with high accuracy for the majority of the SNPs. And the second stage employs a supervised classification method to incorporate allele frequency information either from the HapMap data or from a self-training scheme. Confidence score is provided for every genotype call. The overall performance is shown to be comparable to that of CRLMM as verified by the known gold standard HapMap data and is superior in small sample cases. The new algorithm is computationally simple and standalone in the sense that a self-training scheme can be used without employing any other training data. A package implementing the calling algorithm is freely available at http://www.sfs.ecnu.edu.cn/teachers/xuj_en.html. 相似文献
9.
10.
SNiPer-HD: improved genotype calling accuracy by an expectation-maximization algorithm for high-density SNP arrays 总被引:2,自引:0,他引:2
Hua J Craig DW Brun M Webster J Zismann V Tembe W Joshipura K Huentelman MJ Dougherty ER Stephan DA 《Bioinformatics (Oxford, England)》2007,23(1):57-63
MOTIVATION: The technology to genotype single nucleotide polymorphisms (SNPs) at extremely high densities provides for hypothesis-free genome-wide scans for common polymorphisms associated with complex disease. However, we find that some errors introduced by commonly employed genotyping algorithms may lead to inflation of false associations between markers and phenotype. RESULTS: We have developed a novel SNP genotype calling program, SNiPer-High Density (SNiPer-HD), for highly accurate genotype calling across hundreds of thousands of SNPs. The program employs an expectation-maximization (EM) algorithm with parameters based on a training sample set. The algorithm choice allows for highly accurate genotyping for most SNPs. Also, we introduce a quality control metric for each assayed SNP, such that poor-behaving SNPs can be filtered using a metric correlating to genotype class separation in the calling algorithm. SNiPer-HD is superior to the standard dynamic modeling algorithm and is complementary and non-redundant to other algorithms, such as BRLMM. Implementing multiple algorithms together may provide highly accurate genotyping calls, without inflation of false positives due to systematically miss-called SNPs. A reliable and accurate set of SNP genotypes for increasingly dense panels will eliminate some false association signals and false negative signals, allowing for rapid identification of disease susceptibility loci for complex traits. AVAILABILITY: SNiPer-HD is available at TGen's website: http://www.tgen.org/neurogenomics/data. 相似文献
11.
12.
SUMMARY: The purpose of this work is to provide the modern molecular geneticist with tools to perform more efficient and more accurate analysis of the genotype data they produce. By using Microsoft Excel macros written in Visual Basic, we can translate genotype data into a form readable by the versatile software 'Arlequin', read the Arlequin output, calculate statistics of linkage disequilibrium, and put the results in a format for viewing with the software 'GOLD'. AVAILABILITY: The software is available by FTP at: ftp://xcsg.iarc.fr/cox/Genotype_Transposer/. SUPPLEMENTARY INFORMATION: Detailed instruction and examples are available at: ftp://xcsg.iarc.fr/cox/Genotype&_Transposer/. Arlequin is available at: http://lgb.unige.ch/arlequin/. GOLD is available at: http://www.well.ox.ac.uk/asthma/GOLD/. 相似文献
13.
Huixiao Hong Zhenqiang Su Weigong Ge Leming Shi Roger Perkins Hong Fang Donna Mendrick Weida Tong 《Journal of genetics》2010,89(1):55-64
Genome-wide association studies (GWAS) examine the entire human genome with the goal of identifying genetic variants (usually single nucleotide polymorphisms (SNPs)) that are associated with phenotypic traits such as disease status and drug response. The discordance of significantly associated SNPs for the same disease identified from different GWAS indicates that false associations exist in such results. In addition to the possible sources of spurious associations that have been investigated and discussed intensively, such as sample size and population stratification, an accurate and reproducible genotype calling algorithm is required for concordant GWAS results from different studies. However, variations of genotype calling of an algorithm and their effects on significantly associated SNPs identified in downstream association analyses have not been systematically investigated. In this paper, the variations of genotype calling using the Bayesian Robust Linear Model with Mahalanobis distance classifier (BRLMM) algorithm and the resulting influence on the lists of significantly associated SNPs were evaluated using the raw data of 270 HapMap samples analysed with the Affymetrix Human Mapping 500K Array Set (Affy500K) by changing algorithmic parameters. Modified were the Dynamic Model (DM) call confidence threshold (threshold) and the number of randomly selected SNPs (size). Comparative analysis of the calling results and the corresponding lists of significantly associated SNPs identified through association analysis revealed that algorithmic parameters used in BRLMM affected the genotype calls and the significantly associated SNPs. Both the threshold and the size affected the called genotypes and the lists of significantly associated SNPs in association analysis. The effect of the threshold was much larger than the effect of the size. Moreover, the heterozygous calls had lower consistency compared to the homozygous calls. 相似文献
14.
High Seng Chai Terry M Therneau Kent R Bailey Jean-Pierre A Kocher 《BMC bioinformatics》2010,11(1):356
Background
Microarray measurements are susceptible to a variety of experimental artifacts, some of which give rise to systematic biases that are spatially dependent in a unique way on each chip. It is likely that such artifacts affect many SNP arrays, but the normalization methods used in currently available genotyping algorithms make no attempt at spatial bias correction. Here, we propose an effective single-chip spatial bias removal procedure for Affymetrix 6.0 SNP arrays or platforms with similar design features. This procedure deals with both extreme and subtle biases and is intended to be applied before standard genotype calling algorithms. 相似文献15.
A highly integrated monolithic device was developed that automatically carries out a complex series of molecular processes on multiple samples. The device is capable of extracting and concentrating nucleic acids from milliliter aqueous samples and performing microliter chemical amplification, serial enzymatic reactions, metering, mixing and nucleic acid hybridization. The device, which is smaller than a credit card, can manipulate over 10 reagents in more than 60 sequential operations and was tested for the detection of mutations in a 1.6 kb region of the HIV genome from serum samples containing as few as 500 copies of the RNA. The elements in this device are readily linked into complex, flexible and highly parallel analysis networks for high throughput sample preparation or, conversely, for low cost portable DNA analysis instruments in point-of-care medical diagnostics, environmental testing and defensive biological agent detection. 相似文献
16.
An automated method was developed for continuous, in situ determination of acetylene reduction (N2 fixation) by intact soybean plants (Glycine max [L.]). The culture vessel containing the roots of intact plants grown in sand culture is sealed at the surface and an air-acetylene mixture continuously injected into the root chamber. The effluent gas is automatically sampled and injected into a gas chromatograph. Continuous acetylene assay at intervals as short as 3.5 min may be made over a period of several days, without attention, except for plant watering. Adverse effects of prolonged exposure of the root system to acetylene were mitigated by pulse injection of acetylene for 20 min followed by 40 min of acetylene-free air. Bare root systems can be suspended in a reaction chamber and sprayed with water or nutrient solution; this permits periodic removal of the root system for sampling nodules. 相似文献
17.
18.
A cell-counting algorithm, developed in Matlab®, was created to efficiently count migrated fluorescently-stained cells on membranes from migration assays. At each concentration of cells used (10,000, and 100,000 cells), images were acquired at 2.5 ×, 5 ×, and 10 × objective magnifications. Automated cell counts strongly correlated to manual counts (r2 = 0.99, P < 0.0001 for a total of 47 images), with no difference in the measurements between methods under all conditions. We conclude that our automated method is accurate, more efficient, and void of variability and potential observer bias normally associated with manual counting. 相似文献
19.
Callegaro A Spinelli R Beltrame L Bicciato S Caristina L Censuales S De Bellis G Battaglia C 《Nucleic acids research》2006,34(7):e56
Single nucleotide polymorphisms (SNPs) are often determined using TaqMan real-time PCR assays (Applied Biosystems) and commercial software that assigns genotypes based on reporter probe signals at the end of amplification. Limitations to the large-scale application of this approach include the need for positive controls or operator intervention to set signal thresholds when one allele is rare. In the interest of optimizing real-time PCR genotyping, we developed an algorithm for automatic genotype calling based on the full course of real-time PCR data. Best cycle genotyping algorithm (BCGA), written in the open source language R, is based on the assumptions that classification depends on the time (cycle) of amplification and that it is possible to identify a best discriminating cycle for each SNP assay. The algorithm is unique in that it classifies samples according to the behavior of blanks (no DNA samples), which cluster with heterozygous samples. This method of classification eliminates the need for positive controls and permits accurate genotyping even in the absence of a genotype class, for example when one allele is rare. Here, we describe the algorithm and test its validity, compared to the standard end-point method and to DNA sequencing. 相似文献
20.
Degenerate adaptor sequences for detecting PCR duplicates in reduced representation sequencing data improve genotype calling accuracy 总被引:1,自引:0,他引:1 下载免费PDF全文
RAD‐tag is a powerful tool for high‐throughput genotyping. It relies on PCR amplification of the starting material, following enzymatic digestion and sequencing adaptor ligation. Amplification introduces duplicate reads into the data, which arise from the same template molecule and are statistically nonindependent, potentially introducing errors into genotype calling. In shotgun sequencing, data duplicates are removed by filtering reads starting at the same position in the alignment. However, restriction enzymes target specific locations within the genome, causing reads to start in the same place, and making it difficult to estimate the extent of PCR duplication. Here, we introduce a slight change to the Illumina sequencing adaptor chemistry, appending a unique four‐base tag to the first index read, which allows duplicate discrimination in aligned data. This approach was validated on the Illumina MiSeq platform, using double‐digest libraries of ants (Wasmannia auropunctata) and yeast (Saccharomyces cerevisiae) with known genotypes, producing modest though statistically significant gains in the odds of calling a genotype accurately. More importantly, removing duplicates also corrected for strong sample‐to‐sample variability of genotype calling accuracy seen in the ant samples. For libraries prepared from low‐input degraded museum bird samples (Mixornis gularis), which had low complexity, having been generated from relatively few starting molecules, adaptor tags show that virtually all of the genotypes were called with inflated confidence as a result of PCR duplicates. Quantification of library complexity by adaptor tagging does not significantly increase the difficulty of the overall workflow or its cost, but corrects for differences in quality between samples and permits analysis of low‐input material. 相似文献