首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Genomic prediction models are often calibrated using multi-generation data. Over time, as data accumulates, training data sets become increasingly heterogeneous. Differences in allele frequency and linkage disequilibrium patterns between the training and prediction genotypes may limit prediction accuracy. This leads to the question of whether all available data or a subset of it should be used to calibrate genomic prediction models. Previous research on training set optimization has focused on identifying a subset of the available data that is optimal for a given prediction set. However, this approach does not contemplate the possibility that different training sets may be optimal for different prediction genotypes. To address this problem, we recently introduced a sparse selection index (SSI) that identifies an optimal training set for each individual in a prediction set. Using additive genomic relationships, the SSI can provide increased accuracy relative to genomic-BLUP (GBLUP). Non-parametric genomic models using Gaussian kernels (KBLUP) have, in some cases, yielded higher prediction accuracies than standard additive models. Therefore, here we studied whether combining SSIs and kernel methods could further improve prediction accuracy when training genomic models using multi-generation data. Using four years of doubled haploid maize data from the International Maize and Wheat Improvement Center (CIMMYT), we found that when predicting grain yield the KBLUP outperformed the GBLUP, and that using SSI with additive relationships (GSSI) lead to 5–17% increases in accuracy, relative to the GBLUP. However, differences in prediction accuracy between the KBLUP and the kernel-based SSI were smaller and not always significant.Subject terms: Quantitative trait, Genetic models  相似文献   

2.
The objective of this study was to quantify the accuracy of imputing the genotype of parents using information on the genotype of their progeny and a family-based and population-based imputation algorithm. Two separate data sets were used, one containing both dairy and beef animals (n=3122) with high-density genotypes (735 151 single nucleotide polymorphisms (SNPs)) and the other containing just dairy animals (n=5489) with medium-density genotypes (51 602 SNPs). Imputation accuracy of three different genotype density panels were evaluated representing low (i.e. 6501 SNPs), medium and high density. The full genotypes of sires with genotyped half-sib progeny were masked and subsequently imputed. Genotyped half-sib progeny group sizes were altered from 4 up to 12 and the impact on imputation accuracy was quantified. Up to 157 and 258 sires were used to test the accuracy of imputation in the dairy plus beef data set and the dairy-only data set, respectively. The efficiency and accuracy of imputation was quantified as the proportion of genotypes that could not be imputed, and as both the genotype concordance rate and allele concordance rate. The median proportion of genotypes per animal that could not be imputed in the imputation process decreased as the number of genotyped half-sib progeny increased; values for the medium-density panel ranged from a median of 0.015 with a half-sib progeny group size of 4 to a median of 0.0014 to 0.0015 with a half-sib progeny group size of 8. The accuracy of imputation across different paternal half-sib progeny group sizes was similar in both data sets. Concordance rates increased considerably as the number of genotyped half-sib progeny increased from four (mean animal allele concordance rate of 0.94 in both data sets for the medium-density genotype panel) to five (mean animal allele concordance rate of 0.96 in both data sets for the medium-density genotype panel) after which it was relatively stable up to a half-sib progeny group size of eight. In the data set with dairy-only animals, sufficient sires with paternal half-sib progeny groups up to 12 were available and the within-animal mean genotype concordance rates continued to increase up to this group size. The accuracy of imputation was worst for the low-density genotypes, especially with smaller half-sib progeny group sizes but the difference in imputation accuracy between density panels diminished as progeny group size increased; the difference between high and medium-density genotype panels was relatively small across all half-sib progeny group sizes. Where biological material or genotypes are not available on individual animals, at least five progeny can be genotyped (on either a medium or high-density genotyping platform) and the parental alleles imputed with, on average, ⩾96% accuracy.  相似文献   

3.
Microsatellite markers have demonstrated their value for performing paternity exclusion and hence exploring mating patterns in plants and animals. Methodology is well established for diploid species, and several software packages exist for elucidating paternity in diploids; however, these issues are not so readily addressed in polyploids due to the increased complexity of the exclusion problem and a lack of available software. We introduce polypatex , an r package for paternity exclusion analysis using microsatellite data in autopolyploid, monoecious or dioecious/bisexual species with a ploidy of 4n, 6n or 8n. Given marker data for a set of offspring, their mothers and a set of candidate fathers, polypatex uses allele matching to exclude candidates whose marker alleles are incompatible with the alleles in each offspring–mother pair. polypatex can analyse marker data sets in which allele copy numbers are known (genotype data) or unknown (allelic phenotype data) – for data sets in which allele copy numbers are unknown, comparisons are made taking into account all possible genotypes that could arise from the compared allele sets. polypatex is a software tool that provides population geneticists with the ability to investigate the mating patterns of autopolyploids using paternity exclusion analysis on data from codominant markers having multiple alleles per locus.  相似文献   

4.
Combining two data sets with allele information from overlapping microsatellite markers is often desirable, particularly in population genetic studies where a substantial body of published data exists. When genotyping is performed in different laboratories, allele size calling may not be presumed to be consistent. Our approach solves this problem by assigning allele sizes across studies using maximum-likelihood theory. Using data overlaps in samples and markers, allele shifts between two studies are calculated for each overlapping marker and a single file containing allele frequencies of consistent alleles is produced. The program (combi.pl) is written in PERL and available at http://data40.uni-tz.gwdg.de/~htaeube.  相似文献   

5.
Publicly available single nucleotide polymorphism (SNP) allele frequencies are an important resource for the selection of genetic markers that may be most useful for gene mapping and association studies. Data mining these allele frequencies through disparate public databases and Websites is time consuming and can result in inconsistent findings. We have developed a web-based software tool, Frequency Finder, to acquire SNP allele frequencies from multiple public data sources and return a summarized result to the user. Our software optimizes and automates the search of candidate markers, decreasing the amount of time it would take to extract pertinent data manually. We have included several methods to output the data, including on-screen and as a compressed text file. We show that Frequency Finder accurately retrieves available frequency data from the available sources. Using this tool, we detect significant differences between Asian, African and Caucasian populations in the allele frequency spectra of 246 097 SNPs. While limited to public databases that provide web-based access to allele frequencies, Frequency Finder provides a single, user-friendly interface for retrieving allele frequencies for large batches of SNPs from multiple data sources.  相似文献   

6.
We have created a high-density SNP resource encompassing 7.87 million polymorphic loci across 49 inbred mouse strains of the laboratory mouse by combining data available from public databases and training a hidden Markov model to impute missing genotypes in the combined data. The strong linkage disequilibrium found in dense sets of SNP markers in the laboratory mouse provides the basis for accurate imputation. Using genotypes from eight independent SNP resources, we empirically validated the quality of the imputed genotypes and demonstrated that they are highly reliable for most inbred strains. The imputed SNP resource will be useful for studies of natural variation and complex traits. It will facilitate association study designs by providing high-density SNP genotypes for large numbers of mouse strains. We anticipate that this resource will continue to evolve as new genotype data become available for laboratory mouse strains. The data are available for bulk download or query at /. Electronic supplementary material The online version of this article (doi:) contains supplementary material, which is available to authorized users.  相似文献   

7.
Classifying genotypes into clusters based on DNA fingerprinting, and/or agronomic attributes, for studying genetic and phenotypic diversity is a common practice. Researchers are interested in knowing the minimum number of fragments (and markers) needed for finding the underlying structural patterns of diversity in a population of interest, and using this information in conjunction with the phenotypic attributes to obtain more precise clusters of genotypes. The objectives of this study are to present: (1) a retrospective method of analysis for selecting a minimum number of fragments (and markers) from a study needed to produce the same classification of genotypes as that obtained using all the fragments (and markers), and (2) a classification strategy for genotypes that allows the combination of the minimum set of fragments with available phenotypic attributes. Results obtained on seven experimental data sets made up of different plant species, number of individuals per species’ and number of markers, showed that the retrospective analysis did indeed find few relevant fragments (and markers) that best discriminated the genotypes. In two data sets, the classification strategy of combining the information on the relevant minimum fragments with the available morpho-agronomic attributes produced compact and well-differentiated groups of genotypes.  相似文献   

8.
Allelic dropout is a commonly observed source of missing data in microsatellite genotypes, in which one or both allelic copies at a locus fail to be amplified by the polymerase chain reaction. Especially for samples with poor DNA quality, this problem causes a downward bias in estimates of observed heterozygosity and an upward bias in estimates of inbreeding, owing to mistaken classifications of heterozygotes as homozygotes when one of the two copies drops out. One general approach for avoiding allelic dropout involves repeated genotyping of homozygous loci to minimize the effects of experimental error. Existing computational alternatives often require replicate genotyping as well. These approaches, however, are costly and are suitable only when enough DNA is available for repeated genotyping. In this study, we propose a maximum-likelihood approach together with an expectation-maximization algorithm to jointly estimate allelic dropout rates and allele frequencies when only one set of nonreplicated genotypes is available. Our method considers estimates of allelic dropout caused by both sample-specific factors and locus-specific factors, and it allows for deviation from Hardy–Weinberg equilibrium owing to inbreeding. Using the estimated parameters, we correct the bias in the estimation of observed heterozygosity through the use of multiple imputations of alleles in cases where dropout might have occurred. With simulated data, we show that our method can (1) effectively reproduce patterns of missing data and heterozygosity observed in real data; (2) correctly estimate model parameters, including sample-specific dropout rates, locus-specific dropout rates, and the inbreeding coefficient; and (3) successfully correct the downward bias in estimating the observed heterozygosity. We find that our method is fairly robust to violations of model assumptions caused by population structure and by genotyping errors from sources other than allelic dropout. Because the data sets imputed under our model can be investigated in additional subsequent analyses, our method will be useful for preparing data for applications in diverse contexts in population genetics and molecular ecology.  相似文献   

9.
To support microsatellite data communication, we have developed a convenient method for creating locus‐specific microsatellite allele ladders used to align data from different laboratories. The ladders were constructed by pooling polymerase chain reaction (PCR) products to create a template for amplification. Four ladders were field‐tested in six different laboratories using different genotyping platforms. Despite substantial differences in absolute size estimates of DNA fragments, each laboratory correctly scored unknown sample genotypes according to the ladder designations. The results indicate that our simple preparation method provides reliable allele ladders in a time‐efficient manner for verifying microsatellite genotypes across platforms.  相似文献   

10.
ABSTRACT: BACKGROUND: Multi-locus sequence typing (MLST) has become the gold standard for population analyses of bacterial pathogens. This method focuses on the sequences of a small number of loci (usually seven) to divide the population and is simple, robust and facilitates comparison of results between laboratories and over time. Over the last decade, researchers and population health specialists have invested substantial effort in building up public MLST databases for nearly 100 different bacterial species, and these databases contain a wealth of important information linked to MLST sequence types such as time and place of isolation of isolation, host or niche, serotype and even clinical or drug resistance profiles. Recent advances in sequencing technology mean it is increasingly feasible to perform bacterial population analysis at the whole genome level. This offers massive gains in resolving power and genetic profiling compared to MLST, and will eventually replace MLST for bacterial typing and population analysis. However given the wealth of data currently available in MLST databases, it is crucial to maintain backwards compatibility with MLST schemes so that new genome analyses can be understood in their proper historical context. RESULTS: We present a software tool, SRST, for quick and accurate retrieval of sequence types from short read sets, using inputs easily downloaded from public databases. SRST assigns alleles using read mapping and an allele assignment score incorporating sequence coverage and variability, to determine the most likely allele. Analysis of over 3,500 loci in more than 500 publicly accessible Illumina read sets showed SRST to be highly accurate at allele assignment. SRST output is compatible with common analysis tools such as eBURST, Clonal Frame or PhyloViz, allowing easy comparison between novel genome data and MLST data. Alignment, fastq and pileup files can also be generated for novel alleles. CONCLUSIONS: SRST is a novel software tool for accurate assignment of sequence types using short read data. Several uses for the tool are demonstrated, including quality control for high-throughput sequencing projects, plasmid MLST and analysis of genomic data during outbreak investigation. SRST is open-source, requires Python, BWA and SamTools, and is available from http://srst.sourceforge.net.  相似文献   

11.
Structural variation is an important class of genetic variation in mammals. High-throughput sequencing (HTS) technologies promise to revolutionize copy-number variation (CNV) detection but present substantial analytic challenges. Converging evidence suggests that multiple types of CNV-informative data (e.g. read-depth, read-pair, split-read) need be considered, and that sophisticated methods are needed for more accurate CNV detection. We observed that various sources of experimental biases in HTS confound read-depth estimation, and note that bias correction has not been adequately addressed by existing methods. We present a novel read-depth–based method, GENSENG, which uses a hidden Markov model and negative binomial regression framework to identify regions of discrete copy-number changes while simultaneously accounting for the effects of multiple confounders. Based on extensive calibration using multiple HTS data sets, we conclude that our method outperforms existing read-depth–based CNV detection algorithms. The concept of simultaneous bias correction and CNV detection can serve as a basis for combining read-depth with other types of information such as read-pair or split-read in a single analysis. A user-friendly and computationally efficient implementation of our method is freely available.  相似文献   

12.
Human blood plasma can be obtained relatively noninvasively and contains proteins from most, if not all, tissues of the body. Therefore, an extensive, quantitative catalog of plasma proteins is an important starting point for the discovery of disease biomarkers. In 2005, we showed that different proteomics measurements using different sample preparation and analysis techniques identify significantly different sets of proteins, and that a comprehensive plasma proteome can be compiled only by combining data from many different experiments. Applying advanced computational methods developed for the analysis and integration of very large and diverse data sets generated by tandem MS measurements of tryptic peptides, we have now compiled a high-confidence human plasma proteome reference set with well over twice the identified proteins of previous high-confidence sets. It includes a hierarchy of protein identifications at different levels of redundancy following a clearly defined scheme, which we propose as a standard that can be applied to any proteomics data set to facilitate cross-proteome analyses. Further, to aid in development of blood-based diagnostics using techniques such as selected reaction monitoring, we provide a rough estimate of protein concentrations using spectral counting. We identified 20,433 distinct peptides, from which we inferred a highly nonredundant set of 1929 protein sequences at a false discovery rate of 1%. We have made this resource available via PeptideAtlas, a large, multiorganism, publicly accessible compendium of peptides identified in tandem MS experiments conducted by laboratories around the world.  相似文献   

13.
Dispersal sets the fundamental scales of ecological and evolutionary dynamics and has important implications for population persistence. Patterns of marine dispersal remain poorly understood, partly because dispersal may vary through time and often homogenizes allele frequencies. However, combining multiple types of natural tags can provide more precise dispersal estimates, and biological collections can help to reconstruct dispersal patterns through time. We used single nucleotide polymorphism genotypes and otolith core microchemistry from archived collections of larval summer flounder (Paralichthys dentatus, n = 411) captured between 1989 and 2012 at five locations along the US East coast to reconstruct dispersal patterns through time. Neither genotypes nor otolith microchemistry alone were sufficient to identify the source of larval fish. However, microchemistry identified clusters of larvae (n = 3–33 larvae per cluster) that originated in the same location, and genetic assignment of clusters could be made with substantially more confidence. We found that most larvae probably originated near a biogeographical break (Cape Hatteras) and that larvae were transported in both directions across this break. Larval sources did not shift north through time, despite the northward shift of adult populations in recent decades. Our novel approach demonstrates that summer flounder dispersal is widespread throughout their range, on both intra‐ and intergenerational timescales, and may be a particularly important process for synchronizing population dynamics and maintaining genetic diversity during an era of rapid environmental change. Broadly, our results reveal the value of archived collections and of combining multiple natural tags to understand the magnitude and directionality of dispersal in species with extensive gene flow.  相似文献   

14.
In order to investigate the comparability of microsatellite profiles obtained in different laboratories, ten partners in seven countries analyzed 46 grape cultivars at six loci (VVMD5, VVMD7, VVMD27, VVS2, VrZAG62, and VrZAG79). No effort was made to standardize equipment or protocols. Although some partners obtained very similar results, in other cases different absolute allele sizes and, sometimes, different relative allele sizes were obtained. A strategy for data comparison by means of reference to the alleles detected in well-known cultivars was proposed. For each marker, each allele was designated by a code based on the name of the reference cultivar carrying that allele. Thirty-three cultivars, representing from 13 to 23 alleles per marker, were chosen as references. After the raw data obtained by the different partners were coded, more than 97% of the data were in agreement. Minor discrepancies were attributed to errors, suboptimal amplification and visualization, and misscoring of heterozygous versus homozygous allele pairs. We have shown that coded microsatellite data produced in different laboratories with different protocols and conditions can be compared, and that it is suitable for the identification and SSR allele characterization of cultivars. It is proposed that the six markers employed here, already widely used, be adopted as a minimal standard marker set for future grapevine cultivar analyses, and that additional cultivars be characterized by means of the coded reference alleles presented here. The complete database is available at . Cuttings of the 33 reference cultivars are available on request from the Institut National de la Recherche Agronomique Vassal collection (didier.vares@ensam.inra.fr).Electronic Supplementary Material Supplementary material is available for this article at  相似文献   

15.
16.
Johnson PC  Haydon DT 《Genetics》2007,175(2):827-842
The importance of quantifying and accounting for stochastic genotyping errors when analyzing microsatellite data is increasingly being recognized. This awareness is motivating the development of data analysis methods that not only take errors into consideration but also recognize the difference between two distinct classes of error, allelic dropout and false alleles. Currently methods to estimate rates of allelic dropout and false alleles depend upon the availability of error-free reference genotypes or reliable pedigree data, which are often not available. We have developed a maximum-likelihood-based method for estimating these error rates from a single replication of a sample of genotypes. Simulations show it to be both accurate and robust to modest violations of its underlying assumptions. We have applied the method to estimating error rates in two microsatellite data sets. It is implemented in a computer program, Pedant, which estimates allelic dropout and false allele error rates with 95% confidence regions from microsatellite genotype data and performs power analysis. Pedant is freely available at http://www.stats.gla.ac.uk/ approximately paulj/pedant.html.  相似文献   

17.
Landraces of wheat can serve as important potential sources for extending the genetic basis of selection cultivars. Analysis of microsatellites and typing of polymorphism in a representative sample of 347 genotypes, including landraces and selection cultivars, was performed using a set of 38 selected oligonucleotide primer pairs. Each genotype had a unique allele combination at 39 microsatellite loci examined. Classification of genotypes with respect to the level of their similarity was performed using cluster analysis. The data obtained pointed to genetic differentiation of hexaploid wheat. The groups of cultivars, the formation of which was thought to be associated with the main old areas of wheat cultivation in Europe and Asia, were identified. The basis of each of the groups was formed by landraces of common wheat. The differences between the groups identified were associated with multiple changes in the wheat genome and were expressed as quantitative differences in the allele frequencies of microsatellite loci. The results of the study are of interest in terms of understanding the structure of wheat genetic diversity and revealing the pathways of evolution of this culture.  相似文献   

18.
Duplicated loci, for example those associated with major histocompatibility complex (MHC) genes, often have similar DNA sequences that can be coamplified with a pair of primers. This results in genotyping difficulties and inaccurate analyses. Here, we present a method to assign alleles to different loci in amplifications of duplicated loci. This method simultaneously considers several factors that may each affect correct allele assignment. These are the sharing of identical alleles among loci, null alleles, copy number variation, negative amplification, heterozygote excess or heterozygote deficiency, and linkage disequilibrium. The possible multilocus genotypes are extracted from the alleles for each individual and weighted to estimate the allele frequencies. The likelihood of an allele configuration is calculated and is optimized with a heuristic algorithm. Monte‐Carlo simulations and three empirical MHC data sets are used as examples to evaluate the efficacy of our method under different conditions. Our new software, mhc‐typer V1.1, is freely available at https://github.com/huangkang1987/mhc-typer .  相似文献   

19.
20.
Gene content is the number of copies of a particular allele in a genotype of an animal. Gene content can be used to study additive gene action of candidate gene. Usually genotype data are available only for a part of population and for the rest gene contents have to be calculated based on typed relatives. Methods to calculate expected gene content for animals on large complex pedigrees are relatively complex. In this paper we proposed a practical method to calculate gene content using a linear regression. The method does not estimate genotype probabilities but these can be approximated from gene content assuming Hardy-Weinberg proportions. The approach was compared with other methods on multiple simulated data sets for real bovine pedigrees of 1 082 and 907 903 animals. Different allelic frequencies (0.4 and 0.2) and proportions of the missing genotypes (90, 70, and 50%) were considered in simulation. The simulation showed that the proposed method has similar capability to predict gene content as the iterative peeling method, however it requires less time and can be more practical for large pedigrees. The method was also applied to real data on the bovine myostatin locus on a large dual-purpose Belgian Blue pedigree of 235 133 animals. It was demonstrated that the proposed method can be easily adapted for particular pedigrees.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号