首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
ABSTRACT: BACKGROUND: Single nucleotide polymorphism (SNP) genotyping assays normally give rise to certain percents of no-calls; the problem becomes severe when the target organisms, such as cattle, do not have a high resolution genomic sequence. Missing SNP genotypes, when related to target traits, would confound downstream data analyses such as genome-wide association studies (GWAS). Existing methods for recovering the missing values are successful to some extent --- either accurate but not fast enough or fast but not accurate enough. RESULTS: To a target missing genotype, we take only the SNP loci within a genetic distance vicinity and only the samples within a similarity vicinity into our local imputation process. For missing genotype imputation, the comparative performance evaluations through extensive simulation studies using real human and cattle genotype datasets demonstrated that our nearest neighbor based local imputation method was one of the most efficient methods, and outperformed existing methods except the time-consuming fastPHASE; for missing haplotype allele imputation, the comparative performance evaluations using real mouse haplotype datasets demonstrated that our method was not only one of the most efficient methods, but also one of the most accurate methods. CONCLUSIONS: Given that fastPHASE requires a long imputation time on medium to high density datasets, and that our nearest neighbor based local imputation method only performed slightly worse, yet better than all other methods, one might want to adopt our method as an alternative missing SNP genotype or missing haplotype allele imputation method.  相似文献   

2.
Pei YF  Li J  Zhang L  Papasian CJ  Deng HW 《PloS one》2008,3(10):e3551
The power of genetic association analyses is often compromised by missing genotypic data which contributes to lack of significant findings, e.g., in in silico replication studies. One solution is to impute untyped SNPs from typed flanking markers, based on known linkage disequilibrium (LD) relationships. Several imputation methods are available and their usefulness in association studies has been demonstrated, but factors affecting their relative performance in accuracy have not been systematically investigated. Therefore, we investigated and compared the performance of five popular genotype imputation methods, MACH, IMPUTE, fastPHASE, PLINK and Beagle, to assess and compare the effects of factors that affect imputation accuracy rates (ARs). Our results showed that a stronger LD and a lower MAF for an untyped marker produced better ARs for all the five methods. We also observed that a greater number of haplotypes in the reference sample resulted in higher ARs for MACH, IMPUTE, PLINK and Beagle, but had little influence on the ARs for fastPHASE. In general, MACH and IMPUTE produced similar results and these two methods consistently outperformed fastPHASE, PLINK and Beagle. Our study is helpful in guiding application of imputation methods in association analyses when genotype data are missing.  相似文献   

3.

Background  

Metagenomics, or the sequencing and analysis of collective genomes (metagenomes) of microorganisms isolated from an environment, promises direct access to the "unculturable majority". This emerging field offers the potential to lay solid basis on our understanding of the entire living world. However, the taxonomic classification is an essential task in the analysis of metagenomics data sets that it is still far from being solved. We present a novel strategy to predict the taxonomic origin of environmental genomic fragments. The proposed classifier combines the idea of the k-nearest neighbor with strategies from kernel-based learning.  相似文献   

4.
Zhang N  Little RJ 《Biometrics》2012,68(3):933-942
Summary We consider the linear regression of outcome Y on regressors W and Z with some values of W missing, when our main interest is the effect of Z on Y, controlling for W. Three common approaches to regression with missing covariates are (i) complete‐case analysis (CC), which discards the incomplete cases, and (ii) ignorable likelihood methods, which base inference on the likelihood based on the observed data, assuming the missing data are missing at random ( Rubin, 1976b ), and (iii) nonignorable modeling, which posits a joint distribution of the variables and missing data indicators. Another simple practical approach that has not received much theoretical attention is to drop the regressor variables containing missing values from the regression modeling (DV, for drop variables). DV does not lead to bias when either (i) the regression coefficient of W is zero or (ii) W and Z are uncorrelated. We propose a pseudo‐Bayesian approach for regression with missing covariates that compromises between the CC and DV estimates, exploiting information in the incomplete cases when the data support DV assumptions. We illustrate favorable properties of the method by simulation, and apply the proposed method to a liver cancer study. Extension of the method to more than one missing covariate is also discussed.  相似文献   

5.
Related individuals share potentially long chromosome segments that trace to a common ancestor. We describe a phasing algorithm (ChromoPhase) that utilizes this characteristic of finite populations to phase large sections of a chromosome. In addition to phasing, our method imputes missing genotypes in individuals genotyped at lower marker density when more densely genotyped relatives are available. ChromoPhase uses a pedigree to collect an individual's (the proband) surrogate parents and offspring and uses genotypic similarity to identify its genomic surrogates. The algorithm then cycles through the relatives and genomic surrogates one at a time to find shared chromosome segments. Once a segment has been identified, any missing information in the proband is filled in with information from the relative. We tested ChromoPhase in a simulated population consisting of 400 individuals at a marker density of 1500/M, which is approximately equivalent to a 50K bovine single nucleotide polymorphism chip. In simulated data, 99.9% loci were correctly phased and, when imputing from 100 to 1500 markers, more than 87% of missing genotypes were correctly imputed. Performance increased when the number of generations available in the pedigree increased, but was reduced when the sparse genotype contained fewer loci. However, in simulated data, ChromoPhase correctly imputed at least 12% more genotypes than fastPHASE, depending on sparse marker density. We also tested the algorithm in a real Holstein cattle data set to impute 50K genotypes in animals with a sparse 3K genotype. In these data 92% of genotypes were correctly imputed in animals with a genotyped sire. We evaluated the accuracy of genomic predictions with the dense, sparse, and imputed simulated data sets and show that the reduction in genomic evaluation accuracy is modest even with imperfectly imputed genotype data. Our results demonstrate that imputation of missing genotypes, and potentially full genome sequence, using long-range phasing is feasible.  相似文献   

6.
Using surface electromyography (sEMG) signal for efficient recognition of hand gestures has attracted increasing attention during the last decade, with most previous work being focused on recognition of upper arm and gross hand movements and some work on the classification of individual finger movements such as finger typing tasks. However, relatively few investigations can be found in the literature for automatic classification of multiple finger movements such as finger number gestures. This paper focuses on the recognition of number gestures based on a 4-channel wireless sEMG system. We investigate the effects of three popular feature types (i.e. Hudgins’ time–domain features (TD), autocorrelation and cross-correlation coefficients (ACCC) and spectral power magnitudes (SPM)) and four popular classification algorithms (i.e. k-nearest neighbor (k-NN), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA) and support vector machine (SVM)) in offline recognition. Motivated by the good performance of SVM, we further propose combining the three features and employing a new classification method, multiple kernel learning SVM (MKL-SVM). Real sEMG results from six subjects show that all combinations, except k-NN or LDA using ACCC features, can achieve above 91% average recognition accuracy, and the highest accuracy is 97.93% achieved by the proposed MKL-SVM method using the three feature combination (3F). Referring to the offline recognition results, we also implement a real-time recognition system. Our results show that all six subjects can achieve a real-time recognition accuracy higher than 90%. The number gestures are therefore promising for practical applications such as human–computer interaction (HCI).  相似文献   

7.
The von Bertalanffy growth equation (VBGE) is commonly used in ecology and fisheries management to model individual growth of an organism. Generally, a nonlinear regression is used with length-at-age data to recover key life history parameters: L (asymptotic size), k (the growth coefficient), and t 0 (a time used to calculate size at age 0). However, age data are often unavailable for many species of interest, which makes the regression impossible. To confront this problem, we have developed a Bayesian model to find L using only length data. We use length-at-age data for female blue shark, Prionace glauca, to test our hypothesis. Preliminary comparisons of the model output and the results of a nonlinear regression using the VBGE show similar estimates of L . We also developed a full Bayesian model that fits the VBGE to the same data used in the classical regression and the length-based Bayesian model. Classical regression methods are highly sensitive to missing data points, and our analysis shows that fitting the VBGE in a Bayesian framework is more robust. We investigate the assumptions made with the traditional curve fitting methods, and argue that either the full Bayesian or the length-based Bayesian models are preferable to classical nonlinear regressions. These methods clarify and address assumptions␣made in classical regressions using von Bertalanffy growth and facilitate more detailed stock assessments of species for which data are sparse.  相似文献   

8.
9.
 Trait means of marker genotypes are often inconsistent across experiments, thereby hindering the use of regression techniques in marker-assisted selection. Best linear unbiased prediction based on trait and marker data (TM-BLUP) does not require prior information on the mean effects associated with specific marker genotypes and, consequently, may be useful in applied breeding programs. The objective of this paper is to present a flanking-marker, TM-BLUP model that is applicable to interpopulation single crosses that characterize maize (Zea mays L.) breeding programs. The performance of a single cross is modeled as the sum of testcross additive and dominance effects at unmarked quantitative trait loci (QTL) and at marked QTL (MQTL). The TM-BLUP model requires information on the recombination frequencies between flanking markers and the MQTL and on MQTL variances. A tabular method is presented for calculating the conditional probability that MQTL alleles in two inbreds are identical by descent given the observed marker genotypes (G k obs) at the kth MQTL. Information on identity by descent of MQTL alleles can then be used to calculate the conditional covariance of MQTL effects between single crosses given G k obs. The inverse of the covariance matrix for dominance effects at unmarked QTL and MQTL can be written directly from the inverse of the covariance matrices of the corresponding testcross additive effects. In practice, the computations required in TM-BLUP may be prohibitive. The computational requirements may be reduced with simplified TM-BLUP models wherein dominance effects at MQTL are excluded, only the single crosses that have been tested are included, or information is pooled across several MQTL. Received: 22 June 1997 / Accepted: 25 February 1998  相似文献   

10.
Zheng X  Liu T  Wang J 《Amino acids》2009,37(2):427-433
A complexity-based approach is proposed to predict subcellular location of proteins. Instead of extracting features from protein sequences as done previously, our approach is based on a complexity decomposition of symbol sequences. In the first step, distance between each pair of protein sequences is evaluated by the conditional complexity of one sequence given the other. Subcellular location of a protein is then determined using the k-nearest neighbor algorithm. Using three widely used data sets created by Reinhardt and Hubbard, Park and Kanehisa, and Gardy et al., our approach shows an improvement in prediction accuracy over those based on the amino acid composition and Markov model of protein sequences.  相似文献   

11.
Imputation of high-density genotypes from low- or medium-density platforms is a promising way to enhance the efficiency of whole-genome selection programs at low cost. In this study, we compared the efficiency of three widely used imputation algorithms (fastPHASE, BEAGLE and findhap) using Chinese Holstein cattle with Illumina BovineSNP50 genotypes. A total of 2108 cattle were randomly divided into a reference population and a test population to evaluate the influence of the reference population size. Three bovine chromosomes, BTA1, 16 and 28, were used to represent large, medium and small chromosome size, respectively. We simulated different scenarios by randomly masking 20%, 40%, 80% and 95% single-nucleotide polymorphisms (SNPs) on each chromosome in the test population to mimic different SNP density panels. Illumina Bovine3K and Illumina BovineLD (6909 SNPs) information was also used. We found that the three methods showed comparable accuracy when the proportion of masked SNPs was low. However, the difference became larger when more SNPs were masked. BEAGLE performed the best and was most robust with imputation accuracies >90% in almost all situations. fastPHASE was affected by the proportion of masked SNPs, especially when the masked SNP rate was high. findhap ran the fastest, whereas its accuracies were lower than those of BEAGLE but higher than those of fastPHASE. In addition, enlarging the reference population improved the imputation accuracy for BEAGLE and findhap, but did not affect fastPHASE. Considering imputation accuracy and computational requirements, BEAGLE has been found to be more reliable for imputing genotypes from low- to high-density genotyping platforms.  相似文献   

12.
Diversity Arrays Technology (DArT) is a DNA hybridisation-based molecular marker technique that can detect simultaneously variation at numerous genomic loci without sequence information. This efficiency makes it a potential tool for a quick and powerful assessment of the structure of germplasm collections. This article demonstrates the usefulness of DArT markers for genetic diversity analyses of Musa spp. genotypes. We developed four complexity reduction methods to generate DArT genomic representations and we tested their performance using 48 reference Musa genotypes. For these four complexity reduction methods, DArT markers displayed high polymorphism information content. We selected the two methods which generated the most polymorphic genomic representations (PstI/BstNI 16.8%, PstI/TaqI 16.1%) to analyze a panel of 168 Musa genotypes from two of the most important field collections of Musa in the world: Cirad (Neufchateau, Guadeloupe), and IITA (Ibadan, Nigeria). Since most edible cultivars are derived from two wild species, Musa acuminata (A genome) and Musa balbisiana (B genome), the study is restricted mostly to accessions of these two species and those derived from them. The genomic origin of the markers can help resolving the pedigree of valuable genotypes of unknown origin. A total of 836 markers were identified and used for genotyping. Ten percent of them were specific to the A genome and enabled targeting this genome portion in relatedness analysis among diverse ploidy constitutions. DArT markers revealed genetic relationships among Musa genotype consistent with those provided by the other markers technologies, but at a significantly higher resolution and speed and reduced cost.  相似文献   

13.

Introduction

A common problem in metabolomics data analysis is the existence of a substantial number of missing values, which can complicate, bias, or even prevent certain downstream analyses. One of the most widely-used solutions to this problem is imputation of missing values using a k-nearest neighbors (kNN) algorithm to estimate missing metabolite abundances. kNN implicitly assumes that missing values are uniformly distributed at random in the dataset, but this is typically not true in metabolomics, where many values are missing because they are below the limit of detection of the analytical instrumentation.

Objectives

Here, we explore the impact of nonuniformly distributed missing values (missing not at random, or MNAR) on imputation performance. We present a new model for generating synthetic missing data and a new algorithm, No-Skip kNN (NS-kNN), that accounts for MNAR values to provide more accurate imputations.

Methods

We compare the imputation errors of the original kNN algorithm using two distance metrics, NS-kNN, and a recently developed algorithm KNN-TN, when applied to multiple experimental datasets with different types and levels of missing data.

Results

Our results show that NS-kNN typically outperforms kNN when at least 20–30% of missing values in a dataset are MNAR. NS-kNN also has lower imputation errors than KNN-TN on realistic datasets when at least 50% of missing values are MNAR.

Conclusion

Accounting for the nonuniform distribution of missing values in metabolomics data can significantly improve the results of imputation algorithms. The NS-kNN method imputes missing metabolomics data more accurately than existing kNN-based approaches when used on realistic datasets.
  相似文献   

14.
P. Saha  P. J. Heagerty 《Biometrics》2010,66(4):999-1011
Summary Competing risks arise naturally in time‐to‐event studies. In this article, we propose time‐dependent accuracy measures for a marker when we have censored survival times and competing risks. Time‐dependent versions of sensitivity or true positive (TP) fraction naturally correspond to consideration of either cumulative (or prevalent) cases that accrue over a fixed time period, or alternatively to incident cases that are observed among event‐free subjects at any select time. Time‐dependent (dynamic) specificity (1–false positive (FP)) can be based on the marker distribution among event‐free subjects. We extend these definitions to incorporate cause of failure for competing risks outcomes. The proposed estimation for cause‐specific cumulative TP/dynamic FP is based on the nearest neighbor estimation of bivariate distribution function of the marker and the event time. On the other hand, incident TP/dynamic FP can be estimated using a possibly nonproportional hazards Cox model for the cause‐specific hazards and riskset reweighting of the marker distribution. The proposed methods extend the time‐dependent predictive accuracy measures of Heagerty, Lumley, and Pepe (2000, Biometrics 56, 337–344) and Heagerty and Zheng (2005, Biometrics 61, 92–105).  相似文献   

15.
Naveed M  Khan A  Khan AU 《Amino acids》2012,42(5):1809-1823
G protein-coupled receptors (GPCRs) are transmembrane proteins, which transduce signals from extracellular ligands to intracellular G protein. Automatic classification of GPCRs can provide important information for the development of novel drugs in pharmaceutical industry. In this paper, we propose an evolutionary approach, GPCR-MPredictor, which combines individual classifiers for predicting GPCRs. GPCR-MPredictor is a web predictor that can efficiently predict GPCRs at five levels. The first level determines whether a protein sequence is a GPCR or a non-GPCR. If the predicted sequence is a GPCR, then it is further classified into family, subfamily, sub-subfamily, and subtype levels. In this work, our aim is to analyze the discriminative power of different feature extraction and classification strategies in case of GPCRs prediction and then to use an evolutionary ensemble approach for enhanced prediction performance. Features are extracted using amino acid composition, pseudo amino acid composition, and dipeptide composition of protein sequences. Different classification approaches, such as k-nearest neighbor (KNN), support vector machine (SVM), probabilistic neural networks (PNN), J48, Adaboost, and Naives Bayes, have been used to classify GPCRs. The proposed hierarchical GA-based ensemble classifier exploits the prediction results of SVM, KNN, PNN, and J48 at each level. The GA-based ensemble yields an accuracy of 99.75, 92.45, 87.80, 83.57, and 96.17% at the five levels, on the first dataset. We further perform predictions on a dataset consisting of 8,000 GPCRs at the family, subfamily, and sub-subfamily level, and on two other datasets of 365 and 167 GPCRs at the second and fourth levels, respectively. In comparison with the existing methods, the results demonstrate the effectiveness of our proposed GPCR-MPredictor in classifying GPCRs families. It is accessible at .  相似文献   

16.
Patients with coronary artery disease (CAD), including those who have had myocardial infarction (MI), and control subjects have been compared with respect to the distributions of the alleles and genotypes of polymorphic marker G(–455)A of gene FGB encoding the fibrinogen -chain. The groups studied do not differ significantly with respect to the distributions of G(–455)A alleles and genotypes. This indicates that this marker is not associated with CAD in the Moscow population. Allele A of the G(–455)A polymorphic marker has been found to be associated with an increased fibrinogen content of blood plasma in women with CAD.  相似文献   

17.
Scientists are using acoustic monitoring to assess the impact of altered soundscapes on wildlife communities and human systems. In the soundscape ecology field, monitoring and analyses approaches rely on the interdisciplinary intersection of ecology, acoustics, and computer science. Combining theory and practice of each field in the context of Knowledge Discovery in Databases (KDD), soundscape ecologists provide innovative monitoring solutions for ecologically-driven research questions. We propose a soundscape content analysis framework for improved knowledge outcome with assistance of the new multi-label (ML) concept.Here, we investigated the effectiveness of a ML k-nearest neighbor algorithm (ML-kNN) for labeling concurrent soundscape components within a single recording. We manually labeled 1200 field recordings for the presence of soundscape components and extracted ecological acoustic features, audio profile features, and Gaussian-mixture model features for each recording. Then, we tested the ML-kNN algorithm accuracy with well-established metrics adapted to ML learning.We found that seventeen unique acoustic features could predict a set of biophonic, geophonic, and anthrophonic labels for a single field recording with average precision of 0.767. However, certain labels were predicted incorrectly depending on the time of day and co-occurrence of that label with another label, suggesting further refinement is needed to improve the accuracy of predicted labels.Overall, this ML classification approach could enable researchers to label field recordings more quickly and generate an “alert” system for monitoring changes in a specific sound class. Ultimately, the adaptation of the ML algorithm may provide soundscape ecologists with new metadata labels that are searchable in large databases of soundscape field recordings.  相似文献   

18.
It is not uncommon for biological anthropologists to analyze incomplete bioarcheological or forensic skeleton specimens. As many quantitative multivariate analyses cannot handle incomplete data, missing data imputation or estimation is a common preprocessing practice for such data. Using William W. Howells' Craniometric Data Set and the Goldman Osteometric Data Set, we evaluated the performance of multiple popular statistical methods for imputing missing metric measurements. Results indicated that multiple imputation methods outperformed single imputation methods, such as Bayesian principal component analysis (BPCA). Multiple imputation with Bayesian linear regression implemented in the R package norm2, the Expectation–Maximization (EM) with Bootstrapping algorithm implemented in Amelia, and the Predictive Mean Matching (PMM) method and several of the derivative linear regression models implemented in mice, perform well regarding accuracy, robustness, and speed. Based on the findings of this study, we suggest a practical procedure for choosing appropriate imputation methods.  相似文献   

19.
The successful prediction of thermophilic proteins is useful for designing stable enzymes that are functional at high temperature. We have used the increment of diversity (ID), a novel amino acid composition-based similarity distance, in a 2-class K-nearest neighbor classifier to classify thermophilic and mesophilic proteins. And the KNN-ID classifier was successfully developed to predict the thermophilic proteins. Instead of extracting features from protein sequences as done previously, our approach was based on a diversity measure of symbol sequences. The similarity distance between each pair of protein sequences was first calculated to quantitatively measure the similarity level of one given sequence and the other. The query protein is then determined using the K-nearest neighbor algorithm. Comparisons with multiple recently published methods showed that the KNN-ID proposed in this study outperforms the other methods. The improved predictive performance indicated it is a simple and effective classifier for discriminating thermophilic and mesophilic proteins. At last, the influence of protein length and protein identity on prediction accuracy was discussed further. The prediction model and dataset used in this article can be freely downloaded from http://wlxy.imu.edu.cn/college/biostation/fuwu/KNN-ID/index.htm.  相似文献   

20.
Genotyping-by-sequencing (GBS) is a rapid and cost-effective genome-wide genotyping technique applicable whether a reference genome is available or not. Due to the cost-coverage trade-off, however, GBS typically produces large amounts of missing marker genotypes, whose imputation becomes therefore both challenging and critical for later analyses. In this work, the performance of four general imputation methods (K-nearest neighbors, Random Forest, singular value decomposition, and mean value) and two genotype-specific methods (“Beagle” and FILLIN) was measured on GBS data from alfalfa (Medicago sativa L., autotetraploid, heterozygous, without reference genome) and rice (Oryza sativa L., diploid, 100 % homozygous, with reference genome). Alfalfa SNP were aligned on the genome of the closely related species Medicago truncatula L.. Benchmarks consisted in progressive data filtering for marker call rate (up to 70 %) and increasing proportions (up to 20 %) of known genotypes masked for imputation. The relative performance was measured as the total proportion of correctly imputed genotypes, globally and within each genotype class (two homozygotes in rice, two homozygotes and one heterozygote in alfalfa). We found that imputation accuracy was robust to increasing missing rates, and consistently higher in rice than in alfalfa. Accuracy was as high as 90–100 % for the major (most frequent) homozygous genotype, but dropped to 80–90 % (rice) and below 30 % (alfalfa) in the minor homozygous genotype. Beagle was the best performing method, both accuracy- and time-wise, in rice. In alfalfa, KNNI and RFI gave the highest accuracies, but KNNI was much faster.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号