首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Background  

Supervised learning for classification of cancer employs a set of design examples to learn how to discriminate between tumors. In practice it is crucial to confirm that the classifier is robust with good generalization performance to new examples, or at least that it performs better than random guessing. A suggested alternative is to obtain a confidence interval of the error rate using repeated design and test sets selected from available examples. However, it is known that even in the ideal situation of repeated designs and tests with completely novel samples in each cycle, a small test set size leads to a large bias in the estimate of the true variance between design sets. Therefore different methods for small sample performance estimation such as a recently proposed procedure called Repeated Random Sampling (RSS) is also expected to result in heavily biased estimates, which in turn translates into biased confidence intervals. Here we explore such biases and develop a refined algorithm called Repeated Independent Design and Test (RIDT).  相似文献   

2.

Background

The most common application of imputation is to infer genotypes of a high-density panel of markers on animals that are genotyped for a low-density panel. However, the increase in accuracy of genomic predictions resulting from an increase in the number of markers tends to reach a plateau beyond a certain density. Another application of imputation is to increase the size of the training set with un-genotyped animals. This strategy can be particularly successful when a set of closely related individuals are genotyped.

Methods

Imputation on completely un-genotyped dams was performed using known genotypes from the sire of each dam, one offspring and the offspring’s sire. Two methods were applied based on either allele or haplotype frequencies to infer genotypes at ambiguous loci. Results of these methods and of two available software packages were compared. Quality of imputation under different population structures was assessed. The impact of using imputed dams to enlarge training sets on the accuracy of genomic predictions was evaluated for different populations, heritabilities and sizes of training sets.

Results

Imputation accuracy ranged from 0.52 to 0.93 depending on the population structure and the method used. The method that used allele frequencies performed better than the method based on haplotype frequencies. Accuracy of imputation was higher for populations with higher levels of linkage disequilibrium and with larger proportions of markers with more extreme allele frequencies. Inclusion of imputed dams in the training set increased the accuracy of genomic predictions. Gains in accuracy ranged from close to zero to 37.14%, depending on the simulated scenario. Generally, the larger the accuracy already obtained with the genotyped training set, the lower the increase in accuracy achieved by adding imputed dams.

Conclusions

Whenever a reference population resembling the family configuration considered here is available, imputation can be used to achieve an extra increase in accuracy of genomic predictions by enlarging the training set with completely un-genotyped dams. This strategy was shown to be particularly useful for populations with lower levels of linkage disequilibrium, for genomic selection on traits with low heritability, and for species or breeds for which the size of the reference population is limited.  相似文献   

3.

Background  

Affymetrix GeneChip technology enables the parallel observations of tens of thousands of genes. It is important that the probe set annotations are reliable so that biological inferences can be made about genes which undergo differential expression. Probe sets representing the same gene might be expected to show similar fold changes/z-scores, however this is in fact not the case.  相似文献   

4.

Background  

The Internal Transcribed Spacer (ITS) regions of fungal ribosomal DNA (rDNA) are highly variable sequences of great importance in distinguishing fungal species by PCR analysis. Previously published PCR primers available for amplifying these sequences from environmental samples provide varying degrees of success at discriminating against plant DNA while maintaining a broad range of compatibility. Typically, it has been necessary to use multiple primer sets to accommodate the range of fungi under study, potentially creating artificial distinctions for fungal sequences that amplify with more than one primer set.  相似文献   

5.

Background  

Fluorescence microscopy is widely used to determine the subcellular location of proteins. Efforts to determine location on a proteome-wide basis create a need for automated methods to analyze the resulting images. Over the past ten years, the feasibility of using machine learning methods to recognize all major subcellular location patterns has been convincingly demonstrated, using diverse feature sets and classifiers. On a well-studied data set of 2D HeLa single-cell images, the best performance to date, 91.5%, was obtained by including a set of multiresolution features. This demonstrates the value of multiresolution approaches to this important problem.  相似文献   

6.

Background  

Knowing the subcellular location of proteins provides clues to their function as well as the interconnectivity of biological processes. Dozens of tools are available for predicting protein location in the eukaryotic cell. Each tool performs well on certain data sets, but their predictions often disagree for a given protein. Since the individual tools each have particular strengths, we set out to integrate them in a way that optimally exploits their potential. The method we present here is applicable to various subcellular locations, but tailored for predicting whether or not a protein is localized in mitochondria. Knowledge of the mitochondrial proteome is relevant to understanding the role of this organelle in global cellular processes.  相似文献   

7.

Background

A century after its discovery, Chagas disease still represents a major neglected tropical threat. Accurate diagnostics tools as well as surrogate markers of parasitological response to treatment are research priorities in the field. The purpose of this study was to evaluate the performance of PCR methods in detection of Trypanosoma cruzi DNA by an external quality evaluation.

Methodology/Findings

An international collaborative study was launched by expert PCR laboratories from 16 countries. Currently used strategies were challenged against serial dilutions of purified DNA from stocks representing T. cruzi discrete typing units (DTU) I, IV and VI (set A), human blood spiked with parasite cells (set B) and Guanidine Hidrochloride-EDTA blood samples from 32 seropositive and 10 seronegative patients from Southern Cone countries (set C). Forty eight PCR tests were reported for set A and 44 for sets B and C; 28 targeted minicircle DNA (kDNA), 13 satellite DNA (Sat-DNA) and the remainder low copy number sequences. In set A, commercial master mixes and Sat-DNA Real Time PCR showed better specificity, but kDNA-PCR was more sensitive to detect DTU I DNA. In set B, commercial DNA extraction kits presented better specificity than solvent extraction protocols. Sat-DNA PCR tests had higher specificity, with sensitivities of 0.05–0.5 parasites/mL whereas specific kDNA tests detected 5.10−3 par/mL. Sixteen specific and coherent methods had a Good Performance in both sets A and B (10 fg/µl of DNA from all stocks, 5 par/mL spiked blood). The median values of sensitivities, specificities and accuracies obtained in testing the Set C samples with the 16 tests determined to be good performing by analyzing Sets A and B samples varied considerably. Out of them, four methods depicted the best performing parameters in all three sets of samples, detecting at least 10 fg/µl for each DNA stock, 0.5 par/mL and a sensitivity between 83.3–94.4%, specificity of 85–95%, accuracy of 86.8–89.5% and kappa index of 0.7–0.8 compared to consensus PCR reports of the 16 good performing tests and 63–69%, 100%, 71.4–76.2% and 0.4–0.5, respectively compared to serodiagnosis. Method LbD2 used solvent extraction followed by Sybr-Green based Real time PCR targeted to Sat-DNA; method LbD3 used solvent DNA extraction followed by conventional PCR targeted to Sat-DNA. The third method (LbF1) used glass fiber column based DNA extraction followed by TaqMan Real Time PCR targeted to Sat-DNA (cruzi 1/cruzi 2 and cruzi 3 TaqMan probe) and the fourth method (LbQ) used solvent DNA extraction followed by conventional hot-start PCR targeted to kDNA (primer pairs 121/122). These four methods were further evaluated at the coordinating laboratory in a subset of human blood samples, confirming the performance obtained by the participating laboratories.

Conclusion/Significance

This study represents a first crucial step towards international validation of PCR procedures for detection of T. cruzi in human blood samples.  相似文献   

8.

Aim

Climate is considered a major driver of species distributions. Long‐term climatic means are commonly used as predictors in correlative species distribution models (SDMs). However, this coarse temporal resolution does not reflect local conditions that populations experience, such as short‐term weather extremes, which may have a strong impact on population dynamics and local distributions. We here compare the performance of climate‐ and weather‐based predictors in regional SDMs and their influence on future predictions, which are increasingly used in conservation planning.

Location

South‐western Germany.

Methods

We built different SDMs for 20 Orthoptera species based on three predictor sets at a regional scale for current and future climate scenarios. We calculated standard bioclimatic variables and yearly and seasonal sets of climate change indicating variables of weather extremes. As the impact of extreme events may be stronger for habitat specialists than for generalists, we distinguished species’ degrees of specialization. We computed linear mixed‐effects models to identify significant effects of algorithm, predictor set and specialization on model performance and calculated correlations and geographical niche overlap between spatial predictions.

Results

Current predictions were rather similar among all predictor sets, but highly variable for future climate scenarios. Bioclimatic and seasonal weather predictors performed slightly better than yearly weather predictors, though performance differences were minor. We found no evidence that specialists are more sensitive to weather extremes than generalists.

Main conclusions

For future projections of species distributions, SDM predictor selection should not solely be based on current performances and predictions. As long‐term climate and short‐term weather predictors represent different environmental drivers of a species’ distribution, we argue to interpret diverging future projections as complements. Even if similar current performances and predictions might imply their equivalency, favouring one predictor set neglects important aspects of future distributions and might mislead conservation decisions based on them.
  相似文献   

9.
10.

Background

It is important to accurately determine the performance of peptide:MHC binding predictions, as this enables users to compare and choose between different prediction methods and provides estimates of the expected error rate. Two common approaches to determine prediction performance are cross-validation, in which all available data are iteratively split into training and testing data, and the use of blind sets generated separately from the data used to construct the predictive method. In the present study, we have compared cross-validated prediction performances generated on our last benchmark dataset from 2009 with prediction performances generated on data subsequently added to the Immune Epitope Database (IEDB) which served as a blind set.

Results

We found that cross-validated performances systematically overestimated performance on the blind set. This was found not to be due to the presence of similar peptides in the cross-validation dataset. Rather, we found that small size and low sequence/affinity diversity of either training or blind datasets were associated with large differences in cross-validated vs. blind prediction performances. We use these findings to derive quantitative rules of how large and diverse datasets need to be to provide generalizable performance estimates.

Conclusion

It has long been known that cross-validated prediction performance estimates often overestimate performance on independently generated blind set data. We here identify and quantify the specific factors contributing to this effect for MHC-I binding predictions. An increasing number of peptides for which MHC binding affinities are measured experimentally have been selected based on binding predictions and thus are less diverse than historic datasets sampling the entire sequence and affinity space, making them more difficult benchmark data sets. This has to be taken into account when comparing performance metrics between different benchmarks, and when deriving error estimates for predictions based on benchmark performance.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2105-15-241) contains supplementary material, which is available to authorized users.  相似文献   

11.
I-TASSER server for protein 3D structure prediction   总被引:5,自引:0,他引:5  

Background  

Prediction of 3-dimensional protein structures from amino acid sequences represents one of the most important problems in computational structural biology. The community-wide Critical Assessment of Structure Prediction (CASP) experiments have been designed to obtain an objective assessment of the state-of-the-art of the field, where I-TASSER was ranked as the best method in the server section of the recent 7th CASP experiment. Our laboratory has since then received numerous requests about the public availability of the I-TASSER algorithm and the usage of the I-TASSER predictions.  相似文献   

12.
13.

Introduction

The clinical significance of a treatment effect demonstrated in a randomized trial is typically assessed by reference to differences in event rates at the group level. An alternative is to make individualized predictions for each patient based on a prediction model. This approach is growing in popularity, particularly for cancer. Despite its intuitive advantages, it remains plausible that some prediction models may do more harm than good. Here we present a novel method for determining whether predictions from a model should be used to apply the results of a randomized trial to individual patients, as opposed to using group level results.

Methods

We propose applying the prediction model to a data set from a randomized trial and examining the results of patients for whom the treatment arm recommended by a prediction model is congruent with allocation. These results are compared with the strategy of treating all patients through use of a net benefit function that incorporates both the number of patients treated and the outcome. We examined models developed using data sets regarding adjuvant chemotherapy for colorectal cancer and Dutasteride for benign prostatic hypertrophy.

Results

For adjuvant chemotherapy, we found that patients who would opt for chemotherapy even for small risk reductions, and, conversely, those who would require a very large risk reduction, would on average be harmed by using a prediction model; those with intermediate preferences would on average benefit by allowing such information to help their decision making. Use of prediction could, at worst, lead to the equivalent of an additional death or recurrence per 143 patients; at best it could lead to the equivalent of a reduction in the number of treatments of 25% without an increase in event rates. In the Dutasteride case, where the average benefit of treatment is more modest, there is a small benefit of prediction modelling, equivalent to a reduction of one event for every 100 patients given an individualized prediction.

Conclusion

The size of the benefit associated with appropriate clinical implementation of a good prediction model is sufficient to warrant development of further models. However, care is advised in the implementation of prediction modelling, especially for patients who would opt for treatment even if it was of relatively little benefit.  相似文献   

14.

Background  

In conventional PCR, total amplicon yield becomes independent of starting template number as amplification reaches plateau and varies significantly among replicate reactions. This paper describes a strategy for reconfiguring PCR so that the signal intensity of a single fluorescent detection probe after PCR thermal cycling reflects genomic composition. The resulting method corrects for product yield variations among replicate amplification reactions, permits resolution of homozygous and heterozygous genotypes based on endpoint fluorescence signal intensities, and readily identifies imbalanced allele ratios equivalent to those arising from gene/chromosomal duplications. Furthermore, the use of only a single colored probe for genotyping enhances the multiplex detection capacity of the assay.  相似文献   

15.

Background  

Gene-gene epistatic interactions likely play an important role in the genetic basis of many common diseases. Recently, machine-learning and data mining methods have been developed for learning epistatic relationships from data. A well-known combinatorial method that has been successfully applied for detecting epistasis is Multifactor Dimensionality Reduction (MDR). Jiang et al. created a combinatorial epistasis learning method called BNMBL to learn Bayesian network (BN) epistatic models. They compared BNMBL to MDR using simulated data sets. Each of these data sets was generated from a model that associates two SNPs with a disease and includes 18 unrelated SNPs. For each data set, BNMBL and MDR were used to score all 2-SNP models, and BNMBL learned significantly more correct models. In real data sets, we ordinarily do not know the number of SNPs that influence phenotype. BNMBL may not perform as well if we also scored models containing more than two SNPs. Furthermore, a number of other BN scoring criteria have been developed. They may detect epistatic interactions even better than BNMBL.  相似文献   

16.

Background

At the current price, the use of high-density single nucleotide polymorphisms (SNP) genotyping assays in genomic selection of dairy cattle is limited to applications involving elite sires and dams. The objective of this study was to evaluate the use of low-density assays to predict direct genomic value (DGV) on five milk production traits, an overall conformation trait, a survival index, and two profit index traits (APR, ASI).

Methods

Dense SNP genotypes were available for 42,576 SNP for 2,114 Holstein bulls and 510 cows. A subset of 1,847 bulls born between 1955 and 2004 was used as a training set to fit models with various sets of pre-selected SNP. A group of 297 bulls born between 2001 and 2004 and all cows born between 1992 and 2004 were used to evaluate the accuracy of DGV prediction. Ridge regression (RR) and partial least squares regression (PLSR) were used to derive prediction equations and to rank SNP based on the absolute value of the regression coefficients. Four alternative strategies were applied to select subset of SNP, namely: subsets of the highest ranked SNP for each individual trait, or a single subset of evenly spaced SNP, where SNP were selected based on their rank for ASI, APR or minor allele frequency within intervals of approximately equal length.

Results

RR and PLSR performed very similarly to predict DGV, with PLSR performing better for low-density assays and RR for higher-density SNP sets. When using all SNP, DGV predictions for production traits, which have a higher heritability, were more accurate (0.52-0.64) than for survival (0.19-0.20), which has a low heritability. The gain in accuracy using subsets that included the highest ranked SNP for each trait was marginal (5-6%) over a common set of evenly spaced SNP when at least 3,000 SNP were used. Subsets containing 3,000 SNP provided more than 90% of the accuracy that could be achieved with a high-density assay for cows, and 80% of the high-density assay for young bulls.

Conclusions

Accurate genomic evaluation of the broader bull and cow population can be achieved with a single genotyping assays containing ~ 3,000 to 5,000 evenly spaced SNP.  相似文献   

17.

Background  

Gene set enrichment analysis (GSEA) is a microarray data analysis method that uses predefined gene sets and ranks of genes to identify significant biological changes in microarray data sets. GSEA is especially useful when gene expression changes in a given microarray data set is minimal or moderate.  相似文献   

18.

Background  

The choice of probe set algorithms for expression summary in a GeneChip study has a great impact on subsequent gene expression data analysis. Spiked-in cRNAs with known concentration are often used to assess the relative performance of probe set algorithms. Given the fact that the spiked-in cRNAs do not represent endogenously expressed genes in experiments, it becomes increasingly important to have methods to study whether a particular probe set algorithm is more appropriate for a specific dataset, without using such external reference data.  相似文献   

19.
20.

Background

Systems biology has embraced computational modeling in response to the quantitative nature and increasing scale of contemporary data sets. The onslaught of data is accelerating as molecular profiling technology evolves. The Dialogue for Reverse Engineering Assessments and Methods (DREAM) is a community effort to catalyze discussion about the design, application, and assessment of systems biology models through annual reverse-engineering challenges.

Methodology and Principal Findings

We describe our assessments of the four challenges associated with the third DREAM conference which came to be known as the DREAM3 challenges: signaling cascade identification, signaling response prediction, gene expression prediction, and the DREAM3 in silico network challenge. The challenges, based on anonymized data sets, tested participants in network inference and prediction of measurements. Forty teams submitted 413 predicted networks and measurement test sets. Overall, a handful of best-performer teams were identified, while a majority of teams made predictions that were equivalent to random. Counterintuitively, combining the predictions of multiple teams (including the weaker teams) can in some cases improve predictive power beyond that of any single method.

Conclusions

DREAM provides valuable feedback to practitioners of systems biology modeling. Lessons learned from the predictions of the community provide much-needed context for interpreting claims of efficacy of algorithms described in the scientific literature.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号