首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this study, we used the phenotype simulation package naturalgwas to test the performance of Zhao's Random Forest method in comparison to an uncorrected Random Forest test, latent factor mixed models (LFMM), genome-wide efficient mixed models (GEMMA), and confounder adjusted linear regression (CATE). We created 400 sets of phenotypes, corresponding to five effect sizes and two, five, 15, or 30 causal loci, simulated from two empirical data sets containing SNPs from Striped Bass representing three and 13 populations. All association methods were evaluated for their ability to detect genotype–phenotype associations based on power, false discovery rates, and number of false positives. Genomic inflation was highest for uncorrected Random Forest and LFMM tests and lowest for Gemma and Zhao's Random Forest. All association tests had similar power to detect causal loci, and Zhao's Random Forest had the lowest false discovery rate in all scenarios. To measure the performance of association tests in small data sets with few loci surrounding a causal gene we also ran analyses again after removing causal loci from each data set. All association tests were only able to find true positives, defined as loci located within 30 kbp of a causal locus, in 3%–18% of simulations. In contrast, at least one false positive was found in 17%–44% of simulations. Zhao's Random Forest again identified the fewest false positives of all association tests studied. The ability to test the power of association tests for individual empirical data sets can be an extremely useful first step when designing a GWAS study.  相似文献   

2.
Effects of filtering by Present call on analysis of microarray experiments   总被引:1,自引:0,他引:1  

Background

Affymetrix GeneChips® are widely used for expression profiling of tens of thousands of genes. The large number of comparisons can lead to false positives. Various methods have been used to reduce false positives, but they have rarely been compared or quantitatively evaluated. Here we describe and evaluate a simple method that uses the detection (Present/Absent) call generated by the Affymetrix microarray suite version 5 software (MAS5) to remove data that is not reliably detected before further analysis, and compare this with filtering by expression level. We explore the effects of various thresholds for removing data in experiments of different size (from 3 to 10 arrays per treatment), as well as their relative power to detect significant differences in expression.

Results

Our approach sets a threshold for the fraction of arrays called Present in at least one treatment group. This method removes a large percentage of probe sets called Absent before carrying out the comparisons, while retaining most of the probe sets called Present. It preferentially retains the more significant probe sets (p ≤ 0.001) and those probe sets that are turned on or off, and improves the false discovery rate. Permutations to estimate false positives indicate that probe sets removed by the filter contribute a disproportionate number of false positives. Filtering by fraction Present is effective when applied to data generated either by the MAS5 algorithm or by other probe-level algorithms, for example RMA (robust multichip average). Experiment size greatly affects the ability to reproducibly detect significant differences, and also impacts the effect of filtering; smaller experiments (3–5 samples per treatment group) benefit from more restrictive filtering (≥50% Present).

Conclusion

Use of a threshold fraction of Present detection calls (derived by MAS5) provided a simple method that effectively eliminated from analysis probe sets that are unlikely to be reliable while preserving the most significant probe sets and those turned on or off; it thereby increased the ratio of true positives to false positives.  相似文献   

3.
A major challenge in the development of peptide-based vaccines is finding the right immunogenic element, with efficient and long-lasting immunization effects, from large potential targets encoded by pathogen genomes. Computer models are convenient tools for scanning pathogen genomes to preselect candidate immunogenic peptides for experimental validation. Current methods predict many false positives resulting from a low prevalence of true positives. We develop a test reject method based on the prediction uncertainty estimates determined by Gaussian process regression. This method filters false positives among predicted epitopes from a pathogen genome. The performance of stand-alone Gaussian process regression is compared to other state-of-the-art methods using cross validation on 11 benchmark data sets. The results show that the Gaussian process method has the same accuracy as the top performing algorithms. The combination of Gaussian process regression with the proposed test reject method is used to detect true epitopes from the Vaccinia virus genome. The test rejection increases the prediction accuracy by reducing the number of false positives without sacrificing the method's sensitivity. We show that the Gaussian process in combination with test rejection is an effective method for prediction of T-cell epitopes in large and diverse pathogen genomes, where false positives are of concern.  相似文献   

4.
Perola E 《Proteins》2006,64(2):422-435
In spite of recent improvements in docking and scoring methods, high false-positive rates remain a common issue in structure-based virtual screening. In this study, the distinctive features of false positives in kinase virtual screens were investigated. A series of retrospective virtual screens on kinase targets was performed on specifically designed test sets, each combining true ligands and experimentally confirmed inactive compounds. A systematic analysis of the docking poses generated for the top-ranking compounds highlighted key aspects differentiating true hits from false positives. The most recurring feature in the poses of false positives was the absence of certain key interactions known to be required for kinase binding. A systematic analysis of 444 crystal structures of ligand-bound kinases showed that at least two hydrogen bonds between the ligand and the backbone protein atoms in the kinase hinge region are present in 90% of the complexes, with very little variability across targets. Closer inspection showed that when the two hydrogen bonds are present, one of three preferred hinge-binding motifs is involved in 96.5% of the cases. Less than 10% of the false positives satisfied these two criteria in the minimized docking poses generated by our standard protocol. Ligand conformational artifacts were also shown to contribute to the occurrence of false positives in a number of cases. Application of this knowledge in the form of docking constraints and post-processing filters provided consistent improvements in virtual screening performance on all systems. The false-positive rates were significantly reduced and the enrichment factors increased by an average of twofold. On the basis of these results, a generalized two-step protocol for virtual screening on kinase targets is suggested.  相似文献   

5.
We have constructed a perceptron type neural network for E. coli promoter prediction and improved its ability to generalize with a new technique for selecting the sequence features shown during training. We have also reconstructed five previous prediction methods and compared the effectiveness of those methods and our neural network. Surprisingly, the simple statistical method of Mulligan et al. performed the best amongst the previous methods. Our neural network was comparable to Mulligan's method when false positives were kept low and better than Mulligan's method when false negatives were kept low. We also showed the correlation between the prediction rates of neural networks achieved by previous researchers and the information content of their data sets.  相似文献   

6.
Broberg P 《Genome biology》2002,3(9):preprint00-23

Background  

In the pharmaceutical industry and in academia substantial efforts are made to make the best use of the promising microarray technology. The data generated by microarrays are more complex than most other biological data attracting much attention at this point. A method for finding an optimal test statistic with which to rank genes with respect to differential expression is outlined and tested. At the heart of the method lies an estimate of the false negative and false positive rates. Both investing in false positives and missing true positives lead to a waste of resources. The procedure sets out to minimise these errors. For calculation of the false positive and negative rates a simulation procedure is invoked.  相似文献   

7.
An objective of many functional genomics studies is to estimate treatment-induced changes in gene expression. cDNA arrays interrogate each tissue sample for the levels of mRNA for hundreds to tens of thousands of genes, and the use of this technology leads to a multitude of treatment contrasts. By-gene hypotheses tests evaluate the evidence supporting no effect, but selecting a significance level requires dealing with the multitude of comparisons. The p-values from these tests order the genes such that a p-value cutoff divides the genes into two sets. Ideally one set would contain the affected genes and the other would contain the unaffected genes. However, the set of genes selected as affected will have false positives, i.e., genes that are not affected by treatment. Likewise, the other set of genes, selected as unaffected, will contain false negatives, i.e., genes that are affected. A plot of the observed p-values (1 - p) versus their expectation under a uniform [0, 1] distribution allows one to estimate the number of true null hypotheses. With this estimate, the false positive rates and false negative rates associated with any p-value cutoff can be estimated. When computed for a range of cutoffs, these rates summarize the ability of the study to resolve effects. In our work, we are more interested in selecting most of the affected genes rather than protecting against a few false positives. An optimum cutoff, i.e., the best set given the data, depends upon the relative cost of falsely classifying a gene as affected versus the cost of falsely classifying a gene as unaffected. We select the cutoff by a decision-theoretic method analogous to methods developed for receiver operating characteristic curves. In addition, we estimate the false discovery rate and the false nondiscovery rate associated with any cutoff value. Two functional genomics studies that were designed to assess a treatment effect are used to illustrate how the methods allowed the investigators to determine a cutoff to suit their research goals.  相似文献   

8.
In vitro experiments with C3H 10T(1/2) mouse cells were performed to determine whether Frequency Division Multiple Access (FDMA) or Code Division Multiple Access (CDMA) modulated radiofrequency (RF) radiations induce changes in gene expression. After the cells were exposed to either modulation for 24 h at a specific absorption rate (SAR) of 5 W/ kg, RNA was extracted from both exposed and sham-exposed cells for gene expression analysis. As a positive control, cells were exposed to 0.68 Gy of X rays and gene expression was evaluated 4 h after exposure. Gene expression was evaluated using the Affymetrix U74Av2 GeneChip to detect changes in mRNA levels. Each exposure condition was repeated three times. The GeneChip data were analyzed using a two-tailed t test, and the expected number of false positives was estimated from t tests on 20 permutations of the six sham RF-field-exposed samples. For the X-ray-treated samples, there were more than 90 probe sets with expression changes greater than 1.3-fold beyond the number of expected false positives. Approximately one-third of these genes had previously been reported in the literature as being responsive to radiation. In contrast, for both CDMA and FDMA radiation, the number of probe sets with an expression change greater than 1.3-fold was less than or equal to the expected number of false positives. Thus the 24-h exposures to FDMA or CDMA RF radiation at 5 W/kg had no statistically significant effect on gene expression.  相似文献   

9.
When detecting positive selection in proteins, the prevalence of errors resulting from misalignment and the ability of alignment filters to mitigate such errors are not well understood, but filters are commonly applied to try to avoid false positive results. Focusing on the sitewise detection of positive selection across a wide range of divergence levels and indel rates, we performed simulation experiments to quantify the false positives and false negatives introduced by alignment error and the ability of alignment filters to improve performance. We found that some aligners led to many false positives, whereas others resulted in very few. False negatives were a problem for all aligners, increasing with sequence divergence. Of the aligners tested, PRANK's codon-based alignments consistently performed the best and ClustalW performed the worst. Of the filters tested, GUIDANCE performed the best and Gblocks performed the worst. Although some filters showed good ability to reduce the error rates from ClustalW and MAFFT alignments, none were found to substantially improve the performance of PRANK alignments under most conditions. Our results revealed distinct trends in error rates and power levels for aligners and filters within a biologically plausible parameter space. With the best aligner, a low false positive rate was maintained even with extremely divergent indel-prone sequences. Controls using the true alignment and an optimal filtering method suggested that performance improvements could be gained by improving aligners or filters to reduce the prevalence of false negatives, especially at higher divergence levels and indel rates.  相似文献   

10.
DNA microarray technology provides useful tools for profiling global gene expression patterns in different cell/tissue samples. One major challenge is the large number of genes relative to the number of samples. The use of all genes can suppress or reduce the performance of a classification rule due to the noise of nondiscriminatory genes. Selection of an optimal subset from the original gene set becomes an important prestep in sample classification. In this study, we propose a family-wise error (FWE) rate approach to selection of discriminatory genes for two-sample or multiple-sample classification. The FWE approach controls the probability of the number of one or more false positives at a prespecified level. A public colon cancer data set is used to evaluate the performance of the proposed approach for the two classification methods: k nearest neighbors (k-NN) and support vector machine (SVM). The selected gene sets from the proposed procedure appears to perform better than or comparable to several results reported in the literature using the univariate analysis without performing multivariate search. In addition, we apply the FWE approach to a toxicogenomic data set with nine treatments (a control and eight metals, As, Cd, Ni, Cr, Sb, Pb, Cu, and AsV) for a total of 55 samples for a multisample classification. Two gene sets are considered: the gene set omegaF formed by the ANOVA F-test, and a gene set omegaT formed by the union of one-versus-all t-tests. The predicted accuracies are evaluated using the internal and external crossvalidation. Using the SVM classification, the overall accuracies to predict 55 samples into one of the nine treatments are above 80% for internal crossvalidation. OmegaF has slightly higher accuracy rates than omegaT. The overall predicted accuracies are above 70% for the external crossvalidation; the two gene sets omegaT and omegaF performed equally well.  相似文献   

11.
We aim to compare the performance of Bowtie2 , bwa‐mem , blastn and blastx when aligning bacterial metagenomes against the Comprehensive Antibiotic Resistance Database (CARD). Simulated reads were used to evaluate the performance of each aligner under the following four performance criteria: correctly mapped, false positives, multi‐reads and partials. The optimal alignment approach was applied to samples from two wastewater treatment plants to detect antibiotic resistance genes using next generation sequencing. blastn mapped with greater accuracy among the four sequence alignment approaches considered followed by Bowtie2 . blastx generated the greatest number of false positives and multi‐reads when aligned against the CARD. The performance of each alignment tool was also investigated using error‐free reads. Although each aligner mapped a greater number of error‐free reads as compared to Illumina‐error reads, in general, the introduction of sequencing errors had little effect on alignment results when aligning against the CARD. Given each performance criteria, blastn was found to be the most favourable alignment tool and was therefore used to assess resistance genes in sewage samples. Beta‐lactam and aminoglycoside were found to be the most abundant classes of antibiotic resistance genes in each sample.

Significance and Impact of the Study

Antibiotic resistance genes (ARGs) are pollutants known to persist in wastewater treatment plants among other environments, thus methods for detecting these genes have become increasingly relevant. Next generation sequencing has brought about a host of sequence alignment tools that provide a comprehensive look into antimicrobial resistance in environmental samples. However, standardizing practices in ARG metagenomic studies is challenging since results produced from alignment tools can vary significantly. Our study provides sequence alignment results of synthetic, and authentic bacterial metagenomes mapped against an ARG database using multiple alignment tools, and the best practice for detecting ARGs in environmental samples.  相似文献   

12.
The gap between the number of known protein sequences and structures continues to widen, particularly as a result of sequencing projects for entire genomes. Recently there have been many attempts to generate structural assignments to all genes on sets of completed genomes using fold-recognition methods. We developed a method that detects false positives made by these genome-wide structural assignment experiments by identifying isolated occurrences. The method was tested using two sets of assignments, generated by SUPERFAMILY and PSI-BLAST, on 150 completed genomes. A phylogeny of these genomes was built and a parsimony algorithm was used to identify isolated occurrences by detecting occurrences that cause a gain at leaf level. Isolated occurrences tend to have high e-values, and in both sets of assignments, a sudden increase in isolated occurrences is observed for e-values >10−8 for SUPERFAMILY and >10−4 for PSI-BLAST. Conditions to predict false positives are based on these results. Independent tests confirm that the predicted false positives are indeed more likely to be incorrectly assigned. Evaluation of the predicted false positives also showed that the accuracy of profile-based fold-recognition methods might depend on secondary structure content and sequence length. We show that false positives generated by fold-recognition methods can be identified by considering structural occurrence patterns on completed genomes; occurrences that are isolated within the phylogeny tend to be less reliable. The method provides a new independent way to examine the quality of fold assignments and may be used to improve the output of any genome-wide fold assignment method.  相似文献   

13.
Sequence-based residue contact prediction plays a crucial role in protein structure reconstruction. In recent years, the combination of evolutionary coupling analysis (ECA) and deep learning (DL) techniques has made tremendous progress for residue contact prediction, thus a comprehensive assessment of current methods based on a large-scale benchmark data set is very needed. In this study, we evaluate 18 contact predictors on 610 non-redundant proteins and 32 CASP13 targets according to a wide range of perspectives. The results show that different methods have different application scenarios: (1) DL methods based on multi-categories of inputs and large training sets are the best choices for low-contact-density proteins such as the intrinsically disordered ones and proteins with shallow multi-sequence alignments (MSAs). (2) With at least 5L (L is sequence length) effective sequences in the MSA, all the methods show the best performance, and methods that rely only on MSA as input can reach comparable achievements as methods that adopt multi-source inputs. (3) For top L/5 and L/2 predictions, DL methods can predict more hydrophobic interactions while ECA methods predict more salt bridges and disulfide bonds. (4) ECA methods can detect more secondary structure interactions, while DL methods can accurately excavate more contact patterns and prune isolated false positives. In general, multi-input DL methods with large training sets dominate current approaches with the best overall performance. Despite the great success of current DL methods must be stated the fact that there is still much room left for further improvement: (1) With shallow MSAs, the performance will be greatly affected. (2) Current methods show lower precisions for inter-domain compared with intra-domain contact predictions, as well as very high imbalances in precisions between intra-domains. (3) Strong prediction similarities between DL methods indicating more feature types and diversified models need to be developed. (4) The runtime of most methods can be further optimized.  相似文献   

14.
Searches using position specific scoring matrices (PSSMs) have been commonly used in remote homology detection procedures such as PSI-BLAST and RPS-BLAST. A PSSM is generated typically using one of the sequences of a family as the reference sequence. In the case of PSI-BLAST searches the reference sequence is same as the query. Recently we have shown that searches against the database of multiple family-profiles, with each one of the members of the family used as a reference sequence, are more effective than searches against the classical database of single family-profiles. Despite relatively a better overall performance when compared with common sequence-profile matching procedures, searches against the multiple family-profiles database result in a few false positives and false negatives. Here we show that profile length and divergence of sequences used in the construction of a PSSM have major influence on the performance of multiple profile based search approach. We also identify that a simple parameter defined by the number of PSSMs corresponding to a family that is hit, for a query, divided by the total number of PSSMs in the family can distinguish effectively the true positives from the false positives in the multiple profiles search approach.  相似文献   

15.
In a wide range of contexts, including predator avoidance, medical decision-making and security screening, decision accuracy is fundamentally constrained by the trade-off between true and false positives. Increased true positives are possible only at the cost of increased false positives; conversely, decreased false positives are associated with decreased true positives. We use an integrated theoretical and experimental approach to show that a group of decision-makers can overcome this basic limitation. Using a mathematical model, we show that a simple quorum decision rule enables individuals in groups to simultaneously increase true positives and decrease false positives. The results from a predator-detection experiment that we performed with humans are in line with these predictions: (i) after observing the choices of the other group members, individuals both increase true positives and decrease false positives, (ii) this effect gets stronger as group size increases, (iii) individuals use a quorum threshold set between the average true- and false-positive rates of the other group members, and (iv) individuals adjust their quorum adaptively to the performance of the group. Our results have broad implications for our understanding of the ecology and evolution of group-living animals and lend themselves for applications in the human domain such as the design of improved screening methods in medical, forensic, security and business applications.  相似文献   

16.
A comparison was made of four statistically based schemes for classifying epithelial cells from 243 fine needle aspirates of breast masses as benign or malignant. Two schemes were computer-generated decision trees and two were user generated. Eleven cytologic characteristics described in the literature as being useful in distinguishing benign from malignant breast aspirates were assessed on a scale of 1 to 10, with 1 being closest to that described as benign and 10 to that described as malignant. The original computer-generated dichotomous decision tree gave 6 false negatives and 12 false positives on the data set; another tree generated from the current data improved performance slightly, with 5 false negatives and 10 false positives. Maximum diagnostic overlap occurred at the cut-point of the original dichotomous tree. The insertion of a third node evaluating additional parameters resulted in one false negative and seven false positives. This performance was matched by summing the scores of the eight characteristics that individually were most effective in separating benign from malignant. We conclude that, while statistically designed, computer-generated dichotomous decision trees identify a starting sequence for applying cytologic characteristics to distinguish between benign and malignant breast aspirates, modifications based on human expert knowledge may result in schemes that improve diagnostic performance.  相似文献   

17.
Experiments with cultured C3H 10T 1/2 cells were performed to determine if exposure to cell phone radiofrequency (RF) radiations induce changes in gene expression. Following a 24 h exposure of 5 W/kg specific adsorption rate, RNA was extracted from the exposed and sham control cells for microarray analysis on Affymetrix U74Av2 Genechips. Cells exposed to 0.68 Gy of X-rays with a 4-h recovery were used as positive controls. The number of gene expression changes induced by RF radiation was not greater than the number of false positives expected based on a sham versus sham comparison. In contrast, the X-irradiated samples showed higher numbers of probe sets changing expression level than in the sham versus sham comparison.  相似文献   

18.
FST outlier tests are a potentially powerful way to detect genetic loci under spatially divergent selection. Unfortunately, the extent to which these tests are robust to nonequilibrium demographic histories has been understudied. We developed a landscape genetics simulator to test the effects of isolation by distance (IBD) and range expansion on FST outlier methods. We evaluated the two most commonly used methods for the identification of FST outliers (FDIST2 and BayeScan, which assume samples are evolutionarily independent) and two recent methods (FLK and Bayenv2, which estimate and account for evolutionary nonindependence). Parameterization with a set of neutral loci (‘neutral parameterization’) always improved the performance of FLK and Bayenv2, while neutral parameterization caused FDIST2 to actually perform worse in the cases of IBD or range expansion. BayeScan was improved when the prior odds on neutrality was increased, regardless of the true odds in the data. On their best performance, however, the widely used methods had high false‐positive rates for IBD and range expansion and were outperformed by methods that accounted for evolutionary nonindependence. In addition, default settings in FDIST2 and BayeScan resulted in many false positives suggesting balancing selection. However, all methods did very well if a large set of neutral loci is available to create empirical P‐values. We conclude that in species that exhibit IBD or have undergone range expansion, many of the published FST outliers based on FDIST2 and BayeScan are probably false positives, but FLK and Bayenv2 show great promise for accurately identifying loci under spatially divergent selection.  相似文献   

19.
We generated 61 strains of Escherichia coli in which the expression level of a specific single gene can be changed continuously over a physiologically significant range. In each strain, one auxotrophic gene was deleted from its original position and reinserted at a specific position on the chromosome under the control of the tetA promoter. Therefore, the level of expression of the target gene can be controlled easily by altering the concentrations of inducers, e.g., anhydrotetracycline and doxycycline, in the medium. Protein and mRNA levels and changes in proliferation rate were examined in some of the strains in our collection to determine the ability to control the level of target gene expression over a physiologically significant range. These strains will be useful for extracting omics data sets and for the construction of genome-scale mathematical models, because causality between perturbations in gene expression level and their consequences can be clearly determined.  相似文献   

20.
Indirect tests have detected recombination in mitochondrial DNA (mtDNA) from many animal lineages, including mammals. However, it is possible that features of the molecular evolutionary process without recombination could be incorrectly inferred by indirect tests as being due to recombination. We have identified one such example, which we call "patchy-tachy" (PT), where different partitions of sequences evolve at different rates, that leads to an excess of false positives for recombination inferred by indirect tests. To explore this phenomena, we characterized the false positive rates of six widely used indirect tests for recombination using simulations of general models for mtDNA evolution with PT but without recombination. All tests produced 30-99% false positives for recombination, although the conditions that produced the maximal level of false positives differed between the tests. To evaluate the degree to which conditions that exacerbate false positives are found in published sequence data, we turned to 20 animal mtDNA data sets in which recombination is suggested by indirect tests. Using a model where different regions of the sequences were free to evolve at different rates in different lineages, we demonstrated that PT is prevalent in many data sets in which recombination was previously inferred using indirect tests. Taken together, our results argue that PT without recombination is a viable alternative explanation for detection of widespread recombination in animal mtDNA using indirect tests.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号