首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Markovian analysis is a method to measure optical texture based on gray-level transition probabilities in digitized images. Experiments are described that investigate that classification performance of parameters generated by Markovian analysis. Results using Markov texture parameters show that the selection of a Markov step size strongly affects classification error rates and the number of parameters required to achieve the maximum correct classification rates. Markov texture parameters are shown to achieve high rates of correct classification in discriminating images of normal from abnormal cervical cell nuclei.  相似文献   

2.
This paper presents preliminary results of research toward the development of a high resolution analysis stage for a dual resolution image processing-based prescreening device for cervical cytology. Experiments using both manual and automatic methods for cell segmentation are described. In both cases, 1500 cervical cells were analyzed and classified as normal or abnormal (dysplastic or malignant) using a minimum Mahalanobis distance classifier with eight subclasses of normal cells, and five subclasses of abnormal cells. With manual segmentation, false positive and false negative error rates of 2.98 and 7.73% were obtained. Similar experiments using automatic cell segmentation methods yielded false positive and false negative error rates of 3.90 and 11.56%, respectively. In both cases, independent training and testing data were used.  相似文献   

3.
Abundance trends are the basis for many classifications of threat and recovery status, but they can be a challenge to interpret because of observation error, stochastic variation in abundance (process noise) and temporal autocorrelation in that process noise. To measure the frequency of incorrectly detecting a decline (false-positive or false alarm) and failing to detect a true decline (false-negative), we simulated stable and declining abundance time series across several magnitudes of observation error and autocorrelated process noise. We then empirically estimated the magnitude of observation error and autocorrelated process noise across a broad range of taxa and mapped these estimates onto the simulated parameter space. Based on the taxa we examined, at low classification thresholds (30% decline in abundance) and short observation windows (10 years), false alarms would be expected to occur, on average, about 40% of the time assuming density-independent dynamics, whereas false-negatives would be expected to occur about 60% of the time. However, false alarms and failures to detect true declines were reduced at higher classification thresholds (50% or 80% declines), longer observation windows (20, 40, 60 years), and assuming density-dependent dynamics. The lowest false-positive and false-negative rates are likely to occur for large-bodied, long-lived animal species.  相似文献   

4.
Discrete classification is common in Genomic Signal Processing applications, in particular in classification of discretized gene expression data, and in discrete gene expression prediction and the inference of boolean genomic regulatory networks. Once a discrete classifier is obtained from sample data, its performance must be evaluated through its classification error. In practice, error estimation methods must then be employed to obtain reliable estimates of the classification error based on the available data. Both classifier design and error estimation are complicated, in the case of Genomics, by the prevalence of small-sample data sets in such applications. This paper presents a broad review of the methodology of classification and error estimation for discrete data, in the context of Genomics, focusing on the study of performance in small sample scenarios, as well as asymptotic behavior.Key Words: Genomics, classification, error estimation, discrete histogram rule, sampling distribution, resubstitution, leave-one-out, ensemble methods, coefficient of determination.  相似文献   

5.
Controlling the false discovery rate (FDR) has been proposed as an alternative to controlling the genome-wise error rate (GWER) for detecting quantitative trait loci (QTL) in genome scans. The objective here was to implement FDR in the context of regression interval mapping for multiple traits. Data on five traits from an F2 swine breed cross were used. FDR was implemented using tests at every 1 cM (FDR1) and using tests with the highest test statistic for each marker interval (FDRm). For the latter, a method was developed to predict comparison-wise error rates. At low error rates, FDR1 behaved erratically; FDRm was more stable but gave similar significance thresholds and number of QTL detected. At the same error rate, methods to control FDR gave less stringent significance thresholds and more QTL detected than methods to control GWER. Although testing across traits had limited impact on FDR, single-trait testing was recommended because there is no theoretical reason to pool tests across traits for FDR. FDR based on FDRm was recommended for QTL detection in interval mapping because it provides significance tests that are meaningful, yet not overly stringent, such that a more complete picture of QTL is revealed.  相似文献   

6.
从分子层面对泛癌进行研究已经得到了很大的进展,但是对宫颈鳞状细胞癌的分子分类研究仍然需要更多的探索.为了找到宫颈鳞状细胞癌潜在的子类,本文提出了一个基于多维组学数据的癌症亚型分类分析流程.通过统计学方法对癌症基因组图谱(The Cancer Genome Atlas,TCGA)宫颈鳞状细胞癌的mRNA表达数据、小分子核糖核酸(microRNA,miRNA)表达数据、DNA甲基化数据以及拷贝数变异数据4个维度包含的分子进行筛选,然后对筛选后的分类特征进行整合聚类,进一步筛选能够区分不同子类的关键分类特征,并使用这些关键分类特征建立宫颈鳞状细胞癌分类模型.本研究为宫颈鳞状细胞癌分子层面子类的识别提供了分析流程,得到了两个临床生存水平具有显著性差异的宫颈鳞状细胞癌子类,并确定了8个宫颈鳞状细胞癌的关键分类特征.本研究中识别的宫颈鳞状细胞癌子类和关键分类特征为宫颈鳞状细胞癌早期分类及分类标志物的鉴定提供了重要参考.  相似文献   

7.
The practical use of classification methods depends, for the majority of cases, on whether it is possible to obtain evidence how newly appearing objects of unknown classes-membership can be allocated on the basis of a classifier established by a training sample. In the present paper several error estimation methods are presented and compared with each other. It is shown by means of a practical example in which way the number of characteristics used in the discrimination affects the classification error. Discriminance functions calculated from many characteristics yield a higher error rate than those from fewer characteristics. From the comparison of the error rates and computing times involved in the individual methods recommendations for the selection of practicable estimation methods are given.  相似文献   

8.
Many missing-value (MV) imputation methods have been developed for microarray data, but only a few studies have investigated the relationship between MV imputation and classification accuracy. Furthermore, these studies are problematic in fundamental steps such as MV generation and classifier error estimation. In this work, we carry out a model-based study that addresses some of the issues in previous studies. Six popular imputation algorithms, two feature selection methods, and three classification rules are considered. The results suggest that it is beneficial to apply MV imputation when the noise level is high, variance is small, or gene-cluster correlation is strong, under small to moderate MV rates. In these cases, if data quality metrics are available, then it may be helpful to consider the data point with poor quality as missing and apply one of the most robust imputation algorithms to estimate the true signal based on the available high-quality data points. However, at large MV rates, we conclude that imputation methods are not recommended. Regarding the MV rate, our results indicate the presence of a peaking phenomenon: performance of imputation methods actually improves initially as the MV rate increases, but after an optimum point, performance quickly deteriorates with increasing MV rates.  相似文献   

9.
Johnson PC  Haydon DT 《Genetics》2007,175(2):827-842
The importance of quantifying and accounting for stochastic genotyping errors when analyzing microsatellite data is increasingly being recognized. This awareness is motivating the development of data analysis methods that not only take errors into consideration but also recognize the difference between two distinct classes of error, allelic dropout and false alleles. Currently methods to estimate rates of allelic dropout and false alleles depend upon the availability of error-free reference genotypes or reliable pedigree data, which are often not available. We have developed a maximum-likelihood-based method for estimating these error rates from a single replication of a sample of genotypes. Simulations show it to be both accurate and robust to modest violations of its underlying assumptions. We have applied the method to estimating error rates in two microsatellite data sets. It is implemented in a computer program, Pedant, which estimates allelic dropout and false allele error rates with 95% confidence regions from microsatellite genotype data and performs power analysis. Pedant is freely available at http://www.stats.gla.ac.uk/ approximately paulj/pedant.html.  相似文献   

10.
A random forest method has been selected to perform both gene selection and classification of the microarray data. In this embedded method, the selection of smallest possible sets of genes with lowest error rates is the key factor in achieving highest classification accuracy. Hence, improved gene selection method using random forest has been proposed to obtain the smallest subset of genes as well as biggest subset of genes prior to classification. The option for biggest subset selection is done to assist researchers who intend to use the informative genes for further research. Enhanced random forest gene selection has performed better in terms of selecting the smallest subset as well as biggest subset of informative genes with lowest out of bag error rates through gene selection. Furthermore, the classification performed on the selected subset of genes using random forest has lead to lower prediction error rates compared to existing method and other similar available methods.  相似文献   

11.
Presentation is made of the design of a statistical model for the generation of "artificial specimens" to be used in the development and testing of a high-resolution prescreening system for gynecologic specimen classification. The model is based on two considerations: (1) the nature of the biologic material to be examined and (2) the system to be studied, which in this case is the FAZYTAN cervical prescreening system. Since gynecologic specimens that belong to the same clinical class (Papanicolaou group) have similar compositions of the different cytologic cell types, the simulation model presented is based on the close relationship between the degree of cancer suspiciousness expressed in the clinical diagnostic group and the composition of the cellular samples on a specimen. Statistically, the model considered here is based on an analysis of the single-cell classification (SCC) output process, taking the inherent system properties into account. The statistical information obtained by evaluating large sets of labelled cells is then used to produce artificially generated point distributions in the SCC decision space ("artificial specimens"), which can be used for examination of system reactions under controlled conditions. False-positive and false-negative error rates and system operation characteristics can be measured, and the effects of varying cell compositions as well as the relative performance of different specimen classifiers can be investigated. Although the "artificial specimens" thus created allow the investigation of system reactions with respect to a great variety of input processes, they cannot replace experiments on thousands of original specimens in order to measure system quality under realistic conditions.  相似文献   

12.
Controlling the proportion of false positives in multiple dependent tests   总被引:4,自引:0,他引:4  
Genome scan mapping experiments involve multiple tests of significance. Thus, controlling the error rate in such experiments is important. Simple extension of classical concepts results in attempts to control the genomewise error rate (GWER), i.e., the probability of even a single false positive among all tests. This results in very stringent comparisonwise error rates (CWER) and, consequently, low experimental power. We here present an approach based on controlling the proportion of false positives (PFP) among all positive test results. The CWER needed to attain a desired PFP level does not depend on the correlation among the tests or on the number of tests as in other approaches. To estimate the PFP it is necessary to estimate the proportion of true null hypotheses. Here we show how this can be estimated directly from experimental results. The PFP approach is similar to the false discovery rate (FDR) and positive false discovery rate (pFDR) approaches. For a fixed CWER, we have estimated PFP, FDR, pFDR, and GWER through simulation under a variety of models to illustrate practical and philosophical similarities and differences among the methods.  相似文献   

13.
Over 4600 exfoliated squamous cervical cells taken from appropriate Papanicolaou samples were classified as normal, mildly dysplastic, moderately dysplastic and severely dysplastic by an experienced cytopathologist. The slides were de-stained and subsequently re-stained with Feulgen Thionin-SO2 stain. Images of the nuclei were then captured, recorded and processed employing an image cytometry device. Automated classification of the cells was carried out using three different methods--discriminant function analysis, a decision tree classifier and a neutral network classifier. The discriminant function analysis method, which combined all dysplastic cells into an abnormal group, achieved a combined error rate of less than 0.4% for moderate and severe dysplastic cells, and less than 40% for mildly dysplastic cells. All three methods yielded comparable results, which approached those of human performance.  相似文献   

14.
Recent advances in training deep (multi-layer) architectures have inspired a renaissance in neural network use. For example, deep convolutional networks are becoming the default option for difficult tasks on large datasets, such as image and speech recognition. However, here we show that error rates below 1% on the MNIST handwritten digit benchmark can be replicated with shallow non-convolutional neural networks. This is achieved by training such networks using the ‘Extreme Learning Machine’ (ELM) approach, which also enables a very rapid training time (∼ 10 minutes). Adding distortions, as is common practise for MNIST, reduces error rates even further. Our methods are also shown to be capable of achieving less than 5.5% error rates on the NORB image database. To achieve these results, we introduce several enhancements to the standard ELM algorithm, which individually and in combination can significantly improve performance. The main innovation is to ensure each hidden-unit operates only on a randomly sized and positioned patch of each image. This form of random ‘receptive field’ sampling of the input ensures the input weight matrix is sparse, with about 90% of weights equal to zero. Furthermore, combining our methods with a small number of iterations of a single-batch backpropagation method can significantly reduce the number of hidden-units required to achieve a particular performance. Our close to state-of-the-art results for MNIST and NORB suggest that the ease of use and accuracy of the ELM algorithm for designing a single-hidden-layer neural network classifier should cause it to be given greater consideration either as a standalone method for simpler problems, or as the final classification stage in deep neural networks applied to more difficult problems.  相似文献   

15.
The Newman-Keuls (NK) procedure for testing all pairwise comparisons among a set of treatment means, introduced by Newman (1939) and in a slightly different form by Keuls (1952) was proposed as a reasonable way to alleviate the inflation of error rates when a large number of means are compared. It was proposed before the concepts of different types of multiple error rates were introduced by Tukey (1952a, b; 1953). Although it was popular in the 1950s and 1960s, once control of the familywise error rate (FWER) was accepted generally as an appropriate criterion in multiple testing, and it was realized that the NK procedure does not control the FWER at the nominal level at which it is performed, the procedure gradually fell out of favor. Recently, a more liberal criterion, control of the false discovery rate (FDR), has been proposed as more appropriate in some situations than FWER control. This paper notes that the NK procedure and a nonparametric extension controls the FWER within any set of homogeneous treatments. It proves that the extended procedure controls the FDR when there are well-separated clusters of homogeneous means and between-cluster test statistics are independent, and extensive simulation provides strong evidence that the original procedure controls the FDR under the same conditions and some dependent conditions when the clusters are not well-separated. Thus, the test has two desirable error-controlling properties, providing a compromise between FDR control with no subgroup FWER control and global FWER control. Yekutieli (2002) developed an FDR-controlling procedure for testing all pairwise differences among means, without any FWER-controlling criteria when there is more than one cluster. The empirica example in Yekutieli's paper was used to compare the Benjamini-Hochberg (1995) method with apparent FDR control in this context, Yekutieli's proposed method with proven FDR control, the Newman-Keuls method that controls FWER within equal clusters with apparent FDR control, and several methods that control FWER globally. The Newman-Keuls is shown to be intermediate in number of rejections to the FWER-controlling methods and the FDR-controlling methods in this example, although it is not always more conservative than the other FDR-controlling methods.  相似文献   

16.
An analysis has been performed of visual diagnostic criteria used in cervical cytology applied to machine selected cells in relation to automated classification based on variables, which can be recorded in an image system with automated cell search and segmentation, feature extraction and classification. A 98% accuracy could be obtained with the choice of the most ideal statistical methods for discrimination and the use of the most powerful variables recorded in the image system when compared with consensus of the visual diagnoses based on established cytological criteria for diagnosis of cancer and precancer of the cervix uteri. The most powerful discriminatory variables in the image system (of 17 recorded) for discrimination between normal and abnormal epithelial cells were, in addition to nuclear extinction, cytoplasmic extinction and cytoplasmic shape. It is concluded that the visual classification of cervical cells is highly accurate with experienced observers and that imaging microscopes can be trained to nearly equal this accuracy with appropriate statistical methods of discrimination. The problem of creating fully automated systems, however, also requires the inclusion of even more effective discriminatory variables and also the solution of such problems as automatic cell search, segmentation, artifact rejection, feature extraction, classification and electronic stability in order to become cost-effective.  相似文献   

17.
DNA microarray technology provides useful tools for profiling global gene expression patterns in different cell/tissue samples. One major challenge is the large number of genes relative to the number of samples. The use of all genes can suppress or reduce the performance of a classification rule due to the noise of nondiscriminatory genes. Selection of an optimal subset from the original gene set becomes an important prestep in sample classification. In this study, we propose a family-wise error (FWE) rate approach to selection of discriminatory genes for two-sample or multiple-sample classification. The FWE approach controls the probability of the number of one or more false positives at a prespecified level. A public colon cancer data set is used to evaluate the performance of the proposed approach for the two classification methods: k nearest neighbors (k-NN) and support vector machine (SVM). The selected gene sets from the proposed procedure appears to perform better than or comparable to several results reported in the literature using the univariate analysis without performing multivariate search. In addition, we apply the FWE approach to a toxicogenomic data set with nine treatments (a control and eight metals, As, Cd, Ni, Cr, Sb, Pb, Cu, and AsV) for a total of 55 samples for a multisample classification. Two gene sets are considered: the gene set omegaF formed by the ANOVA F-test, and a gene set omegaT formed by the union of one-versus-all t-tests. The predicted accuracies are evaluated using the internal and external crossvalidation. Using the SVM classification, the overall accuracies to predict 55 samples into one of the nine treatments are above 80% for internal crossvalidation. OmegaF has slightly higher accuracy rates than omegaT. The overall predicted accuracies are above 70% for the external crossvalidation; the two gene sets omegaT and omegaF performed equally well.  相似文献   

18.
This article proposes resampling-based empirical Bayes multiple testing procedures for controlling a broad class of Type I error rates, defined as generalized tail probability (gTP) error rates, gTP (q,g) = Pr(g (V(n),S(n)) > q), and generalized expected value (gEV) error rates, gEV (g) = E [g (V(n),S(n))], for arbitrary functions g (V(n),S(n)) of the numbers of false positives V(n) and true positives S(n). Of particular interest are error rates based on the proportion g (V(n),S(n)) = V(n) /(V(n) + S(n)) of Type I errors among the rejected hypotheses, such as the false discovery rate (FDR), FDR = E [V(n) /(V(n) + S(n))]. The proposed procedures offer several advantages over existing methods. They provide Type I error control for general data generating distributions, with arbitrary dependence structures among variables. Gains in power are achieved by deriving rejection regions based on guessed sets of true null hypotheses and null test statistics randomly sampled from joint distributions that account for the dependence structure of the data. The Type I error and power properties of an FDR-controlling version of the resampling-based empirical Bayes approach are investigated and compared to those of widely-used FDR-controlling linear step-up procedures in a simulation study. The Type I error and power trade-off achieved by the empirical Bayes procedures under a variety of testing scenarios allows this approach to be competitive with or outperform the Storey and Tibshirani (2003) linear step-up procedure, as an alternative to the classical Benjamini and Hochberg (1995) procedure.  相似文献   

19.
20.
Estimating the false discovery rate using nonparametric deconvolution   总被引:1,自引:0,他引:1  
van de Wiel MA  Kim KI 《Biometrics》2007,63(3):806-815
Given a set of microarray data, the problem is to detect differentially expressed genes, using a false discovery rate (FDR) criterion. As opposed to common procedures in the literature, we do not base the selection criterion on statistical significance only, but also on the effect size. Therefore, we select only those genes that are significantly more differentially expressed than some f-fold (e.g., f = 2). This corresponds to use of an interval null domain for the effect size. Based on a simple error model, we discuss a naive estimator for the FDR, interpreted as the probability that the parameter of interest lies in the null-domain (e.g., mu < log(2)(2) = 1) given that the test statistic exceeds a threshold. We improve the naive estimator by using deconvolution. That is, the density of the parameter of interest is recovered from the data. We study performance of the methods using simulations and real data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号