首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Background

The Receiver Operator Characteristic (ROC) curve is well-known in evaluating classification performance in biomedical field. Owing to its superiority in dealing with imbalanced and cost-sensitive data, the ROC curve has been exploited as a popular metric to evaluate and find out disease-related genes (features). The existing ROC-based feature selection approaches are simple and effective in evaluating individual features. However, these approaches may fail to find real target feature subset due to their lack of effective means to reduce the redundancy between features, which is essential in machine learning.

Results

In this paper, we propose to assess feature complementarity by a trick of measuring the distances between the misclassified instances and their nearest misses on the dimensions of pairwise features. If a misclassified instance and its nearest miss on one feature dimension are far apart on another feature dimension, the two features are regarded as complementary to each other. Subsequently, we propose a novel filter feature selection approach on the basis of the ROC analysis. The new approach employs an efficient heuristic search strategy to select optimal features with highest complementarities. The experimental results on a broad range of microarray data sets validate that the classifiers built on the feature subset selected by our approach can get the minimal balanced error rate with a small amount of significant features.

Conclusions

Compared with other ROC-based feature selection approaches, our new approach can select fewer features and effectively improve the classification performance.
  相似文献   

2.
A scene-segmentation method for two-color digitized images acquired from a Papanicolaou-stained cervical smear is proposed. The method first segments a scene into background, red cytoplasm, blue cytoplasm and nuclear regions by a pixel-wise classification and then merges the segmented regions for both types of cytoplasm into a single region. To create the minimum-distance classifier used for the pixel classification, class median vectors are selected from a two-dimensional histogram formed from the optical densities in the red and green images (scanned at 610 nm and 535 nm, respectively). Reference points defined from knowledge about the two-color images played an important role in selecting the vectors for the red and blue cytoplasm. This method was applied to 33 sets of the two-color images. The resulting segmented regions corresponded well with regions apparent to the the human observer. Three different investigations related to the method were carried out; these studies confirmed the suitability of the proposed method.  相似文献   

3.
The agreement between humans and algorithms on whether an event-related potential (ERP) is present or not and the level of variation in the estimated values of its relevant features are largely unknown. Thus, the aim of this study was to determine the categorical and quantitative agreement between manual and automated methods for single-trial detection and estimation of ERP features. To this end, ERPs were elicited in sixteen healthy volunteers using electrical stimulation at graded intensities below and above the nociceptive withdrawal reflex threshold. Presence/absence of an ERP peak (categorical outcome) and its amplitude and latency (quantitative outcome) in each single-trial were evaluated independently by two human observers and two automated algorithms taken from existing literature. Categorical agreement was assessed using percentage positive and negative agreement and Cohen’s κ, whereas quantitative agreement was evaluated using Bland-Altman analysis and the coefficient of variation. Typical values for the categorical agreement between manual and automated methods were derived, as well as reference values for the average and maximum differences that can be expected if one method is used instead of the others. Results showed that the human observers presented the highest categorical and quantitative agreement, and there were significantly large differences between detection and estimation of quantitative features among methods. In conclusion, substantial care should be taken in the selection of the detection/estimation approach, since factors like stimulation intensity and expected number of trials with/without response can play a significant role in the outcome of a study.  相似文献   

4.
This paper presents preliminary results of research toward the development of a high resolution analysis stage for a dual resolution image processing-based prescreening device for cervical cytology. Experiments using both manual and automatic methods for cell segmentation are described. In both cases, 1500 cervical cells were analyzed and classified as normal or abnormal (dysplastic or malignant) using a minimum Mahalanobis distance classifier with eight subclasses of normal cells, and five subclasses of abnormal cells. With manual segmentation, false positive and false negative error rates of 2.98 and 7.73% were obtained. Similar experiments using automatic cell segmentation methods yielded false positive and false negative error rates of 3.90 and 11.56%, respectively. In both cases, independent training and testing data were used.  相似文献   

5.
OBJECTIVE: To segment and quantify microvessels in renal tumor angiogenesis based on a color image analysis method and to improve the accuracy and reproducibility of quantifying microvessel density. STUDY DESIGN: The segmentation task was based on a supervised learning scheme. First, 12 color features (RGB, HSI, I1I2I3 and L*a*b*) were extracted from a training set. The feature selection procedure selected I2L*S features as the best color feature vector. Then we segmented microvessels using the discriminant function made using the minimum error rate classification rule of Bayesian decision theory. In the quantification step, after applying a connected component-labeling algorithm, microvessels with discontinuities were connected and touching microvessels separated. We tested the proposed method on 23 images. RESULTS: The results were evaluated by comparing them with manual quantification of the same images. The comparison revealed that our computerized microvessel counting correlated highly with manual counting by an expert (r = 0.95754). The association between the number of microvessels after the initial segmentation and manual quantification was also assessed using Pearson's correlation coefficient (r = 0.71187). The results indicate that our method is better than conventional computerized image analysis methods. CONCLUSION: Our method correlated highly with quantification by an expert and could become a way to improve the accuracy, feasibility and reproducibility of quantifying microvessel density. We anticipate that it will become a useful diagnostic tool for angiogenesis studies.  相似文献   

6.
Facial neuromuscular dysfunction severely impacts adaptive and expressive behavior and emotional health. Appropriate treatment is aided by quantitative and efficient assessment of facial motion impairment. We validated a newly developed method of quantifying facial motion, automated face analysis (AFA), by comparing it with an established manual marking method, the Maximal Static Response Assay (MSRA). In the AFA, motion of facial features is tracked automatically by computer vision without the need for placement of physical markers or restrictions of rigid head motion. Nine patients (seven women and two men) with a mean age of 39.3 years and various facial nerve disorders (five with Bell's palsy, three with trauma, and one with tumor resection) participated. The patients were videotaped while performing voluntary facial action tasks (brow raise, eye closure, and smile). For comparison with MSRA, physical markers were placed on facial landmarks. Image sequences were digitized into 640 x 480 x 24-bit pixel arrays at 30 frames per second (1 pixel congruent with0.3 mm). As defined for the MSRA, the coordinates of the center of each marker were manually recorded in the initial and final digitized frames, which correspond to repose and maximal response. For the AFA, these points were tracked automatically in the image sequence. Pearson correlation coefficients were used to evaluate consistency of measurement between manual (the MSRA) and automated (the AFA) tracking methods, and paired t tests were used to assess the mean difference between methods for feature tracking. Feature measures were highly consistent between methods, Pearson's r = 0.96 or higher, p < 0.001 for each of the action tasks. The mean differences between the methods were small; the mean error between methods was comparable to the error within the manual method (less than 1 pixel). The AFA demonstrated strong concurrent validity with the MSRA for pixel-wise displacement. Tracking was fully automated and provided motion vectors, which may be useful in guiding surgical and rehabilitative approaches to restoring facial function in patients with facial neuromuscular disorders.  相似文献   

7.

Background

With an increasing number of plant genome sequences, it has become important to develop a robust computational method for detecting plant promoters. Although a wide variety of programs are currently available, prediction accuracy of these still requires further improvement. The limitations of these methods can be addressed by selecting appropriate features for distinguishing promoters and non-promoters.

Methods

In this study, we proposed two feature selection approaches based on hexamer sequences: the Frequency Distribution Analyzed Feature Selection Algorithm (FDAFSA) and the Random Triplet Pair Feature Selecting Genetic Algorithm (RTPFSGA). In FDAFSA, adjacent triplet-pairs (hexamer sequences) were selected based on the difference in the frequency of hexamers between promoters and non-promoters. In RTPFSGA, random triplet-pairs (RTPs) were selected by exploiting a genetic algorithm that distinguishes frequencies of non-adjacent triplet pairs between promoters and non-promoters. Then, a support vector machine (SVM), a nonlinear machine-learning algorithm, was used to classify promoters and non-promoters by combining these two feature selection approaches. We referred to this novel algorithm as PromoBot.

Results

Promoter sequences were collected from the PlantProm database. Non-promoter sequences were collected from plant mRNA, rRNA, and tRNA of PlantGDB and plant miRNA of miRBase. Then, in order to validate the proposed algorithm, we applied a 5-fold cross validation test. Training data sets were used to select features based on FDAFSA and RTPFSGA, and these features were used to train the SVM. We achieved 89% sensitivity and 86% specificity.

Conclusions

We compared our PromoBot algorithm to five other algorithms. It was found that the sensitivity and specificity of PromoBot performed well (or even better) with the algorithms tested. These results show that the two proposed feature selection methods based on hexamer frequencies and random triplet-pair could be successfully incorporated into a supervised machine learning method in promoter classification problem. As such, we expect that PromoBot can be used to help identify new plant promoters. Source codes and analysis results of this work could be provided upon request.  相似文献   

8.
M Seo  S Oh 《PloS one》2012,7(7):e40419

Background

The goal of feature selection is to select useful features and simultaneously exclude garbage features from a given dataset for classification purposes. This is expected to bring reduction of processing time and improvement of classification accuracy.

Methodology

In this study, we devised a new feature selection algorithm (CBFS) based on clearness of features. Feature clearness expresses separability among classes in a feature. Highly clear features contribute towards obtaining high classification accuracy. CScore is a measure to score clearness of each feature and is based on clustered samples to centroid of classes in a feature. We also suggest combining CBFS and other algorithms to improve classification accuracy.

Conclusions/Significance

From the experiment we confirm that CBFS is more excellent than up-to-date feature selection algorithms including FeaLect. CBFS can be applied to microarray gene selection, text categorization, and image classification.  相似文献   

9.

Background

Lately, biomarker discovery has become one of the most significant research issues in the biomedical field. Owing to the presence of high-throughput technologies, genomic data, such as microarray data and RNA-seq, have become widely available. Many kinds of feature selection techniques have been applied to retrieve significant biomarkers from these kinds of data. However, they tend to be noisy with high-dimensional features and consist of a small number of samples; thus, conventional feature selection approaches might be problematic in terms of reproducibility.

Results

In this article, we propose a stable feature selection method for high-dimensional datasets. We apply an ensemble L 1 -norm support vector machine to efficiently reduce irrelevant features, considering the stability of features. We define the stability score for each feature by aggregating the ensemble results, and utilize backward feature elimination on a purified feature set based on this score; therefore, it is possible to acquire an optimal set of features for performance without the need to set a specific threshold. The proposed methodology is evaluated by classifying the binary stage of renal clear cell carcinoma with RNA-seq data.

Conclusion

A comparison with established algorithms, i.e., a fast correlation-based filter, random forest, and an ensemble version of an L 2 -norm support vector machine-based recursive feature elimination, enabled us to prove the superior performance of our method in terms of classification as well as stability in general. It is also shown that the proposed approach performs moderately on high-dimensional datasets consisting of a very large number of features and a smaller number of samples. The proposed approach is expected to be applicable to many other researches aimed at biomarker discovery.
  相似文献   

10.
Histometric features for the objective grading of prostatic adenocarcinoma in histologic specimens were analyzed in five cases each of well, moderately and poorly differentiated lesions. Tissue sections from the selected cases were stained by the Feulgen method and digitized by a video-based microphotometer. Twenty total fields were recorded for each grade: ten at high resolution (an image sampling of 0.5 micron per pixel) and ten at low resolution (0.8 micron per pixel), with two fields per case recorded at each resolution. The images were segmented by an automated expert system-guided scene segmentation procedure. The performance of that procedure was measured by comparing the automated counts of nuclei in the segmented fields to the visual counts made by a pathologist in the same fields. For well, moderately and poorly differentiated cases, respectively, the nuclear counts made by the expert system at high resolution were 2.7%, 4.2% and 4.7% higher than the visual counts (as estimated from a total of 6,628 nuclei), but 1.2%, 2.5% and 1.1% lower at low resolution (10,329 nuclei). High-resolution features and tissue textural features were computed for each case. The high-resolution features showed good separation between the three groups of cases. The tissue textural features showed consistent separation between well and moderately differentiated cases. The relaxation of the spatial resolution (to 0.8 micron/pixel spacing) did not affect the selection of features, but led to less separation between the data from different grades. In conclusion, the automated system performed satisfactorily in distinguishing sections of prostatic tumors of varying degrees of differentiation.  相似文献   

11.
MOTIVATION: Feature (gene) selection can dramatically improve the accuracy of gene expression profile based sample class prediction. Many statistical methods for feature (gene) selection such as stepwise optimization and Monte Carlo simulation have been developed for tissue sample classification. In contrast to class prediction, few statistical and computational methods for feature selection have been applied to clustering algorithms for pattern discovery. RESULTS: An integrated scheme and corresponding program SamCluster for automatic discovery of sample classes based on gene expression profile is presented in this report. The scheme incorporates the feature selection algorithms based on the calculation of CV (coefficient of variation) and t-test into hierarchical clustering and proceeds as follows. At first, the genes with their CV greater than the pre-specified threshold are selected for cluster analysis, which results in two putative sample classes. Then, significantly differentially expressed genes in the two putative sample classes with p-values < or = 0.01, 0.05, or 0.1 from t-test are selected for further cluster analysis. The above processes were iterated until the two stable sample classes were found. Finally, the consensus sample classes are constructed from the putative classes that are derived from the different CV thresholds, and the best putative sample classes that have the minimum distance between the consensus classes and the putative classes are identified. To evaluate the performance of the feature selection for cluster analysis, the proposed scheme was applied to four expression datasets COLON, LEUKEMIA72, LEUKEMIA38, and OVARIAN. The results show that there are only 5, 1, 0, and 0 samples that have been misclassified, respectively. We conclude that the proposed scheme, SamCluster, is an efficient method for discovery of sample classes using gene expression profile. AVAILABILITY: The related program SamCluster is available upon request or from the web page http://www.sph.uth.tmc.edu:8052/hgc/Downloads.asp.  相似文献   

12.
What should be expected from feature selection in small-sample settings   总被引:1,自引:0,他引:1  
MOTIVATION: High-throughput technologies for rapid measurement of vast numbers of biological variables offer the potential for highly discriminatory diagnosis and prognosis; however, high dimensionality together with small samples creates the need for feature selection, while at the same time making feature-selection algorithms less reliable. Feature selection must typically be carried out from among thousands of gene-expression features and in the context of a small sample (small number of microarrays). Two basic questions arise: (1) Can one expect feature selection to yield a feature set whose error is close to that of an optimal feature set? (2) If a good feature set is not found, should it be expected that good feature sets do not exist? RESULTS: The two questions translate quantitatively into questions concerning conditional expectation. (1) Given the error of an optimal feature set, what is the conditionally expected error of the selected feature set? (2) Given the error of the selected feature set, what is the conditionally expected error of the optimal feature set? We address these questions using three classification rules (linear discriminant analysis, linear support vector machine and k-nearest-neighbor classification) and feature selection via sequential floating forward search and the t-test. We consider three feature-label models and patient data from a study concerning survival prognosis for breast cancer. With regard to the two focus questions, there is similarity across all experiments: (1) One cannot expect to find a feature set whose error is close to optimal, and (2) the inability to find a good feature set should not lead to the conclusion that good feature sets do not exist. In practice, the latter conclusion may be more immediately relevant, since when faced with the common occurrence that a feature set discovered from the data does not give satisfactory results, the experimenter can draw no conclusions regarding the existence or nonexistence of suitable feature sets. AVAILABILITY: http://ee.tamu.edu/~edward/feature_regression/  相似文献   

13.

Background

Supervised machine learning methods when applied to the problem of automated protein-function prediction (AFP) require the availability of both positive examples (i.e., proteins which are known to possess a given protein function) and negative examples (corresponding to proteins not associated with that function). Unfortunately, publicly available proteome and genome data sources such as the Gene Ontology rarely store the functions not possessed by a protein. Thus the negative selection, consisting in identifying informative negative examples, is currently a central and challenging problem in AFP. Several heuristics have been proposed through the years to solve this problem; nevertheless, despite their effectiveness, to the best of our knowledge no previous existing work studied which protein features are more relevant to this task, that is, which protein features help more in discriminating reliable and unreliable negatives.

Results

The present work analyses the impact of several features on the selection of negative proteins for the Gene Ontology (GO) terms. The analysis is network-based: it exploits the fact that proteins can be naturally structured in a network, considering the pairwise relationships coming from several sources of data, such as protein-protein and genetic interactions. Overall, the proposed protein features, including local and global graph centrality measures and protein multifunctionality, can be term-aware (i.e., depending on the GO term) and term-unaware (i.e., invariant across the GO terms). We validated the informativeness of each feature utilizing a temporal holdout in three different experiments on yeast, mouse and human proteomes: (i) feature selection to detect which protein features are more helpful for the negative selection; (ii) protein function prediction to verify whether the features considered are also useful to predict GO terms; (iii) negative selection by applying two different negative selection algorithms on proteins represented through the proposed features.

Conclusions

Term-aware features (with some exceptions) resulted more informative for problem (i), together with node betweenness, which is the most relevant among term-unaware features. The node positive neighborhood instead is the most predictive feature for the AFP problem, while experiment (iii) showed that the proposed features allow negative selection algorithms to select effectively negative instances in the temporal holdout setting, with better results when nonlinear combinations of features are also exploited.
  相似文献   

14.
The automation of single particle selection and tomographic segmentation of asymmetric particles and objects is facilitated by continuing improvement of methods based on the detection of pixel discontinuity. Here, we present the new arbitrary z-crossings approach which can be employed to enhance the accuracy of edge detection algorithms that are based on the second derivative. This is demonstrated using the Laplacian of Gaussian (LoG) filter. In its normal implementation the LoG filter uses a z value of zero to define edge contours. In contrast, the arbitrary z-crossings approach allows the user to adjust z, which causes the subsequently generated contours to tend towards lighter or darker image objects, depending on the sign of z. This functionality has been coupled with an additional feature: the ability to use the major and minor axes of bounding contours to hone automated object selection. In combination, these features significantly enhance the accuracy of particle selection and the speed of tomographic segmentation. Both features have been incorporated into the software package Swarm(PS) in which parameters are automatically adjusted based on user defined target selection.  相似文献   

15.
Species misclassification (misidentification) and handling errors have been frequently reported in various plant species conserved at diverse gene banks, which could restrict use of germplasm for correct purpose. The objectives of the present study were to (i) determine the extent of genotyping error (reproducibility) on DArTseq-based single-nucleotide polymorphisms (SNPs); (ii) determine the proportion of misclassified accessions across 3134 samples representing three African rice species complex (Oryza glaberrima, O. barthii, and O. longistaminata) and an Asian rice (O. sativa), which are conserved at the AfricaRice gene bank; and (iii) develop species- and sub-species (ecotype)-specific diagnostic SNP markers for rapid and low-cost quality control (QC) analysis. Genotyping error estimated from 15 accessions, each replicated from 2 to 16 times, varied from 0.2 to 3.1%, with an overall average of 0.8%. Using a total of 3134 accessions genotyped with 31,739 SNPs, the proportion of misclassified samples was 3.1% (97 of the 3134 accessions). Excluding the 97 misclassified accessions, we identified a total of 332 diagnostic SNPs that clearly discriminated the three indigenous African species complex from Asian rice (156 SNPs), O. longistaminata accessions from both O. barthii and O. glaberrima (131 SNPs), and O. sativa spp. indica from O. sativa spp. japonica (45 SNPs). Using chromosomal position, minor allele frequency, and polymorphic information content as selection criteria, we recommended a subset of 24 to 36 of the 332 diagnostic SNPs for routine QC genotyping, which would be highly useful in determining the genetic identity of each species and correct human errors during routine gene bank operations.  相似文献   

16.
In order to improve the separation between abnormal cells and noncellular artifacts in the CERVIFIP automated cervical cytology prescreening system, 22 different object texture features were investigated. The features were all statistical parameters of the pixel density histograms or one-dimensional filtered values of central and border regions of the object images. The features were calculated for 231 images (100 cells and 131 artifacts) detected as Suspect Cells by the current CERVIFIP and were then tested in hierarchical and linear discriminant classifiers. After selecting the two best features for use in a hierarchical classifier, 83% correct classification was achieved. One of these features was specifically designed to remove poorly focused objects. With maximum likelihood discrimination using all 22 features, an overall correct classification rate of 90% was obtained.  相似文献   

17.
18.
Driven by genomic somatic variation, tumour tissues are typically heterogeneous, yet unbiased quantitative methods are rarely used to analyse heterogeneity at the protein level. Motivated by this problem, we developed automated image segmentation of images of multiple biomarkers in Ewing sarcoma to generate distributions of biomarkers between and within tumour cells. We further integrate high dimensional data with patient clinical outcomes utilising random survival forest (RSF) machine learning. Using material from cohorts of genetically diagnosed Ewing sarcoma with EWSR1 chromosomal translocations, confocal images of tissue microarrays were segmented with level sets and watershed algorithms. Each cell nucleus and cytoplasm were identified in relation to DAPI and CD99, respectively, and protein biomarkers (e.g. Ki67, pS6, Foxo3a, EGR1, MAPK) localised relative to nuclear and cytoplasmic regions of each cell in order to generate image feature distributions. The image distribution features were analysed with RSF in relation to known overall patient survival from three separate cohorts (185 informative cases). Variation in pre-analytical processing resulted in elimination of a high number of non-informative images that had poor DAPI localisation or biomarker preservation (67 cases, 36%). The distribution of image features for biomarkers in the remaining high quality material (118 cases, 104 features per case) were analysed by RSF with feature selection, and performance assessed using internal cross-validation, rather than a separate validation cohort. A prognostic classifier for Ewing sarcoma with low cross-validation error rates (0.36) was comprised of multiple features, including the Ki67 proliferative marker and a sub-population of cells with low cytoplasmic/nuclear ratio of CD99. Through elimination of bias, the evaluation of high-dimensionality biomarker distribution within cell populations of a tumour using random forest analysis in quality controlled tumour material could be achieved. Such an automated and integrated methodology has potential application in the identification of prognostic classifiers based on tumour cell heterogeneity.  相似文献   

19.
《IRBM》2020,41(4):229-239
Feature selection algorithms are the cornerstone of machine learning. By increasing the properties of the samples and samples, the feature selection algorithm selects the significant features. The general name of the methods that perform this function is the feature selection algorithm. The general purpose of feature selection algorithms is to select the most relevant properties of data classes and to increase the classification performance. Thus, we can select features based on their classification performance. In this study, we have developed a feature selection algorithm based on decision support vectors classification performance. The method can work according to two different selection criteria. We tested the classification performances of the features selected with P-Score with three different classifiers. Besides, we assessed P-Score performance with 13 feature selection algorithms in the literature. According to the results of the study, the P-Score feature selection algorithm has been determined as a method which can be used in the field of machine learning.  相似文献   

20.
Single particle analysis (SPA) coupled with high-resolution electron cryo-microscopy is emerging as a powerful technique for the structure determination of membrane protein complexes and soluble macromolecular assemblies. Current estimates suggest that approximately 10(4)-10(5) particle projections are required to attain a 3A resolution 3D reconstruction (symmetry dependent). Selecting this number of molecular projections differing in size, shape and symmetry is a rate-limiting step for the automation of 3D image reconstruction. Here, we present Swarm(PS), a feature rich GUI based software package to manage large scale, semi-automated particle picking projects. The software provides cross-correlation and edge-detection algorithms. Algorithm-specific parameters are transparently and automatically determined through user interaction with the image, rather than by trial and error. Other features include multiple image handling (approximately 10(2)), local and global particle selection options, interactive image freezing, automatic particle centering, and full manual override to correct false positives and negatives. Swarm(PS) is user friendly, flexible, extensible, fast, and capable of exporting boxed out projection images, or particle coordinates, compatible with downstream image processing suites.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号