首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
SUMMARY: Several papers have been published where nonlinear machine learning algorithms, e.g. artificial neural networks, support vector machines and decision trees, have been used to model the specificity of the HIV-1 protease and extract specificity rules. We show that the dataset used in these studies is linearly separable and that it is a misuse of nonlinear classifiers to apply them to this problem. The best solution on this dataset is achieved using a linear classifier like the simple perceptron or the linear support vector machine, and it is straightforward to extract rules from these linear models. We identify key residues in peptides that are efficiently cleaved by the HIV-1 protease and list the most prominent rules, relating them to experimental results for the HIV-1 protease. MOTIVATION: Understanding HIV-1 protease specificity is important when designing HIV inhibitors and several different machine learning algorithms have been applied to the problem. However, little progress has been made in understanding the specificity because nonlinear and overly complex models have been used. RESULTS: We show that the problem is much easier than what has previously been reported and that linear classifiers like the simple perceptron or linear support vector machines are at least as good predictors as nonlinear algorithms. We also show how sets of specificity rules can be generated from the resulting linear classifiers. AVAILABILITY: The datasets used are available at http://www.hh.se/staff/bioinf/  相似文献   

2.
ABSTRACT: BACKGROUND: Many problems in bioinformatics involve classification based on features such as sequence, structure or morphology. Given multiple classifiers, two crucial questions arise: how does their performance compare, and how can they best be combined to produce a better classifier? A classifier can be evaluated in terms of sensitivity and specificity using benchmark, or gold standard, data, that is, data for which the true classification is known. However, a gold standard is not always available. Here we demonstrate that a Bayesian model for comparing medical diagnostics without a gold standard can be successfully applied in the bioinformatics domain, to genomic scale data sets. We present a new implementation, which unlike previous implementations is applicable to any number of classifiers. We apply this model, for the first time, to the problem of finding the globally optimal logical combination of classifiers. RESULTS: We compared three classifiers of protein subcellular localisation, and evaluated our estimates of sensitivity and specificity against estimates obtained using a gold standard. The method overestimated sensitivity and specificity with only a small discrepancy, and correctly ranked the classifiers. Diagnostic tests for swine flu were then compared on a small data set. Lastly, classifiers for a genome-wide association study of macular degeneration with 541094 SNPs were analysed. In all cases, run times were feasible, and results precise. The optimal logical combination of classifiers was also determined for all three data sets. Code and data are available from http://bioinformatics.monash.edu.au/downloads/. CONCLUSIONS: The examples demonstrate the methods are suitable for both small and large data sets, applicable to the wide range of bioinformatics classification problems, and robust to dependence between classifiers. In all three test cases, the globally optimal logical combination of the classifiers was found to be their union, according to three out of four ranking criteria. We propose as a general rule of thumb that the union of classifiers will be close to optimal.  相似文献   

3.
A boosting approach for motif modeling using ChIP-chip data   总被引:1,自引:0,他引:1  
  相似文献   

4.
OBJECTIVE: To design and analyze an automated diagnostic system for breast carcinoma based on fine needle aspiration (FNA). STUDY DESIGN: FNA is a noninvasive alternative to surgical biopsy for the diagnosis of breast carcinoma. Widespread clinical use of FNA is limited by the relatively poor interobserver reproducibility of the visual interpretation of FNA images. To overcome the reproducibility problem, past research has focused on the development of automated diagnosis systems that yield accurate, reproducible results. While automated diagnosis is, by definition, reproducible, it has yet to achieve diagnostic accuracy comparable to that of surgical biopsy. In this article we describe a sophisticated new diagnostic system in which the mean sensitivity (of FNA diagnosis) approaches that of surgical biopsy. The diagnostic system that we devised analyzes the digital FNA data extracted from FNA images. To achieve high sensitivity, the system needs to solve large, equality-constrained, integer nonlinear optimization problems repeatedly. Powerful techniques from the theory of Lie groups and a novel optimization technique are built into the system to solve the underlying optimization problems effectively. The system is trained using digital data from FNA samples with confirmed diagnosis. To analyze the diagnostic accuracy of the system > 8,000 computational experiments were performed using digital FNA data from the Wisconsin Breast Cancer Database. RESULTS: The system has a mean sensitivity of 99.62% and mean specificity of 93.31%. Statistical analysis shows that at the 95% confidence level, the system can be trusted to correctly diagnose new malignant FNA samples with an accuracy of 99.44-99.8% and new benign FNA samples with an accuracy of 92.43-93.93%. CONCLUSION: The diagnostic system is robust and has higher sensitivity than do all the other systems reported in the literature. The specificity of the system needs to be improved.  相似文献   

5.
6.
The detection of lung cancer has a special value in the diagnosis of cancer diseases. Based on nine elemental concentrations (i.e., chromium, iron, manganese, aluminum, cadmium, copper, zinc, nickel, and selenium) in urine samples and an ensemble linear discriminant analysis (ELDA), a detection method for lung cancer has been developed. A dataset containing 30 healthy samples and 27 lung cancer samples is used for experiment. The whole dataset was first split into a training set with 29 samples and a test set with 28 samples. The prediction results from the ELDA classifier were compared with those from single Fisher’s discriminate analysis (FDA). On the test set, the ELDA classifier achieved better performance, that is, a sensitivity of 100%, a specificity of 86.7%, and an overall accuracy of 92.9%, while the FDA classifier had a sensitivity of 92.3%, a specificity of 93.3%, and an overall accuracy of 92.9%. The superiority of ELDA to FDA is ascribed to the fact that ELDA can model more nonlinear relationships through the cooperation of several single models, suggesting that ensemble modeling is more advisable in such a task.  相似文献   

7.
《IRBM》2020,41(4):195-204
ObjectivesMammography mass recognition is considered as a very challenge pattern recognition problem due to the high similarity between normal and abnormal masses. Therefore, the main objective of this study is to develop an efficient and optimized two-stage recognition model to tackle this recognition task.Material and methodsBasically, the developed recognition model combines an ensemble of linear Support Vector Machine (SVM) classifiers with a Reinforcement Learning-based Memetic Particle Swarm Optimizer (RLMPSO) as RLMPSO-SVM recognition model. RLMPSO is used to construct a two-stage of an ensemble of linear SVM classifiers by performing simultaneous SVM parameters tuning, features selection, and training instances selection. The first stage of RLMPSO-SVM recognition model is responsible about recognizing the input ROI mammography masses as normal or abnormal mass pattern. Meanwhile, the second stage of RLMPSO-SVM model used to perform further recognition for abnormal ROIs as malignant or benign masses. In order to evaluate the effectiveness of RLMPSO-SVM, a total of 1187 normal ROIs, 111 malignant ROIs, and 135 benign ROIs were randomly selected from DDSM database images.ResultsReported results indicated that RLMPSO-SVM model was able to achieve performances of 97.57% sensitivity rate with 97.86% specificity rate for normal vs. abnormal recognition cases. For malignant vs. benign recognition performance it was reported of 97.81% sensitivity rate with 96.92% specificity rate.ConclusionReported results indicated that RLMPSO-SVM recognition model is an effective tool that could assist the radiologist during the diagnosis of the presented abnormalities in mammography images. The outcomes indicated that RLMPSO-SVM significantly outperformed various SVM-based models as well as other variants of computational intelligence models including multi-layer perceptron, naive Bayes classifier, and k-nearest neighbor.  相似文献   

8.
Proposed molecular classifiers may be overfit to idiosyncrasies of noisy genomic and proteomic data. Cross-validation methods are often used to obtain estimates of classification accuracy, but both simulations and case studies suggest that, when inappropriate methods are used, bias may ensue. Bias can be bypassed and generalizability can be tested by external (independent) validation. We evaluated 35 studies that have reported on external validation of a molecular classifier. We extracted information on study design and methodological features, and compared the performance of molecular classifiers in internal cross-validation versus external validation for 28 studies where both had been performed. We demonstrate that the majority of studies pursued cross-validation practices that are likely to overestimate classifier performance. Most studies were markedly underpowered to detect a 20% decrease in sensitivity or specificity between internal cross-validation and external validation [median power was 36% (IQR, 21-61%) and 29% (IQR, 15-65%), respectively]. The median reported classification performance for sensitivity and specificity was 94% and 98%, respectively, in cross-validation and 88% and 81% for independent validation. The relative diagnostic odds ratio was 3.26 (95% CI 2.04-5.21) for cross-validation versus independent validation. Finally, we reviewed all studies (n = 758) which cited those in our study sample, and identified only one instance of additional subsequent independent validation of these classifiers. In conclusion, these results document that many cross-validation practices employed in the literature are potentially biased and genuine progress in this field will require adoption of routine external validation of molecular classifiers, preferably in much larger studies than in current practice.  相似文献   

9.
Cascaded multiple classifiers for secondary structure prediction   总被引:11,自引:0,他引:11       下载免费PDF全文
We describe a new classifier for protein secondary structure prediction that is formed by cascading together different types of classifiers using neural networks and linear discrimination. The new classifier achieves an accuracy of 76.7% (assessed by a rigorous full Jack-knife procedure) on a new nonredundant dataset of 496 nonhomologous sequences (obtained from G.J. Barton and J.A. Cuff). This database was especially designed to train and test protein secondary structure prediction methods, and it uses a more stringent definition of homologous sequence than in previous studies. We show that it is possible to design classifiers that can highly discriminate the three classes (H, E, C) with an accuracy of up to 78% for beta-strands, using only a local window and resampling techniques. This indicates that the importance of long-range interactions for the prediction of beta-strands has been probably previously overestimated.  相似文献   

10.
It is a challenging task to predict with high reliability whether plant genomic sequences contain a polyadenylation (polyA) site or not. In this paper, we solve the task by means of a systematic machine-learning procedure applied on a dataset of 1000 Arabidopsis thaliana sequences flanking polyA sites. Our procedure consists of three steps. In the first step, we extract informative features from the sequences using the highly informative k-mer windows approach. Experiments with five classifiers show that the best performance is approximately 83%. In the second step, we improve performance to 95% by reducing the number of features using linear discriminant analysis, followed by applying the linear discriminant classifier. In the third step, we apply the transductive confidence machines approach and the receiver operating characteristic isometrics approach. The resulting two classifiers enable presetting any desired performance by dealing carefully with sequences for which it is unclear whether they contain polyA sites or not. For example, in our case study, we obtain 99% performance by leaving 26% of the sequences unclassified, and 100% performance by leaving 40% of the sequences unclassified. This is clearly useful for experimental verification of putative polyA sites in the laboratory. The novel methods in our machine-learning procedure should find applications in several areas of bioinformatics.  相似文献   

11.
Atrial fibrillation (AF) and atrial flutter (AFL) are the two common atrial arrhythmia encountered in the clinical practice. In order to diagnose these abnormalities the electrocardiogram (ECG) is widely used. The conventional linear time and frequency domain methods cannot decipher the hidden complexity present in these signals. The ECG is inherently a non-linear, non-stationary and non-Gaussian signal. The non-linear models can provide improved results and capture minute variations present in the time series. Higher order spectra (HOS) is a non-linear dynamical method which is highly rugged to noise. In the present study, the performances of two methods are compared: (i) 3rd order HOS cumulants and (ii) HOS bispectrum. The 3rd order cumulant and bispectrum coefficients are subjected to dimensionality reduction using independent component analysis (ICA) and classified using classification and regression tree (CART), random forest (RF), artificial neural network (ANN) and k-nearest neighbor (KNN) classifiers to select the best classifier. The ICA components of cumulant coefficients have provided the average accuracy, sensitivity, specificity and positive predictive value of 99.50%, 100%, 99.22% and 99.72% respectively using KNN classifier. Similarly, the ICA components of HOS bispectrum coefficients have yielded the average accuracy, sensitivity, specificity and PPV of 97.65%, 98.16%, 98.75% and 99.53% respectively using KNN. So, the ICA performed on the 3rd order HOS cumulants coupled with KNN classifier performed better than the HOS bispectrum method. The proposed methodology is robust and can be used in mass screening of cardiac patients.  相似文献   

12.
MOTIVATION: A major problem of pattern classification is estimation of the Bayes error when only small samples are available. One way to estimate the Bayes error is to design a classifier based on some classification rule applied to sample data, estimate the error of the designed classifier, and then use this estimate as an estimate of the Bayes error. Relative to the Bayes error, the expected error of the designed classifier is biased high, and this bias can be severe with small samples. RESULTS: This paper provides a correction for the bias by subtracting a term derived from the representation of the estimation error. It does so for Boolean classifiers, these being defined on binary features. Although the general theory applies to any Boolean classifier, a model is introduced to reduce the number of parameters. A key point is that the expected correction is conservative. Properties of the corrected estimate are studied via simulation. The correction applies to binary predictors because they are mathematically identical to Boolean classifiers. In this context the correction is adapted to the coefficient of determination, which has been used to measure nonlinear multivariate relations between genes and design genetic regulatory networks. An application using gene-expression data from a microarray experiment is provided on the website http://gspsnap.tamu.edu/smallsample/ (user:'smallsample', password:'smallsample)').  相似文献   

13.
Clinical trials increasingly employ medical imaging data in conjunction with supervised classifiers, where the latter require large amounts of training data to accurately model the system. Yet, a classifier selected at the start of the trial based on smaller and more accessible datasets may yield inaccurate and unstable classification performance. In this paper, we aim to address two common concerns in classifier selection for clinical trials: (1) predicting expected classifier performance for large datasets based on error rates calculated from smaller datasets and (2) the selection of appropriate classifiers based on expected performance for larger datasets. We present a framework for comparative evaluation of classifiers using only limited amounts of training data by using random repeated sampling (RRS) in conjunction with a cross-validation sampling strategy. Extrapolated error rates are subsequently validated via comparison with leave-one-out cross-validation performed on a larger dataset. The ability to predict error rates as dataset size increases is demonstrated on both synthetic data as well as three different computational imaging tasks: detecting cancerous image regions in prostate histopathology, differentiating high and low grade cancer in breast histopathology, and detecting cancerous metavoxels in prostate magnetic resonance spectroscopy. For each task, the relationships between 3 distinct classifiers (k-nearest neighbor, naive Bayes, Support Vector Machine) are explored. Further quantitative evaluation in terms of interquartile range (IQR) suggests that our approach consistently yields error rates with lower variability (mean IQRs of 0.0070, 0.0127, and 0.0140) than a traditional RRS approach (mean IQRs of 0.0297, 0.0779, and 0.305) that does not employ cross-validation sampling for all three datasets.  相似文献   

14.
《Genomics》2020,112(5):3089-3096
Automatic classification of glaucoma from fundus images is a vital diagnostic tool for Computer-Aided Diagnosis System (CAD). In this work, a novel fused feature extraction technique and ensemble classifier fusion is proposed for diagnosis of glaucoma. The proposed method comprises of three stages. Initially, the fundus images are subjected to preprocessing followed by feature extraction and feature fusion by Intra-Class and Extra-Class Discriminative Correlation Analysis (IEDCA). The feature fusion approach eliminates between-class correlation while retaining sufficient Feature Dimension (FD) for Correlation Analysis (CA). The fused features are then fed to the classifiers namely Support Vector Machine (SVM), Random Forest (RF) and K-Nearest Neighbor (KNN) for classification individually. Finally, Classifier fusion is also designed which combines the decision of the ensemble of classifiers based on Consensus-based Combining Method (CCM). CCM based Classifier fusion adjusts the weights iteratively after comparing the outputs of all the classifiers. The proposed fusion classifier provides a better improvement in accuracy and convergence when compared to the individual algorithms. A classification accuracy of 99.2% is accomplished by the two-level hybrid fusion approach. The method is evaluated on the public datasets High Resolution Fundus (HRF) and DRIVE datasets with cross dataset validation.  相似文献   

15.
MOTIVATION: Microbial diversity is still largely unknown in most environments, such as soils. In order to get access to this microbial 'black-box', the development of powerful tools such as microarrays are necessary. However, the reliability of this approach relies on probe efficiency, in particular sensitivity, specificity and explorative power, in order to obtain an image of the microbial communities that is close to reality. RESULTS: We propose a new probe design algorithm that is able to select microarray probes targeting SSU rRNA at any phylogenetic level. This original approach, implemented in a program called 'PhylArray', designs a combination of degenerate and non-degenerate probes for each target taxon. Comparative experimental evaluations indicate that probes designed with PhylArray yield a higher sensitivity and specificity than those designed by conventional approaches. Applying the combined PhyArray/GoArrays strategy helps to optimize the hybridization performance of short probes. Finally, hybridizations with environmental targets have shown that the use of the PhylArray strategy can draw attention to even previously unknown bacteria.  相似文献   

16.

Background  

Experimental examinations of biofluids to measure concentrations of proteins or their fragments or metabolites are being explored as a means of early disease detection, distinguishing diseases with similar symptoms, and drug treatment efficacy. Many studies have produced classifiers with a high sensitivity and specificity, and it has been argued that accurate results necessarily imply some underlying biology-based features in the classifier. The simplest test of this conjecture is to examine datasets designed to contain no information with classifiers used in many published studies.  相似文献   

17.
目的:探讨单用阴道超声(TVS)、子宫输卵管造影(HSG)、超声子宫水造影(SIS)以及三种方法联合诊断不孕症患者子宫内膜息肉(EP)的临床价值。方法:以206例行宫腔镜联合诊刮或病检的不孕症患者为研究对象,回顾性分析各种检查方法对EP的筛查结果,评价各种检查方法的真实性、可靠性以及预测值。结果:206例不孕症中,共确诊EP患者60例,阳性率29.1%。三种检查方法中,TVS的灵敏度最高(70.0%),特异度最低(73.3%),漏诊率最低(30.0%),误诊率最高(26.7%),正确诊断指数最高(43.3%),阴性似然比最小(0.409),阴性预测值最高(85.6%);SIS检查的灵敏度最低(38.7%),漏诊率最高(61.3%),但是特异性最高(93.3%),误诊率最低(6.7%),阳性似然比最大(4.284),阳性预测值最大(66.6%),正确诊断指数最低(32.0%);HSG检查的上述各项评价指标均介于TVS和SIS之间。TVS和SIS与金标准的符合率低,Kappa值均小于0.4;HSG符合率最高(86.2%),Kappa值0.647。三种检查联合诊断的灵敏度89.3%,漏诊率10.7%,特异度91.4%,误诊率8.6%,正确诊断指数80.7%,阳性似然比10.384,阴性似然比0.117,符合率89.3%,Kappa值0.792,阳性预测值83.3%,阳性预测值94.6%。结论:对于宫腔可能存在内膜息肉的不孕症患者,单一采用阴道超声检查、子宫输卵管造影或超声子宫水造影方法的灵敏度均较低,漏诊率高,与金标准的一致性较差,而三种方法联合用于诊断不孕症患者EP的真实性、可靠性及预测值均较好。  相似文献   

18.
Accurate diagnosis in suspected ischaemic stroke can be difficult. We explored the urinary proteome in patients with stroke (n = 69), compared to controls (n = 33), and developed a biomarker model for the diagnosis of stroke. We performed capillary electrophoresis online coupled to micro-time-of-flight mass spectrometry. Potentially disease-specific peptides were identified and a classifier based on these was generated using support vector machine-based software. Candidate biomarkers were sequenced by liquid chromatography-tandem mass spectrometry. We developed two biomarker-based classifiers, employing 14 biomarkers (nominal p-value <0.004) or 35 biomarkers (nominal p-value <0.01). When tested on a blinded test set of 47 independent samples, the classification factor was significantly different between groups; for the 35 biomarker model, median value of the classifier was 0.49 (-0.30 to 1.25) in cases compared to -1.04 (IQR -1.86 to -0.09) in controls, p<0.001. The 35 biomarker classifier gave sensitivity of 56%, specificity was 93% and the AUC on ROC analysis was 0.86. This study supports the potential for urinary proteomic biomarker models to assist with the diagnosis of acute stroke in those with mild symptoms. We now plan to refine further and explore the clinical utility of such a test in large prospective clinical trials.  相似文献   

19.
Recognition of protein-DNA binding sites in genomic sequences is a crucial step for discovering biological functions of genomic sequences. Explosive growth in availability of sequence information has resulted in a demand for binding site detection methods with high specificity. The motivation of the work presented here is to address this demand by a systematic approach based on Maximum Likelihood Estimation. A general framework is developed in which a large class of binding site detection methods can be described in a uniform and consistent way. Protein-DNA binding is determined by binding energy, which is an approximately linear function within the space of sequence words. All matrix based binding word detectors can be regarded as different linear classifiers which attempt to estimate the linear separation implied by the binding energy function. The standard approaches of consensus sequences and profile matrices are described using this framework. A maximum likelihood approach for determining this linear separation leads to a novel matrix type, called the binding matrix. The binding matrix is the most specific matrix based classifier which is consistent with the input set of known binding words. It achieves significant improvements in specificity compared to other matrices. This is demonstrated using 95 sets of experimentally determined binding words provided by the TRANSFAC database.  相似文献   

20.
OBJECTIVE: To analyze smears of 197 thyroid follicular tumors (adenoma and carcinoma). STUDY DESIGN: Several types of artificial neural networks (ANN) of various designs were used for diagnosis of thyroid follicular tumors. The typical complex of cytologic features, some nuclear morphometric parameters (area, perimeter, shape factor) and density features of chromatin texture (mean value and SD of gray levels) were defined for each tumor. RESULTS: The ANN was trained by means of cytologic features characteristic for a thyroid follicular adenoma and a follicular carcinoma. At subsequent testing, the correct cytologic diagnosis was established in 93% (25 of 27) of cases. The morphometry increased the accuracy of diagnosis for follicular tumors in up to 97% (75 of 78) of cases. ANN correctly distinguished an adenoma or a carcinoma in 87% (73 of 84) of cases when using color microscopic images of tumors. CONCLUSION: The usage of ANN has raised sensitivity of cytologic diagnosis of follicular tumors to 90%, compared with a usual cytologic method (sensitivity of 56%). The automatic classification of thyroid follicular tumors by means of ANN is prospective.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号