期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Predicting Classifier Performance with Limited Training Data: Applications to Computer-Aided Diagnosis in Breast and Prostate Cancer

Ajay Basavanhally Satish Viswanath Anant Madabhushi 《PloS one》2015,10(5)

Clinical trials increasingly employ medical imaging data in conjunction with supervised classifiers, where the latter require large amounts of training data to accurately model the system. Yet, a classifier selected at the start of the trial based on smaller and more accessible datasets may yield inaccurate and unstable classification performance. In this paper, we aim to address two common concerns in classifier selection for clinical trials: (1) predicting expected classifier performance for large datasets based on error rates calculated from smaller datasets and (2) the selection of appropriate classifiers based on expected performance for larger datasets. We present a framework for comparative evaluation of classifiers using only limited amounts of training data by using random repeated sampling (RRS) in conjunction with a cross-validation sampling strategy. Extrapolated error rates are subsequently validated via comparison with leave-one-out cross-validation performed on a larger dataset. The ability to predict error rates as dataset size increases is demonstrated on both synthetic data as well as three different computational imaging tasks: detecting cancerous image regions in prostate histopathology, differentiating high and low grade cancer in breast histopathology, and detecting cancerous metavoxels in prostate magnetic resonance spectroscopy. For each task, the relationships between 3 distinct classifiers (k-nearest neighbor, naive Bayes, Support Vector Machine) are explored. Further quantitative evaluation in terms of interquartile range (IQR) suggests that our approach consistently yields error rates with lower variability (mean IQRs of 0.0070, 0.0127, and 0.0140) than a traditional RRS approach (mean IQRs of 0.0297, 0.0779, and 0.305) that does not employ cross-validation sampling for all three datasets. 相似文献

2.

Multi-class protein fold recognition using support vector machines and neural networks 总被引：25，自引：0，他引：25

Ding CH Dubchak I 《Bioinformatics (Oxford, England)》2001,17(4):349-358

MOTIVATION: Protein fold recognition is an important approach to structure discovery without relying on sequence similarity. We study this approach with new multi-class classification methods and examined many issues important for a practical recognition system. RESULTS: Most current discriminative methods for protein fold prediction use the one-against-others method, which has the well-known 'False Positives' problem. We investigated two new methods: the unique one-against-others and the all-against-all methods. Both improve prediction accuracy by 14-110% on a dataset containing 27 SCOP folds. We used the Support Vector Machine (SVM) and the Neural Network (NN) learning methods as base classifiers. SVMs converges fast and leads to high accuracy. When scores of multiple parameter datasets are combined, majority voting reduces noise and increases recognition accuracy. We examined many issues involved with large number of classes, including dependencies of prediction accuracy on the number of folds and on the number of representatives in a fold. Overall, recognition systems achieve 56% fold prediction accuracy on a protein test dataset, where most of the proteins have below 25% sequence identity with the proteins used in training. 相似文献

3.

The recognition of multi-class protein folds by adding average chemical shifts of secondary structure elements

Zhenxing Feng Xiuzhen Hu Zhuo Jiang Hangyu Song Muhammad Aqeel Ashraf 《Saudi Journal of Biological Sciences》2016,23(2):189-197

The recognition of protein folds is an important step in the prediction of protein structure and function. Recently, an increasing number of researchers have sought to improve the methods for protein fold recognition. Following the construction of a dataset consisting of 27 protein fold classes by Ding and Dubchak in 2001, prediction algorithms, parameters and the construction of new datasets have improved for the prediction of protein folds. In this study, we reorganized a dataset consisting of 76-fold classes constructed by Liu et al. and used the values of the increment of diversity, average chemical shifts of secondary structure elements and secondary structure motifs as feature parameters in the recognition of multi-class protein folds. With the combined feature vector as the input parameter for the Random Forests algorithm and ensemble classification strategy, we propose a novel method to identify the 76 protein fold classes. The overall accuracy of the test dataset using an independent test was 66.69%; when the training and test sets were combined, with 5-fold cross-validation, the overall accuracy was 73.43%. This method was further used to predict the test dataset and the corresponding structural classification of the first 27-protein fold class dataset, resulting in overall accuracies of 79.66% and 93.40%, respectively. Moreover, when the training set and test sets were combined, the accuracy using 5-fold cross-validation was 81.21%. Additionally, this approach resulted in improved prediction results using the 27-protein fold class dataset constructed by Ding and Dubchak. 相似文献

4.

Accurate molecular classification of cancer using simple rules 总被引：1，自引：0，他引：1

Xiaosheng Wang Osamu Gotoh 《BMC medical genomics》2009,2(1):1-23

Background

One intractable problem with using microarray data analysis for cancer classification is how to reduce the extremely high-dimensionality gene feature data to remove the effects of noise. Feature selection is often used to address this problem by selecting informative genes from among thousands or tens of thousands of genes. However, most of the existing methods of microarray-based cancer classification utilize too many genes to achieve accurate classification, which often hampers the interpretability of the models. For a better understanding of the classification results, it is desirable to develop simpler rule-based models with as few marker genes as possible.

Methods

We screened a small number of informative single genes and gene pairs on the basis of their depended degrees proposed in rough sets. Applying the decision rules induced by the selected genes or gene pairs, we constructed cancer classifiers. We tested the efficacy of the classifiers by leave-one-out cross-validation (LOOCV) of training sets and classification of independent test sets.

Results

We applied our methods to five cancerous gene expression datasets: leukemia (acute lymphoblastic leukemia [ALL] vs. acute myeloid leukemia [AML]), lung cancer, prostate cancer, breast cancer, and leukemia (ALL vs. mixed-lineage leukemia [MLL] vs. AML). Accurate classification outcomes were obtained by utilizing just one or two genes. Some genes that correlated closely with the pathogenesis of relevant cancers were identified. In terms of both classification performance and algorithm simplicity, our approach outperformed or at least matched existing methods.

Conclusion

In cancerous gene expression datasets, a small number of genes, even one or two if selected correctly, is capable of achieving an ideal cancer classification effect. This finding also means that very simple rules may perform well for cancerous class prediction. 相似文献

5.

Feature Selection and Classification of MAQC-II Breast Cancer and Multiple Myeloma Microarray Gene Expression Data

Qingzhong Liu Andrew H. Sung Zhongxue Chen Jianzhong Liu Xudong Huang Youping Deng 《PloS one》2009,4(12)

Microarray data has a high dimension of variables but available datasets usually have only a small number of samples, thereby making the study of such datasets interesting and challenging. In the task of analyzing microarray data for the purpose of, e.g., predicting gene-disease association, feature selection is very important because it provides a way to handle the high dimensionality by exploiting information redundancy induced by associations among genetic markers. Judicious feature selection in microarray data analysis can result in significant reduction of cost while maintaining or improving the classification or prediction accuracy of learning machines that are employed to sort out the datasets. In this paper, we propose a gene selection method called Recursive Feature Addition (RFA), which combines supervised learning and statistical similarity measures. We compare our method with the following gene selection methods:

Support Vector Machine Recursive Feature Elimination (SVMRFE)
Leave-One-Out Calculation Sequential Forward Selection (LOOCSFS)
Gradient based Leave-one-out Gene Selection (GLGS)

To evaluate the performance of these gene selection methods, we employ several popular learning classifiers on the MicroArray Quality Control phase II on predictive modeling (MAQC-II) breast cancer dataset and the MAQC-II multiple myeloma dataset. Experimental results show that gene selection is strictly paired with learning classifier. Overall, our approach outperforms other compared methods. The biological functional analysis based on the MAQC-II breast cancer dataset convinced us to apply our method for phenotype prediction. Additionally, learning classifiers also play important roles in the classification of microarray data and our experimental results indicate that the Nearest Mean Scale Classifier (NMSC) is a good choice due to its prediction reliability and its stability across the three performance measurements: Testing accuracy, MCC values, and AUC errors. 相似文献

6.

Genetic algorithms applied to multi-class prediction for the analysis of gene expression data 总被引：9，自引：0，他引：9

Ooi CH Tan P 《Bioinformatics (Oxford, England)》2003,19(1):37-44

MOTIVATION: An important challenge in the use of large-scale gene expression data for biological classification occurs when the expression dataset being analyzed involves multiple classes. Key issues that need to be addressed under such circumstances are the efficient selection of good predictive gene groups from datasets that are inherently 'noisy', and the development of new methodologies that can enhance the successful classification of these complex datasets. METHODS: We have applied genetic algorithms (GAs) to the problem of multi-class prediction. A GA-based gene selection scheme is described that automatically determines the members of a predictive gene group, as well as the optimal group size, that maximizes classification success using a maximum likelihood (MLHD) classification method. RESULTS: The GA/MLHD-based approach achieves higher classification accuracies than other published predictive methods on the same multi-class test dataset. It also permits substantial feature reduction in classifier genesets without compromising predictive accuracy. We propose that GA-based algorithms may represent a powerful new tool in the analysis and exploration of complex multi-class gene expression data. AVAILABILITY: Supplementary information, data sets and source codes are available at http://www.omniarray.com/bioinformatics/GA. 相似文献

7.

Multiclass cancer classification by support vector machines with class-wise optimized genes and probability estimates

Ashish Anand 《Journal of theoretical biology》2009,259(3):533-229

We investigate the multiclass classification of cancer microarray samples. In contrast to classification of two cancer types from gene expression data, multiclass classification of more than two cancer types are relatively hard and less studied problem. We used class-wise optimized genes with corresponding one-versus-all support vector machine (OVA-SVM) classifier to maximize the utilization of selected genes. Final prediction was made by using probability scores from all classifiers. We used three different methods of estimating probability from decision value. Among the three probability methods, Platt's approach was more consistent, whereas, isotonic approach performed better for datasets with unequal proportion of samples in different classes. Probability based decision does not only gives true and fair comparison between different one-versus-all (OVA) classifiers but also gives the possibility of using them for any post analysis. Several ensemble experiments, an example of post analysis, of the three probability methods were implemented to study their effect in improving the classification accuracy. We observe that ensemble did help in improving the predictive accuracy of cancer data sets especially involving unbalanced samples. Four-fold external stratified cross-validation experiment was performed on the six multiclass cancer datasets to obtain unbiased estimates of prediction accuracies. Analysis of class-wise frequently selected genes on two cancer datasets demonstrated that the approach was able to select important and relevant genes consistent to literature. This study demonstrates successful implementation of the framework of class-wise feature selection and multiclass classification for prediction of cancer subtypes on six datasets. 相似文献

8.

Are clusters found in one dataset present in another dataset? 总被引：4，自引：0，他引：4

Kapp AV Tibshirani R 《Biostatistics (Oxford, England)》2007,8(1):9-31

In many microarray studies, a cluster defined on one dataset is sought in an independent dataset. If the cluster is found in the new dataset, the cluster is said to be "reproducible" and may be biologically significant. Classifying a new datum to a previously defined cluster can be seen as predicting which of the previously defined clusters is most similar to the new datum. If the new data classified to a cluster are similar, molecularly or clinically, to the data already present in the cluster, then the cluster is reproducible and the corresponding prediction accuracy is high. Here, we take advantage of the connection between reproducibility and prediction accuracy to develop a validation procedure for clusters found in datasets independent of the one in which they were characterized. We define a cluster quality measure called the "in-group proportion" (IGP) and introduce a general procedure for individually validating clusters. Using simulations and real breast cancer datasets, the IGP is compared to four other popular cluster quality measures (homogeneity score, separation score, silhouette width, and weighted average discrepant pairs score). Moreover, simulations and the real breast cancer datasets are used to compare the four versions of the validation procedure which all use the IGP, but differ in the way in which the null distributions are generated. We find that the IGP is the best measure of prediction accuracy, and one version of the validation procedure is the more widely applicable than the other three. An implementation of this algorithm is in a package called "clusterRepro" available through The Comprehensive R Archive Network (http://cran.r-project.org). 相似文献

9.

Network-based logistic regression integration method for biomarker identification

Ke Zhang Wei Geng Shuqin Zhang 《BMC systems biology》2018,12(9):135

Background

Many mathematical and statistical models and algorithms have been proposed to do biomarker identification in recent years. However, the biomarkers inferred from different datasets suffer a lack of reproducibilities due to the heterogeneity of the data generated from different platforms or laboratories. This motivates us to develop robust biomarker identification methods by integrating multiple datasets.

Methods

In this paper, we developed an integrative method for classification based on logistic regression. Different constant terms are set in the logistic regression model to measure the heterogeneity of the samples. By minimizing the differences of the constant terms within the same dataset, both the homogeneity within the same dataset and the heterogeneity in multiple datasets can be kept. The model is formulated as an optimization problem with a network penalty measuring the differences of the constant terms. The L₁ penalty, elastic penalty and network related penalties are added to the objective function for the biomarker discovery purpose. Algorithms based on proximal Newton method are proposed to solve the optimization problem.

Results

We first applied the proposed method to the simulated datasets. Both the AUC of the prediction and the biomarker identification accuracy are improved. We then applied the method to two breast cancer gene expression datasets. By integrating both datasets, the prediction AUC is improved over directly merging the datasets and MetaLasso. And it’s comparable to the best AUC when doing biomarker identification in an individual dataset. The identified biomarkers using network related penalty for variables were further analyzed. Meaningful subnetworks enriched by breast cancer were identified.

Conclusion

A network-based integrative logistic regression model is proposed in the paper. It improves both the prediction and biomarker identification accuracy.

相似文献

10.

Genomic analyses based on pulmonary adenocarcinoma in situ reveal early lung cancer signature

Dan Li William Yang Yifan Zhang Jack Y Yang Renchu Guan Dong Xu Mary Qu Yang 《BMC medical genomics》2018,11(5):106

Background

Non-small cell lung cancer (NSCLC) represents more than about 80% of the lung cancer. The early stages of NSCLC can be treated with complete resection with a good prognosis. However, most cases are detected at late stage of the disease. The average survival rate of the patients with invasive lung cancer is only about 4%. Adenocarcinoma in situ (AIS) is an intermediate subtype of lung adenocarcinoma that exhibits early stage growth patterns but can develop into invasion.

Methods

In this study, we used RNA-seq data from normal, AIS, and invasive lung cancer tissues to identify a gene module that represents the distinguishing characteristics of AIS as AIS-specific genes. Two differential expression analysis algorithms were employed to identify the AIS-specific genes. Then, the subset of the best performed AIS-specific genes for the early lung cancer prediction were selected by random forest. Finally, the performances of the early lung cancer prediction were assessed using random forest, support vector machine (SVM) and artificial neural networks (ANNs) on four independent early lung cancer datasets including one tumor-educated blood platelets (TEPs) dataset.

Results

Based on the differential expression analysis, 107 AIS-specific genes that consisted of 93 protein-coding genes and 14 long non-coding RNAs (lncRNAs) were identified. The significant functions associated with these genes include angiogenesis and ECM-receptor interaction, which are highly related to cancer development and contribute to the smoking-free lung cancers. Moreover, 12 of the AIS-specific lncRNAs are involved in lung cancer progression by potentially regulating the ECM-receptor interaction pathway. The feature selection by random forest identified 20 of the AIS-specific genes as early stage lung cancer signatures using the dataset obtained from The Cancer Genome Atlas (TCGA) lung adenocarcinoma samples. Of the 20 signatures, two were lncRNAs, BLACAT1 and CTD-2527I21.15 which have been reported to be associated with bladder cancer, colorectal cancer and breast cancer. In blind classification for three independent tissue sample datasets, these signature genes consistently yielded about 98% accuracy for distinguishing early stage lung cancer from normal cases. However, the prediction accuracy for the blood platelets samples was only 64.35% (sensitivity 78.1%, specificity 50.59%, and AUROC 0.747).

Conclusions

The comparison of AIS with normal and invasive tumor revealed diseases-specific genes and offered new insights into the mechanism underlying AIS progression into an invasive tumor. These genes can also serve as the signatures for early diagnosis of lung cancer with high accuracy. The expression profile of gene signatures identified from tissue cancer samples yielded remarkable early cancer prediction for tissues samples, however, relatively lower accuracy for boold platelets samples.

相似文献

11.

A Composite Model for Subgroup Identification and Prediction via Bicluster Analysis

Hung-Chia Chen Wen Zou Tzu-Pin Lu James J. Chen 《PloS one》2014,9(10)

Background

A major challenges in the analysis of large and complex biomedical data is to develop an approach for 1) identifying distinct subgroups in the sampled populations, 2) characterizing their relationships among subgroups, and 3) developing a prediction model to classify subgroup memberships of new samples by finding a set of predictors. Each subgroup can represent different pathogen serotypes of microorganisms, different tumor subtypes in cancer patients, or different genetic makeups of patients related to treatment response.

Methods

This paper proposes a composite model for subgroup identification and prediction using biclusters. A biclustering technique is first used to identify a set of biclusters from the sampled data. For each bicluster, a subgroup-specific binary classifier is built to determine if a particular sample is either inside or outside the bicluster. A composite model, which consists of all binary classifiers, is constructed to classify samples into several disjoint subgroups. The proposed composite model neither depends on any specific biclustering algorithm or patterns of biclusters, nor on any classification algorithms.

Results

The composite model was shown to have an overall accuracy of 97.4% for a synthetic dataset consisting of four subgroups. The model was applied to two datasets where the sample’s subgroup memberships were known. The procedure showed 83.7% accuracy in discriminating lung cancer adenocarcinoma and squamous carcinoma subtypes, and was able to identify 5 serotypes and several subtypes with about 94% accuracy in a pathogen dataset.

Conclusion

The composite model presents a novel approach to developing a biclustering-based classification model from unlabeled sampled data. The proposed approach combines unsupervised biclustering and supervised classification techniques to classify samples into disjoint subgroups based on their associated attributes, such as genotypic factors, phenotypic outcomes, efficacy/safety measures, or responses to treatments. The procedure is useful for identification of unknown species or new biomarkers for targeted therapy. 相似文献

12.

A high-accuracy protein structural class prediction algorithm using predicted secondary structural information

Tian Liu Cangzhi Jia 《Journal of theoretical biology》2010,267(3):272-275

One major problem with the existing algorithm for the prediction of protein structural classes is low accuracies for proteins from α/β and α+β classes. In this study, three novel features were rationally designed to model the differences between proteins from these two classes. In combination with other rational designed features, an 11-dimensional vector prediction method was proposed. By means of this method, the overall prediction accuracy based on 25PDB dataset was 1.5% higher than the previous best-performing method, MODAS. Furthermore, the prediction accuracy for proteins from α+β class based on 25PDB dataset was 5% higher than the previous best-performing method, SCPRED. The prediction accuracies obtained with the D675 and FC699 datasets were also improved. 相似文献

13.

Injury prediction and vulnerability assessment using strain and susceptibility measures of the deep white matter

Wei Zhao Yunliang Cai Zhigang Li Songbai Ji 《Biomechanics and modeling in mechanobiology》2017,16(5):1709-1727

Reliable prediction and diagnosis of concussion is important for its effective clinical management. Previous model-based studies largely employ peak responses from a single element in a pre-selected anatomical region of interest (ROI) and utilize a single training dataset for injury prediction. A more systematic and rigorous approach is necessary to scrutinize the entire white matter (WM) ROIs as well as ROI-constrained neural tracts. To this end, we evaluated injury prediction performances of the 50 deep WM regions using predictor variables based on strains obtained from simulating the 58 reconstructed American National Football League head impacts. To objectively evaluate performance, repeated random subsampling was employed to split the impacts into independent training and testing datasets (39 and 19 cases, respectively, with 100 trials). Univariate logistic regressions were conducted based on training datasets to compute the area under the receiver operating characteristic curve (AUC), while accuracy, sensitivity, and specificity were reported based on testing datasets. Two tract-wise injury susceptibilities were identified as the best overall via pair-wise permutation test. They had comparable AUC, accuracy, and sensitivity, with the highest values occurring in superior longitudinal fasciculus (SLF; 0.867–0.879, 84.4–85.2, and 84.1–84.6%, respectively). Using metrics based on WM fiber strain, the most vulnerable ROIs included genu of corpus callosum, cerebral peduncle, and uncinate fasciculus, while genu and main body of corpus callosum, and SLF were among the most vulnerable tracts. Even for one un-concussed athlete, injury susceptibility of the cingulum (hippocampus) right was elevated. These findings highlight the unique injury discriminatory potentials of computational models and may provide important insight into how best to incorporate WM structural anisotropy for investigation of brain injury. 相似文献

14.

SegNet and Salp Water Optimization-driven Deep Belief Network for Segmentation and Classification of Brain Tumor

《Gene expression patterns : GEP》2022

Classification of brain tumor in Magnetic Resonance Imaging (MRI) images is highly popular in treatment planning, early diagnosis, and outcome evaluation. It is very difficult for classifying and diagnosing tumors from several images. Thus, an automatic prediction strategy is essential in classifying brain tumors as malignant, core, edema, or benign. In this research, a novel approach using Salp Water Optimization-based Deep Belief network (SWO-based DBN) is introduced to classify brain tumor. At the initial stage, the input image is pre-processed to eradicate the artifacts present in input image. Following pre-processing, the segmentation is executed by SegNet, where the SegNet is trained using the proposed SWO. Moreover, the Convolutional Neural Network (CNN) features are employed to mine the features for future processing. At last, the introduced SWO-based DBN technique efficiently categorizes the brain tumor with respect to the extracted features. Thereafter, the produced output of the introduced SegNet + SWO-based DBN is made use of in brain tumor segmentation and classification. The developed technique produced better results with highest values of accuracy at 0.933, specificity at 0.880, and sensitivity at 0.938 using BRATS, 2018 datasets and accuracy at 0.921, specificity at 0.853, and sensitivity at 0.928 for BRATS, 2020 dataset. 相似文献

15.

The predictive accuracy of secondary chemical shifts is more affected by protein secondary structure than solvent environment

Marie-Laurence Tremblay Aaron W. Banks Jan K. Rainey 《Journal of biomolecular NMR》2010,46(4):257-270

Biomolecular NMR spectroscopy frequently employs estimates of protein secondary structure using secondary chemical shift (Δδ) values, measured as the difference between experimental and random coil chemical shifts (RCCS). Most published random coil data have been determined in aqueous conditions, reasonable for non-membrane proteins, but potentially less relevant for membrane proteins. Two new RCCS sets are presented here, determined in dimethyl sulfoxide (DMSO) and chloroform:methanol:water (4:4:1 by volume) at 298 K. A web-based program, CS-CHEMeleon, has been implemented to determine the accuracy of secondary structure assessment by calculating and comparing Δδ values for various RCCS datasets. Using CS-CHEMeleon, Δδ predicted versus experimentally determined secondary structures were compared for large datasets of membrane and non-membrane proteins as a function of RCCS dataset, Δδ threshold, nucleus, localized parameter averaging and secondary structure type. Optimized Δδ thresholds are presented both for published and for the DMSO and chloroform:methanol:water derived RCCS tables. Despite obvious RCCS variations between datasets, prediction of secondary structure was consistently similar. Strikingly, predictive accuracy seems to be most dependent upon the type of secondary structure, with helices being the most accurately predicted by Δδ values using five different RCCS tables. We suggest caution when using Δδ-based restraints in structure calculations as the underlying dataset may be biased. Comparative assessment of multiple RCCS datasets should be performed, and resulting Δδ-based restraints weighted appropriately relative to other experimental restraints. 相似文献

16.

Faster and more accurate global protein function assignment from protein interaction networks using the MFGO algorithm

Sun S Zhao Y Jiao Y Yin Y Cai L Zhang Y Lu H Chen R Bu D 《FEBS letters》2006,580(7):1891-1896

MOTIVATION: Predicting protein function accurately is an important issue in the post-genomic era. To achieve this goal, several approaches have been proposed deduce the function of unclassified proteins through sequence similarity, co-expression profiles, and other information. Among these methods, the global optimization method (GOM) is an interesting and powerful tool that assigns functions to unclassified proteins based on their positions in a physical interactions network [Vazquez, A., Flammini, A., Maritan, A. and Vespignani, A. (2003) Global protein function prediction from protein-protein interaction networks, Nat. Biotechnol., 21, 697-700]. To boost both the accuracy and speed of GOM, a new prediction method, MFGO (modified and faster global optimization) is presented in this paper, which employs local optimal repetition method to reduce calculation time, and takes account of topological structure information to achieve a more accurate prediction. CONCLUSION: On four proteins interaction datasets, including Vazquez dataset, YP dataset, DIP-core dataset, and SPK dataset, MFGO was tested and compared with the popular MR (majority rule) and GOM methods. Experimental results confirm MFGO's improvement on both speed and accuracy. Especially, MFGO method has a distinctive advantage in accurately predicting functions for proteins with few neighbors. Moreover, the robustness of the approach was validated both in a dataset containing a high percentage of unknown proteins and a disturbed dataset through random insertion and deletion. The analysis shows that a moderate amount of misplaced interactions do not preclude a reliable function assignment. 相似文献

17.

Improving cancer classification accuracy using gene pairs

Chopra P Lee J Kang J Lee S 《PloS one》2010,5(12):e14305

Recent studies suggest that the deregulation of pathways, rather than individual genes, may be critical in triggering carcinogenesis. The pathway deregulation is often caused by the simultaneous deregulation of more than one gene in the pathway. This suggests that robust gene pair combinations may exploit the underlying bio-molecular reactions that are relevant to the pathway deregulation and thus they could provide better biomarkers for cancer, as compared to individual genes. In order to validate this hypothesis, in this paper, we used gene pair combinations, called doublets, as input to the cancer classification algorithms, instead of the original expression values, and we showed that the classification accuracy was consistently improved across different datasets and classification algorithms. We validated the proposed approach using nine cancer datasets and five classification algorithms including Prediction Analysis for Microarrays (PAM), C4.5 Decision Trees (DT), Naive Bayesian (NB), Support Vector Machine (SVM), and k-Nearest Neighbor (k-NN). 相似文献

18.

A Stochastic Simulation Framework for the Prediction of Strategic Noise Mapping and Occupational Noise Exposure Using the Random Walk Approach

Lim Ming Han Zaiton Haron Khairulzan Yahya Suhaimi Abu Bakar Mohamad Ngasri Dimon 《PloS one》2015,10(4)

Strategic noise mapping provides important information for noise impact assessment and noise abatement. However, producing reliable strategic noise mapping in a dynamic, complex working environment is difficult. This study proposes the implementation of the random walk approach as a new stochastic technique to simulate noise mapping and to predict the noise exposure level in a workplace. A stochastic simulation framework and software, namely RW-eNMS, were developed to facilitate the random walk approach in noise mapping prediction. This framework considers the randomness and complexity of machinery operation and noise emission levels. Also, it assesses the impact of noise on the workers and the surrounding environment. For data validation, three case studies were conducted to check the accuracy of the prediction data and to determine the efficiency and effectiveness of this approach. The results showed high accuracy of prediction results together with a majority of absolute differences of less than 2 dBA; also, the predicted noise doses were mostly in the range of measurement. Therefore, the random walk approach was effective in dealing with environmental noises. It could predict strategic noise mapping to facilitate noise monitoring and noise control in the workplaces. 相似文献

19.

Prediction of protein-protein interactions using random decision forest framework 总被引：13，自引：0，他引：13

Chen XW Liu M 《Bioinformatics (Oxford, England)》2005,21(24):4394-4400

MOTIVATION: Protein interactions are of biological interest because they orchestrate a number of cellular processes such as metabolic pathways and immunological recognition. Domains are the building blocks of proteins; therefore, proteins are assumed to interact as a result of their interacting domains. Many domain-based models for protein interaction prediction have been developed, and preliminary results have demonstrated their feasibility. Most of the existing domain-based methods, however, consider only single-domain pairs (one domain from one protein) and assume independence between domain-domain interactions. RESULTS: In this paper, we introduce a domain-based random forest of decision trees to infer protein interactions. Our proposed method is capable of exploring all possible domain interactions and making predictions based on all the protein domains. Experimental results on Saccharomyces cerevisiae dataset demonstrate that our approach can predict protein-protein interactions with higher sensitivity (79.78%) and specificity (64.38%) compared with that of the maximum likelihood approach. Furthermore, our model can be used to infer interactions not only for single-domain pairs but also for multiple domain pairs. 相似文献

20.

A Time-Series-Based Feature Extraction Approach for Prediction of Protein Structural Class

Ravi Gupta Ankush Mittal Kuldip Singh 《EURASIP Journal on Bioinformatics and Systems Biology》2008,2008(1):235451

This paper presents a novel feature vector based on physicochemical property of amino acids for prediction protein structural classes. The proposed method is divided into three different stages. First, a discrete time series representation to protein sequences using physicochemical scale is provided. Later on, a wavelet-based time-series technique is proposed for extracting features from mapped amino acid sequence and a fixed length feature vector for classification is constructed. The proposed feature space summarizes the variance information of ten different biological properties of amino acids. Finally, an optimized support vector machine model is constructed for prediction of each protein structural class. The proposed approach is evaluated using leave-one-out cross-validation tests on two standard datasets. Comparison of our result with existing approaches shows that overall accuracy achieved by our approach is better than exiting methods. 相似文献