首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Advances in DNA microarray technologies have made gene expression profiles a significant candidate in identifying different types of cancers. Traditional learning-based cancer identification methods utilize labeled samples to train a classifier, but they are inconvenient for practical application because labels are quite expensive in the clinical cancer research community. This paper proposes a semi-supervised projective non-negative matrix factorization method (Semi-PNMF) to learn an effective classifier from both labeled and unlabeled samples, thus boosting subsequent cancer classification performance. In particular, Semi-PNMF jointly learns a non-negative subspace from concatenated labeled and unlabeled samples and indicates classes by the positions of the maximum entries of their coefficients. Because Semi-PNMF incorporates statistical information from the large volume of unlabeled samples in the learned subspace, it can learn more representative subspaces and boost classification performance. We developed a multiplicative update rule (MUR) to optimize Semi-PNMF and proved its convergence. The experimental results of cancer classification for two multiclass cancer gene expression profile datasets show that Semi-PNMF outperforms the representative methods.  相似文献   

2.
For current computational intelligence techniques, a major challenge is how to learn new concepts in changing environment. Traditional learning schemes could not adequately address this problem due to a lack of dynamic data selection mechanism. In this paper, inspired by human learning process, a novel classification algorithm based on incremental semi-supervised support vector machine (SVM) is proposed. Through the analysis of prediction confidence of samples and data distribution in a changing environment, a “soft-start” approach, a data selection mechanism and a data cleaning mechanism are designed, which complete the construction of our incremental semi-supervised learning system. Noticeably, with the ingenious design procedure of our proposed algorithm, the computation complexity is reduced effectively. In addition, for the possible appearance of some new labeled samples in the learning process, a detailed analysis is also carried out. The results show that our algorithm does not rely on the model of sample distribution, has an extremely low rate of introducing wrong semi-labeled samples and can effectively make use of the unlabeled samples to enrich the knowledge system of classifier and improve the accuracy rate. Moreover, our method also has outstanding generalization performance and the ability to overcome the concept drift in a changing environment.  相似文献   

3.
MOTIVATION: Gene expression profiling is a powerful approach to identify genes that may be involved in a specific biological process on a global scale. For example, gene expression profiling of mutant animals that lack or contain an excess of certain cell types is a common way to identify genes that are important for the development and maintenance of given cell types. However, it is difficult for traditional computational methods, including unsupervised and supervised learning methods, to detect relevant genes from a large collection of expression profiles with high sensitivity and specificity. Unsupervised methods group similar gene expressions together while ignoring important prior biological knowledge. Supervised methods utilize training data from prior biological knowledge to classify gene expression. However, for many biological problems, little prior knowledge is available, which limits the prediction performance of most supervised methods. RESULTS: We present a Bayesian semi-supervised learning method, called BGEN, that improves upon supervised and unsupervised methods by both capturing relevant expression profiles and using prior biological knowledge from literature and experimental validation. Unlike currently available semi-supervised learning methods, this new method trains a kernel classifier based on labeled and unlabeled gene expression examples. The semi-supervised trained classifier can then be used to efficiently classify the remaining genes in the dataset. Moreover, we model the confidence of microarray probes and probabilistically combine multiple probe predictions into gene predictions. We apply BGEN to identify genes involved in the development of a specific cell lineage in the C. elegans embryo, and to further identify the tissues in which these genes are enriched. Compared to K-means clustering and SVM classification, BGEN achieves higher sensitivity and specificity. We confirm certain predictions by biological experiments. AVAILABILITY: The results are available at http://www.csail.mit.edu/~alanqi/projects/BGEN.html.  相似文献   

4.
《IRBM》2023,44(3):100747
ObjectivesThe accurate preoperative segmentation of the uterus and uterine fibroids from magnetic resonance images (MRI) is an essential step for diagnosis and real-time ultrasound guidance during high-intensity focused ultrasound (HIFU) surgery. Conventional supervised methods are effective techniques for image segmentation. Recently, semi-supervised segmentation approaches have been reported in the literature. One popular technique for semi-supervised methods is to use pseudo-labels to artificially annotate unlabeled data. However, many existing pseudo-label generations rely on a fixed threshold used to generate a confidence map, regardless of the proportion of unlabeled and labeled data.Materials and MethodsTo address this issue, we propose a novel semi-supervised framework called Confidence-based Threshold Adaptation Network (CTANet) to improve the quality of pseudo-labels. Specifically, we propose an online pseudo-labels method to automatically adjust the threshold, producing high-confident unlabeled annotations and boosting segmentation accuracy. To further improve the network's generalization to fit the diversity of different patients, we design a novel mixup strategy by regularizing the network on each layer in the decoder part and introducing a consistency regularization loss between the outputs of two sub-networks in CTANet.ResultsWe compare our method with several state-of-the-art semi-supervised segmentation methods on the same uterine fibroids dataset containing 297 patients. The performance is evaluated by the Dice similarity coefficient, the precision, and the recall. The results show that our method outperforms other semi-supervised learning methods. Moreover, for the same training set, our method approaches the segmentation performance of a fully supervised U-Net (100% annotated data) but using 4 times less annotated data (25% annotated data, 75% unannotated data).ConclusionExperimental results are provided to illustrate the effectiveness of the proposed semi-supervised approach. The proposed method can contribute to multi-class segmentation of uterine regions from MRI for HIFU treatment.  相似文献   

5.

Background

The prognosis of cancer recurrence is an important research area in bioinformatics and is challenging due to the small sample sizes compared to the vast number of genes. There have been several attempts to predict cancer recurrence. Most studies employed a supervised approach, which uses only a few labeled samples. Semi-supervised learning can be a great alternative to solve this problem. There have been few attempts based on manifold assumptions to reveal the detailed roles of identified cancer genes in recurrence.

Results

In order to predict cancer recurrence, we proposed a novel semi-supervised learning algorithm based on a graph regularization approach. We transformed the gene expression data into a graph structure for semi-supervised learning and integrated protein interaction data with the gene expression data to select functionally-related gene pairs. Then, we predicted the recurrence of cancer by applying a regularization approach to the constructed graph containing both labeled and unlabeled nodes.

Conclusions

The average improvement rate of accuracy for three different cancer datasets was 24.9% compared to existing supervised and semi-supervised methods. We performed functional enrichment on the gene networks used for learning. We identified that those gene networks are significantly associated with cancer-recurrence-related biological functions. Our algorithm was developed with standard C++ and is available in Linux and MS Windows formats in the STL library. The executable program is freely available at: http://embio.yonsei.ac.kr/~Park/ssl.php.  相似文献   

6.
The wealth of interaction information provided in biomedical articles motivated the implementation of text mining approaches to automatically extract biomedical relations. This paper presents an unsupervised method based on pattern clustering and sentence parsing to deal with biomedical relation extraction. Pattern clustering algorithm is based on Polynomial Kernel method, which identifies interaction words from unlabeled data; these interaction words are then used in relation extraction between entity pairs. Dependency parsing and phrase structure parsing are combined for relation extraction. Based on the semi-supervised KNN algorithm, we extend the proposed unsupervised approach to a semi-supervised approach by combining pattern clustering, dependency parsing and phrase structure parsing rules. We evaluated the approaches on two different tasks: (1) Protein–protein interactions extraction, and (2) Gene–suicide association extraction. The evaluation of task (1) on the benchmark dataset (AImed corpus) showed that our proposed unsupervised approach outperformed three supervised methods. The three supervised methods are rule based, SVM based, and Kernel based separately. The proposed semi-supervised approach is superior to the existing semi-supervised methods. The evaluation on gene–suicide association extraction on a smaller dataset from Genetic Association Database and a larger dataset from publicly available PubMed showed that the proposed unsupervised and semi-supervised methods achieved much higher F-scores than co-occurrence based method.  相似文献   

7.
由于基因表达数据高属性维、低样本维的特点,Fisher分类器对该种数据分类性能不是很高。本文提出了Fisher的改进算法Fisher-List。该算法独特之处在于为每个类别确定一个决策阀值,每个阀值既包含总体样本信息,又含有某些对分类至关重要的个体样本信息。本文用实验证明新算法在基因表达数据分类方面比Fisher、LogitBoost、AdaBoost、k-近邻法、决策树和支持向量机具有更高的性能。  相似文献   

8.
Protein-protein interaction hot spots, as revealed by alanine scanning mutagenesis, make dominant contributions to the free energy of binding. Since mutagenesis experiments are expensive and time-consuming, the development of computational methods to identify hot spots is becoming increasingly important. In this study, by using a new combination of sequence, structure and energy features, we propose an iterative semi-supervised algorithm, SemiHS, to incorporate unlabeled data to improve the accuracy of hot spots prediction when sufficient training data is un-available and to overcome the imbalanced data problem. We evaluate the predictive power of SemiHS on a labeled set of 265 alanine-mutated interface residues in 17 complexes and a large unlabeled set of 2465 interface residues with 10-fold cross validation, and get an AUC score of 0.85, with a sensitivity of 0.70 and a specificity of 0.87, which are better than those of the existing methods. Moreover, we validate the proposed method by an independent test and obtain encouraging results.  相似文献   

9.
10.

Background

Biomedical extraction based on supervised machine learning still faces the problem that a limited labeled dataset does not saturate the learning method. Many supervised learning algorithms for bio-event extraction have been affected by the data sparseness.

Methods

In this study, a semi-supervised method for combining labeled data with large scale of unlabeled data is presented to improve the performance of biomedical event extraction. We propose a set of rich feature vector, including a variety of syntactic features and semantic features, such as N-gram features, walk subsequence features, predicate argument structure (PAS) features, especially some new features derived from a strategy named Event Feature Coupling Generalization (EFCG). The EFCG algorithm can create useful event recognition features by making use of the correlation between two sorts of original features explored from the labeled data, while the correlation is computed with the help of massive amounts of unlabeled data. This introduced EFCG approach aims to solve the data sparse problem caused by limited tagging corpus, and enables the new features to cover much more event related information with better generalization properties.

Results

The effectiveness of our event extraction system is evaluated on the datasets from the BioNLP Shared Task 2011 and PubMed. Experimental results demonstrate the state-of-the-art performance in the fine-grained biomedical information extraction task.

Conclusions

Limited labeled data could be combined with unlabeled data to tackle the data sparseness problem by means of our EFCG approach, and the classified capability of the model was enhanced through establishing a rich feature set by both labeled and unlabeled datasets. So this semi-supervised learning approach could go far towards improving the performance of the event extraction system. To the best of our knowledge, it was the first attempt at combining labeled and unlabeled data for tasks related biomedical event extraction.
  相似文献   

11.
Hastie T  Tibshirani R  Eisen MB  Alizadeh A  Levy R  Staudt L  Chan WC  Botstein D  Brown P 《Genome biology》2000,1(2):research0003.1-research000321

Background  

Large gene expression studies, such as those conducted using DNA arrays, often provide millions of different pieces of data. To address the problem of analyzing such data, we describe a statistical method, which we have called 'gene shaving'. The method identifies subsets of genes with coherent expression patterns and large variation across conditions. Gene shaving differs from hierarchical clustering and other widely used methods for analyzing gene expression studies in that genes may belong to more than one cluster, and the clustering may be supervised by an outcome measure. The technique can be 'unsupervised', that is, the genes and samples are treated as unlabeled, or partially or fully supervised by using known properties of the genes or samples to assist in finding meaningful groupings.  相似文献   

12.

Background

Predicting drug-protein interactions from heterogeneous biological data sources is a key step for in silico drug discovery. The difficulty of this prediction task lies in the rarity of known drug-protein interactions and myriad unknown interactions to be predicted. To meet this challenge, a manifold regularization semi-supervised learning method is presented to tackle this issue by using labeled and unlabeled information which often generates better results than using the labeled data alone. Furthermore, our semi-supervised learning method integrates known drug-protein interaction network information as well as chemical structure and genomic sequence data.

Results

Using the proposed method, we predicted certain drug-protein interactions on the enzyme, ion channel, GPCRs, and nuclear receptor data sets. Some of them are confirmed by the latest publicly available drug targets databases such as KEGG.

Conclusions

We report encouraging results of using our method for drug-protein interaction network reconstruction which may shed light on the molecular interaction inference and new uses of marketed drugs.
  相似文献   

13.
In this paper, we develop a novel semi-supervised learning algorithm called active hybrid deep belief networks (AHD), to address the semi-supervised sentiment classification problem with deep learning. First, we construct the previous several hidden layers using restricted Boltzmann machines (RBM), which can reduce the dimension and abstract the information of the reviews quickly. Second, we construct the following hidden layers using convolutional restricted Boltzmann machines (CRBM), which can abstract the information of reviews effectively. Third, the constructed deep architecture is fine-tuned by gradient-descent based supervised learning with an exponential loss function. Finally, active learning method is combined based on the proposed deep architecture. We did several experiments on five sentiment classification datasets, and show that AHD is competitive with previous semi-supervised learning algorithm. Experiments are also conducted to verify the effectiveness of our proposed method with different number of labeled reviews and unlabeled reviews respectively.  相似文献   

14.
DNA microarray gene expression and microarray-based comparative genomic hybridization (aCGH) have been widely used for biomedical discovery. Because of the large number of genes and the complex nature of biological networks, various analysis methods have been proposed. One such method is "gene shaving," a procedure which identifies subsets of the genes with coherent expression patterns and large variation across samples. Since combining genomic information from multiple sources can improve classification and prediction of diseases, in this paper we proposed a new method, "ICA gene shaving" (ICA, independent component analysis), for jointly analyzing gene expression and copy number data. First we used ICA to analyze joint measurements, gene expression and copy number, of a biological system and project the data onto statistically independent biological processes. Next, we used these results to identify patterns of variation in the data and then applied an iterative shaving method. We investigated the properties of our proposed method by analyzing both simulated and real data. We demonstrated that the robustness of our method to noise using simulated data. Using breast cancer data, we showed that our method is superior to the Generalized Singular Value Decomposition (GSVD) gene shaving method for identifying genes associated with breast cancer.  相似文献   

15.
We propose a new method for tumor classification from gene expression data, which mainly contains three steps. Firstly, the original DNA microarray gene expression data are modeled by independent component analysis (ICA). Secondly, the most discriminant eigenassays extracted by ICA are selected by the sequential floating forward selection technique. Finally, support vector machine is used to classify the modeling data. To show the validity of the proposed method, we applied it to classify three DNA microarray datasets involving various human normal and tumor tissue samples. The experimental results show that the method is efficient and feasible.  相似文献   

16.
《Genomics》2020,112(1):114-126
Gene expression data are expected to make a great contribution in the producing of efficient cancer diagnosis and prognosis. Gene expression data are coded by large measured genes, and only of a few number of them carry precious information for different classes of samples. Recently, several researchers proposed gene selection methods based on metaheuristic algorithms for analysing and interpreting gene expression data. However, due to large number of selected genes with limited number of patient's samples and complex interaction between genes, many gene selection methods experienced challenges in order to approach the most relevant and reliable genes. Hence, in this paper, a hybrid filter/wrapper, called rMRMR-MBA is proposed for gene selection problem. In this method, robust Minimum Redundancy Maximum Relevancy (rMRMR) as filter to select the most promising genes and an modified bat algorithm (MBA) as search engine in wrapper approach is proposed to identify a small set of informative genes. The performance of the proposed method has been evaluated using ten gene expression datasets. For performance evaluation, MBA is evaluated by studying the convergence behaviour of MBA with and without TRIZ optimisation operators. For comparative evaluation, the results of the proposed rMRMR-MBA were compared against ten state-of-arts methods using the same datasets. The comparative study demonstrates that the proposed method produced better results in terms of classification accuracy and number of selected genes in two out of ten datasets and competitive results on the remaining datasets. In a nutshell, the proposed method is able to produce very promising results with high classification accuracy which can be considered a promising contribution for gene selection domain.  相似文献   

17.
18.
19.
Fung ES  Ng MK 《Bioinformation》2007,2(5):230-234
One of the applications of the discriminant analysis on microarray data is to classify patient and normal samples based on gene expression values. The analysis is especially important in medical trials and diagnosis of cancer subtypes. The main contribution of this paper is to propose a simple Fisher-type discriminant method on gene selection in microarray data. In the new algorithm, we calculate a weight for each gene and use the weight values as an indicator to identify the subsets of relevant genes that categorize patient and normal samples. A l(2) - l(1) norm minimization method is implemented to the discriminant process to automatically compute the weights of all genes in the samples. The experiments on two microarray data sets have shown that the new algorithm can generate classification results as good as other classification methods, and effectively determine relevant genes for classification purpose. In this study, we demonstrate the gene selection's ability and the computational effectiveness of the proposed algorithm. Experimental results are given to illustrate the usefulness of the proposed model.  相似文献   

20.
Rapidly growing public gene expression databases contain a wealth of data for building an unprecedentedly detailed picture of human biology and disease. This data comes from many diverse measurement platforms that make integrating it all difficult. Although RNA-sequencing (RNA-seq) is attracting the most attention, at present, the rate of new microarray studies submitted to public databases far exceeds the rate of new RNA-seq studies. There is clearly a need for methods that make it easier to combine data from different technologies. In this paper, we propose a new method for processing RNA-seq data that yields gene expression estimates that are much more similar to corresponding estimates from microarray data, hence greatly improving cross-platform comparability. The method we call PREBS is based on estimating the expression from RNA-seq reads overlapping the microarray probe regions, and processing these estimates with standard microarray summarisation algorithms. Using paired microarray and RNA-seq samples from TCGA LAML data set we show that PREBS expression estimates derived from RNA-seq are more similar to microarray-based expression estimates than those from other RNA-seq processing methods. In an experiment to retrieve paired microarray samples from a database using an RNA-seq query sample, gene signatures defined based on PREBS expression estimates were found to be much more accurate than those from other methods. PREBS also allows new ways of using RNA-seq data, such as expression estimation for microarray probe sets. An implementation of the proposed method is available in the Bioconductor package “prebs.”  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号