首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
MOTIVATION: Discriminant analysis is an effective tool for the classification of experimental units into groups. Here, we consider the typical problem of classifying subjects according to phenotypes via gene expression data and propose a method that incorporates variable selection into the inferential procedure, for the identification of the important biomarkers. To achieve this goal, we build upon a conjugate normal discriminant model, both linear and quadratic, and include a stochastic search variable selection procedure via an MCMC algorithm. Furthermore, we incorporate into the model prior information on the relationships among the genes as described by a gene-gene network. We use a Markov random field (MRF) prior to map the network connections among genes. Our prior model assumes that neighboring genes in the network are more likely to have a joint effect on the relevant biological processes. RESULTS: We use simulated data to assess performances of our method. In particular, we compare the MRF prior to a situation where independent Bernoulli priors are chosen for the individual predictors. We also illustrate the method on benchmark datasets for gene expression. Our simulation studies show that employing the MRF prior improves on selection accuracy. In real data applications, in addition to identifying markers and improving prediction accuracy, we show how the integration of existing biological knowledge into the prior model results in an increased ability to identify genes with strong discriminatory power and also aids the interpretation of the results.  相似文献   

2.

Background

Predicting disease causative genes (or simply, disease genes) has played critical roles in understanding the genetic basis of human diseases and further providing disease treatment guidelines. While various computational methods have been proposed for disease gene prediction, with the recent increasing availability of biological information for genes, it is highly motivated to leverage these valuable data sources and extract useful information for accurately predicting disease genes.

Results

We present an integrative framework called N2VKO to predict disease genes. Firstly, we learn the node embeddings from protein-protein interaction (PPI) network for genes by adapting the well-known representation learning method node2vec. Secondly, we combine the learned node embeddings with various biological annotations as rich feature representation for genes, and subsequently build binary classification models for disease gene prediction. Finally, as the data for disease gene prediction is usually imbalanced (i.e. the number of the causative genes for a specific disease is much less than that of its non-causative genes), we further address this serious data imbalance issue by applying oversampling techniques for imbalance data correction to improve the prediction performance. Comprehensive experiments demonstrate that our proposed N2VKO significantly outperforms four state-of-the-art methods for disease gene prediction across seven diseases.

Conclusions

In this study, we show that node embeddings learned from PPI networks work well for disease gene prediction, while integrating node embeddings with other biological annotations further improves the performance of classification models. Moreover, oversampling techniques for imbalance correction further enhances the prediction performance. In addition, the literature search of predicted disease genes also shows the effectiveness of our proposed N2VKO framework for disease gene prediction.
  相似文献   

3.
Many cell activities are organized as a network, and genes are clustered into co-expressed groups if they have the same or closely related biological function or they are co-regulated. In this study, based on an assumption that a strong candidate disease gene is more likely close to gene groups in which all members coordinately differentially express than individual genes with differential expression, we developed a novel disease gene prioritization method GroupRank by integrating gene co-expression and differential expression information generated from microarray data as well as PPI network. A candidate gene is ranked high using GroupRank if it is differentially expressed in disease and control or is close to differentially co-expressed groups in PPI network. We tested our method on data sets of lung, kidney, leukemia and breast cancer. The results revealed GroupRank could efficiently prioritize disease genes with significantly improved AUC value in comparison to the previous method with no consideration of co-exprssed gene groups in PPI network. Moreover, the functional analyses of the major contributing gene group in gene prioritization of kidney cancer verified that our algorithm GroupRank not only ranks disease genes efficiently but also could help us identify and understand possible mechanisms in important physiological and pathological processes of disease.  相似文献   

4.
MOTIVATION: Gene expression profiling is a powerful approach to identify genes that may be involved in a specific biological process on a global scale. For example, gene expression profiling of mutant animals that lack or contain an excess of certain cell types is a common way to identify genes that are important for the development and maintenance of given cell types. However, it is difficult for traditional computational methods, including unsupervised and supervised learning methods, to detect relevant genes from a large collection of expression profiles with high sensitivity and specificity. Unsupervised methods group similar gene expressions together while ignoring important prior biological knowledge. Supervised methods utilize training data from prior biological knowledge to classify gene expression. However, for many biological problems, little prior knowledge is available, which limits the prediction performance of most supervised methods. RESULTS: We present a Bayesian semi-supervised learning method, called BGEN, that improves upon supervised and unsupervised methods by both capturing relevant expression profiles and using prior biological knowledge from literature and experimental validation. Unlike currently available semi-supervised learning methods, this new method trains a kernel classifier based on labeled and unlabeled gene expression examples. The semi-supervised trained classifier can then be used to efficiently classify the remaining genes in the dataset. Moreover, we model the confidence of microarray probes and probabilistically combine multiple probe predictions into gene predictions. We apply BGEN to identify genes involved in the development of a specific cell lineage in the C. elegans embryo, and to further identify the tissues in which these genes are enriched. Compared to K-means clustering and SVM classification, BGEN achieves higher sensitivity and specificity. We confirm certain predictions by biological experiments. AVAILABILITY: The results are available at http://www.csail.mit.edu/~alanqi/projects/BGEN.html.  相似文献   

5.
Chronic obstructive pulmonary disease (COPD) is a major cause of morbidity and mortality worldwide. Irreversible airflow limitation, both progressive and associated with an inflammatory response of the lungs to noxious particles or gases, is a hallmark of the disease. Cigarette smoking is the most important environmental risk factor for COPD, nevertheless, only approximately 20–30% of smokers develop symptomatic disease. Epidemiological studies, case-control studies in relatives of patients with COPD, and twin studies suggest that COPD is a genetically complex disease with environmental factors and many involved genes interacting together. Two major strategies have been employed to identify the genes and the polymorphisms that likely contribute to the development of complex diseases: association studies and linkage analyses. Biologically plausible pathogenetic mechanisms are prerequisites to focus the search for genes of known function in association studies. Protease-antiprotease imbalance, generation of oxidative stress, and chronic inflammation are recognized as the principal mechanisms leading to irreversible airflow obstruction and parenchymal destruction in the lung. Therefore, genes which have been implicated in the pathogenesis of COPD are involved in antiproteolysis, antioxidant barrier and metabolism of xenobiotic substances, inflammatory response to cigarette smoke, airway hyperresponsiveness, and pulmonary vascular remodelling. Significant associations with COPD-related phenotypes have been reported for polymorphisms in genes coding for matrix metalloproteinases, microsomal epoxide hydrolase, glutathione-S-transferases, heme oxygenase, tumor necrosis factor, interleukines 1, 8, and 13, vitamin D-binding protein and β-2-adrenergic receptor (ADRB2), whereas adequately powered replication studies failed to confirm most of the previously observed associations. Genome-wide linkage analyses provide us with a novel tool to identify the general locations of COPD susceptibility genes, and should be followed by association analyses of positional candidate genes from COPD pathophysiology, positional candidate genes selected from gene expression studies, or dense single nucleotide polymorphism panels across regions of linkage. Haplotype analyses of genes with multiple polymorphic sites in linkage disequilibrium, such as the ADRB2 gene, provide another promising field that has yet to be explored in patients with COPD. In the present article we review the current knowledge about gene polymorphisms that have been recently linked to the risk of developing COPD and/or may account for variations in the disease course.  相似文献   

6.
Prognostic and diagnostic biomarker discovery is one of the key issues for a successful stratification of patients according to clinical risk factors. For this purpose, statistical classification methods, such as support vector machines (SVM), are frequently used tools. Different groups have recently shown that the usage of prior biological knowledge significantly improves the classification results in terms of accuracy as well as reproducibility and interpretability of gene lists. Here, we introduce pathClass, a collection of different SVM-based classification methods for improved gene selection and classfication performance. The methods contained in pathClass do not merely rely on gene expression data but also exploit the information that is carried in gene network data. AVAILABILITY: pathClass is open source and freely available as an R-Package on the CRAN repository at http://cran.r-project.org.  相似文献   

7.
为确定慢性阻塞性肺病(COPD)的分子标记物及COPD与肺鳞状细胞癌(LUSC)共存的差异表达基因,探寻COPD合并肺癌的预测因子,发现新的治疗靶点。本研究采用生物信息学方法,从GEO数据库中筛选3套基因芯片数据集,挖掘COPD患者小气道上皮细胞(SAEC)的差异表达基因(DEG)以及潜在的生物标记物,并通过基因本体(GO)、京都基因与基因组百科全书(KEGG)富集分析预测DEGs的功能及参与的代谢途径。继而对DEGs构建PPI网络,使用Cytoscape软件筛选子模块和Hub基因,并将Hub基因通过TCGA数据库分析其在LUSC中的差异表达情况及差异基因间的相关性。结果共获得52个上调基因和24个下调基因,代谢通路主要集中在细胞色素P450对外源物质的代谢、化学致癌、花生四烯酸代谢及甲状腺激素合成四条途径上,通过Cytoscape软件从PPI网络中筛选得到2个功能模块和10个Hub基因,进一步验证发现其中5个基因在TCGA数据库中的LUSC样本中同样差异表达。由此推测SPP1、ALDH3A1、SPRR3、KRT6A和SPRR1B 可能为COPD 分子标记物及COPD与LUSC共存的DEGs,从而为研究COPD和LUSC的发病机制及二者潜在关系奠定良好的基础。  相似文献   

8.

Background

Non-small cell lung cancer (NSCLC) represents more than about 80% of the lung cancer. The early stages of NSCLC can be treated with complete resection with a good prognosis. However, most cases are detected at late stage of the disease. The average survival rate of the patients with invasive lung cancer is only about 4%. Adenocarcinoma in situ (AIS) is an intermediate subtype of lung adenocarcinoma that exhibits early stage growth patterns but can develop into invasion.

Methods

In this study, we used RNA-seq data from normal, AIS, and invasive lung cancer tissues to identify a gene module that represents the distinguishing characteristics of AIS as AIS-specific genes. Two differential expression analysis algorithms were employed to identify the AIS-specific genes. Then, the subset of the best performed AIS-specific genes for the early lung cancer prediction were selected by random forest. Finally, the performances of the early lung cancer prediction were assessed using random forest, support vector machine (SVM) and artificial neural networks (ANNs) on four independent early lung cancer datasets including one tumor-educated blood platelets (TEPs) dataset.

Results

Based on the differential expression analysis, 107 AIS-specific genes that consisted of 93 protein-coding genes and 14 long non-coding RNAs (lncRNAs) were identified. The significant functions associated with these genes include angiogenesis and ECM-receptor interaction, which are highly related to cancer development and contribute to the smoking-free lung cancers. Moreover, 12 of the AIS-specific lncRNAs are involved in lung cancer progression by potentially regulating the ECM-receptor interaction pathway. The feature selection by random forest identified 20 of the AIS-specific genes as early stage lung cancer signatures using the dataset obtained from The Cancer Genome Atlas (TCGA) lung adenocarcinoma samples. Of the 20 signatures, two were lncRNAs, BLACAT1 and CTD-2527I21.15 which have been reported to be associated with bladder cancer, colorectal cancer and breast cancer. In blind classification for three independent tissue sample datasets, these signature genes consistently yielded about 98% accuracy for distinguishing early stage lung cancer from normal cases. However, the prediction accuracy for the blood platelets samples was only 64.35% (sensitivity 78.1%, specificity 50.59%, and AUROC 0.747).

Conclusions

The comparison of AIS with normal and invasive tumor revealed diseases-specific genes and offered new insights into the mechanism underlying AIS progression into an invasive tumor. These genes can also serve as the signatures for early diagnosis of lung cancer with high accuracy. The expression profile of gene signatures identified from tissue cancer samples yielded remarkable early cancer prediction for tissues samples, however, relatively lower accuracy for boold platelets samples.
  相似文献   

9.

Background

Polygenic diseases are usually caused by the dysfunction of multiple genes. Unravelling such disease genes is crucial to fully understand the genetic landscape of diseases on molecular level. With the advent of ‘omic’ data era, network-based methods have prominently boosted disease gene discovery. However, how to make better use of different types of data for the prediction of disease genes remains a challenge.

Results

In this study, we improved the performance of disease gene prediction by integrating the similarity of disease phenotype, biological function and network topology. First, for each phenotype, a phenotype-specific network was specially constructed by mapping phenotype similarity information of given phenotype onto the protein-protein interaction (PPI) network. Then, we developed a gene gravity-like algorithm, to score candidate genes based on not only topological similarity but also functional similarity. We tested the proposed network and algorithm by conducting leave-one-out and leave-10%-out cross validation and compared them with state-of-art algorithms. The results showed a preference to phenotype-specific network as well as gene gravity-like algorithm. At last, we tested the predicting capacity of proposed algorithms by test gene set derived from the DisGeNET database. Also, potential disease genes of three polygenic diseases, obesity, prostate cancer and lung cancer, were predicted by proposed methods. We found that the predicted disease genes are highly consistent with literature and database evidence.

Conclusions

The good performance of phenotype-specific networks indicates that phenotype similarity information has positive effect on the prediction of disease genes. The proposed gene gravity-like algorithm outperforms the algorithm of Random Walk with Restart (RWR), implicating its predicting capacity by combing topological similarity with functional similarity. Our work will give an insight to the discovery of disease genes by fusing multiple similarities of genes and diseases.
  相似文献   

10.

Background

One of the major goals in gene and protein expression profiling of cancer is to identify biomarkers and build classification models for prediction of disease prognosis or treatment response. Many traditional statistical methods, based on microarray gene expression data alone and individual genes' discriminatory power, often fail to identify biologically meaningful biomarkers thus resulting in poor prediction performance across data sets. Nonetheless, the variables in multivariable classifiers should synergistically interact to produce more effective classifiers than individual biomarkers.

Results

We developed an integrated approach, namely network-constrained support vector machine (netSVM), for cancer biomarker identification with an improved prediction performance. The netSVM approach is specifically designed for network biomarker identification by integrating gene expression data and protein-protein interaction data. We first evaluated the effectiveness of netSVM using simulation studies, demonstrating its improved performance over state-of-the-art network-based methods and gene-based methods for network biomarker identification. We then applied the netSVM approach to two breast cancer data sets to identify prognostic signatures for prediction of breast cancer metastasis. The experimental results show that: (1) network biomarkers identified by netSVM are highly enriched in biological pathways associated with cancer progression; (2) prediction performance is much improved when tested across different data sets. Specifically, many genes related to apoptosis, cell cycle, and cell proliferation, which are hallmark signatures of breast cancer metastasis, were identified by the netSVM approach. More importantly, several novel hub genes, biologically important with many interactions in PPI network but often showing little change in expression as compared with their downstream genes, were also identified as network biomarkers; the genes were enriched in signaling pathways such as TGF-beta signaling pathway, MAPK signaling pathway, and JAK-STAT signaling pathway. These signaling pathways may provide new insight to the underlying mechanism of breast cancer metastasis.

Conclusions

We have developed a network-based approach for cancer biomarker identification, netSVM, resulting in an improved prediction performance with network biomarkers. We have applied the netSVM approach to breast cancer gene expression data to predict metastasis in patients. Network biomarkers identified by netSVM reveal potential signaling pathways associated with breast cancer metastasis, and help improve the prediction performance across independent data sets.  相似文献   

11.
Polymorphonuclear leukocytes (PMNs) are major effector cells in the chronic airway inflammation in chronic obstructive pulmonary disease (COPD). PMN degranulation is associated with degradation of extracellular matrix and tissue damage. Hck is an essential molecule in the signaling pathway regulating PMN degranulation. We hypothesized that polymorphisms affect the expression level of Hck, which, in turn, modulates PMN mediator release and tissue damage and influences the development of COPD. Here we systematically investigated genetic tag polymorphisms of the Hck gene, Hck mRNA and protein expression pattern in PMNs, and PMN mediator release (myeloperoxidase) in 60 healthy white subjects, and assessed their association with the use of several genetic models. The association of genetic polymorphisms with COPD-related phenotypes was determined in the lung healthy study cohort (LHS). We identified a novel 15 bp insertion/deletion polymorphism (8,656 L/S) in intron 1 of the Hck gene, which was associated with differential expression of Hck protein and PMN myeloperoxidase release. In the LHS cohort, there was significant interaction between the 8,656 L/S polymorphism and smoking on baseline lung function and 8,656 L/S was associated with bronchodilator response. These data suggest that the insertion/deletion polymorphism could be a functional polymorphism of the Hck gene, may contribute to COPD pathogenesis and modify COPD-related phenotypes.  相似文献   

12.
13.
Chronic Obstructive Pulmonary Disease (COPD) is a complex disease. Genetic, epigenetic, and environmental factors are known to contribute to COPD risk and disease progression. Therefore we developed a systematic approach to identify key regulators of COPD that integrates genome-wide DNA methylation, gene expression, and phenotype data in lung tissue from COPD and control samples. Our integrative analysis identified 126 key regulators of COPD. We identified EPAS1 as the only key regulator whose downstream genes significantly overlapped with multiple genes sets associated with COPD disease severity. EPAS1 is distinct in comparison with other key regulators in terms of methylation profile and downstream target genes. Genes predicted to be regulated by EPAS1 were enriched for biological processes including signaling, cell communications, and system development. We confirmed that EPAS1 protein levels are lower in human COPD lung tissue compared to non-disease controls and that Epas1 gene expression is reduced in mice chronically exposed to cigarette smoke. As EPAS1 downstream genes were significantly enriched for hypoxia responsive genes in endothelial cells, we tested EPAS1 function in human endothelial cells. EPAS1 knockdown by siRNA in endothelial cells impacted genes that significantly overlapped with EPAS1 downstream genes in lung tissue including hypoxia responsive genes, and genes associated with emphysema severity. Our first integrative analysis of genome-wide DNA methylation and gene expression profiles illustrates that not only does DNA methylation play a ‘causal’ role in the molecular pathophysiology of COPD, but it can be leveraged to directly identify novel key mediators of this pathophysiology.  相似文献   

14.
Toxic liver injury causes necrosis and fibrosis, which may lead to cirrhosis and liver failure. Despite recent progress in understanding the mechanism of liver fibrosis, our knowledge of the molecular-level details of this disease is still incomplete. The elucidation of networks and pathways associated with liver fibrosis can provide insight into the underlying molecular mechanisms of the disease, as well as identify potential diagnostic or prognostic biomarkers. Towards this end, we analyzed rat gene expression data from a range of chemical exposures that produced observable periportal liver fibrosis as documented in DrugMatrix, a publicly available toxicogenomics database. We identified genes relevant to liver fibrosis using standard differential expression and co-expression analyses, and then used these genes in pathway enrichment and protein-protein interaction (PPI) network analyses. We identified a PPI network module associated with liver fibrosis that includes known liver fibrosis-relevant genes, such as tissue inhibitor of metalloproteinase-1, galectin-3, connective tissue growth factor, and lipocalin-2. We also identified several new genes, such as perilipin-3, legumain, and myocilin, which were associated with liver fibrosis. We further analyzed the expression pattern of the genes in the PPI network module across a wide range of 640 chemical exposure conditions in DrugMatrix and identified early indications of liver fibrosis for carbon tetrachloride and lipopolysaccharide exposures. Although it is well known that carbon tetrachloride and lipopolysaccharide can cause liver fibrosis, our network analysis was able to link these compounds to potential fibrotic damage before histopathological changes associated with liver fibrosis appeared. These results demonstrated that our approach is capable of identifying early-stage indicators of liver fibrosis and underscore its potential to aid in predictive toxicity, biomarker identification, and to generally identify disease-relevant pathways.  相似文献   

15.
To identify non-invasive gene expression markers for chronic obstructive pulmonary disease (COPD), we performed genome-wide expression profiling of peripheral blood samples from 12 subjects with significant airflow obstruction and an equal number of non-obstructed controls. RNA was isolated from Peripheral Blood Mononuclear Cells (PBMCs) and gene expression was assessed using Affymetrix U133 Plus 2.0 arrays.Tests for gene expression changes that discriminate between COPD cases (FEV1< 70% predicted, FEV1/FVC < 0.7) and controls (FEV1> 80% predicted, FEV1/FVC > 0.7) were performed using Significance Analysis of Microarrays (SAM) and Bayesian Analysis of Differential Gene Expression (BADGE). Using either test at high stringency (SAM median FDR = 0 or BADGE p < 0.01) we identified differential expression for 45 known genes. Correlation of gene expression with lung function measurements (FEV1 & FEV1/FVC), using both Pearson and Spearman correlation coefficients (p < 0.05), identified a set of 86 genes. A total of 16 markers showed evidence of significant correlation (p < 0.05) with quantitative traits and differential expression between cases and controls. We further compared our peripheral gene expression markers with those we previously identified from lung tissue of the same cohort. Two genes, RP9and NAPE-PLD, were identified as decreased in COPD cases compared to controls in both lung tissue and blood. These results contribute to our understanding of gene expression changes in the peripheral blood of patients with COPD and may provide insight into potential mechanisms involved in the disease.  相似文献   

16.
张思嘉  蔡挺  张顺 《生物信息学》2022,20(4):247-256
基于SNP突变数据与mRNA表达谱关联分析,构建一种肝癌分子分型方法并对比不同分型预后的差异,并对不同分型肝癌的发生发展机制进一步研究。首先通过TCGA数据库收集359例肝细胞癌患者的SNP突变数据和mRNA表达数据,采用Wilcoxon秩和检验,筛选突变后差异表达基因,并通过生物信息学工具String和Cytoscape构建差异表达基因的蛋白互作网络,筛选连接度最高的10个Hub基因。利用Consensus Cluster Plus软件包,基于Hub基因mRNA表达水平构建NMF分子分型模型,再结合生存数据评估各分型患者的预后。最后利用加权基因共表达网络分析(WGCNA),识别与肝癌分子分型相关的模块,并针对关键模块的基因进行通路富集,从而对不同分型肝癌的基因表达谱进行比较。结果:NMF模型将肝癌分为高危、低危2个分型,其中CDKN2A和FOXO1基因对分型贡献度高。生存分析显示低危组患者的生存情况显著优于高危组,高危组富集多个与肿瘤细胞侵蚀、转移、复发过程相关的信号通路,低危组则与细胞周期和胰液分泌相关。本研究在无先验性信息的前提下,基于突变后显著差异表达的Hub基因表达水平构建的...  相似文献   

17.
MOTIVATION: The inference of genes that are truly associated with inherited human diseases from a set of candidates resulting from genetic linkage studies has been one of the most challenging tasks in human genetics. Although several computational approaches have been proposed to prioritize candidate genes relying on protein-protein interaction (PPI) networks, these methods can usually cover less than half of known human genes. RESULTS: We propose to rely on the biological process domain of the gene ontology to construct a gene semantic similarity network and then use the network to infer disease genes. We show that the constructed network covers about 50% more genes than a typical PPI network. By analyzing the gene semantic similarity network with the PPI network, we show that gene pairs tend to have higher semantic similarity scores if the corresponding proteins are closer to each other in the PPI network. By analyzing the gene semantic similarity network with a phenotype similarity network, we show that semantic similarity scores of genes associated with similar diseases are significantly different from those of genes selected at random, and that genes with higher semantic similarity scores tend to be associated with diseases with higher phenotype similarity scores. We further use the gene semantic similarity network with a random walk with restart model to infer disease genes. Through a series of large-scale leave-one-out cross-validation experiments, we show that the gene semantic similarity network can achieve not only higher coverage but also higher accuracy than the PPI network in the inference of disease genes.  相似文献   

18.
Chronic obstructive pulmonary disease (COPD) is a risk factor for the development of lung cancer. The aim of this study was to identify early diagnosis biomarkers for lung squamous cell carcinoma (SQCC) in COPD patients and to determine the potential pathogenetic mechanisms. The GSE12472 data set was downloaded from the Gene Expression Omnibus database. Differentially co‐expressed links (DLs) and differentially expressed genes (DEGs) in both COPD and normal tissues, or in both SQCC + COPD and COPD samples were used to construct a dynamic network associated with high‐risk genes for the SQCC pathogenetic process. Enrichment analysis was performed based on Gene Ontology annotations and Kyoto Encyclopedia of Genes and Genomes pathway analysis. We used the gene expression data and the clinical information to identify the co‐expression modules based on weighted gene co‐expression network analysis (WGCNA). In total, 205 dynamic DEGs, 5034 DLs and one pathway including CDKN1A, TP53, RB1 and MYC were found to have correlations with the pathogenetic progress. The pathogenetic mechanisms shared by both SQCC and COPD are closely related to oxidative stress, the immune response and infection. WGCNA identified 11 co‐expression modules, where magenta and black were correlated with the “time to distant metastasis.” And the “surgery due to” was closely related to the brown and blue modules. In conclusion, a pathway that includes TP53, CDKN1A, RB1 and MYC may play a vital role in driving COPD towards SQCC. Inflammatory processes and the immune response participate in COPD‐related carcinogenesis.  相似文献   

19.
基于SVM和平均影响值的人肿瘤信息基因提取   总被引:1,自引:0,他引:1       下载免费PDF全文
基于基因表达谱的肿瘤分类信息基因选取是发现肿瘤特异表达基因、探索肿瘤基因表达模式的重要手段。借助由基因表达谱获得的分类信息进行肿瘤诊断是当今生物信息学领域中的一个重要研究方向,有望成为临床医学上一种快速而有效的肿瘤分子诊断方法。鉴于肿瘤基因表达谱样本数据维数高、样本量小以及噪音大等特点,提出一种结合支持向量机应用平均影响值来寻找肿瘤信息基因的算法,其优点是能够搜索到基因数量尽可能少而分类能力尽可能强的多个信息基因子集。采用二分类肿瘤数据集验证算法的可行性和有效性,对于结肠癌样本集,只需3个基因就能获得100%的留一法交叉验证识别准确率。为避免样本集的不同划分对分类性能的影响,进一步采用全折交叉验证方法来评估各信息基因子集的分类性能,优选出更可靠的信息基因子集。与基它肿瘤分类方法相比,实验结果在信息基因数量以及分类性能方面具有明显的优势。  相似文献   

20.
Micro array data provides information of expression levels of thousands of genes in a cell in a single experiment. Numerous efforts have been made to use gene expression profiles to improve precision of tumor classification. In our present study we have used the benchmark colon cancer data set for analysis. Feature selection is done using t‐statistic. Comparative study of class prediction accuracy of 3 different classifiers viz., support vector machine (SVM), neural nets and logistic regression was performed using the top 10 genes ranked by the t‐statistic. SVM turned out to be the best classifier for this dataset based on area under the receiver operating characteristic curve (AUC) and total accuracy. Logistic Regression ranks as the next best classifier followed by Multi Layer Perceptron (MLP). The top 10 genes selected by us for classification are all well documented for their variable expression in colon cancer. We conclude that SVM together with t-statistic based feature selection is an efficient and viable alternative to popular techniques.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号