首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 250 毫秒
基于肿瘤基因表达谱的肿瘤分类是生物信息学的一个重要研究内容。传统的肿瘤信息特征提取方法大多基于信息基因选择方法,但是在筛选基因时,不可避免的会造成分类信息的流失。提出了一种基于邻接矩阵分解的肿瘤亚型特征提取方法,首先对肿瘤基因表达谱数据构造高斯权邻接矩阵,接着对邻接矩阵进行奇异值分解,最后将分解得到的正交矩阵特征行向量作为分类特征输入支持向量机进行分类识别。采用留一法对白血病两个亚型的基因表达谱数据集进行实验,实验结果证明了该方法的可行性和有效性。  相似文献   

基于SVM和平均影响值的人肿瘤信息基因提取   总被引:1,自引:0,他引:1       下载免费PDF全文
基于基因表达谱的肿瘤分类信息基因选取是发现肿瘤特异表达基因、探索肿瘤基因表达模式的重要手段。借助由基因表达谱获得的分类信息进行肿瘤诊断是当今生物信息学领域中的一个重要研究方向,有望成为临床医学上一种快速而有效的肿瘤分子诊断方法。鉴于肿瘤基因表达谱样本数据维数高、样本量小以及噪音大等特点,提出一种结合支持向量机应用平均影响值来寻找肿瘤信息基因的算法,其优点是能够搜索到基因数量尽可能少而分类能力尽可能强的多个信息基因子集。采用二分类肿瘤数据集验证算法的可行性和有效性,对于结肠癌样本集,只需3个基因就能获得100%的留一法交叉验证识别准确率。为避免样本集的不同划分对分类性能的影响,进一步采用全折交叉验证方法来评估各信息基因子集的分类性能,优选出更可靠的信息基因子集。与基它肿瘤分类方法相比,实验结果在信息基因数量以及分类性能方面具有明显的优势。  相似文献   

通过对基因表达谱数据的分析从而促进肿瘤诊断与治疗技术的发展,其研究正成为生物医学领域的一个热点。因此,提出了一种熵信息处理和主成分分析(principal component analysis,PCA)相结合的方法。首先运用熵信息对超高维基因表达谱数据进行粗选取,得到特征基因子集;由于基因子集仍存在相关性,进而利用PCA对其进一步冗余剔除;最后对得到的无冗余且具有正交性信息的基因特征进行真实数据实验。实验结果显示所采用的方法能有效去除肿瘤样本中的不相关和冗余信息,同时最大程度的保留肿瘤分类信息。与其他肿瘤分类方法相比,在精度上具有比较明显的优势,从而验证了该方法是有效的、可行的。  相似文献   

神经胶质瘤(glioma)是一种严重的颅内肿瘤疾病,具有高复发率、高死亡率和低治愈率等特点。利用基因微阵列数据识别与神经胶质瘤相关的特征基因,对该疾病的临床诊断和生物医学研究将起到有益的参考和借鉴作用。作者针对神经胶质瘤数据,提出了一种集成类随机森林特征基因选择方法。首先应用有监督奇异值分解对数据进行降维并粗选出基因;其次应用类随机森林特征选择方法选出特征基因。实验结果显示,该方法对分类器的适应性强;对比其他方法,分类率优势明显;更重要的是,在选出的前50个特征基因中有39个基因与神经胶质瘤或肿瘤细胞生物过程存在着密切联系,证实该方法不仅保持了较高的分类率,而且保证了选择的特征基因具有很强的生物学关联意义,具有较高的可行性和实用性。  相似文献   

基于决策森林特征基因的两种识别方法   总被引:1,自引:0,他引:1  
应用DNA芯片可获得成千上万个基因的表达谱数据。寻找对疾病有鉴别力的特征基因 ,滤掉与疾病无关的基因是基因表达谱数据分析的关键问题。利用决策森林方法的集成优势 ,提出基于决策森林的两种特征基因识别方法。该方法先由决策森林按照一定的显著性水平滤掉大部分与疾病类别无关的基因 ,然后采用统计频数法和扰动法 ,根据所选特征对分类的贡献程度对初选的特征基因作更加精细地选择。最后 ,选用神经网络作为外部分类器对所选的特征基因子集进行评价 ,将提出的方法应用于 4 0例结肠癌组织与 2 2例正常组织中 2 0 0 0个基因的表达谱实验数据。结果表明 :上述两种方法选出的特征基因均具有较高的疾病鉴别能力 ,均可获得最优特征基因子集 ,基于决策森林的统计频数法优于扰动法。  相似文献   

应用DNA芯片数据挖掘复杂疾病相关基因的集成决策方法   总被引:11,自引:2,他引:9  
DNA芯片技术的迅速发展, 可同时检测成千上万个基因的表达谱数据, 为生命科学家们从一个全新的角度阐明生命的本质提供了可能性. 目前, 基因表达谱分析的工作大多集中在对癌症等疾病分类、疾病亚型识别等方面, 而从这些基因表达谱信息中挖掘反映疾病本质特征的相关基因, 是一项在后基因组时代更具挑战意义的科学研究, 基因挖掘由于缺少理想的数据挖掘技术而被忽视. 我们提出了一种新颖的特征基因挖掘的集成决策方法, 目的在于解决三个重要的生物学问题: 生物学分类及疾病分型、复杂疾病相关基因深度挖掘和目标驱使的基因网络构建. 我们成功地将此集成决策方法应用于一套结肠癌DNA表达谱数据, 结果显示这一新颖的特征基因挖掘技术在应用DNA芯片数据分析、挖掘复杂疾病相关基因等方面具有很高的价值.  相似文献   

随机森林:一种重要的肿瘤特征基因选择法   总被引:2,自引:0,他引:2  
特征选择技术已经被广泛地应用于生物信息学科,随机森林(random forests,RF)是其中一种重要的特征选择方法。利用RF对胃癌、结肠癌和肺癌等5组基因表达谱数据进行特征基因选择,将选择结果与支持向量机(support vector machine,SVM)结合对原数据集分类,并对特征基因选择及分类结果进行初步的分析。同时使用微阵列显著性分析(significant analysis of microarray,SAM)和ReliefF法与RF比较,结果显示随机森林选择的特征基因包含更多分类信息,分类准确率更高。结合该方法自身具有的分类方面的诸多优势,随机森林可以作为一种可靠的基因表达谱数据分析手段被广泛使用。  相似文献   

基于Hom in的基因共表达网络的比较分析,发现人类基因共表达网络和蛋白质相互作用数据之间存在一定的相关性。采用基因本体论对这两个网络重叠区域进行基因分类后发现,这些编码的蛋白质主要集中在对刺激物的应答途径之中。通过对该途径中的蛋白质相互作用网络作图,获得了两个独立的功能模块。通过对模块中的基因分类和关键基因分析得出两者分别对应于内外源刺激物的应答功能。本研究对于利用不断丰富的核酸公共数据信息挖掘蛋白质相互作用的研究具有积极的促进作用。  相似文献   

利用病原菌序列差异,对病原菌特定基因和位点进行检测,可以快速发现和鉴别病原菌的分类和特征,对传染病快速诊断和溯源具有基础性意义和重要价值.本文旨在覆盖中国重要传染病的103种病原菌,寻找各分类阶元中特有的同源基因,并从中挑选出适合用于病原菌鉴定、分型的候选基因.利用生物信息学和基因组学方法,对已有全基因组序列的275株病原菌的836415个基因进行比对分析,进一步明确菌株的门、纲、目、科、属各分类阶元中特有的同源基因集合;通过COG功能分类方法,对同源基因集合进行功能注释,并分析在不同分类阶元内的保守基因功能的变化规律.本研究寻找到适合鉴定和分型的不同分类阶元(门、纲、目、科、属)的同源基因集合共19563个(门2891个、纲1016个、目3601个、科10130个、属1925个).对同源基因功能的分析表明,适合对病原菌进行鉴定的基因在不同分类阶元中,表现的功能存在较大差异.革兰氏阳性和阴性病原菌在不同分类阶元中,同源基因表现出的功能也存在差异.该结果将为对在中国广泛存在的病原菌进行检测所涉及的探针、芯片设计提供理论依据,加快目标探针的筛选工作.同时,研究也是首次将世界范围内的全基因组数据和中国重大传染病涉及的病原菌紧密联系结合,为利用功能基因组学开展区域性、有针对性的病原检测和监测,提供候选基因和位点筛选的新方法.相关结果在细菌的元基因组学研究中也具有一定的应用价值.  相似文献   

王蕊平  王年  苏亮亮  陈乐 《生物信息学》2011,9(2):164-166,170
海量数据的存在是现代信息社会的一大特点,如何在成千上万的基因中有效地选出样本的分类特征对癌症的诊治具有重要意义。采用局部非负矩阵分解方法对癌症基因表达谱数据进行特征提取。首先对基因表达谱数据进行筛选,然后构造局部非负矩阵并对其进行分解得到维数低、能充分表征样本的特征向量,最后用支持向量机对特征向量进行分类。结果表明该方法的可行性和有效性。  相似文献   

结合基因功能分类体系Gene Ontology筛选聚类特征基因   总被引:3,自引:0,他引:3  
使用两套基因表达谱数据,按各基因的表达值方差,选择表达变异基因对样本聚类,发现一般使用方差较大的前10%的基因作为特征基因,就可以较好地对疾病样本聚类。对不同的疾病,包含聚类信息的特征基因有不同的分布特点。在此基础上,结合基因功能分类体系(Gene Ontology,GO),进一步筛选聚类的特征基因。通过检验在Gene Ontology中的每个功能类中的表达变异基因是否非随机地聚集,寻找疾病相关功能类,再根据相关功能类中的表达变异基因进行聚类分析。实验结果显示:结合基因功能体系进一步筛选表达变异基因作为聚类特征基因,可以保持或提高聚类准确性,并使得聚类结果具有明确的生物学意义。另外,发现了一些可能和淋巴瘤和白血病相关的基因。  相似文献   

Fung ES  Ng MK 《Bioinformation》2007,2(5):230-234
One of the applications of the discriminant analysis on microarray data is to classify patient and normal samples based on gene expression values. The analysis is especially important in medical trials and diagnosis of cancer subtypes. The main contribution of this paper is to propose a simple Fisher-type discriminant method on gene selection in microarray data. In the new algorithm, we calculate a weight for each gene and use the weight values as an indicator to identify the subsets of relevant genes that categorize patient and normal samples. A l(2) - l(1) norm minimization method is implemented to the discriminant process to automatically compute the weights of all genes in the samples. The experiments on two microarray data sets have shown that the new algorithm can generate classification results as good as other classification methods, and effectively determine relevant genes for classification purpose. In this study, we demonstrate the gene selection's ability and the computational effectiveness of the proposed algorithm. Experimental results are given to illustrate the usefulness of the proposed model.  相似文献   

ABSTRACT: BACKGROUND: A common task in analyzing microarray data is to determine which genes are differentially expressed across two (or more) kind of tissue samples or samples submitted under experimental conditions. Several statistical methods have been proposed to accomplish this goal, generally based on measures of distance between classes. It is well known that biological samples are heterogeneous because of factors such as molecular subtypes or genetic background that are often unknown to the experimenter. For instance, in experiments which involve molecular classification of tumors it is important to identify significant subtypes of cancer. Bimodal or multimodal distributions often reflect the presence of subsamples mixtures. Consequently, there can be genes differentially expressed on sample subgroups which are missed if usual statistical approaches are used. In this paper we propose a new graphical tool which not only identifies genes with up and down regulations, but also genes with differential expression in different subclasses, that are usually missed if current statistical methods are used. This tool is based on two measures of distance between samples, namely the overlapping coefficient (OVL) between two densities and the area under the receiver operating characteristic (ROC) curve. The methodology proposed here was implemented in the open-source R software. RESULTS: This method was applied to a publicly available dataset, as well as to a simulated dataset. We compared our results with the ones obtained using some of the standard methods for detecting differentially expressed genes, namely Welch t-statistic, fold change (FC), rank products (RP), average difference (AD), weighted average difference (WAD), moderated t-statistic (modT), intensity-based moderated t-statistic (ibmT), significance analysis of microarrays (samT) and area under the ROC curve (AUC). On both datasets all differentially expressed genes with bimodal or multimodal distributions were not selected by all standard selection procedures. We also compared our results with (i) area between ROC curve and rising area (ABCR) and (ii) the test for not proper ROC curves (TNRC). We found our methodology more comprehensive, because it detects both bimodal and multimodal distributions and different variances can be considered on both samples. Another advantage of our method is that we can analyze graphically the behavior of different kinds of differentially expressed genes. CONCLUSION: Our results indicate that the arrow plot represents a new flexible and useful tool for the analysis of gene expression profiles from microarrays.  相似文献   



Breast cancer is a heterogeneous disease usually including four molecular subtypes such as luminal A, luminal B, HER2-enriched, and triple-negative breast cancer (TNBC). TNBC is more aggressive than other breast cancer subtypes. Despite major advances in ER-positive or HER2-amplified breast cancer, there is no targeted agent currently available for TNBC, so it is urgent to identify new potential therapeutic targets for TNBC.


We first used microarray analysis to compare gene expression profiling between TNBC and non-TNBC. Furthermore an integrated analysis was conducted based on our own and published data, leading to more robust, reproducible and accurate predictions. Additionally, we performed qRT-PCR in breast cancer cell lines to verify the findings in integrated analysis.


After searching Gene Expression Omnibus database (GEO), two microarray studies were obtained according to the inclusion criteria. The integrated analysis was conducted, including 30 samples of TNBC and 77 samples of non-TNBC. 556 genes were found to be consistently differentially expressed (344 up-regulated genes and 212 down-regulated genes in TNBC). Functional annotation for these differentially expressed genes (DEGs) showed that the most significantly enriched Gene Ontology (GO) term for molecular functions was protein binding (GO: 0005515, P = 6.09E-21), while that for biological processes was signal transduction (GO: 0007165, P = 9.46E-08), and that for cellular component was cytoplasm (GO: 0005737, P = 2.09E-21). The most significant pathway was Pathways in cancer (P = 6.54E-05) based on Kyoto Encyclopedia of Genes and Genomes (KEGG). DUSP1 (Degree = 21), MYEOV2 (Degree = 15) and UQCRQ (Degree = 14) were identified as the significant hub proteins in the protein-protein interaction (PPI) network. Five genes were selected to perform qRT-PCR in seven breast cancer cell lines, and qRT-PCR results showed that the expression pattern of selected genes in TNBC lines and non-TNBC lines was nearly consistent with that in the integrated analysis.


This study may help to understand the pathogenesis of different breast cancer subtypes, contributing to the successful identification of therapeutic targets for TNBC.  相似文献   



To perform a meta-analysis of gene expression microarray data from animal studies of lung injury, and to identify an injury-specific gene expression signature capable of predicting the development of lung injury in humans.


We performed a microarray meta-analysis using 77 microarray chips across six platforms, two species and different animal lung injury models exposed to lung injury with or/and without mechanical ventilation. Individual gene chips were classified and grouped based on the strategy used to induce lung injury. Effect size (change in gene expression) was calculated between non-injurious and injurious conditions comparing two main strategies to pool chips: (1) one-hit and (2) two-hit lung injury models. A random effects model was used to integrate individual effect sizes calculated from each experiment. Classification models were built using the gene expression signatures generated by the meta-analysis to predict the development of lung injury in human lung transplant recipients.


Two injury-specific lists of differentially expressed genes generated from our meta-analysis of lung injury models were validated using external data sets and prospective data from animal models of ventilator-induced lung injury (VILI). Pathway analysis of gene sets revealed that both new and previously implicated VILI-related pathways are enriched with differentially regulated genes. Classification model based on gene expression signatures identified in animal models of lung injury predicted development of primary graft failure (PGF) in lung transplant recipients with larger than 80% accuracy based upon injury profiles from transplant donors. We also found that better classifier performance can be achieved by using meta-analysis to identify differentially-expressed genes than using single study-based differential analysis.


Taken together, our data suggests that microarray analysis of gene expression data allows for the detection of “injury" gene predictors that can classify lung injury samples and identify patients at risk for clinically relevant lung injury complications.  相似文献   

Gene expression studies have been widely used in an effort to identify signatures that can predict clinical progression of cancer. In this study we focused instead on identifying gene expression differences between breast tumors and adjacent normal tissue, and between different subtypes of tumor classified by clinical marker status. We have collected a set of 20 breast cancer tissues, matched with the adjacent pathologically normal tissue from the same patient. The cancer samples representing each subtype of breast cancer identified by estrogen receptor ER(+/-) and Her2(+/-) status and divided into four subgroups (ER+/Her2+, ER+/Her2-, ER-/Her2+, and ER-/Her2-) were hybridized on Affymetrix HG-133 Plus 2.0 microarrays. By comparing cancer samples with their matched normal controls we have identified 3537 overall differentially expressed genes using data analysis methods from Bioconductor. When we looked at the genes in common of the four subgroups, we found 151 regulated genes, some of them encoding known targets for breast cancer treatment. Unique genes in the four subgroups instead suggested gene regulation dependent on the ER/Her2 markers selection. In conclusion, the results indicate that microarray studies using robust analysis of matched tumor and normal samples from the same patients can be used to identify genes differentially expressed in breast cancer tumor subtypes even when small numbers of samples are considered and can further elucidate molecular features of breast cancer.  相似文献   

Data from gene expression arrays are influenced by many experimental parameters that lead to variations not simply accessible by standard quantification methods. To compare measurements from gene expression array experiments, quantitative data are commonly normalised using reference genes or global normalisation methods based on mean or median values. These methods are based on the assumption that (i) selected reference genes are expressed at a standard level in all experiments or (ii) that mean or median signal of expression will give a quantitative reference for each individual experiment. We introduce here a new ranking diagram, with which we can show how the different normalisation methods compare, and how they are influenced by variations in measurements (noise) that occur in every experiment. Furthermore, we show that an upper trimmed mean provides a simple and robust method for normalisation of larger sets of experiments by comparative analysis.  相似文献   

张思嘉  蔡挺  张顺 《生物信息学》2022,20(4):247-256
基于SNP突变数据与mRNA表达谱关联分析,构建一种肝癌分子分型方法并对比不同分型预后的差异,并对不同分型肝癌的发生发展机制进一步研究。首先通过TCGA数据库收集359例肝细胞癌患者的SNP突变数据和mRNA表达数据,采用Wilcoxon秩和检验,筛选突变后差异表达基因,并通过生物信息学工具String和Cytoscape 构建差异表达基因的蛋白互作网络,筛选连接度最高的10个Hub基因。利用Consensus Cluster Plus软件包,基于Hub基因mRNA表达水平构建NMF分子分型模型,再结合生存数据评估各分型患者的预后。最后利用加权基因共表达网络分析(WGCNA),识别与肝癌分子分型相关的模块,并针对关键模块的基因进行通路富集,从而对不同分型肝癌的基因表达谱进行比较。结果:NMF模型将肝癌分为高危、低危2个分型,其中CDKN2A和FOXO1基因对分型贡献度高。生存分析显示低危组患者的生存情况显著优于高危组,高危组富集多个与肿瘤细胞侵蚀、转移、复发过程相关的信号通路,低危组则与细胞周期和胰液分泌相关。本研究在无先验性信息的前提下,基于突变后显著差异表达的Hub基因表达水平构建的肝癌分子分型对肝癌患者预后评估具有一定的指导意义,其中CDKN2A和FOXO1突变是肝癌患者的不良预后因素,针对二者的靶向药研发,可能为肝癌患者提供新的治疗策略。  相似文献   

DNA microarray experiments have generated large amount of gene expression measurements across different conditions. One crucial step in the analysis of these data is to detect differentially expressed genes. Some parametric methods, including the two-sample t-test (T-test) and variations of it, have been used. Alternatively, a class of non-parametric algorithms, such as the Wilcoxon rank sum test (WRST), significance analysis of microarrays (SAM) of Tusher et al. (2001), the empirical Bayesian (EB) method of Efron et al. (2001), etc., have been proposed. Most available popular methods are based on t-statistic. Due to the quality of the statistic that they used to describe the difference between groups of data, there are situations when these methods are inefficient, especially when the data follows multi-modal distributions. For example, some genes may display different expression patterns in the same cell type, say, tumor or normal, to form some subtypes. Most available methods are likely to miss these genes. We developed a new non-parametric method for selecting differentially expressed genes by relative entropy, called SDEGRE, to detect differentially expressed genes by combining relative entropy and kernel density estimation, which can detect all types of differences between two groups of samples. The significance of whether a gene is differentially expressed or not can be estimated by resampling-based permutations. We illustrate our method on two data sets from Golub et al. (1999) and Alon et al. (1999). Comparing the results with those of the T-test, the WRST and the SAM, we identified novel differentially expressed genes which are of biological significance through previous biological studies while they were not detected by the other three methods. The results also show that the genes selected by SDEGRE have a better capability to distinguish the two cell types.  相似文献   

基因表达水平与同义密码子使用关系的初步研究   总被引:3,自引:0,他引:3  
提出一个预测基因表达水平和同义密码子使用的自洽信息聚类方法。将同义密码子分成最适密码子、非最适密码子和稀有密码子,认为三者的使用频率是调控基因表达水平的主要因素。基于这一观点,对Ecoli和Yeast两类生物的基因表达水平和密码子的使用,用自洽信息聚类方法进行了预测。发现高低表达基因明显分开,基因表达水平被分为四级;甚高表达基因(VH)、高表达基因(H)、较低表达基因(LM)和低表达基因(LL);  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号