首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 203 毫秒
1.
蛋白质序列中的关联规则发现及其应用   总被引:2,自引:0,他引:2  
随着蛋白质序列-结构分析中使用的机器学习算法越来越复杂,其结果的解释和发现过程也随之复杂化,因此有必要寻找简单且理论上可靠的方法。通过引入原理简单、理论可靠、结果具有很强实际意义的关联规则发现算法,找到了蛋白质序列中数以万计的模式。结合实例演示了如何将这些模式应用于蛋白质序列分析中,如保守区域发现、二级结构预测等。同时根据这些结果构建了一个二级结构规则库和一种简单的二级结构预测算法,实验结果表明,约81%的二级结构可以由至少一条关联规则预测得到。  相似文献   

2.
癌症通常由基因变异的累积所驱动,有效地识别癌症的驱动突变是一个巨大的挑战。目前已有方法更多是通过将基因组区域中观察到的突变率与背景突变率(BMR)预期的突变率进行比较或功能影响测试来识别驱动基因,该驱动基因本质上是存在统计异常的基因。而且并未对已有明确分类的癌症的子类之间驱动基因进行研究。本文引入关联规则算法,探寻发生该基因突变诱使病人患该子类低级别脑胶质瘤的有效规则,将突变数据与患癌结果通过算法建立关系,再通过支持度、置信度和提升度这三个指标对产生的规则进行筛选和评估,来预测候选驱动基因以及类间驱动基因差异。最后利用491例低级别脑胶质瘤体细胞突变数据,得到22个与结果存在关联的驱动基因及其所属的子类,敏感性和假阳性结果优于目前已有的单一算法,且22个基因均具有重要的生物学功能。同时建立了基于22个基因的低级别脑胶质瘤子类识别方法,模型总体准确率达98.99%,方法可有效区分三子类。  相似文献   

3.
作为泌尿系统常见的肿瘤之一,肾肿瘤发病率在逐年上升。针对Affymetrix hgu133b的基因芯片数据进行差异表达基因筛选,应用加权基因共表达网络分析算法构建肾肿瘤差异表达基因的共表达网络。分析肾部正常组织和肿瘤组织差异表达基因之间的关联模式;选取与肿瘤发生关联程度最高的模块,筛选枢纽基因。最后,针对枢纽基因进行基因本体富集分析。细胞衰老是抑制肿瘤发生的主要机制之一,分析结果显示枢纽基因PLA2R1和TBX3与细胞衰老有关,可能对肾肿瘤的形成具有重要影响。该结果与基因PLA2R1通过促进细胞衰老抑制肾部肿瘤发生的研究结论一致。  相似文献   

4.
GO功能类与基因差异表达的关联规则挖掘算法   总被引:1,自引:0,他引:1  
针对基因功能分类体系Gene Ontology的层次结构特点,修改关联规则挖掘算法Apriori,开发“挖掘与基因差异表达关联的GO功能组合”软件(RuleGO).RuleGO以基因表达谱上的差异表达基因集合和不差异表达基因集合为输入,输出组合特征功能类与基因差异表达现象的关联规则,有助于解释基因差异表达现象的本质原因,如疾病发病机制、药物作用机理等.将RuleGO 和OntoExpress应用在结肠癌和腺癌表达谱数据集上,结果显示,RuleGO比OntoExpress能发现更多的与差异表达现象关联的特征功能类,更能看到在OntoExpress上不能发现的组合特征功能类.另外,结果显示,将规则的置信度和支持度要求设置较高时,一般只有组合功能类才能满足要求,这提示在基因表达谱分析中不宜采用单个角度的单个功能分类单元,考虑功能分类单元的组合可能更有意义.  相似文献   

5.
全基因组基因-基因相互作用研究现状   总被引:2,自引:0,他引:2  
沈佳薇  胡晓菡  师咏勇 《遗传》2011,33(8):820-828
复杂疾病目前正在全球范围流行, 极大地影响人类的健康。研究发现, 复杂疾病的性状受到多个位点的相互作用影响。目前的全基因组关联分析(Genome-wide association study, GWAS)仅仅解析单个SNP位点对疾病易感性的贡献, 单纯依靠这一种策略并不能在寻找复杂疾病的病因上得到根本性的突破。基因-基因相互作用可能是复杂疾病致病的主要因素之一。针对这一点, 科学家已经提出了一些检验基因相互作用的算法, 包括惩罚logistic回归模型、多因子降维(Multifactor dimensional reduction)、集合关联法(Set-association approach)、贝叶斯网络(Bayesian networks)、随机森林法等。文章首先对目前这些方法做了综述, 并指出了其中的不足, 包括计算复杂度太高、假设驱动、数据会过度拟合、对低维数据不敏感等, 进而简述了一种由笔者所在实验室开发的基于GPU的研究基因相互作用的算法, 该算法复杂度低, 不需要任何假设, 没有边际效应, 有很好的稳定性, 速度快, 适用于进行全基因组范围内的基因-基因相互作用计算。  相似文献   

6.
危急值管理系统的本质是对实验室结果数据的采集、管理、分析与预警,因此,需要将计算机技术中的知识表达与推理等规则算法技术应用到系统中,使之能够对LIS系统的检验结果进行表达与推理,从而推动IT系统向智能化方向发展。文章探讨将规则算法内置于临床实验室危急值管理系统中,以提高信息系统的智能性和适用性;针实验室危急值管理系统的不足,设计开发了规则算法引擎,实现了实验室危急值的智能准确预警。  相似文献   

7.
表面肌电信号(Surface Electromyography,sEMG)是通过相应肌群表面的传感器记录下来的一维时间序列非平稳生物电信号,不但反映了神经肌肉系统活动,对于反映相应动作肢体活动信息同样重要。而模式识别是肌电应用领域的基础和关键。为了在应用基于表面肌电信号模式识别中选取合适算法,本文拟对基于表面肌电信号的人体动作识别算法进行回顾分析,主要包括模糊模式识别算法、线性判别分析算法、人工神经网络算法和支持向量机算法。模糊模式识别能自适应提取模糊规则,对初始化规则不敏感,适合处理s EMG这样具有严格不重复的生物电信号;线性判别分析对数据进行降维,计算简单,但不适合大数据;人工神经网络可以同时描述训练样本输入输出的线性关系和非线性映射关系,可以解决复杂的分类问题,学习能力强;支持向量机处理小样本、非线性的高维数据优势明显,计算速度快。比较各方法的优缺点,为今后处理此类问题模式识别算法选取提供了参考和依据。  相似文献   

8.
根据植物茎叶图像模拟根系图像的人工神经网络算法   总被引:2,自引:0,他引:2  
以内蒙古野生甘草产区为试验地,采用图像关联的小波-神经网络综合算法建立了甘草茎叶-根系图像联结的BP人工神经网络模型。该模型具有运算速度快,易于处理复杂图像数据的特点。通过植物茎叶图像对根系图像的模拟,实现块茎植物生物量数量化评估和农作物估产。  相似文献   

9.
目的:分析常用安神类中成药的处方用药规律。方法:收集《新编国家中成药》中的安神类药品处方,基于中医传承辅助系统建立处方数据库,采用关联规则apriori算法、复杂系统熵聚类等方法开展研究,确定处方中各种药物的使用频次及药物之间的关联规则等。结果:高频次药物包括茯苓、甘草、当归、麦冬、朱砂等;高频次药物组合包括“当归、茯苓”“茯苓、炒酸枣仁”“甘草、茯苓”等;置信度较高的关联规则包括“牛黄、朱砂”“酸枣仁、茯苓”等,新处方包括“茯苓、炒酸枣仁、熟地黄、五味子、丹参、麦冬、生地黄”等。结论:安神类中成药处方药物多具有养血定志,补气滋阴和重镇安神之功效。  相似文献   

10.
贝叶斯聚类在基因表达谱知识挖掘中的应用   总被引:1,自引:0,他引:1  
在大规模基因表达谱的数据分析中引入了一种全新的基于贝叶斯模型的聚类算法,从生物学背景出发,研究了该算法应用在大规模基因表达谱中的理论基础和算法优越性,并应用该算法对两个公共的基因表达数据集进行了知识再挖掘。结果表明,与其他聚类算法相比,该算法在知识发现方面具有显著的优越性。挖掘出的生物学知识对该领域研究人员的实验设计也有一定的启发性。  相似文献   

11.
Mining gene expression databases for association rules   总被引:16,自引:0,他引:16  
  相似文献   

12.
In the medical domain, it is very significant to develop a rule-based classification model. This is because it has the ability to produce a comprehensible and understandable model that accounts for the predictions. Moreover, it is desirable to know not only the classification decisions but also what leads to these decisions. In this paper, we propose a novel dynamic quantitative rule-based classification model, namely DQB, which integrates quantitative association rule mining and the Artificial Bee Colony (ABC) algorithm to provide users with more convenience in terms of understandability and interpretability via an accurate class quantitative association rule-based classifier model. As far as we know, this is the first attempt to apply the ABC algorithm in mining for quantitative rule-based classifier models. In addition, this is the first attempt to use quantitative rule-based classification models for classifying microarray gene expression profiles. Also, in this research we developed a new dynamic local search strategy named DLS, which is improved the local search for artificial bee colony (ABC) algorithm. The performance of the proposed model has been compared with well-known quantitative-based classification methods and bio-inspired meta-heuristic classification algorithms, using six gene expression profiles for binary and multi-class cancer datasets. From the results, it can be concludes that a considerable increase in classification accuracy is obtained for the DQB when compared to other available algorithms in the literature, and it is able to provide an interpretable model for biologists. This confirms the significance of the proposed algorithm in the constructing a classifier rule-based model, and accordingly proofs that these rules obtain a highly qualified and meaningful knowledge extracted from the training set, where all subset of quantitive rules report close to 100% classification accuracy with a minimum number of genes. It is remarkable that apparently (to the best of our knowledge) several new genes were discovered that have not been seen in any past studies. For the applicability demand, based on the results acqured from microarray gene expression analysis, we can conclude that DQB can be adopted in a different real world applications with some modifications.  相似文献   

13.
MOTIVATION: Association rule analysis methods are important techniques applied to gene expression data for finding expression relationships between genes. However, previous methods implicitly assume that all genes have similar importance, or they ignore the individual importance of each gene. The relation intensity between any two items has never been taken into consideration. Therefore, we proposed a technique named REMMAR (RElational-based Multiple Minimum supports Association Rules) algorithm to tackle this problem. This method adjusts the minimum relation support (MRS) for each gene pair depending on the regulatory relation intensity to discover more important association rules with stronger biological meaning. RESULTS: In the actual case study of this research, REMMAR utilized the shortest distance between any two genes in the Saccharomyces cerevisiae gene regulatory network (GRN) as the relation intensity to discover the association rules from two S.cerevisiae gene expression datasets. Under experimental evaluation, REMMAR can generate more rules with stronger relation intensity, and filter out rules without biological meaning in the protein-protein interaction network (PPIN). Furthermore, the proposed method has a higher precision (100%) than the precision of reference Apriori method (87.5%) for the discovered rules use a literature survey. Therefore, the proposed REMMAR algorithm can discover stronger association rules in biological relationships dissimilated by traditional methods to assist biologists in complicated genetic exploration.  相似文献   

14.
Li BQ  Huang T  Liu L  Cai YD  Chou KC 《PloS one》2012,7(4):e33393
One of the most important and challenging problems in biomedicine and genomics is how to identify the disease genes. In this study, we developed a computational method to identify colorectal cancer-related genes based on (i) the gene expression profiles, and (ii) the shortest path analysis of functional protein association networks. The former has been used to select differentially expressed genes as disease genes for quite a long time, while the latter has been widely used to study the mechanism of diseases. With the existing protein-protein interaction data from STRING (Search Tool for the Retrieval of Interacting Genes), a weighted functional protein association network was constructed. By means of the mRMR (Maximum Relevance Minimum Redundancy) approach, six genes were identified that can distinguish the colorectal tumors and normal adjacent colonic tissues from their gene expression profiles. Meanwhile, according to the shortest path approach, we further found an additional 35 genes, of which some have been reported to be relevant to colorectal cancer and some are very likely to be relevant to it. Interestingly, the genes we identified from both the gene expression profiles and the functional protein association network have more cancer genes than the genes identified from the gene expression profiles alone. Besides, these genes also had greater functional similarity with the reported colorectal cancer genes than the genes identified from the gene expression profiles alone. All these indicate that our method as presented in this paper is quite promising. The method may become a useful tool, or at least plays a complementary role to the existing method, for identifying colorectal cancer genes. It has not escaped our notice that the method can be applied to identify the genes of other diseases as well.  相似文献   

15.
Model-based clustering is a popular tool for summarizing high-dimensional data. With the number of high-throughput large-scale gene expression studies still on the rise, the need for effective data- summarizing tools has never been greater. By grouping genes according to a common experimental expression profile, we may gain new insight into the biological pathways that steer biological processes of interest. Clustering of gene profiles can also assist in assigning functions to genes that have not yet been functionally annotated. In this paper, we propose 2 model selection procedures for model-based clustering. Model selection in model-based clustering has to date focused on the identification of data dimensions that are relevant for clustering. However, in more complex data structures, with multiple experimental factors, such an approach does not provide easily interpreted clustering outcomes. We propose a mixture model with multiple levels, , that provides sparse representations both "within" and "between" cluster profiles. We explore various flexible "within-cluster" parameterizations and discuss how efficient parameterizations can greatly enhance the objective interpretability of the generated clusters. Moreover, we allow for a sparse "between-cluster" representation with a different number of clusters at different levels of an experimental factor of interest. This enhances interpretability of clusters generated in multiple-factor contexts. Interpretable cluster profiles can assist in detecting biologically relevant groups of genes that may be missed with less efficient parameterizations. We use our multilevel mixture model to mine a proliferating cell line expression data set for annotational context and regulatory motifs. We also investigate the performance of the multilevel clustering approach on several simulated data sets.  相似文献   

16.
MOTIVATION: Analysis of gene expression data can provide insights into the positive and negative co-regulation of genes. However, existing methods such as association rule mining are computationally expensive and the quality and quantities of the rules are sensitive to the support and confidence values. In this paper, we introduce the concept of positive and negative co-regulated gene cluster (PNCGC) that more accurately reflects the co-regulation of genes, and propose an efficient algorithm to extract PNCGCs. RESULTS: We experimented with the Yeast dataset and compared our resulting PNCGCs with the association rules generated by the Apriori mining algorithm. Our results show that our PNCGCs identify some missing co-regulations of association rules, and our algorithm greatly reduces the large number of rules involving uncorrelated genes generated by the Apriori scheme. AVAILABILITY: The software is available upon request.  相似文献   

17.
Gene expression microarray experiments frequently generate datasets with multiple values missing. However, most of the analysis, mining, and classification methods for gene expression data require a complete matrix of gene array values. Therefore, the accurate estimation of missing values in such datasets has been recognized as an important issue, and several imputation algorithms have already been proposed to the biological community. Most of these approaches, however, are not particularly suitable for time series expression profiles. In view of this, we propose a novel imputation algorithm, which is specially suited for the estimation of missing values in gene expression time series data. The algorithm utilizes Dynamic Time Warping (DTW) distance in order to measure the similarity between time expression profiles, and subsequently selects for each gene expression profile with missing values a dedicated set of candidate profiles for estimation. Three different DTW-based imputation (DTWimpute) algorithms have been considered: position-wise, neighborhood-wise, and two-pass imputation. These have initially been prototyped in Perl, and their accuracy has been evaluated on yeast expression time series data using several different parameter settings. The experiments have shown that the two-pass algorithm consistently outperforms, in particular for datasets with a higher level of missing entries, the neighborhood-wise and the position-wise algorithms. The performance of the two-pass DTWimpute algorithm has further been benchmarked against the weighted K-Nearest Neighbors algorithm, which is widely used in the biological community; the former algorithm has appeared superior to the latter one. Motivated by these findings, indicating clearly the added value of the DTW techniques for missing value estimation in time series data, we have built an optimized C++ implementation of the two-pass DTWimpute algorithm. The software also provides for a choice between three different initial rough imputation methods.  相似文献   

18.

Background

Microarray gene expression data are accumulating in public databases. The expression profiles contain valuable information for understanding human gene expression patterns. However, the effective use of public microarray data requires integrating the expression profiles from heterogeneous sources.

Results

In this study, we have compiled a compendium of microarray expression profiles of various human tissue samples. The microarray raw data generated in different research laboratories have been obtained and combined into a single dataset after data normalization and transformation. To demonstrate the usefulness of the integrated microarray data for studying human gene expression patterns, we have analyzed the dataset to identify potential tissue-selective genes. A new method has been proposed for genome-wide identification of tissue-selective gene targets using both microarray intensity values and detection calls. The candidate genes for brain, liver and testis-selective expression have been examined, and the results suggest that our approach can select some interesting gene targets for further experimental studies.

Conclusion

A computational approach has been developed in this study for combining microarray expression profiles from heterogeneous sources. The integrated microarray data can be used to investigate tissue-selective expression patterns of human genes.
  相似文献   

19.

Background  

Data clustering analysis has been extensively applied to extract information from gene expression profiles obtained with DNA microarrays. To this aim, existing clustering approaches, mainly developed in computer science, have been adapted to microarray data analysis. However, previous studies revealed that microarray datasets have very diverse structures, some of which may not be correctly captured by current clustering methods. We therefore approached the problem from a new starting point, and developed a clustering algorithm designed to capture dataset-specific structures at the beginning of the process.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号