首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到17条相似文献,搜索用时 234 毫秒
1.
系统发育谱方法是目前研究较多的一种基于非同源性的生物大分子功能注释方法。针对现有算法存在的一些缺陷,从两个方面对该方法做了改进:一是构造基于权重的系统发育谱;二是采用改进的聚类算法对发育谱的相似性进行分析。从NCBI上下载100条Escherichia coli K12蛋白质作为实验数据,分别使用改进的算法和经典的层次聚类算法、K均值聚类算法对相似谱进行分析。结果显示,提出的改进算法在对相似谱聚类的精确度上明显优于后两种聚类算法。  相似文献   

2.
系统发育谱算法作为一种有效的大规模基因组功能注释方法,已经被成功的应用到原核生物基因组的功能注释中去。通过对系统发育谱方法中的一个关键环节——相似谱的聚类进行分析,提出了一种基于统计建模的方法来对相似的系统发育谱进行聚类。实验表明,该方法在保证较高的覆盖率的同时,还有效的提高了算法的整体速度,且当参与建模的系统发育谱的数目越大时,算法的精确度越高。  相似文献   

3.
群体分型是一种有助于更好的理解人类身心健康等复杂生物学问题的有效方法,聚类是一种为了对样本分组来降低复杂性的定义肠型的方法,而传统K-means聚类算法的K值选取无法确定,本文在传统K-means聚类算法的基础上进行了改进,并公开数据集上进行了验证,实验表明改进算法能够解决K值选取无法确定的问题,且聚类结果的稳定性、准确性和聚类质量都得到显著提高。将改进后的模型运用于肠道菌群OTUs数据,发现不仅能够有效地区分2-型糖尿病患者样本间的相似性,而且能鉴定出影响菌群结构异质性最大的OTUs菌,为临床解决2-型糖尿病问题提供了一种新的思路。  相似文献   

4.
提出了一种蛋白质相互作用的相似性度量,将其与基因表达数据的相似性度量相结合,定义了一种融合的距离度量,并且将这种融合的距离度量用于改进现有的K—means聚类方法。经过实际数据的检验,改进后的K—means方法比常用的其它几种聚类方法具有更好的效果,说明结合蛋白质相互作用数据可以使得基因表达聚类的结果更有生物意义。  相似文献   

5.
聚类分析在黄霉素发酵过程中的应用   总被引:2,自引:0,他引:2  
【目的】将聚类分析的方法应用于黄霉素摇瓶发酵条件的优化过程中。【方法】通过系统聚类算法、K均值聚类算法和模糊C均值聚类算法对不同批次黄霉素发酵的摇瓶数据的聚类分析进行比较,发现模糊C均值聚类算法优于其他聚类算法,确定了以模糊C均值聚类算法对黄霉素摇瓶发酵数据进行聚类分析。【结果】然后利用模糊C均值聚类算法选取优质组样本,并利用优质样本优化了黄霉素摇瓶发酵的控制参数分布范围。【结论】这充分证明了聚类分析在发酵过程的优化过程中有良好的实用性。  相似文献   

6.
肠道菌群与诸多人类重大疾病相关,研究在不同条件下的肠道菌群数据具有重要意义。由于菌群数据出现零膨胀现象,采用成对比率几何平均值(GMPR)方法对其进行归一化。本研究以2型糖尿病数据集为例,提出一种改进的Spectrum算法。首先,使用基于特征加权的相似度矩阵,避免忽视每个样本/特征所对应的不同特征值大小在该样本中所占据的权重;其次,将拉普拉斯矩阵替换为Hessian矩阵,避免传统谱聚类的灵敏度问题,将ISODATA聚类算法代替原本的K-means算法,有效地调整聚类中心数K。试验结果表明,GMPR+改进Spectrum在2型糖尿病中的标准化互信息(NMI)为0.423,戴维森堡丁指数(DBI)为4.751,Calinski-Harabasz指标(CH)为25.541,兰德指数(RI)为0.835,调整兰德指数(ARI)为0.019,较改进前的效果有所提升,并且该算法可以识别出不同类型患病人群在肠道菌群上的结构差异,挖掘出肠道微生物组的关键细菌。  相似文献   

7.
系统聚类分析在细菌全细胞脂肪酸模式识别中的应用   总被引:1,自引:1,他引:1  
用欧氏距离系数和指数相关系数,结合8种常用的系统聚类算法,对用毛细管柱气相色谱祛绘制的34株莫拉氏菌(Moraxella)及其类属菌和13株嗜肺军团杆菌(Legionella pneumo-phila)的全细胞脂肪酸气相色谱图,进行了聚类分析。比较了欧氏距离系数的8种系统聚类算法所得的聚类树状谱。结果表明,奠拉氏菌与嗜肺军团杆菌可以明确区分。在奠拉氏菌中,我国分离的两个新种与目前该属的主要标准株也能明确区分。两种相似系数中,欧氏距离系数的聚类结果较好;8种系统聚类算法中,最长距离法和类平均法的聚类结果较好。  相似文献   

8.
贝叶斯聚类在基因表达谱知识挖掘中的应用   总被引:1,自引:0,他引:1  
在大规模基因表达谱的数据分析中引入了一种全新的基于贝叶斯模型的聚类算法,从生物学背景出发,研究了该算法应用在大规模基因表达谱中的理论基础和算法优越性,并应用该算法对两个公共的基因表达数据集进行了知识再挖掘。结果表明,与其他聚类算法相比,该算法在知识发现方面具有显著的优越性。挖掘出的生物学知识对该领域研究人员的实验设计也有一定的启发性。  相似文献   

9.
:分析了当前常用的标准化方法在肿瘤基因芯片中引起错误分类的原因,提出了一种基于类均值的标准化方法.该方法对基因表达谱进行双向标准化,并将标准化过程与聚类过程相互缠绕,利用聚类结果来修正参照表达水平.选取了5组肿瘤基因芯片数据,用层次聚类和K-均值聚类算法在不同的方差水平上分别对常用的标准化和基于类均值的标准化处理后的基因表达数据进行聚类分析比较.实验结果表明,基于类均值的标准化方法能有效提高肿瘤基因表达谱聚类结果的质量.  相似文献   

10.
系统发育树又称进化树、生命树等,在达尔文的"进化论"一书中首次出现,之后系统发育树的重构被广大生物学家所接受。该文阐述了构建系统发育树的基本流程,对目前用于构建系统发育树的四类算法(距离法、最大简约法、最大似然法和贝叶斯法)进行了详细地分析和比较,并介绍了一些常用系统发育树构建和分析软件(PHYLIP、MEGA、MrBayes)的特点。  相似文献   

11.
k-均值聚类算法是一种广泛应用于基因表达数据聚类分析中的迭代变换算法,它通常用距离法来表示基因间的关系,但不能有效的反应基因间的相互依赖的关系。为此,提出基于信息论的k-modes聚类算法,克服了以上缺点。另外,还引入了伪F统计量,一方面,可以对空间中有部分重叠的点进行有效的分类;另一方面,可以给出最佳聚类数目,从而弥补了k-modes聚类法的不足。使其成为一种非常有效的算法,从而达到较优的聚类效果。  相似文献   

12.
Inference from clustering with application to gene-expression microarrays.   总被引:7,自引:0,他引:7  
There are many algorithms to cluster sample data points based on nearness or a similarity measure. Often the implication is that points in different clusters come from different underlying classes, whereas those in the same cluster come from the same class. Stochastically, the underlying classes represent different random processes. The inference is that clusters represent a partition of the sample points according to which process they belong. This paper discusses a model-based clustering toolbox that evaluates cluster accuracy. Each random process is modeled as its mean plus independent noise, sample points are generated, the points are clustered, and the clustering error is the number of points clustered incorrectly according to the generating random processes. Various clustering algorithms are evaluated based on process variance and the key issue of the rate at which algorithmic performance improves with increasing numbers of experimental replications. The model means can be selected by hand to test the separability of expected types of biological expression patterns. Alternatively, the model can be seeded by real data to test the expected precision of that output or the extent of improvement in precision that replication could provide. In the latter case, a clustering algorithm is used to form clusters, and the model is seeded with the means and variances of these clusters. Other algorithms are then tested relative to the seeding algorithm. Results are averaged over various seeds. Output includes error tables and graphs, confusion matrices, principal-component plots, and validation measures. Five algorithms are studied in detail: K-means, fuzzy C-means, self-organizing maps, hierarchical Euclidean-distance-based and correlation-based clustering. The toolbox is applied to gene-expression clustering based on cDNA microarrays using real data. Expression profile graphics are generated and error analysis is displayed within the context of these profile graphics. A large amount of generated output is available over the web.  相似文献   

13.
We propose an algorithm for selecting and clustering genes according to their time-course or dose-response profiles using gene expression data. The proposed algorithm is based on the order-restricted inference methodology developed in statistics. We describe the methodology for time-course experiments although it is applicable to any ordered set of treatments. Candidate temporal profiles are defined in terms of inequalities among mean expression levels at the time points. The proposed algorithm selects genes when they meet a bootstrap-based criterion for statistical significance and assigns each selected gene to the best fitting candidate profile. We illustrate the methodology using data from a cDNA microarray experiment in which a breast cancer cell line was stimulated with estrogen for different time intervals. In this example, our method was able to identify several biologically interesting genes that previous analyses failed to reveal.  相似文献   

14.
Previous studies have been conducted in gene expression profiling to identify groups of genes that characterize the colorectal carcinoma disease. Despite the success of previous attempts to identify groups of genes in the progression of the colorectal carcinoma disease, their methods either require subjective interpretation of the number of clusters, or lack stability during different runs of the algorithms. All of which limits the usefulness of these methods. In this study, we propose an enhanced algorithm that provides stability and robustness in identifying differentially expressed genes in an expression profile analysis. Our proposed algorithm uses multiple clustering algorithms under the consensus clustering framework. The results of the experiment show that the robustness of our method provides a consistent structure of clusters, similar to the structure found in the previous study. Furthermore, our algorithm outperforms any single clustering algorithms in terms of the cluster quality score.  相似文献   

15.
Kernel density smoothing techniques have been used in classification or supervised learning of gene expression profile (GEP) data, but their applications to clustering or unsupervised learning of those data have not been explored and assessed. Here we report a kernel density clustering method for analysing GEP data and compare its performance with the three most widely-used clustering methods: hierarchical clustering, K-means clustering, and multivariate mixture model-based clustering. Using several methods to measure agreement, between-cluster isolation, and withincluster coherence, such as the Adjusted Rand Index, the Pseudo F test, the r(2) test, and the profile plot, we have assessed the effectiveness of kernel density clustering for recovering clusters, and its robustness against noise on clustering both simulated and real GEP data. Our results show that the kernel density clustering method has excellent performance in recovering clusters from simulated data and in grouping large real expression profile data sets into compact and well-isolated clusters, and that it is the most robust clustering method for analysing noisy expression profile data compared to the other three methods assessed.  相似文献   

16.
Microarray-CGH (comparative genomic hybridization) experiments are used to detect and map chromosomal imbalances. A CGH profile can be viewed as a succession of segments that represent homogeneous regions in the genome whose representative sequences share the same relative copy number on average. Segmentation methods constitute a natural framework for the analysis, but they do not provide a biological status for the detected segments. We propose a new model for this segmentation/clustering problem, combining a segmentation model with a mixture model. We present a new hybrid algorithm called dynamic programming-expectation maximization (DP-EM) to estimate the parameters of the model by maximum likelihood. This algorithm combines DP and the EM algorithm. We also propose a model selection heuristic to select the number of clusters and the number of segments. An example of our procedure is presented, based on publicly available data sets. We compare our method to segmentation methods and to hidden Markov models, and we show that the new segmentation/clustering model is a promising alternative that can be applied in the more general context of signal processing.  相似文献   

17.
MOTIVATION: With the advent of microarray chip technology, large data sets are emerging containing the simultaneous expression levels of thousands of genes at various time points during a biological process. Biologists are attempting to group genes based on the temporal pattern of their expression levels. While the use of hierarchical clustering (UPGMA) with correlation 'distance' has been the most common in the microarray studies, there are many more choices of clustering algorithms in pattern recognition and statistics literature. At the moment there do not seem to be any clear-cut guidelines regarding the choice of a clustering algorithm to be used for grouping genes based on their expression profiles. RESULTS: In this paper, we consider six clustering algorithms (of various flavors!) and evaluate their performances on a well-known publicly available microarray data set on sporulation of budding yeast and on two simulated data sets. Among other things, we formulate three reasonable validation strategies that can be used with any clustering algorithm when temporal observations or replications are present. We evaluate each of these six clustering methods with these validation measures. While the 'best' method is dependent on the exact validation strategy and the number of clusters to be used, overall Diana appears to be a solid performer. Interestingly, the performance of correlation-based hierarchical clustering and model-based clustering (another method that has been advocated by a number of researchers) appear to be on opposite extremes, depending on what validation measure one employs. Next it is shown that the group means produced by Diana are the closest and those produced by UPGMA are the farthest from a model profile based on a set of hand-picked genes. Availability: S+ codes for the partial least squares based clustering are available from the authors upon request. All other clustering methods considered have S+ implementation in the library MASS. S+ codes for calculating the validation measures are available from the authors upon request. The sporulation data set is publicly available at http://cmgm.stanford.edu/pbrown/sporulation  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号