首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Cluster analysis of gene-wide expression data from DNA microarray hybridization studies has proved to be a useful tool for identifying biologically relevant groupings of genes and constructing gene regulatory networks. The motivation for considering mutual information is its capacity to measure a general dependence among gene random variables. We propose a novel clustering strategy based on minimizing mutual information among gene clusters. Simulated annealing is employed to solve the optimization problem. Bootstrap techniques are employed to get more accurate estimates of mutual information when the data sample size is small. Moreover, we propose to combine the mutual information criterion and traditional distance criteria such as the Euclidean distance and the fuzzy membership metric in designing the clustering algorithm. The performances of the new clustering methods are compared with those of some existing methods, using both synthesized data and experimental data. It is seen that the clustering algorithm based on a combined metric of mutual information and fuzzy membership achieves the best performance. The supplemental material is available at www.gspsnap.tamu.edu/gspweb/zxb/glioma_zxb.  相似文献   

2.
MOTIVATION: Arrays allow measurements of the expression levels of thousands of mRNAs to be made simultaneously. The resulting data sets are information rich but require extensive mining to enhance their usefulness. Information theoretic methods are capable of assessing similarities and dissimilarities between data distributions and may be suited to the analysis of gene expression experiments. The purpose of this study was to investigate information theoretic data mining approaches to discover temporal patterns of gene expression from array-derived gene expression data. RESULTS: The Kullback-Leibler divergence, an information-theoretic distance that measures the relative dissimilarity between two data distribution profiles, was used in conjunction with an unsupervised self-organizing map algorithm. Two published, array-derived gene expression data sets were analyzed. The patterns obtained with the KL clustering method were found to be superior to those obtained with the hierarchical clustering algorithm using the Pearson correlation distance measure. The biological significance of the results was also examined. AVAILABILITY: Software code is available by request from the authors. All programs were written in ANSI C and Matlab (Mathworks Inc., Natick, MA).  相似文献   

3.
Comparing chromosomal gene order in two or more related species is an important approach to studying the forces that guide genome organization and evolution. Linked clusters of similar genes found in related genomes are often used to support arguments of evolutionary relatedness or functional selection. However, as the gene order and the gene complement of sister genomes diverge progressively due to large scale rearrangements, horizontal gene transfer, gene duplication and gene loss, it becomes increasingly difficult to determine whether observed similarities in local genomic structure are indeed remnants of common ancestral gene order, or are merely coincidences. A rigorous comparative genomics requires principled methods for distinguishing chance commonalities, within or between genomes, from genuine historical or functional relationships. In this paper, we construct tests for significant groupings against null hypotheses of random gene order, taking incomplete clusters, multiple genomes, and gene families into account. We consider both the significance of individual clusters of prespecified genes and the overall degree of clustering in whole genomes.  相似文献   

4.
5.
6.
7.
Multiconstrained gene clustering based on generalized projections   总被引:1,自引:0,他引:1  

Background  

Gene clustering for annotating gene functions is one of the fundamental issues in bioinformatics. The best clustering solution is often regularized by multiple constraints such as gene expressions, Gene Ontology (GO) annotations and gene network structures. How to integrate multiple pieces of constraints for an optimal clustering solution still remains an unsolved problem.  相似文献   

8.
In this article, we introduce an exploratory framework for learning patterns of conditional co-expression in gene expression data. The main idea behind the proposed approach consists of estimating how the information content shared by a set of M nodes in a network (where each node is associated to an expression profile) varies upon conditioning on a set of L conditioning variables (in the simplest case represented by a separate set of expression profiles). The method is non-parametric and it is based on the concept of statistical co-information, which, unlike conventional correlation based techniques, is not restricted in scope to linear conditional dependency patterns. Moreover, such conditional co-expression relationships can potentially indicate regulatory interactions that do not manifest themselves when only pair-wise relationships are considered. A moment based approximation of the co-information measure is derived that efficiently gets around the problem of estimating high-dimensional multi-variate probability density functions from the data, a task usually not viable due to the intrinsic sample size limitations that characterize expression level measurements. By applying the proposed exploratory method, we analyzed a whole genome microarray assay of the eukaryote Saccharomices cerevisiae and were able to learn statistically significant patterns of conditional co-expression. A selection of such interactions that carry a meaningful biological interpretation are discussed.  相似文献   

9.

Background  

Explicit evolutionary models are required in maximum-likelihood and Bayesian inference, the two methods that are overwhelmingly used in phylogenetic studies of DNA sequence data. Appropriate selection of nucleotide substitution models is important because the use of incorrect models can mislead phylogenetic inference. To better understand the performance of different model-selection criteria, we used 33,600 simulated data sets to analyse the accuracy, precision, dissimilarity, and biases of the hierarchical likelihood-ratio test, Akaike information criterion, Bayesian information criterion, and decision theory.  相似文献   

10.

Background

Clustering is a widely used technique for analysis of gene expression data. Most clustering methods group genes based on the distances, while few methods group genes according to the similarities of the distributions of the gene expression levels. Furthermore, as the biological annotation resources accumulated, an increasing number of genes have been annotated into functional categories. As a result, evaluating the performance of clustering methods in terms of the functional consistency of the resulting clusters is of great interest.

Results

In this paper, we proposed the WDCM (Weibull Distribution-based Clustering Method), a robust approach for clustering gene expression data, in which the gene expressions of individual genes are considered as the random variables following unique Weibull distributions. Our WDCM is based on the concept that the genes with similar expression profiles have similar distribution parameters, and thus the genes are clustered via the Weibull distribution parameters. We used the WDCM to cluster three cancer gene expression data sets from the lung cancer, B-cell follicular lymphoma and bladder carcinoma and obtained well-clustered results. We compared the performance of WDCM with k-means and Self Organizing Map (SOM) using functional annotation information given by the Gene Ontology (GO). The results showed that the functional annotation ratios of WDCM are higher than those of the other methods. We also utilized the external measure Adjusted Rand Index to validate the performance of the WDCM. The comparative results demonstrate that the WDCM provides the better clustering performance compared to k-means and SOM algorithms. The merit of the proposed WDCM is that it can be applied to cluster incomplete gene expression data without imputing the missing values. Moreover, the robustness of WDCM is also evaluated on the incomplete data sets.

Conclusions

The results demonstrate that our WDCM produces clusters with more consistent functional annotations than the other methods. The WDCM is also verified to be robust and is capable of clustering gene expression data containing a small quantity of missing values.  相似文献   

11.
An image segmentation process was derived from an image model that assumed that cell images represent objects having characteristic relationships, limited shape properties and definite local color features. These assumptions allowed the design of a region-growing process in which the color features were used to iteratively aggregate image points in alternation with a test of the convexity of the aggregate obtained. The combination of both local and global criteria allowed the self-adaptation of the algorithm to segmentation difficulties and led to a self-assessment of the adequacy of the final segmentation result. The quality of the segmentation was evaluated by visual control of the match between cell images and the corresponding segmentation masks proposed by the algorithm. A comparison between this region-growing process and the conventional gray-level thresholding is illustrated. A field test involving 700 bone marrow cells, randomly selected from May-Grünwald-Giemsa-stained smears, allowed the evaluation of the efficiency, effectiveness and confidence of the algorithm: 96% of the cells were evaluated as correctly segmented by the algorithm's self-assessment of adequacy, with a 98% confidence. The principles of the other major segmentation algorithms are also reviewed.  相似文献   

12.
利用有限个实验条件下的基因表达谱数据,只能对与实验条件相关的基因功能类进行有效预测,所以有必要限定可预测的基因功能类范围。据此,首先基于GeneOntology(GO)选择富集差异表达基因与实验条件相关的功能类。再通过支持向量机分类器,深化预测迄今只注释到实验条件相关功能类的父结点的基因是否属于该实验条件相关功能类。应用于一套酵母基因表达谱数据,结果显示,在剔除了高度不平衡的训练集合后,平均真阳性率(precision)与平均覆盖率(recall)都分别达到了71%与47%以上。  相似文献   

13.
14.
Summary Procedures for selecting among parental varieties to be used in the synthesis of composites are discussed. In addition to the criterion based on the mean and variance of composites of the same size (k) proposed by Cordoso (1976), we suggest the index Ij=w1vj+w2 j or Ij=(2/k) Ij for a preliminary selection among parental varieties. We show that by increasing k (size of the composite) Ij tends to gj, the general combining ability effect. Such a criterion is particularly important when n, the number of parental varieties, is large, so that the number of possible composites (Nc=2n–n–1) becomes too large to be handled when using the common prediction procedures. Yield data from a 9 × 9 variety diallel cross were used for illustration.  相似文献   

15.

Background  

Clustering the information content of large high-dimensional gene expression datasets has widespread application in "omics" biology. Unfortunately, the underlying structure of these natural datasets is often fuzzy, and the computational identification of data clusters generally requires knowledge about cluster number and geometry.  相似文献   

16.

Background  

Time-course microarray experiments can produce useful data which can help in understanding the underlying dynamics of the system. Clustering is an important stage in microarray data analysis where the data is grouped together according to certain characteristics. The majority of clustering techniques are based on distance or visual similarity measures which may not be suitable for clustering of temporal microarray data where the sequential nature of time is important. We present a Granger causality based technique to cluster temporal microarray gene expression data, which measures the interdependence between two time-series by statistically testing if one time-series can be used for forecasting the other time-series or not.  相似文献   

17.
【目的】多肽化合物Surugamides(sgm)生物合成基因簇包含4个非核糖体多肽合酶(NRPS)基因surA–D,负责2个NRPS生物合成途径。已有报道确认surA基因与SurugamideA产物相关,而surB基因与sgm F产物相关,但对surC和surD基因功能的归属尚没有实验证据。本工作拟在之前研究的基础上进一步确认surA和surD负责Surugamide A产物生物合成,为基因工程改造Surugamides生物合成途径以及研究其NRPS蛋白之间的识别机制提供理论依据。【方法】从海绵中分离放线菌并通过16S rRNA基因序列比对分析其分类单元。通过在线数据库antiSMASH分析基因组序列,发现天然产物生物合成基因簇。通过UPLC-Q-TOF-MS和~(13)CNMR鉴定化合物结构。把构建完成的同源重组双交换质粒导入链霉菌宿主后筛选基因缺失或替换突变株。【结果】从胄甲海绵来源链霉菌S.albidoflavus LHW3101基因组中发现了Surugamides生物合成基因簇,确认了该菌株发酵产物中的化合物sgmA和sgm F。构建了surB和surC基因同时缺失的突变株RJ9,发现RJ9不再产sgm F而仍然产Surugamide A。在缺失突变surB和surC基因的同时在surD基因前引入了组成型强启动子ermEp*,结果发现RJ9产SurugamideA水平是野生型菌株的约2倍。【结论】确认了surB和surC基因与sgmA产物无关。在surD基因前引入强启动子后显著提高了SurugamideA的产量,提示surD基因与sgmA产物相关,结合已报到surA基因与Surugamide A产物相关的证据,进一步确认了surA和surD基因负责Surugamide A生物合成的推论。  相似文献   

18.
滇西南地区拥有丰富的丛生竹林景观和珍稀特有竹种资源,但竹资源分布储量不清、监测技术缺乏等问题很大程度限制了竹资源开发与利用。基于Sentinel-2A影像数据,采用反向传播神经网络、支持向量机、随机森林三种机器学习分类方法进行沧源县丛生竹林信息提取及精度评价,利用Google Earth影像和DEM数据对竹资源分布的空间和地形特征进行了分析。结果表明,随机森林分类精度优于支持向量机和反向传播神经网络,分类总体精度达90%,Kappa系数达0.87,竹林用户精度达81%。沧源县共有竹林138.07 km2,主要分布于城镇村庄、道路、水系和耕地周边,以四旁竹和防护竹林为主,采用Sentinel-2A10 m的分辨率很好地提取了空间上分布分散的丛生竹林。沧源县竹林主要分布在海拔900~2000 m,坡度范围大都位于缓坡和斜坡。研究结果可为沧源县竹资源开发利用提供数据支持,研究方法可作为大型丛生竹遥感监测的参考。  相似文献   

19.
k-均值聚类算法是一种广泛应用于基因表达数据聚类分析中的迭代变换算法,它通常用距离法来表示基因间的关系,但不能有效的反应基因间的相互依赖的关系。为此,提出基于信息论的k-modes聚类算法,克服了以上缺点。另外,还引入了伪F统计量,一方面,可以对空间中有部分重叠的点进行有效的分类;另一方面,可以给出最佳聚类数目,从而弥补了k-modes聚类法的不足。使其成为一种非常有效的算法,从而达到较优的聚类效果。  相似文献   

20.
C A Orengo  N P Brown  W R Taylor 《Proteins》1992,14(2):139-167
A fast method is described for searching and analyzing the protein structure databank. It uses secondary structure followed by residue matching to compare protein structures and is developed from a previous structural alignment method based on dynamic programming. Linear representations of secondary structures are derived and their features compared to identify equivalent elements in two proteins. The secondary structure alignment then constrains the residue alignment, which compares only residues within aligned secondary structures and with similar buried areas and torsional angles. The initial secondary structure alignment improves accuracy and provides a means of filtering out unrelated proteins before the slower residue alignment stage. It is possible to search or sort the protein structure databank very quickly using just secondary structure comparisons. A search through 720 structures with a probe protein of 10 secondary structures required 1.7 CPU hours on a Sun 4/280. Alternatively, combined secondary structure and residue alignments, with a cutoff on the secondary structure score to remove pairs of unrelated proteins from further analysis, took 10.1 CPU hours. The method was applied in searches on different classes of proteins and to cluster a subset of the databank into structurally related groups. Relationships were consistent with known families of protein structure.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号