首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Background  

In recent years, clustering algorithms have been effectively applied in molecular biology for gene expression data analysis. With the help of clustering algorithms such as K-means, hierarchical clustering, SOM, etc, genes are partitioned into groups based on the similarity between their expression profiles. In this way, functionally related genes are identified. As the amount of laboratory data in molecular biology grows exponentially each year due to advanced technologies such as Microarray, new efficient and effective methods for clustering must be developed to process this growing amount of biological data.  相似文献   

2.
3.
Information theory is a branch of mathematics that overlaps with communications, biology, and medical engineering. Entropy is a measure of uncertainty in the set of information. In this study, for each gene and its exons sets, the entropy was calculated in orders one to four. Based on the relative entropy of genes and exons, Kullback-Leibler divergence was calculated. After obtaining the Kullback-Leibler distance for genes and exons sets, the results were entered as input into 7 clustering algorithms: single, complete, average, weighted, centroid, median, and K-means. To aggregate the results of clustering, the AdaBoost algorithm was used. Finally, the results of the AdaBoost algorithm were investigated by GeneMANIA prediction server to explore the results from gene annotation point of view. All calculations were performed using the MATLAB Engineering Software (2015). Following our findings on investigating the results of genes metabolic pathways based on the gene annotations, it was revealed that our proposed clustering method yielded correct, logical, and fast results. This method at the same that had not had the disadvantages of aligning allowed the genes with actual length and content to be considered and also did not require high memory for large-length sequences. We believe that the performance of the proposed method could be used with other competitive gene clustering methods to group biologically relevant set of genes. Also, the proposed method can be seen as a predictive method for those genes bearing up weak genomic annotations.  相似文献   

4.
Cheng and Church algorithm is an important approach in biclustering algorithms. In this paper, the process of the extended space in the second stage of Cheng and Church algorithm is improved and the selections of two important parameters are discussed. The results of the improved algorithm used in the gene expression spectrum analysis show that, compared with Cheng and Church algorithm, the quality of clustering results is enhanced obviously, the mining expression models are better, and the data possess a strong consistency with fluctuation on the condition while the computational time does not increase significantly.  相似文献   

5.
In this work, we introduce in the first part new developments in Principal Component Analysis (PCA) and in the second part a new method to select variables (genes in our application). Our focus is on problems where the values taken by each variable do not all have the same importance and where the data may be contaminated with noise and contain outliers, as is the case with microarray data. The usual PCA is not appropriate to deal with this kind of problems. In this context, we propose the use of a new correlation coefficient as an alternative to Pearson's. This leads to a so-called weighted PCA (WPCA). In order to illustrate the features of our WPCA and compare it with the usual PCA, we consider the problem of analyzing gene expression data sets. In the second part of this work, we propose a new PCA-based algorithm to iteratively select the most important genes in a microarray data set. We show that this algorithm produces better results when our WPCA is used instead of the usual PCA. Furthermore, by using Support Vector Machines, we show that it can compete with the Significance Analysis of Microarrays algorithm.  相似文献   

6.

Background

The generation of multiple sequence alignments (MSAs) is a crucial step for many bioinformatic analyses. Thus improving MSA accuracy and identifying potential errors in MSAs is important for a wide range of post-genomic research. We present a novel method called MergeAlign which constructs consensus MSAs from multiple independent MSAs and assigns an alignment precision score to each column.

Results

Using conventional benchmark tests we demonstrate that on average MergeAlign MSAs are more accurate than MSAs generated using any single matrix of sequence substitution. We show that MergeAlign column scores are related to alignment precision and hence provide an ab initio method of estimating alignment precision in the absence of curated reference MSAs. Using two novel and independent alignment performance tests that utilise a large set of orthologous gene families we demonstrate that increasing MSA performance leads to an increase in the performance of downstream phylogenetic analyses.

Conclusion

Using multiple tests of alignment performance we demonstrate that this novel method has broad general application in biological research.  相似文献   

7.
A new method for segregation and linkage analysis, with pedigree data, is described. Reversible jump Markov chain Monte Carlo methods are used to implement a sampling scheme in which the Markov chain can jump between parameter subspaces corresponding to models with different numbers of quantitative-trait loci (QTL's). Joint estimation of QTL number, position, and effects is possible, avoiding the problems that can arise from misspecification of the number of QTL's in a linkage analysis. The method is illustrated by use of a data set simulated for the 9th Genetic Analysis Workshop; this data set had several oligogenic traits, generated by use of a 1,497-member pedigree. The mixing characteristics of the method appear to be good, and the method correctly recovers the simulated model from the test data set. The approach appears to have great potential both for robust linkage analysis and for the answering of more general questions regarding the genetic control of complex traits.  相似文献   

8.
Effective probabilistic modeling approaches have been developed to find motifs of biological function in DNA sequences. However, the problem of automated model choice remains largely open and becomes more essential as the number of sequences to be analyzed is constantly increasing. Here we propose a reversible jump Markov chain Monte Carlo algorithm for estimating both parameters and model dimension of a Bayesian hidden semi-Markov model dedicated to bacterial promoter motif discovery. Bacterial promoters are complex motifs composed of two boxes separated by a spacer of variable but constrained length and occurring close to the protein translation start site. The algorithm allows simultaneous estimations of the width of the boxes, of the support size of the spacer length distribution, and of the order of the Markovian model used for the "background" nucleotide composition. The application of this method on three sequence sets points out the good behavior of the algorithm and the biological relevance of the estimated promoter motifs.  相似文献   

9.
We performed multipoint linkage analysis of the electrophysiological trait ECB21 on chromosome 4 in the full pedigrees provided by the Collaborative Study on the Genetics of Alcoholism (COGA). Three Markov chain Monte Carlo (MCMC)-based approaches were applied to the provided and re-estimated genetic maps and to five different marker panels consisting of microsatellite (STRP) and/or SNP markers at various densities. We found evidence of linkage near the GABRB1 STRP using all methods, maps, and marker panels. Difficulties encountered with SNP panels included convergence problems and demanding computations.  相似文献   

10.
k-均值聚类算法是一种广泛应用于基因表达数据聚类分析中的迭代变换算法,它通常用距离法来表示基因间的关系,但不能有效的反应基因间的相互依赖的关系。为此,提出基于信息论的k-modes聚类算法,克服了以上缺点。另外,还引入了伪F统计量,一方面,可以对空间中有部分重叠的点进行有效的分类;另一方面,可以给出最佳聚类数目,从而弥补了k-modes聚类法的不足。使其成为一种非常有效的算法,从而达到较优的聚类效果。  相似文献   

11.
12.
Zhou C  Wakefield J 《Biometrics》2006,62(2):515-525
In recent years there has been great interest in making inference for gene expression data collected over time. In this article, we describe a Bayesian hierarchical mixture model for partitioning such data. While conventional approaches cluster the observed data, we assume a nonparametric, random walk model, and partition on the basis of the parameters of this model. The model is flexible and can be tuned to the specific context, respects the order of observations within each curve, acknowledges measurement error, and allows prior knowledge on parameters to be incorporated. The number of partitions may also be treated as unknown, and inferred from the data, in which case computation is carried out via a birth-death Markov chain Monte Carlo algorithm. We first examine the behavior of the model on simulated data, along with a comparison with more conventional approaches, and then analyze meiotic expression data collected over time on fission yeast genes.  相似文献   

13.
MOTIVATION: Cellular processes cause changes over time. Observing and measuring those changes over time allows insights into the how and why of regulation. The experimental platform for doing the appropriate large-scale experiments to obtain time-courses of expression levels is provided by microarray technology. However, the proper way of analyzing the resulting time course data is still very much an issue under investigation. The inherent time dependencies in the data suggest that clustering techniques which reflect those dependencies yield improved performance. RESULTS: We propose to use Hidden Markov Models (HMMs) to account for the horizontal dependencies along the time axis in time course data and to cope with the prevalent errors and missing values. The HMMs are used within a model-based clustering framework. We are given a number of clusters, each represented by one Hidden Markov Model from a finite collection encompassing typical qualitative behavior. Then, our method finds in an iterative procedure cluster models and an assignment of data points to these models that maximizes the joint likelihood of clustering and models. Partially supervised learning--adding groups of labeled data to the initial collection of clusters--is supported. A graphical user interface allows querying an expression profile dataset for time course similar to a prototype graphically defined as a sequence of levels and durations. We also propose a heuristic approach to automate determination of the number of clusters. We evaluate the method on published yeast cell cycle and fibroblasts serum response datasets, and compare them, with favorable results, to the autoregressive curves method.  相似文献   

14.
We present a fast, versatile and adaptive-multiscale algorithm for analyzing a wide-variety of DNA microarray data. Its primary application is in normalization of array data as well as subsequent identification of 'enriched targets', e.g. differentially expressed genes in expression profiling arrays and enriched sites in ChIP-on-chip experimental data. We show how to accommodate the unique characteristics of ChIP-on-chip data, where the set of 'enriched targets' is large, asymmetric and whose proportion to the whole data varies locally. SUPPLEMENTARY INFORMATION: Supplementary figures, related preprint, free software as well as our raw DNA microarray data with PCR validations are available at http://www.math.umn.edu/~lerman/supp/bioinfo06 as well as Bioinformatics online.  相似文献   

15.
One of the most challenging areas in human genetics is the dissection of quantitative traits. In this context, the efficient use of available data is important, including, when possible, use of large pedigrees and many markers for gene mapping. In addition, methods that jointly perform linkage analysis and estimation of the trait model are appealing because they combine the advantages of a model-based analysis with the advantages of methods that do not require prespecification of model parameters for linkage analysis. Here we review a Markov chain Monte Carlo approach for such joint linkage and segregation analysis, which allows analysis of oligogenic traits in the context of multipoint linkage analysis of large pedigrees. We provide an outline for practitioners of the salient features of the method, interpretation of the results, effect of violation of assumptions, and an example analysis of a two-locus trait to illustrate the method.  相似文献   

16.
An improved algorithm for clustering gene expression data   总被引:1,自引:0,他引:1  
MOTIVATION: Recent advancements in microarray technology allows simultaneous monitoring of the expression levels of a large number of genes over different time points. Clustering is an important tool for analyzing such microarray data, typical properties of which are its inherent uncertainty, noise and imprecision. In this article, a two-stage clustering algorithm, which employs a recently proposed variable string length genetic scheme and a multiobjective genetic clustering algorithm, is proposed. It is based on the novel concept of points having significant membership to multiple classes. An iterated version of the well-known Fuzzy C-Means is also utilized for clustering. RESULTS: The significant superiority of the proposed two-stage clustering algorithm as compared to the average linkage method, Self Organizing Map (SOM) and a recently developed weighted Chinese restaurant-based clustering method (CRC), widely used methods for clustering gene expression data, is established on a variety of artificial and publicly available real life data sets. The biological relevance of the clustering solutions are also analyzed.  相似文献   

17.
18.
随着DNA芯片技术的广泛应用,基因表达数据分析已成为生命科学的研究热点之一。概述基因表达聚类技术类型、算法分类与特点、结果可视化与注释;阐述一些流行的和新型的算法;介绍17个最新相关软件包和在线web服务工具;并说明软件工具的研究趋向。  相似文献   

19.
Qu P  Qu Y 《Biometrics》2000,56(4):1249-1255
After continued treatment with an insecticide, within the population of the susceptible insects, resistant strains will occur. It is important to know whether there are any resistant strains, what the proportions are, and what the median lethal doses are for the insecticide. Lwin and Martin (1989, Biometrics 45, 721-732) propose a probit mixture model and use the EM algorithm to obtain the maximum likelihood estimates for the parameters. This approach has difficulties in estimating the confidence intervals and in testing the number of components. We propose a Bayesian approach to obtaining the credible intervals for the location and scale of the tolerances in each component and for the mixture proportions by using data augmentation and Gibbs sampler. We use Bayes factor for model selection and determining the number of components. We illustrate the method with data published in Lwin and Martin (1989).  相似文献   

20.
A mixture Markov regression model is proposed to analyze heterogeneous time series data. Mixture quasi‐likelihood is formulated to model time series with mixture components and exogenous variables. The parameters are estimated by quasi‐likelihood estimating equations. A modified EM algorithm is developed for the mixture time series model. The model and proposed algorithm are tested on simulated data and applied to mosquito surveillance data in Peel Region, Canada.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号