首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION: We consider the problem of clustering a population of Comparative Genomic Hybridization (CGH) data samples using similarity based clustering methods. A key requirement for clustering is to avoid using the noisy aberrations in the CGH samples. RESULTS: We develop a dynamic programming algorithm to identify a small set of important genomic intervals called markers. The advantage of using these markers is that the potentially noisy genomic intervals are excluded during the clustering process. We also develop two clustering strategies using these markers. The first one, prototype-based approach, maximizes the support for the markers. The second one, similarity-based approach, develops a new similarity measure called RSim and refines clusters with the aim of maximizing the RSim measure between the samples in the same cluster. Our results demonstrate that the markers we found represent the aberration patterns of cancer types well and they improve the quality of clustering significantly. AVAILABILITY: All software developed in this paper and all the datasets used are available from the authors upon request.  相似文献   

2.
Array comparative genomic hybridization (aCGH) is a laboratory technique to measure chromosomal copy number changes. A clear biological interpretation of the measurements is obtained by mapping these onto an ordinal scale with categories loss/normal/gain of a copy. The pattern of gains and losses harbors a level of tumor specificity. Here, we present WECCA (weighted clustering of called aCGH data), a method for weighted clustering of samples on the basis of the ordinal aCGH data. Two similarities to be used in the clustering and particularly suited for ordinal data are proposed, which are generalized to deal with weighted observations. In addition, a new form of linkage, especially suited for ordinal data, is introduced. In a simulation study, we show that the proposed cluster method is competitive to clustering using the continuous data. We illustrate WECCA using an application to a breast cancer data set, where WECCA finds a clustering that relates better with survival than the original one.  相似文献   

3.
Microarray-CGH (comparative genomic hybridization) experiments are used to detect and map chromosomal imbalances. A CGH profile can be viewed as a succession of segments that represent homogeneous regions in the genome whose representative sequences share the same relative copy number on average. Segmentation methods constitute a natural framework for the analysis, but they do not provide a biological status for the detected segments. We propose a new model for this segmentation/clustering problem, combining a segmentation model with a mixture model. We present a new hybrid algorithm called dynamic programming-expectation maximization (DP-EM) to estimate the parameters of the model by maximum likelihood. This algorithm combines DP and the EM algorithm. We also propose a model selection heuristic to select the number of clusters and the number of segments. An example of our procedure is presented, based on publicly available data sets. We compare our method to segmentation methods and to hidden Markov models, and we show that the new segmentation/clustering model is a promising alternative that can be applied in the more general context of signal processing.  相似文献   

4.
Quantile smoothing of array CGH data   总被引:4,自引:0,他引:4  
MOTIVATION: Plots of array Comparative Genomic Hybridization (CGH) data often show special patterns: stretches of constant level (copy number) with sharp jumps between them. There can also be much noise. Classic smoothing algorithms do not work well, because they introduce too much rounding. To remedy this, we introduce a fast and effective smoothing algorithm based on penalized quantile regression. It can compute arbitrary quantile curves, but we concentrate on the median to show the trend and the lower and upper quartile curves showing the spread of the data. Two-fold cross-validation is used for optimizing the weight of the penalties. RESULTS: Simulated data and a published dataset are used to show the capabilities of the method to detect the segments of changed copy numbers in array CGH data.  相似文献   

5.

Background  

Array-based comparative genomic hybridization (CGH) is a commonly-used approach to detect DNA copy number variation in whole genome-wide screens. Several statistical methods have been proposed to define genomic segments with different copy numbers in cancer tumors. However, most tumors are heterogeneous and show variation in DNA copy numbers across tumor cells. The challenge is to reveal the copy number profiles of the subpopulations in a tumor and to estimate the percentage of each subpopulation.  相似文献   

6.
Current advances in next-generation sequencing techniques have allowed researchers to conduct comprehensive research on the microbiome and human diseases, with recent studies identifying associations between the human microbiome and health outcomes for a number of chronic conditions. However, microbiome data structure, characterized by sparsity and skewness, presents challenges to building effective classifiers. To address this, we present an innovative approach for distance-based classification using mixture distributions (DCMD). The method aims to improve classification performance using microbiome community data, where the predictors are composed of sparse and heterogeneous count data. This approach models the inherent uncertainty in sparse counts by estimating a mixture distribution for the sample data and representing each observation as a distribution, conditional on observed counts and the estimated mixture, which are then used as inputs for distance-based classification. The method is implemented into a k-means classification and k-nearest neighbours framework. We develop two distance metrics that produce optimal results. The performance of the model is assessed using simulated and human microbiome study data, with results compared against a number of existing machine learning and distance-based classification approaches. The proposed method is competitive when compared to the other machine learning approaches, and shows a clear improvement over commonly used distance-based classifiers, underscoring the importance of modelling sparsity for achieving optimal results. The range of applicability and robustness make the proposed method a viable alternative for classification using sparse microbiome count data. The source code is available at https://github.com/kshestop/DCMD for academic use.  相似文献   

7.

Background  

Microarray-CGH experiments are used to detect and map chromosomal imbalances, by hybridizing targets of genomic DNA from a test and a reference sample to sequences immobilized on a slide. These probes are genomic DNA sequences (BACs) that are mapped on the genome. The signal has a spatial coherence that can be handled by specific statistical tools. Segmentation methods seem to be a natural framework for this purpose. A CGH profile can be viewed as a succession of segments that represent homogeneous regions in the genome whose BACs share the same relative copy number on average. We model a CGH profile by a random Gaussian process whose distribution parameters are affected by abrupt changes at unknown coordinates. Two major problems arise : to determine which parameters are affected by the abrupt changes (the mean and the variance, or the mean only), and the selection of the number of segments in the profile.  相似文献   

8.
Classification and feature selection algorithms for multi-class CGH data   总被引:1,自引:0,他引:1  
Recurrent chromosomal alterations provide cytological and molecular positions for the diagnosis and prognosis of cancer. Comparative genomic hybridization (CGH) has been useful in understanding these alterations in cancerous cells. CGH datasets consist of samples that are represented by large dimensional arrays of intervals. Each sample consists of long runs of intervals with losses and gains. In this article, we develop novel SVM-based methods for classification and feature selection of CGH data. For classification, we developed a novel similarity kernel that is shown to be more effective than the standard linear kernel used in SVM. For feature selection, we propose a novel method based on the new kernel that iteratively selects features that provides the maximum benefit for classification. We compared our methods against the best wrapper-based and filter-based approaches that have been used for feature selection of large dimensional biological data. Our results on datasets generated from the Progenetix database, suggests that our methods are considerably superior to existing methods. AVAILABILITY: All software developed in this article can be downloaded from http://plaza.ufl.edu/junliu/feature.tar.gz.  相似文献   

9.
Robust smooth segmentation approach for array CGH data analysis   总被引:2,自引:0,他引:2  
MOTIVATION: Array comparative genomic hybridization (aCGH) provides a genome-wide technique to screen for copy number alteration. The existing segmentation approaches for analyzing aCGH data are based on modeling data as a series of discrete segments with unknown boundaries and unknown heights. Although the biological process of copy number alteration is discrete, in reality a variety of biological and experimental factors can cause the signal to deviate from a stepwise function. To take this into account, we propose a smooth segmentation (smoothseg) approach. METHODS: To achieve a robust segmentation, we use a doubly heavy-tailed random-effect model. The first heavy-tailed structure on the errors deals with outliers in the observations, and the second deals with possible jumps in the underlying pattern associated with different segments. We develop a fast and reliable computational procedure based on the iterative weighted least-squares algorithm with band-limited matrix inversion. RESULTS: Using simulated and real data sets, we demonstrate how smoothseg can aid in identification of regions with genomic alteration and in classification of samples. For the real data sets, smoothseg leads to smaller false discovery rate and classification error rate than the circular binary segmentation (CBS) algorithm. In a realistic simulation setting, smoothseg is better than wavelet smoothing and CBS in identification of regions with genomic alterations and better than CBS in classification of samples. For comparative analyses, we demonstrate that segmenting the t-statistics performs better than segmenting the data. AVAILABILITY: The R package smoothseg to perform smooth segmentation is available from http://www.meb.ki.se/~yudpaw.  相似文献   

10.

Background  

In two-channel competitive genomic hybridization microarray experiments, the ratio of the two fluorescent signal intensities at each spot on the microarray is commonly used to infer the relative amounts of the test and reference sample DNA levels. This ratio may be influenced by systematic measurement effects from non-biological sources that can introduce biases in the estimated ratios. These biases should be removed before drawing conclusions about the relative levels of DNA. The performance of existing gene expression microarray normalization strategies has not been evaluated for removing systematic biases encountered in array-based comparative genomic hybridization (CGH), which aims to detect single copy gains and losses typically in samples with heterogeneous cell populations resulting in only slight shifts in signal ratios. The purpose of this work is to establish a framework for correcting the systematic sources of variation in high density CGH array images, while maintaining the true biological variations.  相似文献   

11.
应用CGH数据和树模型探索癌症的发病机理   总被引:1,自引:0,他引:1  
李小波  陈俭  吕炳建  来茂德 《遗传》2008,30(4):407-412
比较基因组杂交技术(comparative genomic hybridization, CGH)主要用于检测肿瘤的染色体缺失和扩嘱, 迄今已积累了大量的实验数据, 为全基因组分析肿瘤的发生机制提供了可能。树模型在生物信息学领域通常被用于研究生物形成和进化的历史, 物种之间的进化关系常以系统发生树来表示。树模型同样可以作为一种有力的生物信息学工具来分析CGH数据, 探索癌症的发病机理。文中介绍了两种常见的树模型—— 分支树和距离树, 详细叙述了重建树模型的基本原理和方法, 分析了创建树模型时要注意的几个技术问题, 并对其在肿瘤研究中的应用进行了回顾和总结。肿瘤的树状模型作为单路径线性模型的泛化, 克服了以往单路径线性模型的缺点, 理论上能更加精确地概括到肿脉的多基因、多路径、多阶段的发生发展模式, 从不同角度探讨肿瘤发生发展的分子机制。该模型除可用于分析肿瘤的CGH数据外, 还可用于分析其他多种类型的数据, 包括微阵列CGH(array-CGH)技术等产生的高分辨率数据。  相似文献   

12.

Background

While the theory of enzyme kinetics is fundamental to analyzing and simulating biochemical systems, the derivation of rate equations for complex mechanisms for enzyme-catalyzed reactions is cumbersome and error prone. Therefore, a number of algorithms and related computer programs have been developed to assist in such derivations. Yet although a number of algorithms, programs, and software packages are reported in the literature, one or more significant limitation is associated with each of these tools. Furthermore, none is freely available for download and use by the community.

Results

We have implemented an algorithm based on the schematic method of King and Altman (KA) that employs the topological theory of linear graphs for systematic generation of valid reaction patterns in a GUI-based stand-alone computer program called KAPattern. The underlying algorithm allows for the assumption steady-state, rapid equilibrium-binding, and/or irreversibility for individual steps in catalytic mechanisms. The program can automatically generate MathML and MATLAB output files that users can easily incorporate into simulation programs.

Conclusion

A computer program, called KAPattern, for generating rate equations for complex enzyme system is a freely available and can be accessed at http://www.biocoda.org.  相似文献   

13.

Background  

Existing methods for analyzing bacterial CGH data from two-color arrays are based on log-ratios only, a paradigm inherited from expression studies. We propose an alternative approach, where microarray signals are used in a different way and sequence identity is predicted using a supervised learning approach.  相似文献   

14.
MOTIVATION: In recent years, a range of techniques for analysis and segmentation of array comparative genomic hybridization (aCGH) data have been proposed. For array designs in which clones are of unequal lengths, are unevenly spaced or overlap, the discrete-index view typically adopted by such methods may be questionable or improved. RESULTS: We describe a continuous-index hidden Markov model for aCGH data as well as a Monte Carlo EM algorithm to estimate its parameters. It is shown that for a dataset from the BT-474 cell line analysed on 32K BAC tiling microarrays, this model yields considerably better model fit in terms of lag-1 residual autocorrelations compared to a discrete-index HMM, and it is also shown how to use the model for e.g. estimation of change points on the base-pair scale and for estimation of conditional state probabilities across the genome. In addition, the model is applied to the Glioblastoma Multiforme data used in the comparative study by Lai et al. (Lai,W.R. et al. (2005) Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics, 21, 3763-3370.) giving result similar to theirs but with certain features highlighted in the continuous-index setting.  相似文献   

15.
Modeling recurrent DNA copy number alterations in array CGH data   总被引:1,自引:0,他引:1  
MOTIVATION: Recurrent DNA copy number alterations (CNA) measured with array comparative genomic hybridization (aCGH) reveal important molecular features of human genetics and disease. Studying aCGH profiles from a phenotypic group of individuals can determine important recurrent CNA patterns that suggest a strong correlation to the phenotype. Computational approaches to detecting recurrent CNAs from a set of aCGH experiments have typically relied on discretizing the noisy log ratios and subsequently inferring patterns. We demonstrate that this can have the effect of filtering out important signals present in the raw data. In this article we develop statistical models that jointly infer CNA patterns and the discrete labels by borrowing statistical strength across samples. RESULTS: We propose extending single sample aCGH HMMs to the multiple sample case in order to infer shared CNAs. We model recurrent CNAs as a profile encoded by a master sequence of states that generates the samples. We show how to improve on two basic models by performing joint inference of the discrete labels and providing sparsity in the output. We demonstrate on synthetic ground truth data and real data from lung cancer cell lines how these two important features of our model improve results over baseline models. We include standard quantitative metrics and a qualitative assessment on which to base our conclusions. AVAILABILITY: http://www.cs.ubc.ca/~sshah/acgh.  相似文献   

16.
CNVDetector is a program for locating copy number variations (CNVs) in a single genome. CNVDetector has several merits: (i) it can deal with the array comparative genomic hybridization data even if the noise is not normally distributed; (ii) it has a linear time kernel; (iii) its parameters can be easily selected; (iv) it evaluates the statistical significance for each CNV calling. AVAILABILITY: CNVDetector (for Windows platform) can be downloaded from http:www.csie.ntu.edu.tw/~kmchao/tools/CNVDetector/. The manual of CNVDetector is also available.  相似文献   

17.
A method for calling gains and losses in array CGH data   总被引:11,自引:0,他引:11  
Array CGH is a powerful technique for genomic studies of cancer. It enables one to carry out genome-wide screening for regions of genetic alterations, such as chromosome gains and losses, or localized amplifications and deletions. In this paper, we propose a new algorithm 'Cluster along chromosomes' (CLAC) for the analysis of array CGH data. CLAC builds hierarchical clustering-style trees along each chromosome arm (or chromosome), and then selects the 'interesting' clusters by controlling the False Discovery Rate (FDR) at a certain level. In addition, it provides a consensus summary across a set of arrays, as well as an estimate of the corresponding FDR. We illustrate the method using an application of CLAC on a lung cancer microarray CGH data set as well as a BAC array CGH data set of aneuploid cell strains.  相似文献   

18.
Besides the problem of searching for effective methods for data analysis there are some additional problems with handling data of high uncertainty. Uncertainty problems often arise in an analysis of ecological data, e.g. in the cluster analysis of ecological data. Conventional clustering methods based on Boolean logic ignore the continuous nature of ecological variables and the uncertainty of ecological data. That can result in misclassification or misinterpretation of the data structure. Clusters with fuzzy boundaries reflect better the continuous character of ecological features. But the problem is, that the common clustering methods (like the fuzzy c-means method) are only designed for treating crisp data, that means they provide a fuzzy partition only for crisp data (e.g. exact measurement data). This paper presents the extension and implementation of the method of fuzzy clustering of fuzzy data proposed by Yang and Liu [Yang, M.-S. and Liu, H-H, 1999. Fuzzy clustering procedures for conical fuzzy vector data. Fuzzy Sets and Systems, 106, 189-200.]. The imprecise data can be defined as multidimensional fuzzy sets with not sharply formed boundaries (in the form of the so-called conical fuzzy vectors). They can then be used for the fuzzy clustering together with crisp data. That can be particularly useful when information is not available about the variances which describe the accuracy of the data and probabilistic approaches are impossible. The method proposed by Yang has been extended and implemented for the Fuzzy Clustering System EcoFucs developed at the University of Kiel. As an example, the paper presents the fuzzy cluster analysis of chemicals according to their ecotoxicological properties. The uncertainty and imprecision of ecotoxicological data are very high because of the use of various data sources, various investigation tests and the difficulty of comparing these data. The implemented method can be very helpful in searching for an adequate partition of ecological data into clusters with similar properties.  相似文献   

19.
A new result report for Mascot search results is described. A greedy set cover algorithm is used to create a minimal set of proteins, which is then grouped into families on the basis of shared peptide matches. Protein families with multiple members are represented by dendrograms, generated by hierarchical clustering using the score of the nonshared peptide matches as a distance metric. The peptide matches to the proteins in a family can be compared side by side to assess the experimental evidence for each protein. If the evidence for a particular family member is considered inadequate, the dendrogram can be cut to reduce the number of distinct family members.  相似文献   

20.
Gene-Ontology-based clustering of gene expression data   总被引:2,自引:0,他引:2  
The expected correlation between genetic co-regulation and affiliation to a common biological process is not necessarily the case when numerical cluster algorithms are applied to gene expression data. GO-Cluster uses the tree structure of the Gene Ontology database as a framework for numerical clustering, and thus allowing a simple visualization of gene expression data at various levels of the ontology tree. AVAILABILITY: The 32-bit Windows application is freely available at http://www.mpibpc.mpg.de/go-cluster/  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号