首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
MOTIVATION: In recent years, a range of techniques for analysis and segmentation of array comparative genomic hybridization (aCGH) data have been proposed. For array designs in which clones are of unequal lengths, are unevenly spaced or overlap, the discrete-index view typically adopted by such methods may be questionable or improved. RESULTS: We describe a continuous-index hidden Markov model for aCGH data as well as a Monte Carlo EM algorithm to estimate its parameters. It is shown that for a dataset from the BT-474 cell line analysed on 32K BAC tiling microarrays, this model yields considerably better model fit in terms of lag-1 residual autocorrelations compared to a discrete-index HMM, and it is also shown how to use the model for e.g. estimation of change points on the base-pair scale and for estimation of conditional state probabilities across the genome. In addition, the model is applied to the Glioblastoma Multiforme data used in the comparative study by Lai et al. (Lai,W.R. et al. (2005) Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics, 21, 3763-3370.) giving result similar to theirs but with certain features highlighted in the continuous-index setting.  相似文献   

2.
3.
Microarray-CGH (comparative genomic hybridization) experiments are used to detect and map chromosomal imbalances. A CGH profile can be viewed as a succession of segments that represent homogeneous regions in the genome whose representative sequences share the same relative copy number on average. Segmentation methods constitute a natural framework for the analysis, but they do not provide a biological status for the detected segments. We propose a new model for this segmentation/clustering problem, combining a segmentation model with a mixture model. We present a new hybrid algorithm called dynamic programming-expectation maximization (DP-EM) to estimate the parameters of the model by maximum likelihood. This algorithm combines DP and the EM algorithm. We also propose a model selection heuristic to select the number of clusters and the number of segments. An example of our procedure is presented, based on publicly available data sets. We compare our method to segmentation methods and to hidden Markov models, and we show that the new segmentation/clustering model is a promising alternative that can be applied in the more general context of signal processing.  相似文献   

4.

Background  

Microarray-CGH experiments are used to detect and map chromosomal imbalances, by hybridizing targets of genomic DNA from a test and a reference sample to sequences immobilized on a slide. These probes are genomic DNA sequences (BACs) that are mapped on the genome. The signal has a spatial coherence that can be handled by specific statistical tools. Segmentation methods seem to be a natural framework for this purpose. A CGH profile can be viewed as a succession of segments that represent homogeneous regions in the genome whose BACs share the same relative copy number on average. We model a CGH profile by a random Gaussian process whose distribution parameters are affected by abrupt changes at unknown coordinates. Two major problems arise : to determine which parameters are affected by the abrupt changes (the mean and the variance, or the mean only), and the selection of the number of segments in the profile.  相似文献   

5.
Quantile smoothing of array CGH data   总被引:4,自引:0,他引:4  
MOTIVATION: Plots of array Comparative Genomic Hybridization (CGH) data often show special patterns: stretches of constant level (copy number) with sharp jumps between them. There can also be much noise. Classic smoothing algorithms do not work well, because they introduce too much rounding. To remedy this, we introduce a fast and effective smoothing algorithm based on penalized quantile regression. It can compute arbitrary quantile curves, but we concentrate on the median to show the trend and the lower and upper quartile curves showing the spread of the data. Two-fold cross-validation is used for optimizing the weight of the penalties. RESULTS: Simulated data and a published dataset are used to show the capabilities of the method to detect the segments of changed copy numbers in array CGH data.  相似文献   

6.
Robust smooth segmentation approach for array CGH data analysis   总被引:2,自引:0,他引:2  
MOTIVATION: Array comparative genomic hybridization (aCGH) provides a genome-wide technique to screen for copy number alteration. The existing segmentation approaches for analyzing aCGH data are based on modeling data as a series of discrete segments with unknown boundaries and unknown heights. Although the biological process of copy number alteration is discrete, in reality a variety of biological and experimental factors can cause the signal to deviate from a stepwise function. To take this into account, we propose a smooth segmentation (smoothseg) approach. METHODS: To achieve a robust segmentation, we use a doubly heavy-tailed random-effect model. The first heavy-tailed structure on the errors deals with outliers in the observations, and the second deals with possible jumps in the underlying pattern associated with different segments. We develop a fast and reliable computational procedure based on the iterative weighted least-squares algorithm with band-limited matrix inversion. RESULTS: Using simulated and real data sets, we demonstrate how smoothseg can aid in identification of regions with genomic alteration and in classification of samples. For the real data sets, smoothseg leads to smaller false discovery rate and classification error rate than the circular binary segmentation (CBS) algorithm. In a realistic simulation setting, smoothseg is better than wavelet smoothing and CBS in identification of regions with genomic alterations and better than CBS in classification of samples. For comparative analyses, we demonstrate that segmenting the t-statistics performs better than segmenting the data. AVAILABILITY: The R package smoothseg to perform smooth segmentation is available from http://www.meb.ki.se/~yudpaw.  相似文献   

7.
8.
9.
Array comparative genomic hybridization (aCGH) is a laboratory technique to measure chromosomal copy number changes. A clear biological interpretation of the measurements is obtained by mapping these onto an ordinal scale with categories loss/normal/gain of a copy. The pattern of gains and losses harbors a level of tumor specificity. Here, we present WECCA (weighted clustering of called aCGH data), a method for weighted clustering of samples on the basis of the ordinal aCGH data. Two similarities to be used in the clustering and particularly suited for ordinal data are proposed, which are generalized to deal with weighted observations. In addition, a new form of linkage, especially suited for ordinal data, is introduced. In a simulation study, we show that the proposed cluster method is competitive to clustering using the continuous data. We illustrate WECCA using an application to a breast cancer data set, where WECCA finds a clustering that relates better with survival than the original one.  相似文献   

10.

Background  

In two-channel competitive genomic hybridization microarray experiments, the ratio of the two fluorescent signal intensities at each spot on the microarray is commonly used to infer the relative amounts of the test and reference sample DNA levels. This ratio may be influenced by systematic measurement effects from non-biological sources that can introduce biases in the estimated ratios. These biases should be removed before drawing conclusions about the relative levels of DNA. The performance of existing gene expression microarray normalization strategies has not been evaluated for removing systematic biases encountered in array-based comparative genomic hybridization (CGH), which aims to detect single copy gains and losses typically in samples with heterogeneous cell populations resulting in only slight shifts in signal ratios. The purpose of this work is to establish a framework for correcting the systematic sources of variation in high density CGH array images, while maintaining the true biological variations.  相似文献   

11.
CNVDetector is a program for locating copy number variations (CNVs) in a single genome. CNVDetector has several merits: (i) it can deal with the array comparative genomic hybridization data even if the noise is not normally distributed; (ii) it has a linear time kernel; (iii) its parameters can be easily selected; (iv) it evaluates the statistical significance for each CNV calling. AVAILABILITY: CNVDetector (for Windows platform) can be downloaded from http:www.csie.ntu.edu.tw/~kmchao/tools/CNVDetector/. The manual of CNVDetector is also available.  相似文献   

12.
We describe a hidden Markov model, HMMSTR, for general protein sequence based on the I-sites library of sequence-structure motifs. Unlike the linear hidden Markov models used to model individual protein families, HMMSTR has a highly branched topology and captures recurrent local features of protein sequences and structures that transcend protein family boundaries. The model extends the I-sites library by describing the adjacencies of different sequence-structure motifs as observed in the protein database and, by representing overlapping motifs in a much more compact form, achieves a great reduction in parameters. The HMM attributes a considerably higher probability to coding sequence than does an equivalent dipeptide model, predicts secondary structure with an accuracy of 74.3 %, backbone torsion angles better than any previously reported method and the structural context of beta strands and turns with an accuracy that should be useful for tertiary structure prediction.  相似文献   

13.
We propose a hidden Markov model for multivariate continuous longitudinal responses with covariates that accounts for three different types of missing pattern: (I) partially missing outcomes at a given time occasion, (II) completely missing outcomes at a given time occasion (intermittent pattern), and (III) dropout before the end of the period of observation (monotone pattern). The missing-at-random (MAR) assumption is formulated to deal with the first two types of missingness, while to account for the informative dropout, we rely on an extra absorbing state. Estimation of the model parameters is based on the maximum likelihood method that is implemented by an expectation-maximization (EM) algorithm relying on suitable recursions. The proposal is illustrated by a Monte Carlo simulation study and an application based on historical data on primary biliary cholangitis.  相似文献   

14.

Background  

Array-based comparative genomic hybridization (CGH) is a commonly-used approach to detect DNA copy number variation in whole genome-wide screens. Several statistical methods have been proposed to define genomic segments with different copy numbers in cancer tumors. However, most tumors are heterogeneous and show variation in DNA copy numbers across tumor cells. The challenge is to reveal the copy number profiles of the subpopulations in a tumor and to estimate the percentage of each subpopulation.  相似文献   

15.
MOTIVATION: Array comparative genomic hybridization (CGH) allows detection and mapping of copy number of DNA segments. A challenge is to make inferences about the copy number structure of the genome. Several statistical methods have been proposed to determine genomic segments with different copy number levels. However, to date, no comprehensive comparison of various characteristics of these methods exists. Moreover, the segmentation results have not been utilized in downstream analyses. RESULTS: We describe a comparison of three popular and publicly available methods for the analysis of array CGH data and we demonstrate how segmentation results may be utilized in the downstream analyses such as testing and classification, yielding higher power and prediction accuracy. Since the methods operate on individual chromosomes, we also propose a novel procedure for merging segments across the genome, which results in an interpretable set of copy number levels, and thus facilitate identification of copy number alterations in each genome. AVAILABILITY: http://www.bioconductor.org  相似文献   

16.
A hidden Markov model for progressive multiple alignment   总被引:4,自引:0,他引:4  
MOTIVATION: Progressive algorithms are widely used heuristics for the production of alignments among multiple nucleic-acid or protein sequences. Probabilistic approaches providing measures of global and/or local reliability of individual solutions would constitute valuable developments. RESULTS: We present here a new method for multiple sequence alignment that combines an HMM approach, a progressive alignment algorithm, and a probabilistic evolution model describing the character substitution process. Our method works by iterating pairwise alignments according to a guide tree and defining each ancestral sequence from the pairwise alignment of its child nodes, thus, progressively constructing a multiple alignment. Our method allows for the computation of each column minimum posterior probability and we show that this value correlates with the correctness of the result, hence, providing an efficient mean by which unreliably aligned columns can be filtered out from a multiple alignment.  相似文献   

17.
Protein-protein interactions play a defining role in protein function. Identifying the sites of interaction in a protein is a critical problem for understanding its functional mechanisms, as well as for drug design. To predict sites within a protein chain that participate in protein complexes, we have developed a novel method based on the Hidden Markov Model, which combines several biological characteristics of the sequences neighboring a target residue: structural information, accessible surface area, and transition probability among amino acids. We have evaluated the method using 5-fold cross-validation on 139 unique proteins and demonstrated precision of 66% and recall of 61% in identifying interfaces. These results are better than those achieved by other methods used for identification of interfaces.  相似文献   

18.
A method for calling gains and losses in array CGH data   总被引:11,自引:0,他引:11  
Array CGH is a powerful technique for genomic studies of cancer. It enables one to carry out genome-wide screening for regions of genetic alterations, such as chromosome gains and losses, or localized amplifications and deletions. In this paper, we propose a new algorithm 'Cluster along chromosomes' (CLAC) for the analysis of array CGH data. CLAC builds hierarchical clustering-style trees along each chromosome arm (or chromosome), and then selects the 'interesting' clusters by controlling the False Discovery Rate (FDR) at a certain level. In addition, it provides a consensus summary across a set of arrays, as well as an estimate of the corresponding FDR. We illustrate the method using an application of CLAC on a lung cancer microarray CGH data set as well as a BAC array CGH data set of aneuploid cell strains.  相似文献   

19.
Surveillance systems tracking health patterns in animals have potential for early warning of infectious disease in humans, yet there are many challenges that remain before this can be realized. Specifically, there remains the challenge of detecting early warning signals for diseases that are not known or are not part of routine surveillance for named diseases. This paper reports on the development of a hidden Markov model for analysis of frontline veterinary sentinel surveillance data from Sri Lanka. Field veterinarians collected data on syndromes and diagnoses using mobile phones. A model for submission patterns accounts for both sentinel-related and disease-related variability. Models for commonly reported cattle diagnoses were estimated separately. Region-specific weekly average prevalence was estimated for each diagnoses and partitioned into normal and abnormal periods. Visualization of state probabilities was used to indicate areas and times of unusual disease prevalence. The analysis suggests that hidden Markov modelling is a useful approach for surveillance datasets from novel populations and/or having little historical baselines.  相似文献   

20.

Background  

Hidden Markov models and hidden Boltzmann models are employed in computational biology and a variety of other scientific fields for a variety of analyses of sequential data. Whether the associated algorithms are used to compute an actual probability or, more generally, an odds ratio or some other score, a frequent requirement is that the error statistics of a given score be known. What is the chance that random data would achieve that score or better? What is the chance that a real signal would achieve a given score threshold?  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号