首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
MOTIVATION: In recent years, a range of techniques for analysis and segmentation of array comparative genomic hybridization (aCGH) data have been proposed. For array designs in which clones are of unequal lengths, are unevenly spaced or overlap, the discrete-index view typically adopted by such methods may be questionable or improved. RESULTS: We describe a continuous-index hidden Markov model for aCGH data as well as a Monte Carlo EM algorithm to estimate its parameters. It is shown that for a dataset from the BT-474 cell line analysed on 32K BAC tiling microarrays, this model yields considerably better model fit in terms of lag-1 residual autocorrelations compared to a discrete-index HMM, and it is also shown how to use the model for e.g. estimation of change points on the base-pair scale and for estimation of conditional state probabilities across the genome. In addition, the model is applied to the Glioblastoma Multiforme data used in the comparative study by Lai et al. (Lai,W.R. et al. (2005) Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics, 21, 3763-3370.) giving result similar to theirs but with certain features highlighted in the continuous-index setting.  相似文献   

2.
MOTIVATION: Array Comparative Genomic Hybridization (CGH) can reveal chromosomal aberrations in the genomic DNA. These amplifications and deletions at the DNA level are important in the pathogenesis of cancer and other diseases. While a large number of approaches have been proposed for analyzing the large array CGH datasets, the relative merits of these methods in practice are not clear. RESULTS: We compare 11 different algorithms for analyzing array CGH data. These include both segment detection methods and smoothing methods, based on diverse techniques such as mixture models, Hidden Markov Models, maximum likelihood, regression, wavelets and genetic algorithms. We compute the Receiver Operating Characteristic (ROC) curves using simulated data to quantify sensitivity and specificity for various levels of signal-to-noise ratio and different sizes of abnormalities. We also characterize their performance on chromosomal regions of interest in a real dataset obtained from patients with Glioblastoma Multiforme. While comparisons of this type are difficult due to possibly sub-optimal choice of parameters in the methods, they nevertheless reveal general characteristics that are helpful to the biological investigator.  相似文献   

3.
Array-based comparative genomic hybridization (Array-CGH) is an important technology in molecular biology for the detection of DNA copy number polymorphisms between closely related genomes. Hidden Markov Models (HMMs) are popular tools for the analysis of Array-CGH data, but current methods are only based on first-order HMMs having constrained abilities to model spatial dependencies between measurements of closely adjacent chromosomal regions. Here, we develop parsimonious higher-order HMMs enabling the interpolation between a mixture model ignoring spatial dependencies and a higher-order HMM exhaustively modeling spatial dependencies. We apply parsimonious higher-order HMMs to the analysis of Array-CGH data of the accessions C24 and Col-0 of the model plant Arabidopsis thaliana. We compare these models against first-order HMMs and other existing methods using a reference of known deletions and sequence deviations. We find that parsimonious higher-order HMMs clearly improve the identification of these polymorphisms. Moreover, we perform a functional analysis of identified polymorphisms revealing novel details of genomic differences between C24 and Col-0. Additional model evaluations are done on widely considered Array-CGH data of human cell lines indicating that parsimonious HMMs are also well-suited for the analysis of non-plant specific data. All these results indicate that parsimonious higher-order HMMs are useful for Array-CGH analyses. An implementation of parsimonious higher-order HMMs is available as part of the open source Java library Jstacs (www.jstacs.de/index.php/PHHMM).  相似文献   

4.
Genomic microarrays in the spotlight   总被引:18,自引:0,他引:18  
Microarray-based comparative genomic hybridization (array-CGH) has emerged as a revolutionary platform, enabling the high-resolution detection of DNA copy number aberrations. In this article we outline the use and limitations of genomic clones, cDNA clones and PCR products as targets for genomic microarray construction. Furthermore, the applications and future aspects of these arrays for DNA copy number analysis in research and diagnostics, epigenetic profiling and gene annotation are discussed. These recent developments of genomic microarrays mark only the beginning of a new generation of high-resolution and high-throughput tools for genetic analysis.  相似文献   

5.
BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data   总被引:3,自引:0,他引:3  
SUMMARY: We have developed a new method (BioHMM) for segmenting array comparative genomic hybridization data into states with the same underlying copy number. By utilizing a heterogeneous hidden Markov model, BioHMM incorporates relevant biological factors (e.g. the distance between adjacent clones) in the segmentation process.  相似文献   

6.
MOTIVATION: The identification of DNA copy number changes provides insights that may advance our understanding of initiation and progression of cancer. Array-based comparative genomic hybridization (array-CGH) has emerged as a technique allowing high-throughput genome-wide scanning for chromosomal aberrations. A number of statistical methods have been proposed for the analysis of array-CGH data. In this article, we consider a fused quantile regression model based on three motivations: (1) quantile regression may provide a more comprehensive picture for the ratio profile of copy numbers than the standard mean regression approach; (2) for simplicity, most available methods assume uniform spacing between neighboring clones, while incorporating the information of physical locations of clones may be helpful and (3) most current methods have a set of tuning parameters that must be carefully tuned, which introduces complexity to the implementation. RESULTS: We formulate the detection of regions of gains and losses in a fused regularized quantile regression framework, incorporating physical locations of clones. We derive an efficient algorithm that computes the entire solution path for the resulting optimization problem, and we propose a simple estimate for the complexity of the fitted model, which leads to convenient selection of the tuning parameter. Three published array-CGH datasets are used to demonstrate our approach. AVAILABILITY: R code are available at http://www.stat.lsa.umich.edu/~jizhu/code/cgh/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

7.
Development of physical genomic maps is facilitated by identification of overlapping recombinant DNA clones containing long chromosomal DNA inserts. To simplify the analysis required to determine which clones in a genomic library overlap one another, we partitioned Aspergillus nidulans cosmid libraries into chromosome-specific subcollections. The eight A. nidulans chromosomes were resolved by pulsed field gel electrophoresis and hybridized to filter replicas of cosmid libraries. The subcollections obtained appeared to be representative of the chromosomes based on the correspondence between subcollection size and chromosome length. A sufficient number of clones was obtained in each chromosome-specific subcollection to predict the overlap and assembly of individual clones into a limited number of contiguous regions. This approach should be applicable to many organisms whose genomes can be resolved by pulsed field gel electrophoresis.  相似文献   

8.
Most of the gene prediction algorithms for prokaryotes are based on Hidden Markov Models or similar machine-learning approaches, which imply the optimization of a high number of parameters. The present paper presents a novel method for the classification of coding and non-coding regions in prokaryotic genomes, based on a suitably defined compression index of a DNA sequence. The main features of this new method are the non-parametric logic and the costruction of a dictionary of words extracted from the sequences. These dictionaries can be very useful to perform further analyses on the genomic sequences themselves. The proposed approach has been applied on some prokaryotic complete genomes, obtaining optimal scores of correctly recognized coding and non-coding regions. Several false-positive and false-negative cases have been investigated in detail, which have revealed that this approach can fail in the presence of highly structured coding regions (e.g., genes coding for modular proteins) or quasi-random non-coding regions (e.g., regions hosting non-functional fragments of copies of functional genes; regions hosting promoters or other protein-binding sequences). We perform an overall comparison with other gene-finder software, since at this step we are not interested in building another gene-finder system, but only in exploring the possibility of the suggested approach.  相似文献   

9.
DNA sequence copy number has been shown to be associated with cancer development and progression. Array-based comparative genomic hybridization (aCGH) is a recent development that seeks to identify the copy number ratio at large numbers of markers across the genome. Due to experimental and biological variations across chromosomes and hybridizations, current methods are limited to analyses of single chromosomes. We propose a more powerful approach that borrows strength across chromosomes and hybridizations. We assume a Gaussian mixture model, with a hidden Markov dependence structure and with random effects to allow for intertumoral variation, as well as intratumoral clonal variation. For ease of computation, we base estimation on a pseudolikelihood function. The method produces quantitative assessments of the likelihood of genetic alterations at each clone, along with a graphical display for simple visual interpretation. We assess the characteristics of the method through simulation studies and analysis of a brain tumor aCGH data set. We show that the pseudolikelihood approach is superior to existing methods both in detecting small regions of copy number alteration and in accurately classifying regions of change when intratumoral clonal variation is present. Software for this approach is available at http://www.biostat.harvard.edu/ approximately betensky/papers.html.  相似文献   

10.
Crowley EM 《Biopolymers》2001,58(2):165-174
A goal of the human genome project is to determine the entire sequence of DNA (3 x 10(9) base pairs) found in chromosomes. The massive amounts of data produced by this project require interpretation. A Bayesian model is developed for locating regulatory regions in a DNA sequence. Regulatory regions are areas of DNA to which specific proteins bind and control whether or not a gene is transcribed to produce templates for protein synthesis. Each human cell contains the same DNA sequence. Thus the particular function of different cells is determined by the genes that are transcribed in that cell. A Hidden Markov chain is used to model whether a small interval of the DNA is in a regulatory region or not. This can be regarded as a changepoint problem where the changepoints are the start of a regulatory or nonregulatory region. The data consists of protein-binding elements, which are short subsequences, or "words," in the DNA sequence. Although these words can occur anywhere in the sequence, a larger number are expected in regulatory regions. Therefore, regulatory regions are detected by locating clusters of words. For a particular DNA sequence, the model automatically selects those words that best predict regions of interest. Markov chain Monte Carlo methods are used to explore the posterior distribution of the Hidden Markov chain. The model is tested by means of simulations, and applied to several DNA sequences.  相似文献   

11.
Mapping Genomic Library Clones Using Oligonucleotide Arrays   总被引:1,自引:0,他引:1  
We have developed a high-density DNA probe array and accompanying biochemical and informatic methods to order clones from genomic libraries. This approach involves a series of enzymatic steps for capturing a set of short dispersed sequence markers scattered throughout a high-molecular-weight DNA. By this process, all the ambiguous sequences lying adjacent to a given Type IIS restriction site are ligated between two DNA adapters. These markers, once amplified and labeled by PCR, can be hybridized and detected on a high-density oligonucleotide array bearing probes complementary to all possible markers. The array is synthesized using light-directed combinatorial chemistry. For each clone in a genomic library, a characteristic set of sequence markers can be determined. On the basis of the similarity between the marker sets for each pair of clones, their relative overlap can be measured. The library can be sequentially ordered into a contig map using this overlap information. This new methodology does not require gel-based methods or prior sequence information and involves manipulations that should allow for easy adaptation to automated processing and data collection.  相似文献   

12.
Copy number variation (CNV) has been reported to be associated with disease and various cancers. Hence, identifying the accurate position and the type of CNV is currently a critical issue. There are many tools targeting on detecting CNV regions, constructing haplotype phases on CNV regions, or estimating the numerical copy numbers. However, none of them can do all of the three tasks at the same time. This paper presents a method based on Hidden Markov Model to detect parent specific copy number change on both chromosomes with signals from SNP arrays. A haplotype tree is constructed with dynamic branch merging to model the transition of the copy number status of the two alleles assessed at each SNP locus. The emission models are constructed for the genotypes formed with the two haplotypes. The proposed method can provide the segmentation points of the CNV regions as well as the haplotype phasing for the allelic status on each chromosome. The estimated copy numbers are provided as fractional numbers, which can accommodate the somatic mutation in cancer specimens that usually consist of heterogeneous cell populations. The algorithm is evaluated on simulated data and the previously published regions of CNV of the 270 HapMap individuals. The results were compared with five popular methods: PennCNV, genoCN, COKGEN, QuantiSNP and cnvHap. The application on oral cancer samples demonstrates how the proposed method can facilitate clinical association studies. The proposed algorithm exhibits comparable sensitivity of the CNV regions to the best algorithm in our genome-wide study and demonstrates the highest detection rate in SNP dense regions. In addition, we provide better haplotype phasing accuracy than similar approaches. The clinical association carried out with our fractional estimate of copy numbers in the cancer samples provides better detection power than that with integer copy number states.  相似文献   

13.
Microarray-CGH (comparative genomic hybridization) experiments are used to detect and map chromosomal imbalances. A CGH profile can be viewed as a succession of segments that represent homogeneous regions in the genome whose representative sequences share the same relative copy number on average. Segmentation methods constitute a natural framework for the analysis, but they do not provide a biological status for the detected segments. We propose a new model for this segmentation/clustering problem, combining a segmentation model with a mixture model. We present a new hybrid algorithm called dynamic programming-expectation maximization (DP-EM) to estimate the parameters of the model by maximum likelihood. This algorithm combines DP and the EM algorithm. We also propose a model selection heuristic to select the number of clusters and the number of segments. An example of our procedure is presented, based on publicly available data sets. We compare our method to segmentation methods and to hidden Markov models, and we show that the new segmentation/clustering model is a promising alternative that can be applied in the more general context of signal processing.  相似文献   

14.
Summary Array CGH is a high‐throughput technique designed to detect genomic alterations linked to the development and progression of cancer. The technique yields fluorescence ratios that characterize DNA copy number change in tumor versus healthy cells. Classification of tumors based on aCGH profiles is of scientific interest but the analysis of these data is complicated by the large number of highly correlated measures. In this article, we develop a supervised Bayesian latent class approach for classification that relies on a hidden Markov model to account for the dependence in the intensity ratios. Supervision means that classification is guided by a clinical endpoint. Posterior inferences are made about class‐specific copy number gains and losses. We demonstrate our technique on a study of brain tumors, for which our approach is capable of identifying subsets of tumors with different genomic profiles, and differentiates classes by survival much better than unsupervised methods.  相似文献   

15.
16.
Array-based comparative genomic hybridization (aCGH) enables the measurement of DNA copy number across thousands of locations in a genome. The main goals of analyzing aCGH data are to identify the regions of copy number variation (CNV) and to quantify the amount of CNV. Although there are many methods for analyzing single-sample aCGH data, the analysis of multi-sample aCGH data is a relatively new area of research. Further, many of the current approaches for analyzing multi-sample aCGH data do not appropriately utilize the additional information present in the multiple samples. We propose a procedure called the Fused Lasso Latent Feature Model (FLLat) that provides a statistical framework for modeling multi-sample aCGH data and identifying regions of CNV. The procedure involves modeling each sample of aCGH data as a weighted sum of a fixed number of features. Regions of CNV are then identified through an application of the fused lasso penalty to each feature. Some simulation analyses show that FLLat outperforms single-sample methods when the simulated samples share common information. We also propose a method for estimating the false discovery rate. An analysis of an aCGH data set obtained from human breast tumors, focusing on chromosomes 8 and 17, shows that FLLat and Significance Testing of Aberrant Copy number (an alternative, existing approach) identify similar regions of CNV that are consistent with previous findings. However, through the estimated features and their corresponding weights, FLLat is further able to discern specific relationships between the samples, for example, identifying 3 distinct groups of samples based on their patterns of CNV for chromosome 17.  相似文献   

17.
Laboratory selection is a powerful approach for engineering new traits in metabolic engineering applications. This approach is limited because determining the genetic basis of improved strains can be difficult using conventional methods. We have recently reported a new method that enables the measurement of fitness for all clones contained within comprehensive genomic libraries, thus enabling the genome-scale mapping of fitness altering genes. Here, we demonstrate a strategy for relating these measurements to the individual phenotypes selected for in a particular environment. We first provide a mathematical framework for decomposing fitness into selectable phenotypes. We then employed this framework to predict that single-batch selections would enrich primarily for library clones with increased growth rate, serial-batch would enrich for a broad collection of clones enhanced via a combination of increased growth rate and/or reduced lag times, and that overlap among selected clones would be minimal. We used the SCalar Analysis of Library Enrichments (SCALEs) method to test these predictions. We mapped all genomic regions for which increased copy number conferred a selective advantage to Escherichia coli when cultured via single- or serial-batch in the presence of 1-naphthol. We identified a surprisingly large collection (163 total) of tolerance regions, including all previously identified solvent tolerance genes in E. coli. We show that the majority of the identified regions were unique to the different selection strategies examined and that such differences were indeed due to differences among enriched clones in growth rate and lag times over the solvent concentrations examined. The combination of a framework for decomposing overall fitness into selectable phenotypes along with a genome-scale method for mapping genes to such phenotypes lays the groundwork for improving the rational design of laboratory selections.  相似文献   

18.
19.
There is an increasing interest in using single nucleotide polymorphism (SNP) genotyping arrays for profiling chromosomal rearrangements in tumors, as they allow simultaneous detection of copy number and loss of heterozygosity with high resolution. Critical issues such as signal baseline shift due to aneuploidy, normal cell contamination, and the presence of GC content bias have been reported to dramatically alter SNP array signals and complicate accurate identification of aberrations in cancer genomes. To address these issues, we propose a novel Global Parameter Hidden Markov Model (GPHMM) to unravel tangled genotyping data generated from tumor samples. In contrast to other HMM methods, a distinct feature of GPHMM is that the issues mentioned above are quantitatively modeled by global parameters and integrated within the statistical framework. We developed an efficient EM algorithm for parameter estimation. We evaluated performance on three data sets and show that GPHMM can correctly identify chromosomal aberrations in tumor samples containing as few as 10% cancer cells. Furthermore, we demonstrated that the estimation of global parameters in GPHMM provides information about the biological characteristics of tumor samples and the quality of genotyping signal from SNP array experiments, which is helpful for data quality control and outlier detection in cohort studies.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号