首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
DNA copy number variants (CNVs) that alter the copy number of a particular DNA segment in the genome play an important role in human phenotypic variability and disease susceptibility. A number of CNVs overlapping with genes have been shown to confer risk to a variety of human diseases thus highlighting the relevance of addressing the variability of CNVs at a higher resolution. So far, it has not been possible to deterministically infer the allelic composition of different haplotypes present within the CNV regions. We have developed a novel computational method, called PiCNV, which enables to resolve the haplotype sequence composition within CNV regions in nuclear families based on SNP genotyping microarray data. The algorithm allows to i) phase normal and CNV-carrying haplotypes in the copy number variable regions, ii) resolve the allelic copies of rearranged DNA sequence within the haplotypes and iii) infer the heritability of identified haplotypes in trios or larger nuclear families. To our knowledge this is the first program available that can deterministically phase null, mono-, di-, tri- and tetraploid genotypes in CNV loci. We applied our method to study the composition and inheritance of haplotypes in CNV regions of 30 HapMap Yoruban trios and 34 Estonian families. For 93.6% of the CNV loci, PiCNV enabled to unambiguously phase normal and CNV-carrying haplotypes and follow their transmission in the corresponding families. Furthermore, allelic composition analysis identified the co-occurrence of alternative allelic copies within 66.7% of haplotypes carrying copy number gains. We also observed less frequent transmission of CNV-carrying haplotypes from parents to children compared to normal haplotypes and identified an emergence of several de novo deletions and duplications in the offspring.  相似文献   

2.
Tom Druet  Michel Georges 《Genetics》2010,184(3):789-798
Faithful reconstruction of haplotypes from diploid marker data (phasing) is important for many kinds of genetic analyses, including mapping of trait loci, prediction of genomic breeding values, and identification of signatures of selection. In human genetics, phasing most often exploits population information (linkage disequilibrium), while in animal genetics the primary source of information is familial (Mendelian segregation and linkage). We herein develop and evaluate a method that simultaneously exploits both sources of information. It builds on hidden Markov models that were initially developed to exploit population information only. We demonstrate that the approach improves the accuracy of allele phasing as well as imputation of missing genotypes. Reconstructed haplotypes are assigned to hidden states that are shown to correspond to clusters of genealogically related chromosomes. We show that these cluster states can directly be used to fine map QTL. The method is computationally effective at handling large data sets based on high-density SNP panels.ARRAY technology now allows genotyping of large cohorts for thousands to millions of single nucleotide polymorphisms (SNPs), which are becoming available for a growing list of organisms including human and domestic animals. Among other applications, these advances permit systematic scanning of the genome to map trait loci by association (e.g., Wellcome Trust Case Control Consortium 2007; Charlier et al. 2008), to predict genomic breeding values for complex traits (Meuwissen et al. 2001; Goddard and Hayes 2009), or to identify signatures of selection (e.g., Voight et al. 2006).Present-day genotyping platforms do not directly provide information about linkage phase; i.e., co-inherited alleles at adjacent heterozygous markers (haplotypes) are not identified as such. As haplotype information may considerably empower genetic analyses, indirect phasing strategies have been devised: haplotypes can be reconstructed from unphased genotypes using either familial information (Mendelian segregation and linkage) and/or population information (linkage disequilibrium, LD, and surrogate parents) (e.g., Windig and Meuwissen 2004; Scheet and Stephens 2006; Kong et al. 2008).Haplotype-based approaches are routinely applied in animal genetics for combined linkage and LD mapping of QTL (e.g., Meuwissen and Goddard 2000; Blott et al. 2003). In these studies, phasing has so far relied on familial information provided by the extended pedigrees typical of livestock (e.g., Windig and Meuwissen 2004). This approach, however, leaves a nonnegligible proportion of genotypes unphased, especially for the less connected individuals. After phasing, identity-by-descent (IBD) probabilities conditional on haplotype data—needed for QTL mapping—are computed for all chromosome pairs, using familial as well as population information (hence combined linkage and LD mapping – L + LD) (e.g., Meuwissen and Goddard 2001). However, the use of high-density SNP chips and the analysis of ever larger cohorts render the computation of pairwise IBD probabilities a bottleneck.We herein propose a more efficient, heuristic approach based on hidden Markov models (HMM). It simultaneously phases and sorts haplotypes in clusters that can be used directly for mapping or other purposes. The proposed method exploits familial as well as population information, and imputes missing genotypes. We herein describe the accuracy of the proposed method and its use for L + LD mapping of QTL.  相似文献   

3.
Inference of haplotypes is important in genetic epidemiology studies. However, all large genotype data sets have errors due to the use of inexpensive genotyping machines that are fallible and shortcomings in genotyping scoring softwares, which can have an enormous impact on haplotype inference. In this article, we propose two novel strategies to reduce the impact induced by genotyping errors in haplotype inference. The first method makes use of double sampling. For each individual, the “GenoSpectrum” that consists of all possible genotypes and their corresponding likelihoods are computed. The second method is a genotype clustering algorithm based on multi‐genotyping data, which also assigns a “GenoSpectrum” for each individual. We then describe two hybrid EM algorithms (called DS‐EM and MG‐EM) that perform haplotype inference based on “GenoSpectrum” of each individual obtained by double sampling and multi‐genotyping data. Both simulated data sets and a quasi real‐data set demonstrate that our proposed methods perform well in different situations and outperform the conventional EM algorithm and the HMM algorithm proposed by Sun, Greenwood, and Neal (2007, Genetic Epidemiology 31 , 937–948) when the genotype data sets have errors.  相似文献   

4.
Is it possible to learn and create a first Hidden Markov Model (HMM) without programming skills or understanding the algorithms in detail? In this concise tutorial, we present the HMM through the 2 general questions it was initially developed to answer and describe its elements. The HMM elements include variables, hidden and observed parameters, the vector of initial probabilities, and the transition and emission probability matrices. Then, we suggest a set of ordered steps, for modeling the variables and illustrate them with a simple exercise of modeling and predicting transmembrane segments in a protein sequence. Finally, we show how to interpret the results of the algorithms for this particular problem. To guide the process of information input and explicit solution of the basic HMM algorithms that answer the HMM questions posed, we developed an educational webserver called HMMTeacher. Additional solved HMM modeling exercises can be found in the user’s manual and answers to frequently asked questions. HMMTeacher is available at https://hmmteacher.mobilomics.org, mirrored at https://hmmteacher1.mobilomics.org. A repository with the code of the tool and the webpage is available at https://gitlab.com/kmilo.f/hmmteacher.  相似文献   

5.
MicroRNAs are one class of small single-stranded RNA of about 22 nt serving as important negative gene regulators. In animals, miRNAs mainly repress protein translation by binding itself to the 3′ UTR regions of mRNAs with imperfect complementary pairing. Although bioinformatics investigations have resulted in a number of target prediction tools, all of these have a common shortcoming—a high false positive rate. Therefore, it is important to further filter the predicted targets. In this paper, based on miRNA:target duplex, we construct a second-order Hidden Markov Model, implement Baum-Welch training algorithm and apply this model to further process predicted targets. The model trains the classifier by 244 positive and 49 negative miRNA:target interaction pairs and achieves a sensitivity of 72.54%, specificity of 55.10% and accuracy of 69.62% by 10-fold cross-validation experiments. In order to further verify the applicability of the algorithm, previously collected datasets, including 195 positive and 38 negative, are chosen to test it, with consistent results. We believe that our method will provide some guidance for experimental biologists, especially in choosing miRNA targets for validation.  相似文献   

6.
广义隐Markov模型(GHMM)是基因识别的一种重要模型,但是其计算量比传统的隐Markov模型大得多,以至于不能直 接在基因识别中使用。根据原核生物基因的结构特点,提出了一种高效的简化算法,其计算量是序列长度的线性函数。在此 基础上,构建了针对原核生物基因的识别程序GeneMiner,对实际数据的测试表明,此算法是有效的。  相似文献   

7.
Summary Array CGH is a high‐throughput technique designed to detect genomic alterations linked to the development and progression of cancer. The technique yields fluorescence ratios that characterize DNA copy number change in tumor versus healthy cells. Classification of tumors based on aCGH profiles is of scientific interest but the analysis of these data is complicated by the large number of highly correlated measures. In this article, we develop a supervised Bayesian latent class approach for classification that relies on a hidden Markov model to account for the dependence in the intensity ratios. Supervision means that classification is guided by a clinical endpoint. Posterior inferences are made about class‐specific copy number gains and losses. We demonstrate our technique on a study of brain tumors, for which our approach is capable of identifying subsets of tumors with different genomic profiles, and differentiates classes by survival much better than unsupervised methods.  相似文献   

8.

Background  

Nuclear localization signals (NLSs) are stretches of residues within a protein that are important for the regulated nuclear import of the protein. Of the many import pathways that exist in yeast, the best characterized is termed the 'classical' NLS pathway. The classical NLS contains specific patterns of basic residues and computational methods have been designed to predict the location of these motifs on proteins. The consensus sequences, or patterns, for the other import pathways are less well-understood.  相似文献   

9.
人们很早就发现DNA拷贝数变异与特定染色体重组和基因组异常相关这一现象,但最近才知道它与疾病的相关联系。我们对拷贝数变异的原理、最新研究方法,及其与复杂疾病的相关性研究等进展进行了综述;总结了拷贝数变异研究所存在的问题;对拷贝数变异未来的研究重点和需要解决的问题进行了展望。  相似文献   

10.
HumGene是一个采用广义隐Markov模型(GHMM)的人类基因预测软件.利用人类基因的结构特点,采用概率模型为基因结构中各个特定区域建立了独立的子模型,能够获得全局统一的评价指数,使得系统整体框架具有一定的扩展性.采用一种新的简化算法,有效地降低了计算的复杂度.介绍了软件的构成,对软件进行了测试,给出了与其它类似软件的结果比较.  相似文献   

11.
Large rare copy number variants (CNVs) have been recognized as significant genetic risk factors for the development of schizophrenia (SCZ). However, due to their low frequency (1∶150 to 1∶1000) among patients, large sample sizes are needed to detect an association between specific CNVs and SCZ. So far, the majority of genome-wide CNV analyses have focused on reporting only CNVs that reached a significant P-value within the study cohort and merely confirmed the frequency of already-established risk-carrying CNVs. As a result, CNVs with a very low frequency that might be relevant for SCZ susceptibility are lost for secondary analyses. In this study, we provide a concise collection of high-quality CNVs in a large German sample consisting of 1,637 patients with SCZ or schizoaffective disorder and 1,627 controls. All individuals were genotyped on Illumina''s BeadChips and putative CNVs were identified using QuantiSNP and PennCNV. Only those CNVs that were detected by both programs and spanned ≥30 consecutive SNPs were included in the data collection and downstream analyses (2,366 CNVs, 0.73 CNVs per individual). The genome-wide analysis did not reveal a specific association between a previously unknown CNV and SCZ. However, the group of CNVs previously reported to be associated with SCZ was more frequent in our patients than in the controls. The publication of our dataset will serve as a unique, easily accessible, high-quality CNV data collection for other research groups. The dataset could be useful for the identification of new disease-relevant CNVs that are currently overlooked due to their very low frequency and lack of power for their detection in individual studies.  相似文献   

12.
13.
Endometriosis is a complex gynecological condition that affects 6–10% of women in their reproductive years and is defined by the presence of endometrial glands and stroma outside the uterus. Twin, family, and genome-wide association (GWA) studies have confirmed a genetic role, yet only a small part of the genetic risk can be explained by SNP variation. Copy number variants (CNVs) account for a greater portion of human genetic variation than SNPs and include more recent mutations of large effect. CNVs, likely to be prominent in conditions with decreased reproductive fitness, have not previously been examined as a genetic contributor to endometriosis. Here we employ a high-density genotyping microarray in a genome-wide survey of CNVs in a case-control population that includes 2,126 surgically confirmed endometriosis cases and 17,974 population controls of European ancestry. We apply stringent quality filters to reduce the false positive rate common to many CNV-detection algorithms from 77.7% to 7.3% without noticeable reduction in the true positive rate. We detected no differences in the CNV landscape between cases and controls on the global level which showed an average of 1.92 CNVs per individual with an average size of 142.3 kb. On the local level we identify 22 CNV-regions at the nominal significance threshold (P<0.05), which is greater than the 8.15 CNV-regions expected based on permutation analysis (P<0.001). Three CNV''s passed a genome-wide P-value threshold of 9.3×10−4; a deletion at SGCZ on 8p22 (P = 7.3×10−4, OR = 8.5, Cl = 2.3–31.7), a deletion in MALRD1 on 10p12.31 (P = 5.6×10−4, OR = 14.1, Cl = 2.7–90.9), and a deletion at 11q14.1 (P = 5.7×10−4, OR = 33.8, Cl = 3.3–1651). Two SNPs within the 22 CNVRs show significant genotypic association with endometriosis after adjusting for multiple testing; rs758316 in DPP6 on 7q36.2 (P = 0.0045) and rs4837864 in ASTN2 on 9q33.1 (P = 0.0002). Together, the CNV-loci are detected in 6.9% of affected women compared to 2.1% in the general population.  相似文献   

14.
This paper proposes the use of hidden Markov time series models for the analysis of the behaviour sequences of one or more animals under observation. These models have advantages over the Markov chain models commonly used for behaviour sequences, as they can allow for time-trend or expansion to several subjects without sacrificing parsimony. Furthermore, they provide an alternative to higher-order Markov chain models if a first-order Markov chain is unsatisfactory as a model. To illustrate the use of such models, we fit multivariate and univariate hidden Markov models allowing for time-trend to data from an experiment investigating the effects of feeding on the locomotory behaviour of locusts (Locusta migratoria).  相似文献   

15.

Background  

Sequencing of environmental DNA (often called metagenomics) has shown tremendous potential to uncover the vast number of unknown microbes that cannot be cultured and sequenced by traditional methods. Because the output from metagenomic sequencing is a large set of reads of unknown origin, clustering reads together that were sequenced from the same species is a crucial analysis step. Many effective approaches to this task rely on sequenced genomes in public databases, but these genomes are a highly biased sample that is not necessarily representative of environments interesting to many metagenomics projects.  相似文献   

16.
Urban-scale traffic monitoring plays a vital role in reducing traffic congestion. Owing to its low cost and wide coverage, floating car data (FCD) serves as a novel approach to collecting traffic data. However, sparse probe data represents the vast majority of the data available on arterial roads in most urban environments. In order to overcome the problem of data sparseness, this paper proposes a hidden Markov model (HMM)-based traffic estimation model, in which the traffic condition on a road segment is considered as a hidden state that can be estimated according to the conditions of road segments having similar traffic characteristics. An algorithm based on clustering and pattern mining rather than on adjacency relationships is proposed to find clusters with road segments having similar traffic characteristics. A multi-clustering strategy is adopted to achieve a trade-off between clustering accuracy and coverage. Finally, the proposed model is designed and implemented on the basis of a real-time algorithm. Results of experiments based on real FCD confirm the applicability, accuracy, and efficiency of the model. In addition, the results indicate that the model is practicable for traffic estimation on urban arterials and works well even when more than 70% of the probe data are missing.  相似文献   

17.
rrndb: the Ribosomal RNA Operon Copy Number Database   总被引:4,自引:0,他引:4       下载免费PDF全文
The Ribosomal RNA Operon Copy Number Database (rrndb) is an Internet-accessible database containing annotated information on rRNA operon copy number among prokaryotes. Gene redundancy is uncommon in prokaryotic genomes, yet the rRNA genes can vary from one to as many as 15 copies. Despite the widespread use of 16S rRNA gene sequences for identification of prokaryotes, information on the number and sequence of individual rRNA genes in a genome is not readily accessible. In an attempt to understand the evolutionary implications of rRNA operon redundancy, we have created a phylogenetically arranged report on rRNA gene copy number for a diverse collection of prokaryotic microorganisms. Each entry (organism) in the rrndb contains detailed information linked directly to external websites including the Ribosomal Database Project, GenBank, PubMed and several culture collections. Data contained in the rrndb will be valuable to researchers investigating microbial ecology and evolution using 16S rRNA gene sequences. The rrndb web site is directly accessible on the WWW at http://rrndb.cme. msu.edu.  相似文献   

18.
19.
20.
Chromosome structural changes with nonrecurrent endpoints associated with genomic disorders offer windows into the mechanism of origin of copy number variation (CNV). A recent report of nonrecurrent duplications associated with Pelizaeus-Merzbacher disease identified three distinctive characteristics. First, the majority of events can be seen to be complex, showing discontinuous duplications mixed with deletions, inverted duplications, and triplications. Second, junctions at endpoints show microhomology of 2–5 base pairs (bp). Third, endpoints occur near pre-existing low copy repeats (LCRs). Using these observations and evidence from DNA repair in other organisms, we derive a model of microhomology-mediated break-induced replication (MMBIR) for the origin of CNV and, ultimately, of LCRs. We propose that breakage of replication forks in stressed cells that are deficient in homologous recombination induces an aberrant repair process with features of break-induced replication (BIR). Under these circumstances, single-strand 3′ tails from broken replication forks will anneal with microhomology on any single-stranded DNA nearby, priming low-processivity polymerization with multiple template switches generating complex rearrangements, and eventual re-establishment of processive replication.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号