首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
A large number of new genomic features are being discovered using high throughput techniques. The next challenge is to automatically map them to the reference genome for further analysis and functional annotation. We have developed a tool that can be used to map important genomic features to the latest version of the human genome and also to annotate new features. These genomic features could be of many different source types, including miRNAs, microarray primers or probes, Chip-on-Chip data, CpG islands and SNPs to name a few. A standalone version and web interface for the tool can be accessed through: http://populationhealth.qimr.edu.au/cgi-bin/webFOG/index.cgi. The project details and source code is also available at http://www.bioinformatics.org/webfog.  相似文献   

Graphs such as de Bruijn graphs and OLC (overlap-layout-consensus) graphs have been widely adopted for the de novo assembly of genomic short reads. This work studies another important problem in the field: how graphs can be used for high-performance compression of the large-scale sequencing data. We present a novel graph definition named Hamming-Shifting graph to address this problem. The definition originates from the technological characteristics of next-generation sequencing machines, aiming to link all pairs of distinct reads that have a small Hamming distance or a small shifting offset or both. We compute multiple lexicographically minimal k-mers to index the reads for an efficient search of the weight-lightest edges, and we prove a very high probability of successfully detecting these edges. The resulted graph creates a full mutual reference of the reads to cascade a code-minimized transfer of every child-read for an optimal compression. We conducted compression experiments on the minimum spanning forest of this extremely sparse graph, and achieved a 10 − 30% more file size reduction compared to the best compression results using existing algorithms. As future work, the separation and connectivity degrees of these giant graphs can be used as economical measurements or protocols for quick quality assessment of wet-lab machines, for sufficiency control of genomic library preparation, and for accurate de novo genome assembly.  相似文献   

Genomic sequences obtained through high-throughput sequencing are not uniformly distributed across the genome. For example, sequencing data of total genomic DNA show significant, yet unexpected enrichments on promoters and exons. This systematic bias is a particular problem for techniques such as chromatin immunoprecipitation, where the signal for a target factor is plotted across genomic features. We have focused on data obtained from Illumina's Genome Analyser platform, where at least three factors contribute to sequence bias: GC content, mappability of sequencing reads, and regional biases that might be generated by local structure. We show that relying on input control as a normalizer is not generally appropriate due to sample to sample variation in bias. To correct sequence bias, we present BEADS (bias elimination algorithm for deep sequencing), a simple three-step normalization scheme that successfully unmasks real binding patterns in ChIP-seq data. We suggest that this procedure be done routinely prior to data interpretation and downstream analyses.  相似文献   

Unsequenced bacterial strains can be characterized by comparing their genomic DNA to a sequenced reference genome of the same species. This comparative genomic approach, also called genomotyping, is leading to an increased understanding of bacterial evolution and pathogenesis. It is efficiently accomplished by comparative genomic hybridization on custom-designed cDNA microarrays. The microarray experiment results in fluorescence intensities for reference and sample genome for each gene. The log-ratio of these intensities is usually compared to a cut-off, classifying each gene of the sample genome as a candidate for an absent or present gene with respect to the reference genome. Reducing the usually high rate of false positives in the list of candidates for absent genes is decisive for both time and costs of the experiment. We propose a novel method to improve efficiency of genomotyping experiments in this sense, by rotating the normalized intensity data before setting up the list of candidate genes. We analyze simulated genomotyping data and also re-analyze an experimental data set for comparison and illustration. We approximately halve the proportion of false positives in the list of candidate absent genes for the example comparative genomic hybridization experiment as well as for the simulation experiments.  相似文献   

Comparison of genomic maps is hampered by errors and ambiguities introduced by mapping technology, incorrectly resolved paralogy, small samples of markers, and extensive genome rearrangement. We design an analysis to remove or resolve most of these problems and to extract corrected data where markers occur in consecutive strips in both genomes. To do this, we introduce the notion of prestrip, an efficient way of generating these and a compatibility analysis culminating in a maximum weighted clique (MWC) search. The output can be directly analyzed with genome rearrangement algorithms, allowing the restoration of some of the data not incorporated into the clique solution. We investigate the trade-off between criteria for discarding excessive prestrips to make MWC feasible in terms of retaining as many markers as possible in the solution and producing an economical rearrangement analysis. We explore these questions through simulation and through comparison of the rice and sorghum genomes.  相似文献   

We report on new techniques we have developed for reconstructing phylogenies on whole genomes. Our mathematical techniques include new polynomial-time methods for bounding the inversion length of a candidate tree and new polynomial-time methods for estimating genomic distances which greatly improve the accuracy of neighbor-joining analyses. We demonstrate the power of these techniques through an extensive performance study based on simulating genome evolution under a wide range of model conditions. Combining these new tools with standard approaches (fast reconstruction with neighbor-joining, exploration of all possible refinements of strict consensus trees, etc.) has allowed us to analyze datasets that were previously considered computationally impractical. In particular, we have conducted a complete phylogenetic analysis of a subset of the Campanulaceae family, confirming various conjectures about the relationships among members of the subset and about the principal mechanism of evolution for their chloroplast genome. We give representative results of the extensive experimentation we conducted on both real and simulated datasets in order to validate and characterize our approaches. We find that our techniques provide very accurate reconstructions of the true tree topology even when the data are generated by processes that include a significant fraction of transpositions and when the data are close to saturation.  相似文献   

Next generation sequencing techniques produce enormous data but its analysis and visualization remains a big challenge. To address this, we have developed Genome Annotator Light(GAL), a Docker based package for genome analysis and data visualization. GAL integrated several existing tools and in-house programs inside a Docker Container for systematic analysis and visualization of genomes through web browser. GAL takes varieties of input types ranging from raw Fasta files to fully annotated files, processes them through a standard annotation pipeline and visualizes on a web browser. Comparative genomic analysis is performed automatically within a given taxonomic class. GAL creates interactive genome browser with clickable genomic feature tracks; local BLAST-able database; query page, on-fly downstream data analysis using EMBOSS etc. Overall, GAL is an extremely convenient, portable and platform independent. Fully integrated web-resources can be easily created and deployed, e.g. www.eumicrobedb.org/cglab, for our in-house genomes. GAL is freely available at https://hub.docker.com/u/cglabiicb/.  相似文献   

MOTIVATION: Many bioinformatic approaches exist for finding novel genes within genomic sequence data. Traditionally, homology search-based methods are often the first approach employed in determining whether a novel gene exists that is similar to a known gene. Unfortunately, distantly related genes or motifs often are difficult to find using single query-based homology search algorithms against large sequence datasets such as the human genome. Therefore, the motivation behind this work was to develop an approach to enhance the sensitivity of traditional single query-based homology algorithms against genomic data without losing search selectivity. RESULTS: We demonstrate that by searching against a genome fragmented into all possible reading frames, the sensitivity of homology-based searches is enhanced without degrading its selectivity. Using the ETS-domain, bromodomain and acetyl-CoA acetyltransferase gene as queries, we were able to demonstrate that direct protein-protein searches using BLAST2P or FASTA3 against a human genome segmented among all possible reading frames and translated was substantially more sensitive than traditional protein-DNA searches against a raw genomic sequence using an application such as TBLAST2N. Receiver operating characteristic analysis was employed to demonstrate that the algorithms remained selective, while comparisons of the algorithms showed that the protein-protein searches were more sensitive in identifying hits. Therefore, through the overprediction of reading frames by this method and the increased sensitivity of protein-protein based homology search algorithms, a genome can be deeply mined, potentially finding hits overlooked by protein-DNA searches against raw genomic data.  相似文献   

Lyssaviruses are RNA viruses with single-strand, negative-sense genomes responsible for rabies-like diseases in mammals. To date, genomic and evolutionary studies have most often utilized partial genome sequences, particularly of the nucleoprotein and glycoprotein genes, with little consideration of genome-scale evolution. Herein, we report the first genomic and evolutionary analysis using complete genome sequences of all recognised lyssavirus genotypes, including 14 new complete genomes of field isolates from 6 genotypes and one genotype that is completely sequenced for the first time. In doing so we significantly increase the extent of genome sequence data available for these important viruses. Our analysis of these genome sequence data reveals that all lyssaviruses have the same genomic organization. A phylogenetic analysis reveals strong geographical structuring, with the greatest genetic diversity in Africa, and an independent origin for the two known genotypes that infect European bats. We also suggest that multiple genotypes may exist within the diversity of viruses currently classified as 'Lagos Bat'. In sum, we show that rigorous phylogenetic techniques based on full length genome sequence provide the best discriminatory power for genotype classification within the lyssaviruses.  相似文献   

High-throughput genome sequencing continues to accelerate the rate at which complete genomes are available for biological research. Many of these new genome sequences have little or no genome annotation currently available and hence rely upon computational predictions of protein coding genes. Evidence of translation from proteomic techniques could facilitate experimental validation of protein coding genes, but the techniques for whole genome searching with MS/MS data have not been adequately developed to date. Here we describe GENQUEST, a novel method using peptide isoelectric focusing and accurate mass to greatly reduce the peptide search space, making fast, accurate, and sensitive whole human genome searching possible on common desktop computers. In an initial experiment, almost all exonic peptides identified in a protein database search were identified when searching genomic sequence. Many peptides identified exclusively in the genome searches were incorrectly identified or could not be experimentally validated, highlighting the importance of orthogonal validation. Experimentally validated peptides exclusive to the genomic searches can be used to reannotate protein coding genes. GENQUEST represents an experimental tool that can be used by the proteomics community at large for validating computational approaches to genome annotation.  相似文献   

We propose a network-based approach for surmising the spatial organization of genomes from high-throughput interaction data. Our strategy is based on methods for inferring architectural features of networks. Specifically, we employ a community detection algorithm to partition networks of genomic interactions. These community partitions represent an intuitive interpretation of genomic organization from interaction data. Furthermore, they are able to recapitulate known aspects of the spatial organization of the Saccharomyces cerevisiae genome, such as the rosette conformation of the genome, the clustering of centromeres, as well as tRNAs, and telomeres. We also demonstrate that simple architectural features of genomic interaction networks, such as cliques, can give meaningful insight into the functional role of the spatial organization of the genome. We show that there is a correlation between inter-chromosomal clique size and replication timing, as well as cohesin enrichment. Together, our network-based approach represents an effective and intuitive framework for interpreting high-throughput genomic interaction data. Importantly, there is a great potential for this strategy, given the rich literature and extensive set of existing tools in the field of network analysis.  相似文献   

The Distributed Annotation System (DAS) is a protocol for easy sharing and integration of biological annotations. In order to visualize feature annotations in a genomic context a client is required. Here we present myKaryoView, a simple light-weight DAS tool for visualization of genomic annotation. myKaryoView has been specifically configured to help analyse data derived from personal genomics, although it can also be used as a generic genome browser visualization. Several well-known data sources are provided to facilitate comparison of known genes and normal variation regions. The navigation experience is enhanced by simultaneous rendering of different levels of detail across chromosomes. A simple interface is provided to allow searches for any SNP, gene or chromosomal region. User-defined DAS data sources may also be added when querying the system. We demonstrate myKaryoView capabilities for adding user-defined sources with a set of genetic profiles of family-related individuals downloaded directly from 23andMe. myKaryoView is a web tool for visualization of genomic data specifically designed for direct-to-consumer genomic data that uses publicly available data distributed throughout the Internet. It does not require data to be held locally and it is capable of rendering any feature as long as it conforms to DAS specifications. Configuration and addition of sources to myKaryoView can be done through the interface. Here we show a proof of principle of myKaryoView's ability to display personal genomics data with 23andMe genome data sources. The tool is available at: http://mykaryoview.com.  相似文献   

We develop techniques to estimate the statistical significance of gap-free alignments between two genomic DNA sequences, using human-mouse alignments as an example. The sequences are assumed to be sufficiently similar that some but not all of the neutrally evolving regions (i.e., those under no evolutionary constraint) can be reliably aligned. Our goal is to model the situation in which the neutral rate of evolution, and hence the extent of the aligning intervals, varies across the genome. In some cases, this permits the weaker of two matches to be judged as less likely to have arisen by chance, provided it lies in a genomic interval with a high level of background divergence. We employ a hidden Markov model to capture variations in divergence rates and assign probability values to gap-free alignments using techniques of Dembo and Karlin, which are related to those used for the same purpose by BLAST. Our methods are illustrated in detail using a 1.49 Mb genomic region. Results obtained from the analysis of human chromosome 22 using these techniques are also provided.  相似文献   

参考基因组是现代功能基因组学的核心框架,以此为基础的现代基因组学技术在过去20年对植物遗传变异发掘、功能基因克隆等研究起了巨大的推动作用.然而,越来越多的研究发现,单一或少数参考基因组不能完整代表和呈现物种或特定群体内的所有基因组变异,因此其在功能基因组学研究中应用存在很大的局限性,甚至会导致错误的结果.泛基因组是指物...  相似文献   

Twyford AD  Ennos RA 《Heredity》2012,108(3):179-189
Hybridization has a major role in evolution-from the introgression of important phenotypic traits between species, to the creation of new species through hybrid speciation. Molecular studies of hybridization aim to understand the class of hybrids and the frequency of introgression, detect the signature of ancient hybridization, and understand the behaviour of introgressed loci in their new genomic background. This often involves a large investment in the design and application of molecular markers, leading to a compromise between the depth and breadth of genomic data. New techniques designed to assay a large sub-section of the genome, in association with next-generation sequencing (NGS) technologies, will allow genome-wide hybridization and introgression studies in organisms with no prior sequence data. These detailed genotypic data will unite the breadth of sampling of loci characteristic of population genetics with the depth of sequence information associated with molecular phylogenetics. In this review, we assess the theoretical and methodological constraints that limit our understanding of natural hybridization, and promote the use of NGS for detecting hybridization and introgression between non-model organisms. We also make recommendations for the ways in which emerging techniques, such as pooled barcoded amplicon sequencing and restriction site-associated DNA tags, should be used to overcome current limitations, and enhance our understanding of this evolutionary significant process.  相似文献   

The recent advances in chromosome configuration capture (3C)-based series molecular methods and optical super-resolution (SR) techniques offer powerful tools to investigate three dimensional (3D) genomic structure in prokaryotic and eukaryotic cell nucleus. In this review, we focus on the progress during the last decade in this exciting field. Here we at first introduce briefly genome organization at chromosome, domain and sub-domain level, respectively; then we provide a short introduction to various super-resolution microscopy techniques which can be employed to detect genome 3D structure. We also reviewed the progress of quantitative and visualization tools to evaluate and visualize chromatin interactions in 3D genome derived from Hi-C data. We end up with the discussion that imaging methods and 3C-based molecular methods are not mutually exclusive - - - - actually they are complemental to each other and can be combined together to study 3D genome organization.  相似文献   

With the availability of the nearly complete genomic sequence of C. elegans, the first multicellular organism to be sequenced, molecular biology has definitely entered the postgenomic era. Annotation of the genomic sequence, which refers to identifying the genes and other biologically relevant sections of the genome, is an important and nontrivial next step. A first-pass annotation will be necessarily incomplete but will drive further biological experiments, which in turn will help to annotate the genome better. Given the scale of the genome sequence analysis, it is clear that the annotation should be automated as much as possible without sacrificing the quality of analysis. In this work, we outline our approach to identifying the protein kinases of C. elegans from the genomic sequence. We describe new tools we have developed for analysis, management and visualization of genomic data. By developing modular and scalable solutions, this study has provided a framework for future analysis of the Drosophila and human genomes.  相似文献   

Scanning of the human genome by use of affected relative pairs and dense sets of highly polymorphic markers or by emerging techniques such as genomic mismatch scanning. (GMS) is making it possible to identify the genetic etiology of a disease through detection of susceptibility loci. We present a general statistical model and test to detect disease genes, using affected relative pairs and either markers or GMS technologies in a genome search. There are an exact test and large-sample normal approximation that control for the elevated probability of false detection of linkage in a genome search. The approach can be used to determine the sample size needed to obtain a prespecified power to detect a disease gene in the presence of etiologic heterogeneity for a single class or mixture of relative classes, with any number of markers, or clones, markers PIC values, or mapping function. The approach is used to examine differences in performance of markers and GMS technologies in a common statistical framework and to provide practical information for designing studies of complex traits.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号