首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Motivated by the trend of genome sequencing without completing the sequence of the whole genomes, a problem on filling an incomplete multichromosomal genome (or scaffold) I with respect to a complete target genome G was studied. The objective is to minimize the resulting genomic distance between I' and G, where I' is the corresponding filled scaffold. We call this problem the onesided scaffold filling problem. In this paper, we conduct a systematic study for the scaffold filling problem under the breakpoint distance and its variants, for both unichromosomal and multichromosomal genomes (with and without gene repetitions). When the input genome contains no gene repetition (i.e., is a fragment of a permutation), we show that the two-sided scaffold filling problem (i.e., G is also incomplete) is polynomially solvable for unichromosomal genomes under the breakpoint distance and for multichromosomal genomes under the genomic (or DCJ--Double-Cut-and-Join) distance. However, when the input genome contains some repeated genes, even the one-sided scaffold filling problem becomes NP-complete when the similarity measure is the maximum number of adjacencies between two sequences. For this problem, we also present efficient constant-factor approximation algorithms: factor-2 for the general case and factor 1.33 for the one-sided case.  相似文献   

2.
MOTIVATION: One of the major features of genomic DNA sequences, distinguishing them from texts in most spoken or artificial languages, is their high repetitiveness. Variation in the repetitiveness of genomic texts reflects the presence and density of different biologically important messages. Thus, deviation from an expected number of repeats in both directions indicates a possible presence of a biological signal. Linguistic complexity corresponds to repetitiveness of a genomic text, and potential regulatory sites may be discovered through construction of typical patterns of complexity distribution. RESULTS: We developed software for fast calculation of linguistic sequence complexity of DNA sequences. Our program utilizes suffix trees to compute the number of subwords present in genomic sequences, thereby allowing calculation of linguistic complexity in time linear in genome size. The measure of linguistic complexity was applied to the complete genome of Haemophilus influenzae. Maps of complexity along the entire genome were obtained using sliding windows of 40, 100, and 2000 nucleotides. This approach provided an efficient way to detect simple sequence repeats in this genome. In addition, local profiles of complexity distribution around the starts of translation were constructed for 21 complete prokaryotic genomes. We hypothesize that complexity profiles correspond to evolutionary relationships between organisms. We found principal differences in profiles of the GC-rich and other (non-GC-rich) genomes. We also found characteristic differences in profiles of AT genomes, which probably reflect individual species variations in translational regulation. AVAILABILITY: The program is available upon request from Alexander Bolshoy or at http://csweb.haifa.ac.il/library/#complex.  相似文献   

3.

Background  

Traditional genome alignment methods consider sequence alignment as a variation of the string edit distance problem, and perform alignment by matching characters of the two sequences. They are often computationally expensive and unable to deal with low information regions. Furthermore, they lack a well-principled objective function to measure the performance of sets of parameters. Since genomic sequences carry genetic information, this article proposes that the information content of each nucleotide in a position should be considered in sequence alignment. An information-theoretic approach for pairwise genome local alignment, namely XMAligner, is presented. Instead of comparing sequences at the character level, XMAligner considers a pair of nucleotides from two sequences to be related if their mutual information in context is significant. The information content of nucleotides in sequences is measured by a lossless compression technique.  相似文献   

4.
With the increasing quantities of Brassica genomic data being entered into the public domain and in preparation for the complete Brassica genome sequencing effort, there is a growing requirement for the structuring and detailed bioinformatic analysis of Brassica genomic information within a user-friendly database. At the Plant Biotechnology Centre, Melbourne, Australia, we have developed a series of tools and computational pipelines to assist in the processing and structuring of genomic data, to aid its application to agricultural biotechnology research. These tools include a sequence database, ASTRA, a sequence processing pipeline incorporating annotation against GenBank, SwissProt and Arabidopsis Gene Ontology (GO) data and tools for molecular marker discovery and comparative genome analysis. All sequences are mined for simple sequence repeat (SSR) molecular markers using 'SSR primer' and mapped onto the complete Arabidopsis thaliana genome by sequence comparison. The database may be queried using a text-based search of sequence annotation or GO terms, BLAST comparison against resident sequences, or by the position of candidate orthologues within the Arabidopsis genome. Tools have also been developed and applied to the discovery of single nucleotide polymorphism (SNP) molecular markers and the in silico mapping of Brassica BAC end sequences onto the Arabidopsis genome. Planned extensions to this resource include the integration of gene expression data and the development of an EnsEMBL-based genome viewer.  相似文献   

5.
The accelerating growth of the public microbial genomic data imposes substantial burden on the research community that uses such resources.Building databases for non-redundant reference sequences from massive microbial genomic data based on clustering analysis is essential.However,existing clustering algorithms perform poorly on long genomic sequences.In this article,we present Gclust,a parallel program for clustering complete or draft genomic sequences,where clustering is accelerated with a novel parallelization strategy and a fast sequence comparison algorithm using sparse suffix arrays(SSAs).Moreover,genome identity measures between two sequences are calculated based on their maximal exact matches(MEMs).In this paper,we demonstrate the high speed and clustering quality of Gclust by examining four genome sequence datasets.Gclust is freely available for non-commercial use at https://github.com/niu-lab/gclust.We also introduce a web server for clustering user-uploaded genomes at http://niulab.scgrid.cn/gclust.  相似文献   

6.
Lyssaviruses are RNA viruses with single-strand, negative-sense genomes responsible for rabies-like diseases in mammals. To date, genomic and evolutionary studies have most often utilized partial genome sequences, particularly of the nucleoprotein and glycoprotein genes, with little consideration of genome-scale evolution. Herein, we report the first genomic and evolutionary analysis using complete genome sequences of all recognised lyssavirus genotypes, including 14 new complete genomes of field isolates from 6 genotypes and one genotype that is completely sequenced for the first time. In doing so we significantly increase the extent of genome sequence data available for these important viruses. Our analysis of these genome sequence data reveals that all lyssaviruses have the same genomic organization. A phylogenetic analysis reveals strong geographical structuring, with the greatest genetic diversity in Africa, and an independent origin for the two known genotypes that infect European bats. We also suggest that multiple genotypes may exist within the diversity of viruses currently classified as 'Lagos Bat'. In sum, we show that rigorous phylogenetic techniques based on full length genome sequence provide the best discriminatory power for genotype classification within the lyssaviruses.  相似文献   

7.
貉源阿留申病毒(Raccoon dog and arctic fox amdoparvovirus,RFAV)是自然感染貉和蓝狐的新种阿留申病毒(Amdoparvovirus),为测序RFAV全基因组序列,预测分析RFAV末端发夹结构序列分子特征。本研究采用分段克隆成功获得3株长4832nt、4827nt、4830nt的RFAV全基因组序列,分别命名为RFAV-Y9J、RFAV-RD15、RFAV-HS-R,利用在线软件预测RFAV末端序列二级结构,并与水貂阿留申病毒(AMDV)末端序列进行同源性比对。结果显示阿留申病毒种间、种内3’末端基因组序列保守性强,均存在116nt的Y型发夹结构;RFAV-Y9J与RFAV-RD15毒株5′末端分别存在310nt、305nt的U型发夹结构,RFAV和AMDV种内5′末端基因组序列保守性强,而种间5′末端基因组序列有较大变异。本研究首次完整测序了RFAV的3′和5′末端序列,为其他种阿留申病毒的末端序列扩增提供一种有效方法,为构建RFAV的全基因组序列感染性克隆奠定了基础。  相似文献   

8.
The recently published complete DNA sequence of the bacterium Thermotoga maritima provides evidence, based on protein sequence conservation, for lateral gene transfer between Archaea and Bacteria. We introduce a new method of periodicity analysis of DNA sequences, based on structural parameters, which brings independent evidence for the lateral gene transfer in the genome of T.maritima. The structural analysis relates the Archaea-like DNA sequences to the genome of Pyrococcus horikoshii. Analysis of 24 complete genomic DNA sequences shows different periodicity patterns for organisms of different origin. The typical genomic periodicity for Bacteria is 11 bp whilst it is 10 bp for Archaea. Eukaryotes have more complex spectra but the dominant period in the yeast Saccharomyces cerevisiae is 10.2 bp. These periodicities are most likely reflective of differences in chromatin structure.  相似文献   

9.
In this paper, we introduce a probabilistic measure for computing the similarity between two biological sequences without alignment. The computation of the similarity measure is based on the Kullback-Leibler divergence of two constructed Markov models. We firstly validate the method on clustering nine chromosomes from three species. Secondly, we give the result of similarity search based on our new method. We lastly apply the measure to the construction of phylogenetic tree of 48 HEV genome sequences. Our results indicate that the weighted relative entropy is an efficient and powerful alignment-free measure for the analysis of sequences in the genomic scale.  相似文献   

10.
《Genomics》2019,111(6):1574-1582
Given the vast amount of genomic data, alignment-free sequence comparison methods are required due to their low computational complexity. k-mer based methods can improve comparison accuracy by extracting an effective feature of the genome sequences. The aim of this paper is to extract k-mer intervals of a sequence as a feature of a genome for high comparison accuracy. In the proposed method, we calculated the distance between genome sequences by comparing the distribution of k-mer intervals. Then, we identified the classification results using phylogenetic trees. We used viral, mitochondrial (MT), microbial and mammalian genome sequences to perform classification for various genome sets. We confirmed that the proposed method provides a better classification result than other k-mer based methods. Furthermore, the proposed method could efficiently be applied to long sequences such as human and mouse genomes.  相似文献   

11.
As sequencing technology improves, an increasing number of projects aim to generate full genome sequence, even for nonmodel taxa. These projects may be feasibly conducted at lower read depths if the alignment can be aided by previously developed genomic resources from a closely related species. We investigated the feasibility of constructing a complete mitochondrial (mt) genome without preamplification or other targeting of the sequence. Here we present a full mt genome sequence (16,463 nucleotides) for the bighorn sheep (Ovis canadensis) generated though alignment of SOLiD short-read sequences to a reference genome. Average read depth was 1240, and each base was covered by at least 36 reads. We then conducted a phylogenomic analysis with 27 other bovid mitogenomes, which placed bighorn sheep firmly in the Ovis clade. These results show that it is possible to generate a complete mitogenome by skimming a low-coverage genomic sequencing library. This technique will become increasingly applicable as the number of taxa with some level of genome sequence rises.  相似文献   

12.
13.
Next‐generation sequencing (NGS) is emerging as an efficient and cost‐effective tool in population genomic analyses of nonmodel organisms, allowing simultaneous resequencing of many regions of multi‐genomic DNA from multiplexed samples. Here, we detail our synthesis of protocols for targeted resequencing of mitochondrial and nuclear loci by generating indexed genomic libraries for multiplexing up to 100 individuals in a single sequencing pool, and then enriching the pooled library using custom DNA capture arrays. Our use of DNA sequence from one species to capture and enrich the sequencing libraries of another species (i.e. cross‐species DNA capture) indicates that efficient enrichment occurs when sequences are up to about 12% divergent, allowing us to take advantage of genomic information in one species to sequence orthologous regions in related species. In addition to a complete mitochondrial genome on each array, we have included between 43 and 118 nuclear loci for low‐coverage sequencing of between 18 kb and 87 kb of DNA sequence per individual for single nucleotide polymorphisms discovery from 50 to 100 individuals in a single sequencing lane. Using this method, we have generated a total of over 500 whole mitochondrial genomes from seven cetacean species and green sea turtles. The greater variation detected in mitogenomes relative to short mtDNA sequences is helping to resolve genetic structure ranging from geographic to species‐level differences. These NGS and analysis techniques have allowed for simultaneous population genomic studies of mtDNA and nDNA with greater genomic coverage and phylogeographic resolution than has previously been possible in marine mammals and turtles.  相似文献   

14.
Numerous QTL for a variety of phenotypic traits in dairy and beef cattle have been mapped on bovine chromosome 6 (BTA6). The complete and validated information on the molecular genome organization is an essential prerequisite for the conclusive identification of the causative sequence variation underlying the QTL. In our study we describe efforts to improve the genomic sequence map assembly of BTA6 by filling-in gaps and by suggesting sequence contig rearrangements. This is achieved by the generation and in silico mapping of BAC-end sequences (BESs) from clones containing sequences placed on our high-resolution radiation hybrid (RH) map of BTA6 onto the genome sequence map. Linking high-resolution RH mapping with in silico mapping of BESs on BTA6 enabled the detection of discrepancies in chromosomal assignments of genome sequence contigs and improved the resolution of non-conclusive assignments on the genome sequence assembly. Furthermore, 37% of BESs enabled chromosomal assignment of contigs previously unassigned. Anchoring of 66% of BESs onto HSA4 confirmed the synteny of the respective region of BTA6 including the known evolutionary breakpoints. The BESs will play an important role in the ongoing efforts to complete the sequence of the bovine genome and will also provide a source for the identification of new polymorphic sites in the genome sequence to resolve QTL-containing intervals.  相似文献   

15.
The Barley yellow dwarf disease (BYD) was firstly recognized as an aphid transmitted virus disease by Oswald and Houston[1] in 1951. Now, Barley yel-low dwarf viruses (BYDVs) belong to members of the plant virus family Luteoviridae. They are phloem- limited and obligately transmitted in the circula-tive/persistent manner by several species of cereal aphids and can cause significant economic losses worldwide because of damage to barley, wheat, and oats. In China, BYDVs cause mainly yello…  相似文献   

16.

Background  

The recent determination of complete chloroplast (cp) genomic sequences of various plant species has enabled numerous comparative analyses as well as advances in plant and genome evolutionary studies. In angiosperms, the complete cp genome sequences of about 70 species have been determined, whereas those of only three gymnosperm species, Cycas taitungensis, Pinus thunbergii, and Pinus koraiensis have been established. The lack of information regarding the gene content and genomic structure of gymnosperm cp genomes may severely hamper further progress of plant and cp genome evolutionary studies. To address this need, we report here the complete nucleotide sequence of the cp genome of Cryptomeria japonica, the first in the Cupressaceae sensu lato of gymnosperms, and provide a comparative analysis of their gene content and genomic structure that illustrates the unique genomic features of gymnosperms.  相似文献   

17.
The complete nucleotide sequence of genomic RNA of BYDV-GAV was determined. It comprised 5685 nucleotides and contained six open reading frames and four un-translated regions. The size and organization of BYDV-GAV genome were similar to those of BYDV PAV-aus. The nucleotide and deduced amino acid sequences of the six ORFs were aligned and compared with those of other luteoviruses. The results showed that there was a high degree of identity between BYDV-GAV and MAV-PS1 in all ORFs except ORF5 and ORF6, which had only 87.4% and 70.2% identities respectively. The reported genomic nucleotide sequence of MAV was shorter than that of BYDV-GAV, but the comparison of the genomic nucleotide sequences for MAV-PS1 and GAV showed 90.4% sequence identity for the same region of the genome. According to the level of sequence similarities, BYDV-GAV should be closely related to BYDV-MAV.  相似文献   

18.
We introduce a novel, linguistic-like method of genome analysis. We propose a natural approach to characterizing genomic sequences based on occurrences of fixed length words from a predefined, sufficiently large set of words (strings over the alphabet {A, C, G, T} ). A measure based on this approach is called compositional spectrum and is actually a histogram of imperfect word occurrences. Our results assert that the compositional spectrum is an overall characteristic of a long sequence i.e., a complete genome or an uninterrupted part of a chromosome. This attribute is manifested in the similarity of spectra obtained on different stretches of the same genome, and simultaneously in a broad range of dissimilarities between spectral representations of different genomes. High flexibility characterizes this approach due to imperfect matching and as a result sets of relatively long words can be considered. The proposed approach may have various applications in intra- and intergenomic sequence comparisons.  相似文献   

19.
A system to use bovine EST data in conjunction with human genomic sequence to improve the bovine linkage map over the entire genome or on specific chromosomes was evaluated. Bovine EST sequence was used to provide primer sequences corresponding to bovine genes, while human genomic sequence directed primer design to flank introns and produce amplicons of appropriate size for efficient direct sequencing. The sequence tagged sites (STS) produced in this way from the four sires of the MARC reference families were examined for single nucleotide polymorphisms (SNPs) that could be used to map the corresponding genes. With this approach, along with a primer/extension mass spectrometry SNP genotyping assay, 100 ESTs were placed on the bovine genetic linkage map. The first 70 were chosen at random from bovine EST–human genomic comparisons. An additional 30 ESTs were successfully mapped to bovine Chromosome 19 (BTA19), and comparison of the resulting BTA19 map to the position of the corresponding human orthologs on the HSA17 draft sequences revealed differences in the spacing and order of genes. Over 80% of successful amplicons contained SNPs, indicating that this is an efficient approach to generating EST-associated genetic markers. We have demonstrated the feasibility of constructing a linkage map based on SNPs associated with ESTs and the plausibility of utilizing EST, comparative mapping information, and human sequence data to target regions of the bovine genome for SNP marker development.  相似文献   

20.
Expressed sequence tag projects have currently produced over 400 000 partial gene sequences from more than 30 nematode species and the full genomic sequences of selected nematodes are being determined. In addition, functional analyses in the model nematode Caenorhabditis elegans have addressed the role of almost all genes predicted by the genome sequence. This recent explosion in the amount of available nematode DNA sequences, coupled with new gene function data, provides an unprecedented opportunity to identify pre-validated drug targets through efficient mining of nematode genomic databases. This article describes the various information sources available and strategies that can expedite this process.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号