首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Prophage loci often remain under-annotated or even unrecognized in prokaryotic genome sequencing projects. A PHP application, Prophage Finder, has been developed and implemented to predict prophage loci, based upon clusters of phage-related gene products encoded within DNA sequences. This application provides results detailing several facets of these clusters to facilitate rapid prediction and analysis of prophage sequences. Prophage Finder was tested using previously annotated prokaryotic genomic sequences with manually curated prophage loci as benchmarks. Additional analyses from Prophage Finder searches of several draft prokaryotic genome sequences are available through the Web site (http://bioinformatics.uwp.edu/~phage/DOEResults.php) to illustrate the potential of this application.  相似文献   

2.

Background  

Pathogenicity islands (PAIs), distinct genomic segments of pathogens encoding virulence factors, represent a subgroup of genomic islands (GIs) that have been acquired by horizontal gene transfer event. Up to now, computational approaches for identifying PAIs have been focused on the detection of genomic regions which only differ from the rest of the genome in their base composition and codon usage. These approaches often lead to the identification of genomic islands, rather than PAIs.  相似文献   

3.

Background  

The binding of regulatory proteins to their specific DNA targets determines the accurate expression of the neighboring genes. The in silico prediction of new binding sites in completely sequenced genomes is a key aspect in the deeper understanding of gene regulatory networks. Several algorithms have been described to discriminate against false-positives in the prediction of new binding targets; however none of them has been implemented so far to assist the detection of binding sites at the genomic scale.  相似文献   

4.
Liu Z  Ma Q  Cao J  Gao X  Ren J  Xue Y 《Molecular bioSystems》2011,7(10):2737-2740
Recent experiments revealed the prokaryotic ubiquitin-like protein (PUP) to be a signal for the selective degradation of proteins in Mycobacterium tuberculosis (Mtb). By covalently conjugating the PUP, pupylation functions as a critical post-translational modification (PTM) conserved in actinomycetes. Here, we designed a novel computational tool of GPS-PUP for the prediction of pupylation sites, which was shown to have a promising performance. From small-scale and large-scale studies we collected 238 potentially pupylated substrates for which the exact pupylation sites were still not determined. As an example application, we predicted ~85% of these proteins with at least one potential pupylation site. Furthermore, through functional analysis, we observed that pupylation can target various substrates so as to regulate a broad array of biological processes, such as the response to stress, sulfate and proton transport, and metabolism. The prediction and analysis results prove to be useful for further experimental investigation. The GPS-PUP 1.0 is freely available at: .  相似文献   

5.

Background

Most microarray studies are made using labelling with one or two dyes which allows the hybridization of one or two samples on the same slide. In such experiments, the most frequently used dyes areCy3 andCy5. Recent improvements in the technology (dye-labelling, scanner and, image analysis) allow hybridization up to four samples simultaneously. The two additional dyes areAlexa488 andAlexa494. The triple-target or four-target technology is very promising, since it allows more flexibility in the design of experiments, an increase in the statistical power when comparing gene expressions induced by different conditions and a scaled down number of slides. However, there have been few methods proposed for statistical analysis of such data. Moreover the lowess correction of the global dye effect is available for only two-color experiments, and even if its application can be derived, it does not allow simultaneous correction of the raw data.

Results

We propose a two-step normalization procedure for triple-target experiments. First the dye bleeding is evaluated and corrected if necessary. Then the signal in each channel is normalized using a generalized lowess procedure to correct a global dye bias. The normalization procedure is validated using triple-self experiments and by comparing the results of triple-target and two-color experiments. Although the focus is on triple-target microarrays, the proposed method can be used to normalizepdifferently labelled targets co-hybridized on a same array, for any value ofpgreater than 2.

Conclusion

The proposed normalization procedure is effective: the technical biases are reduced, the number of false positives is under control in the analysis of differentially expressed genes, and the triple-target experiments are more powerful than the corresponding two-color experiments. There is room for improving the microarray experiments by simultaneously hybridizing more than two samples.  相似文献   

6.
7.
Pandit SB  Srinivasan N 《Proteins》2003,52(4):585-597
The members of the family of G-proteins are characterized by their ability to bind and hydrolyze guanosine triphosphate (GTP) to guanosine diphosphate (GDP). Despite a common biochemical function of GTP hydrolysis shared among the members of the family of G-proteins, they are associated with diverse biological roles. The current work describes the identification and detailed analysis of the putative G-proteins encoded in the completely sequenced prokaryotic genomes. Inferences on the biological roles of these G-proteins have been obtained by their classification into known functional subfamilies. We have identified 497 G-proteins in 42 genomes. Seven small GTP-binding protein homologues have been identified in prokaryotes with at least two of the diagnostic sequence motifs of G-proteins conserved. The translation factors have the largest representation (234 sequences) and are found to be ubiquitous, which is consistent with their critical role in protein synthesis. The GTP_OBG subfamily comprises of 79 sequences in our dataset. A total of 177 sequences belong to the subfamily of GTPase of unknown function and 154 of these could be associated with domains of known functions such as cell cycle regulation and t-RNA modification. The large GTP-binding proteins and the alpha-subunit of heterotrimeric G-proteins are not detected in the genomes of the prokaryotes surveyed.  相似文献   

8.
9.

Background  

The conservation of gene order among prokaryotic genomes can provide valuable insight into gene function, protein interactions, or events by which genomes have evolved. Although some tools are available for visualizing and comparing the order of genes between genomes of study, few support an efficient and organized analysis between large numbers of genomes. The Prokaryotic Sequence homology Analysis Tool (PSAT) is a web tool for comparing gene neighborhoods among multiple prokaryotic genomes.  相似文献   

10.

Background  

Horizontal gene transfer (HGT) is considered a strong evolutionary force shaping the content of microbial genomes in a substantial manner. It is the difference in speed enabling the rapid adaptation to changing environmental demands that distinguishes HGT from gene genesis, duplications or mutations. For a precise characterization, algorithms are needed that identify transfer events with high reliability. Frequently, the transferred pieces of DNA have a considerable length, comprise several genes and are called genomic islands (GIs) or more specifically pathogenicity or symbiotic islands.  相似文献   

11.
MRD is a database system to access the microsatellite repeats information of genomes such as archea, eubacteria, and other eukaryotic genomes whose sequence information is available in public domains. MRD stores information about simple tandemly repeated k-mer sequences where k= 1 to 6, i.e. monomer to hexamer. The web interface allows the users to search for the repeat of their interest and to know about the association of the repeat with genes and genomic regions in the specific organism. The data contains the abundance and distribution of microsatellites in the coding and non-coding regions of the genome. The exact location of repeats with respect to genomic regions of interest (such as UTR, exon, intron or intergenic regions) whichever is applicable to organism is highlighted. MRD is available on the World Wide Web at and/or . The database is designed as an open-ended system to accommodate the microsatellite repeats information of other genomes whose complete sequences will be available in future through public domain.  相似文献   

12.

Background  

Selenocysteine and pyrrolysine are the 21st and 22nd amino acids, which are genetically encoded by stop codons. Since a number of microbial genomes have been completely sequenced to date, it is tempting to ask whether the 23rd amino acid is left undiscovered in these genomes. Recently, a computational study addressed this question and reported that no tRNA gene for unknown amino acid was found in genome sequences available. However, performance of the tRNA prediction program on an unknown tRNA family, which may have atypical sequence and structure, is unclear, thereby rendering their result inconclusive. A protein-level study will provide independent insight into the novel amino acid.  相似文献   

13.

Background  

Improvements in DNA sequencing technology and methodology have led to the rapid expansion of databases comprising DNA sequence, gene and genome data. Lower operational costs and heightened interest resulting from initial intriguing novel discoveries from genomics are also contributing to the accumulation of these data sets. A major challenge is to analyze and to mine data from these databases, especially whole genomes. There is a need for computational tools that look globally at genomes for data mining.  相似文献   

14.
Detecting uber-operons in prokaryotic genomes   总被引:3,自引:1,他引:3       下载免费PDF全文
Che D  Li G  Mao F  Wu H  Xu Y 《Nucleic acids research》2006,34(8):2418-2427
  相似文献   

15.
Insertion sequences (ISs) are small DNA segments that are often capable of moving neighbouring genes. Over 1500 different ISs have been identified to date. They can have large and spectacular effects in shaping and reshuffling the bacterial genome. Recent studies have provided dramatic examples of such IS activity, including massive IS expansion during the emergence of some pathogenic bacterial species and the intimate involvement of ISs in assembling genes into complex plasmid structures. However, a global understanding of their impact on bacterial genomes requires detailed knowledge of their distribution across the eubacterial and archaeal kingdoms, understanding their partition between chromosomes and extra-chromosomal elements (e.g. plasmids and viruses) and the factors which influence this, and appreciation of the different transposition mechanisms in action, the target preferences and the host factors that influence transposition. In addition, defective (non- autonomous) elements, which can be complemented by related active elements in the same cell, are often overlooked in genome annotations but also contribute to the evolution of genome organisation.  相似文献   

16.
Prokaryotic genomics is shifting towards comparative approaches to unravel how and why genomes change over time. Both phylogenetic and population genetics approaches are required to dissect the relative roles of selection and drift under these conditions. Lineages evolve adaptively by selection of changes in extant genomes and the way this occurs is being explored from a systemic and evolutionary perspective to understand how mutations relate with gene repertoire changes and how both are contextualized in cellular networks. Through an increased appreciation of genome dynamics in given ecological contexts, a more detailed picture of the genetic basis of prokaryotic evolution is emerging.  相似文献   

17.
18.
We present ParaDB (http://abi.marseille.inserm.fr/paradb/), a new database for large-scale paralogy studies in vertebrate genomes. We intended to collect all information (sequence, mapping and phylogenetic data) needed to map and detect new paralogous regions, previously defined as Paralogons. The AceDB database software was used to generate graphical objects and to organize data. General data were automatically collated from public sources (Ensembl, GadFly and RefSeq). ParaDB provides access to data derived from whole genome sequences (Homo sapiens, Mus musculus and Drosophila melanogaster): cDNA and protein sequences, positional information, bibliographical links. In addition, we provide BLAST results for each protein sequence, InParanoid orthologs and 'In-Paralogs' data, previously established paralogy data, and, to compare vertebrates and Drosophila, orthology data.  相似文献   

19.
Single amplified genomes and genomes assembled from metagenomes have enabled the exploration of uncultured microorganisms at an unprecedented scale. However, both these types of products are plagued by contamination. Since these genomes are now being generated in a high-throughput manner and sequences from them are propagating into public databases to drive novel scientific discoveries, rigorous quality controls and decontamination protocols are urgently needed. Here, we present ProDeGe (Protocol for fully automated Decontamination of Genomes), the first computational protocol for fully automated decontamination of draft genomes. ProDeGe classifies sequences into two classes—clean and contaminant—using a combination of homology and feature-based methodologies. On average, 84% of sequence from the non-target organism is removed from the data set (specificity) and 84% of the sequence from the target organism is retained (sensitivity). The procedure operates successfully at a rate of ~0.30 CPU core hours per megabase of sequence and can be applied to any type of genome sequence.Recent technological advancements have enabled the large-scale sampling of genomes from uncultured microbial taxa, through the high-throughput sequencing of single amplified genomes (SAGs; Rinke et al., 2013; Swan et al., 2013) and assembly and binning of genomes from metagenomes (GMGs; Cuvelier et al., 2010; Sharon and Banfield, 2013). The importance of these products in assessing community structure and function has been established beyond doubt (Kalisky and Quake, 2011). Multiple Displacement Amplification (MDA) and sequencing of single cells has been immensely successful in capturing rare and novel phyla, generating valuable references for phylogenetic anchoring. However, efforts to conduct MDA and sequencing in a high-throughput manner have been heavily impaired by contamination from DNA introduced by the environmental sample, as well as introduced during the MDA or sequencing process (Woyke et al., 2011; Engel et al., 2014; Field et al., 2014). Similarly, metagenome binning and assembly often carries various errors and artifacts depending on the methods used (Nielsen et al., 2014). Even cultured isolate genomes have been shown to lack immunity to contamination with other species (Parks et al., 2014; Mukherjee et al., 2015). As sequencing of these genome product types rapidly increases, contaminant sequences are finding their way into public databases as reference sequences. It is therefore extremely important to define standardized and automated protocols for quality control and decontamination, which would go a long way towards establishing quality standards for all microbial genome product types.Current procedures for decontamination and quality control of genome sequences in single cells and metagenome bins are heavily manual and can consume hours/megabase when performed by expert biologists. Supervised decontamination typically involves homology-based inspection of ribosomal RNA sequences and protein coding genes, as well as visual analysis of k-mer frequency plots and guanine–cytosine content (Clingenpeel, 2015). Manual decontamination is also possible through the software SmashCell (Harrington et al., 2010), which contains a tool for visual identification of contaminants from a self-organizing map and corresponding U-matrix. Another existing software tool, DeconSeq (Schmieder and Edwards, 2011), automatically removes contaminant sequences, however, the contaminant databases are required input. The former lacks automation, whereas the latter requires prior knowledge of contaminants, rendering both applications impractical for high-throughput decontamination.Here, we introduce ProDeGe, the first fully automated computational protocol for decontamination of genomes. ProDeGe uses a combination of homology-based and sequence composition-based approaches to separate contaminant sequences from the target genome draft. It has been pre-calibrated to discard at least 84% of the contaminant sequence, which results in retention of a median 84% of the target sequence. The standalone software is freely available at http://prodege.jgi-psf.org//downloads/src and can be run on any system that has Perl, R (R Core Team, 2014), Prodigal (Hyatt et al., 2010) and NCBI Blast (Camacho et al., 2009) installed. A graphical viewer allowing further exploration of data sets and exporting of contigs accompanies the web application for ProDeGe at http://prodege.jgi-psf.org, which is open to the wider scientific community as a decontamination service (Supplementary Figure S1).The assembly and corresponding NCBI taxonomy of the data set to be decontaminated are required inputs to ProDeGe (Figure 1a). Contigs are annotated with genes following which, eukaryotic contamination is removed based on homology of genes at the nucleotide level using the eukaryotic subset of NCBI''s Nucleotide database as the reference. For detecting prokaryotic contamination, a curated database of reference contigs from the set of high-quality genomes within the Integrated Microbial Genomes (IMG; Markowitz et al., 2014) system is used as the reference. This ensures that errors in public reference databases due to poor quality of sequencing, assembly and annotation do not negatively impact the decontamination process. Contigs determined as belonging to the target organism based on nucleotide level homology to sequences in the above database are defined as ‘Clean'', whereas those aligned to other organisms are defined as ‘Contaminant''. Contigs whose origin cannot be determined based on alignment are classified as ‘Undecided''. Classified clean and contaminated contigs are used to calibrate the separation in the subsequent 5-mer based binning module, which classifies undecided contigs as ‘Clean'' or ‘Contaminant'' using principal components analysis (PCA) of 5-mer frequencies. This parameter can also be specified by the user. When data sets do not have taxonomy deeper than phylum level, or a single confident taxonomic bin cannot be detected using sequence alignment, solely 9-mer based binning is used due to more accurate overall classification. In the absence of a user-defined cutoff, a pre-calibrated cutoff for 80% or more specificity separates the clean contigs from contaminated sequences in the resulting PCA of the 9-mer frequency matrix. Details on ProDeGe''s custom database, evaluation of the performance of the system and exploration of the parameter space to calibrate ProDeGe for a high accurate classification rate are provided in the Supplementary Material.Open in a separate windowFigure 1(a) Schematic overview of the ProDeGe engine. (b) Features of data sets used to validate ProDeGe: SAGs from the Arabidopsis endophyte sequencing project, MDM project, public data sets found in IMG but not sequenced at the JGI, as well as genomes from metagenomes. All the data and results can be found in Supplementary Table S3.The performance of ProDeGe was evaluated using 182 manually screened SAGs (Figure 1b,Supplementary Table S1) from two studies whose data sets are publicly available within the IMG system: genomes of 107 SAGs from an Arabidopsis endophyte sequencing project and 75 SAGs from the Microbial Dark Matter (MDM) project* (only 75/201 SAGs from the MDM project had 1:1 mapping between contigs in the unscreened and the manually screened versions, hence these were used; Rinke et al., 2013). Manual curation of these SAGs demonstrated that the use of ProDeGe prevented 5311 potentially contaminated contigs in these data sets from entering public databases. Figure 2a demonstrates the sensitivity vs specificity plot of ProDeGe results for the above data sets. Most of the data points in Figure 2a cluster in the top right of the box reflecting a median retention of 89% of the clean sequence (sensitivity) and a median rejection of 100% of the sequence of contaminant origin (specificity). In addition, on average, 84% of the bases of a data set are accurately classified. ProDeGe performs best when the target organism has sequenced homologs at the class level or deeper in its high-quality prokaryotic nucleotide reference database. If the target organism''s taxonomy is unknown or not deeper than domain level, or there are few contigs with taxonomic assignments, a target bin cannot be assessed and thus ProDeGe removes contaminant contigs using sequence composition only. The few samples in Figure 2a that demonstrate a higher rate of false positives (lower specificity) and/or reduced sensitivity typically occur when the data set contains few contaminant contigs or ProDeGe incorrectly assumes that the largest bin is the target bin. Some data sets contain a higher proportion of contamination than target sequence and ProDeGe''s performance can suffer under this condition. However, under all other conditions, ProDeGe demonstrates high speed, specificity and sensitivity (Figure 2). In addition, ProDeGe demonstrates better performance in overall classification when nucleotides are considered than when contigs are considered, illustrating that longer contigs are more accurately classified (Supplementary Table S1).Open in a separate windowFigure 2ProDeGe accuracy and performance scatterplots of 182 manually curated single amplified genomes (SAGs), where each symbol represents one SAG data set. (a) Accuracy shown by sensitivity (proportion of bases confirmed ‘Clean'') vs specificity (proportion of bases confirmed ‘Contaminant'') from the Endophyte and Microbial Dark Matter (MDM) data sets. Symbol size reflects input data set size in megabases. Most points cluster in the top right of the plot, showing ProDeGe''s high accuracy. Median and average overall results are shown in Supplementary Table S1. (b) ProDeGe completion time in central processing unit (CPU) core hours for the 182 SAGs. ProDeGe operates successfully at an average rate of 0.30 CPU core hours per megabase of sequence. Principal components analysis (PCA) of a 9-mer frequency matrix costs more computationally than PCA of a 5-mer frequency matrix used with blast-binning. The lack of known taxonomy for the MDM data sets prevents blast-binning, thus showing longer finishing times than the endophyte data sets, which have known taxonomy for use in blast-binning.All SAGs used in the evaluation of ProDeGe were assembled using SPAdes (Bankevich et al., 2012). In-house testing has shown that reads assembled with SPAdes from different strains or even slightly divergent species of the same genera may be combined into the same contig (Personal communications, KT and Robert Bowers). Ideally, the DNA in a well that gets sequenced belongs to a single cell. In the best case, contaminant sequences need to be at least from a different species to be recognized as such by the homology-based screening stage. In the absence of closely related sequenced organisms, contaminant sequences need to be at least from a different genus to be recognized as such by the composition-based screening stage (Supplementary Material). Thus, there is little risk of ProDeGe separating sequences from clonal populations or strains. We have found species- and genus-level contamination in MDA samples to be rare.To evaluate the quality of publicly available uncultured genomes, ProDeGe was used to screen 185 SAGs and 14 GMGs (Figure 1b). Compared with CheckM (Parks et al., 2014), a tool which calculates an estimate of genome sequence contamination using marker genes, ProDeGe generally marks a higher proportion of sequence as ‘Contaminant'' (Supplementary Table S2). This is because ProDeGe has been calibrated to perform at high specificity levels. The command line version of ProDeGe allows users to conduct their own calibration and specify a user-defined distance cutoff. Further, CheckM only outputs the proportion of contamination, but ProDeGe actually labels each contig as ‘Clean'' or ‘Contaminant'' during the process of automated removal.The web application for ProDeGe allows users to export clean and contaminant contigs, examine contig gene calls with their corresponding taxonomies, and discover contig clusters in the first three components of their k-dimensional space. Non-linear approaches for dimensionality reduction of k-mer vectors are gaining popularity (van der Maaten and Hinton, 2008), but we observed no systematic advantage of using t-Distributed Stochastic Neighbor Embedding over PCA (Supplementary Figure S2).ProDeGe is the first step towards establishing a standard for quality control of genomes from both cultured and uncultured microorganisms. It is valuable for preventing the dissemination of contaminated sequence data into public databases, avoiding resulting misleading analyses. The fully automated nature of the pipeline relieves scientists of hours of manual screening, producing reliably clean data sets and enabling the high-throughput screening of data sets for the first time. ProDeGe, therefore, represents a critical component in our toolkit during an era of next-generation DNA sequencing and cultivation-independent microbial genomics.  相似文献   

20.
We describe FrameD, a program that predicts coding regions in prokaryotic and matured eukaryotic sequences. Initially targeted at gene prediction in bacterial GC rich genomes, the gene model used in FrameD also allows to predict genes in the presence of frameshifts and partially undetermined sequences which makes it also very suitable for gene prediction and frameshift correction in unfinished sequences such as EST and EST cluster sequences. Like recent eukaryotic gene prediction programs, FrameD also includes the ability to take into account protein similarity information both in its prediction and its graphical output. Its performances are evaluated on different bacterial genomes. The web site (http://genopole.toulouse.inra.fr/bioinfo/FrameD/FD) allows direct prediction, sequence correction and translation and the ability to learn new models for new organisms.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号