首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Arbuscular mycorrhizal fungi (AMF) are plant root symbionts that play key roles in plant growth and soil fertility. They are obligate biotrophic fungi that form coenocytic multinucleated hyphae and spores. Numerous studies have shown that diverse microorganisms live on the surface of and inside their mycelia, resulting in a metagenome when whole-genome sequencing (WGS) data are obtained from sequencing AMF cultivated in vivo. The metagenome contains not only the AMF sequences, but also those from associated microorganisms. In this study, we introduce a novel bioinformatics program, Spore-associated Symbiotic Microbes (SeSaMe), designed for taxonomic classification of short sequences obtained by next-generation DNA sequencing. A genus-specific usage bias database was created based on amino acid usage and codon usage of a three consecutive codon DNA 9-mer encoding an amino acid trimer in a protein secondary structure. The program distinguishes between coding sequence (CDS) and non-CDS, and classifies a query sequence into a genus group out of 54 genera used as reference. The mean percentages of correct predictions of the CDS and the non-CDS test sets at the genus level were 71% and 50% for bacteria, 68% and 73% for fungi (excluding AMF), and 49% and 72% for AMF (Rhizophagus irregularis), respectively. SeSaMe provides not only a means for estimating taxonomic diversity and abundance but also the gene reservoir of the reference taxonomic groups associated with AMF. Therefore, it enables users to study the symbiotic roles of associated microorganisms. It can also be applicable to other microorganisms as well as soil metagenomes. SeSaMe is freely available at www.fungalsesame.org.  相似文献   

2.
In the process of making full-length cDNA, predicting protein coding regions helps both in the preliminary analysis of genes and in any succeeding process. However, unfinished cDNA contains artifacts including many sequencing errors, which hinder the correct evaluation of coding sequences. Especially, predictions of short sequences are difficult because they provide little information for evaluating coding potential. In this paper, we describe ANGLE, a new program for predicting coding sequences in low quality cDNA. To achieve error-tolerant prediction, ANGLE uses a machine-learning approach, which makes better expression of coding sequence maximizing the use of limited information from input sequences. Our method utilizes not only codon usage, but also protein structure information which is difficult to be used for stochastic model-based algorithms, and optimizes limited information from a short segment when deciding coding potential, with the result that predictive accuracy does not depend on the length of an input sequence. The performance of ANGLE is compared with ESTSCAN on four dataset each of them having a different error rate (one frame-shift error or one substitution error per 200-500 nucleotides) and on one dataset which has no error. ANGLE outperforms ESTSCAN by 9.26% in average Matthews's correlation coefficient on short sequence dataset (< 1000 bases). On long sequence dataset, ANGLE achieves comparable performance.  相似文献   

3.
The advances of next-generation sequencing technology have facilitated metagenomics research that attempts to determine directly the whole collection of genetic material within an environmental sample (i.e. the metagenome). Identification of genes directly from short reads has become an important yet challenging problem in annotating metagenomes, since the assembly of metagenomes is often not available. Gene predictors developed for whole genomes (e.g. Glimmer) and recently developed for metagenomic sequences (e.g. MetaGene) show a significant decrease in performance as the sequencing error rates increase, or as reads get shorter. We have developed a novel gene prediction method FragGeneScan, which combines sequencing error models and codon usages in a hidden Markov model to improve the prediction of protein-coding region in short reads. The performance of FragGeneScan was comparable to Glimmer and MetaGene for complete genomes. But for short reads, FragGeneScan consistently outperformed MetaGene (accuracy improved ∼62% for reads of 400 bases with 1% sequencing errors, and ∼18% for short reads of 100 bases that are error free). When applied to metagenomes, FragGeneScan recovered substantially more genes than MetaGene predicted (>90% of the genes identified by homology search), and many novel genes with no homologs in current protein sequence database.  相似文献   

4.
Octamer sequencing technology (OST) is a primer-directed sequencing strategy in which an individual octamer primer is selected from a pre-synthesized octamer primer library and used to sequence a DNA fragment. However, selecting candidate primers from such a library is time consuming and can be a bottleneck in the sequencing process. To accelerate the sequencing process and to obtain high quality sequencing data, a computer program, electronic OST or eOST, was developed to automatically identify candidate primers from an octamer primer library. eOST integrates the base calling software PHRED to provide a quality assessment for target sequences and identifies potential primer binding sites located within a high quality target region. To increase the sequencing success rate, eOST includes a simple dynamic folding algorithm to automatically calculate the free energy and predict the secondary structure within the template in the vicinity of the octamer-binding site. Several parameters were found to be important, including base quality threshold, the window size of the template sequence segment, and the threshold ΔG value. OST, coupled with the eOST software, can be used to sequence short DNA fragments or in the finishing assembly stage of large-scale sequencing of genomic DNA.  相似文献   

5.
6.
As part of our studies on the molecular mechanisms of mutation by carcinogens we have synthesized 12 oligonucleotides (15-mers) containing an O6-alkylguanine residue at a preselected position for use as primers in the enzymatic synthesis of biologically active DNA. Ten of these oligonucleotides are derived from a minus strand sequence carrying the modified nucleotide in the third codon of gene G of bacteriophage phi X174 DNA. Two others are derived from plus strand sequences carrying the modification in the 12th codon of the human Ha-ras protooncogene. During this work several potentially serious side reactions, which could complicate interpretation of mutagenesis data, were observed. This paper describes a detailed study of these reactions. Since we were unable to avoid undesirable side products, we developed simple chromatographic methods for detecting and removing them.  相似文献   

7.
Non-coding RNAs (crRNAs) produced from clustered regularly interspaced short palindromic repeats (CRISPR) loci and CRISPR-associated (Cas) proteins of the prokaryotic CRISPR-Cas systems form complexes that interfere with the spread of transmissible genetic elements through Cas-catalysed cleavage of foreign genetic material matching the guide crRNA sequences. The easily programmable targeting of nucleic acids enabled by these ribonucleoproteins has facilitated the implementation of CRISPR-based molecular biology tools for in vivo and in vitro modification of DNA and RNA targets. Despite the diversity of DNA-targeting Cas nucleases so far identified, native and engineered derivatives of the Streptococcus pyogenes SpCas9 are the most widely used for genome engineering, at least in part due to their catalytic robustness and the requirement of an exceptionally short motif (5′-NGG-3′ PAM) flanking the target sequence. However, the large size of the SpCas9 variants impairs the delivery of the tool to eukaryotic cells and smaller alternatives are desirable. Here, we identify in a metagenome a new CRISPR-Cas9 system associated with a smaller Cas9 protein (EHCas9) that targets DNA sequences flanked by 5′-NGG-3′ PAMs. We develop a simplified EHCas9 tool that specifically cleaves DNA targets and is functional for genome editing applications in prokaryotes and eukaryotic cells.  相似文献   

8.
The nucleotide sequence of the cellulase gene celC, encoding endoglucanase C of Clostridium thermocellum, has been determined. The coding region of 1032 bp was identified by comparison with the N-terminal amino acid (aa) sequence of endoglucanase C purified from Escherichia coli. The ATG start codon is preceded by an AGGAGG sequence typical of ribosome-binding sites in Gram-positive bacteria. The derived amino acid sequence corresponds to a protein of Mr 40,439. Amino acid analysis and apparent Mr of endoglucanase C are consistent with the amino acid sequence as derived from the DNA sequencing data. A proposed N-terminal 21-aa residue leader (signal) sequence differs from other prokaryotic signal peptides and is non-functional in E. coli. Most of the protein bears no resemblance to the endoglucanases A, B, and D of the same organism. However, a short region of homology between endoglucanases A and C was identified, which is similar to the established active sites of lysozymes and to related sequences of fungal cellulases.  相似文献   

9.

Background

Metagenomics is a cultivation-independent approach that enables the study of the genomic composition of microbes present in an environment. Metagenomic samples are routinely sequenced using next-generation sequencing technologies that generate short nucleotide reads. Proteins identified from these reads are mostly of partial length. On the other hand, de novo assembly of a large metagenomic dataset is computationally demanding and the assembled contigs are often fragmented, resulting in the identification of protein sequences that are also of partial length and incomplete. Annotation of an incomplete protein sequence often proceeds by identifying its homologs in a database of reference sequences. Identifying the homologs of incomplete sequences is a challenge and can result in substandard annotation of proteins from metagenomic datasets. To address this problem, we recently developed a homology detection algorithm named GRASP (Guided Reference-based Assembly of Short Peptides) that identifies the homologs of a given reference protein sequence in a database of short peptide metagenomic sequences. GRASP was developed to implement a simultaneous alignment and assembly algorithm for annotation of short peptides identified on metagenomic reads. The program achieves significantly improved recall rate at the cost of computational efficiency. In this article, we adopted three techniques to speed up the original version of GRASP, including the pre-construction of extension links, local assembly of individual seeds, and the implementation of query-level parallelism.

Results

The resulting new program, GRASPx, achieves >30X speedup compared to its predecessor GRASP. At the same time, we show that the performance of GRASPx is consistent with that of GRASP, and that both of them significantly outperform other popular homology-search tools including the BLAST and FASTA suites. GRASPx was also applied to a human saliva metagenome dataset and shows superior performance for both recall and precision rates.

Conclusions

In this article we present GRASPx, a fast and accurate homology-search program implementing a simultaneous alignment and assembly framework. GRASPx can be used for more comprehensive and accurate annotation of short peptides. GRASPx is freely available at http://graspx.sourceforge.net/.
  相似文献   

10.
Sequencing by hybridization is a method for reconstructing a DNA sequence based on its k-mer content. This content, called the spectrum of the sequence, can be obtained from hybridization with a universal DNA chip. However, even with a sequencing chip containing all 4(9) 9-mers and assuming no hybridization errors, only about 400-bases-long sequences can be reconstructed unambiguously. Drmanac et al. (1989) suggested sequencing long DNA targets by obtaining spectra of many short overlapping fragments of the target, inferring their relative positions along the target, and then computing spectra of subfragments that are short enough to be uniquely recoverable. Drmanac et al. do not treat the realistic case of errors in the hybridization process. In this paper, we study the effect of such errors. We show that the probability of ambiguous reconstruction in the presence of (false negative) errors is close to the probability in the errorless case. More precisely, the ratio between these probabilities is 1 + O(p = (1 - p)(4). 1 = d) where d is the average length of subfragments, and p is the probability of a false negative. We also obtain lower and upper bounds for the probability of unambiguous reconstruction based on an errorless spectrum. For realistic chip sizes, these bounds are tighter than those given by Arratia et al. (1996). Finally, we report results on simulations with real DNA sequences, showing that even in the presence of 50% false negative errors, a target of cosmid length can be recovered with less than 0.1% miscalled bases.  相似文献   

11.
Lowary and Widom selected from random sequences those which form exceptionally stable nucleosomes, including clone 601, the current champion of strong nucleosome (SN) sequences. This unique sequence database (LW sequences) carries sequence elements which confer stability on the nucleosomes formed on the sequences, and, thus, may serve as source of information on the structure of “ideal” or close to ideal nucleosome DNA sequence. An important clue is also provided by crystallographic study of Vasudevan and coauthors on clone 601 nucleosomes. It demonstrated that YR·YR dinucleotide stacks (primarily TA·TA) follow one another at distances 10 or 11 bases or multiples thereof, such that they all are located on the interface between DNA and histone octamer. Combining this important information with alignment of the YR-containing 10-mers and 11-mers from LW sequences, the bendability matrices of the stable nucleosome DNA are derived. The matrices suggest that the periodically repeated TA (YR), RR, and YY dinucleotides are the main sequence features of the SNs. This consensus coincides with the one for recently discovered SNs with visibly periodic DNA sequences. Thus, the experimentally observed stable LW nucleosomes and SNs derived computationally appear to represent the same entity – exceptionally stable SNs.  相似文献   

12.

Background

In silico, secretome proteins can be predicted from completely sequenced genomes using various available algorithms that identify membrane-targeting sequences. For metasecretome (collection of surface, secreted and transmembrane proteins from environmental microbial communities) this approach is impractical, considering that the metasecretome open reading frames (ORFs) comprise only 10% to 30% of total metagenome, and are poorly represented in the dataset due to overall low coverage of metagenomic gene pool, even in large-scale projects.

Results

By combining secretome-selective phage display and next-generation sequencing, we focused the sequence analysis of complex rumen microbial community on the metasecretome component of the metagenome. This approach achieved high enrichment (29 fold) of secreted fibrolytic enzymes from the plant-adherent microbial community of the bovine rumen. In particular, we identified hundreds of heretofore rare modules belonging to cellulosomes, cell-surface complexes specialised for recognition and degradation of the plant fibre.

Conclusions

As a method, metasecretome phage display combined with next-generation sequencing has a power to sample the diversity of low-abundance surface and secreted proteins that would otherwise require exceptionally large metagenomic sequencing projects. As a resource, metasecretome display library backed by the dataset obtained by next-generation sequencing is ready for i) affinity selection by standard phage display methodology and ii) easy purification of displayed proteins as part of the virion for individual functional analysis.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2164-15-356) contains supplementary material, which is available to authorized users.  相似文献   

13.
14.
The nucleotide sequence of the protective antigen (PA) gene from Bacillus anthracis and the 5' and 3' flanking sequences were determined. PA is one of three proteins comprising anthrax toxin; and its nucleotide sequence is the first to be reported from B. anthracis. The open reading frame (ORF) is 2319 bp long, of which 2205 bp encode the 735 amino acids of the secreted protein. This region is preceded by 29 codons, which appear to encode a signal peptide having characteristics in common with those of other secreted proteins. A consensus TATAAT sequence was located at the putative -10 promoter site. A Shine-Dalgarno site similar to that found in genes of other Bacillus sp. was located 7 bp upstream from the ATG start codon. The codon usage for the PA gene reflected its high A + T (69%) base composition and differed from those of genes for bacterial proteins from most other sequences examined. The TAA translation stop codon was followed by an inverted repeat forming a potential termination signal. In addition, a 192-codon ORF of unknown significance, theoretically encoding a 21.6-kDa protein, preceded the 5' end of the PA gene.  相似文献   

15.
Using the M13 dideoxy sequencing technique, we have established the DNA sequences of colicins E2 and E3 which encompass the receptor-binding and the catalytic domains of each of the nucleases, and their immunity (imm) genes. The imm gene of plasmid ColE2-P9 is 255 bp long and is separated from the end of the col gene by a dinucleotide. This gene pair is arranged similarly in plasmid ColE3-CA38 except that the intergenic space is 9 bp and the E3 imm gene is one codon shorter than its E2 counterpart. Comparisons of the E2 and E3 imm sequences indicate considerable divergence whereas the receptor-binding domains of both colicins are highly conserved. The two nuclease domains appear to share some sequence homology. A possible evolutionary relationship between colicin E3 and other microbial extracellular ribonucleases is also suggested from the sequence alignment analysis.  相似文献   

16.

Background

Ribosomal 16S DNA sequences are an essential tool for identifying and classifying microbes. High-throughput DNA sequencing now makes it economically possible to produce very large datasets of 16S rDNA sequences in short time periods, necessitating new computer tools for analyses. Here we describe FastGroup, a Java program designed to dereplicate libraries of 16S rDNA sequences. By dereplication we mean to: 1) compare all the sequences in a data set to each other, 2) group similar sequences together, and 3) output a representative sequence from each group. In this way, duplicate sequences are removed from a library.

Results

FastGroup was tested using a library of single-pass, bacterial 16S rDNA sequences cloned from coral-associated bacteria. We found that the optimal strategy for dereplicating these sequences was to: 1) trim ambiguous bases from the 5' end of the sequences and all sequence 3' of the conserved Bact517 site, 2) match the sequences from the 3' end, and 3) group sequences >=97% identical to each other.

Conclusions

The FastGroup program simplifies the dereplication of 16S rDNA sequence libraries and prepares the raw sequences for subsequent analyses.  相似文献   

17.
Base composition, codon usages and amino acid usages have been analyzed by taking 529 orthologous sequences of Aquifex aeolicus and Bacillus subtilis, having different optimal growth temperatures. These two bacteria do not have significant difference in overall GC composition, but GC(1+2) and GC3 levels were found to vary significantly. Significant increments in purine content and GC3 composition have been observed in the coding sequences of Aquifex aeolicus than its Bacillus subtilis counterparts. Correspondence analyses on codon and amino acid usages reveal that variation in base composition actually influences their codon and amino acid usages. Two selection pressures acting on the nucleotide level (GC3 and purine enrichment), causes variation in the amino acid usage differently in different protein secondary structures. Our results suggest that adaptation of amino acid usages in coil structure of Aquifex aeolicus proteins is under the control of both purine increment and GC3 composition, whereas the adaptation of the amino acids in the helical region of thermophilic bacteria is strongly influenced by the purine content. Evolutionary perspectives concerning the temperature adaptation of DNA and protein molecules of these two bacteria have been discussed on the basis of these results.  相似文献   

18.
Annotation of protein functions plays an important role in understanding life at the molecular level. High‐throughput sequencing produces massive numbers of raw proteins sequences and only about 1% of them have been manually annotated with functions. Experimental annotations of functions are expensive, time‐consuming and do not keep up with the rapid growth of the sequence numbers. This motivates the development of computational approaches that predict protein functions. A novel deep learning framework, DeepFunc, is proposed which accurately predicts protein functions from protein sequence‐ and network‐derived information. More precisely, DeepFunc uses a long and sparse binary vector to encode information concerning domains, families, and motifs collected from the InterPro tool that is associated with the input protein sequence. This vector is processed with two neural layers to obtain a low‐dimensional vector which is combined with topological information extracted from protein–protein interactions (PPIs) and functional linkages. The combined information is processed by a deep neural network that predicts protein functions. DeepFunc is empirically and comparatively tested on a benchmark testing dataset and the Critical Assessment of protein Function Annotation algorithms (CAFA) 3 dataset. The experimental results demonstrate that DeepFunc outperforms current methods on the testing dataset and that it secures the highest Fmax = 0.54 and AUC = 0.94 on the CAFA3 dataset.  相似文献   

19.
20.
High-quality data about protein structures and their gene sequences are essential to the understanding of the relationship between protein folding and protein coding sequences. Firstly we constructed the EcoPDB database, which is a high-quality database of Escherichia coli genes and their corresponding PDB structures. Based on EcoPDB, we presented a novel approach based on information theory to investigate the correlation between cysteine synonymous codon usages and local amino acids flanking cysteines, the correlation between cysteine synonymous codon usages and synonymous codon usages of local amino acids flanking cysteines, as well as the correlation between cysteine synonymous codon usages and the disulfide bonding states of cysteines in the E. coli genome. The results indicate that the nearest neighboring residues and their synonymous codons of the C-terminus have the greatest influence on the usages of the synonymous codons of cysteines and the usage of the synonymous codons has a specific correlation with the disulfide bond formation of cysteines in proteins. The correlations may result from the regulation mechanism of protein structures at gene sequence level and reflect the biological function restriction that cysteines pair to form disulfide bonds. The results may also be helpful in identifying residues that are important for synonymous codon selection of cysteines to introduce disulfide bridges in protein engineering and molecular biology. The approach presented in this paper can also be utilized as a complementary computational method and be applicable to analyse the synonymous codon usages in other model organisms.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号