首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Based on the well-known k-mer model, we propose a k-mer natural vector model for representing a genetic sequence based on the numbers and distributions of k-mers in the sequence. We show that there exists a one-to-one correspondence between a genetic sequence and its associated k-mer natural vector. The k-mer natural vector method can be easily and quickly used to perform phylogenetic analysis of genetic sequences without requiring evolutionary models or human intervention. Whole or partial genomes can be handled more effective with our proposed method. It is applied to the phylogenetic analysis of genetic sequences, and the obtaining results fully demonstrate that the k-mer natural vector method is a very powerful tool for analysing and annotating genetic sequences and determining evolutionary relationships both in terms of accuracy and efficiency.  相似文献   

2.
Oligomers of length k, or k-mers, are convenient and widely used features for modeling the properties and functions of DNA and protein sequences. However, k-mers suffer from the inherent limitation that if the parameter k is increased to resolve longer features, the probability of observing any specific k-mer becomes very small, and k-mer counts approach a binary variable, with most k-mers absent and a few present once. Thus, any statistical learning approach using k-mers as features becomes susceptible to noisy training set k-mer frequencies once k becomes large. To address this problem, we introduce alternative feature sets using gapped k-mers, a new classifier, gkm-SVM, and a general method for robust estimation of k-mer frequencies. To make the method applicable to large-scale genome wide applications, we develop an efficient tree data structure for computing the kernel matrix. We show that compared to our original kmer-SVM and alternative approaches, our gkm-SVM predicts functional genomic regulatory elements and tissue specific enhancers with significantly improved accuracy, increasing the precision by up to a factor of two. We then show that gkm-SVM consistently outperforms kmer-SVM on human ENCODE ChIP-seq datasets, and further demonstrate the general utility of our method using a Naïve-Bayes classifier. Although developed for regulatory sequence analysis, these methods can be applied to any sequence classification problem.  相似文献   

3.
Several proteins and genes are members of families that share a public evolutionary. In order to outline the evolutionary relationships and to recognize conserved patterns, sequence comparison becomes an emerging process. The current work investigates critically the k-mer role in composition vector method for comparing genome sequences. Generally, composition vector methods using k-mer are applied under choice of different value of k to compare genome sequences. For some values of k, results are satisfactory, but for other values of k, results are unsatisfactory. Standard composition vector method is carried out in the proposed work using 3-mer string length. In addition, special type of information based similarity index is used as a distance measure. It establishes that use of 3-mer and information based similarity index provide satisfactory results especially for comparison of whole genome sequences in all cases. These selections provide a sort of unified approach towards comparison of genome sequences.  相似文献   

4.

Background

NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f0 to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance.

Results

We propose to use a Gamma distribution to model the frequencies of erroneous k-mers and a mixture of Gaussian distributions to model correct k-mers, and combine them to determine f0. To identify the two special subsets of k-mers, we use the z-score of k-mers which measures the number of standard deviations a k-mer’s frequency is from the mean. Then these statistically-solid k-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets.

Conclusion

The z-score is adequate to distinguish solid k-mers from weak k-mers, particularly useful for pinpointing out solid k-mers having very low frequency. Applying z-score on k-mer can markedly improve the error correction accuracy.
  相似文献   

5.
Zhao  Liang  Xie  Jin  Bai  Lin  Chen  Wen  Wang  Mingju  Zhang  Zhonglei  Wang  Yiqi  Zhao  Zhe  Li  Jinyan 《BMC genomics》2018,19(10):1-10
Background

NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f0 to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance.

Results

We propose to use a Gamma distribution to model the frequencies of erroneous k-mers and a mixture of Gaussian distributions to model correct k-mers, and combine them to determine f0. To identify the two special subsets of k-mers, we use the z-score of k-mers which measures the number of standard deviations a k-mer’s frequency is from the mean. Then these statistically-solid k-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets.

Conclusion

The z-score is adequate to distinguish solid k-mers from weak k-mers, particularly useful for pinpointing out solid k-mers having very low frequency. Applying z-score on k-mer can markedly improve the error correction accuracy.

  相似文献   

6.
Methods for discovery of local similarities and estimation of evolutionary distance by identifying k-mers (contiguous subsequences of length k) common to two sequences are described. Given unaligned sequences of length L, these methods have O(L) time complexity. The ability of compressed amino acid alphabets to extend these techniques to distantly related proteins was investigated. The performance of these algorithms was evaluated for different alphabets and choices of k using a test set of 1848 pairs of structurally alignable sequences selected from the FSSP database. Distance measures derived from k-mer counting were found to correlate well with percentage identity derived from sequence alignments. Compressed alphabets were seen to improve performance in local similarity discovery, but no evidence was found of improvements when applied to distance estimates. The performance of our local similarity discovery method was compared with the fast Fourier transform (FFT) used in MAFFT, which has O(L log L) time complexity. The method for achieving comparable coverage to FFT is revealed here, and is more than an order of magnitude faster. We suggest using k-mer distance for fast, approximate phylogenetic tree construction, and show that a speed improvement of more than three orders of magnitude can be achieved relative to standard distance methods, which require alignments.  相似文献   

7.
The taxonomic analysis of sequencing data has become important in many areas of life sciences. However, currently available tools for that purpose either consume large amounts of RAM or yield insufficient quality and robustness. Here, we present kASA, a k-mer based tool capable of identifying and profiling metagenomic DNA or protein sequences with high computational efficiency and a user-definable memory footprint. We ensure both high sensitivity and precision by using an amino acid-like encoding of k-mers together with a range of multiple k’s. Custom algorithms and data structures optimized for external memory storage enable a full-scale taxonomic analysis without compromise on laptop, desktop, and HPCC.  相似文献   

8.

Background

A basic task in bioinformatics is the counting of k-mers in genome sequences. Existing k-mer counting tools are most often optimized for small k < 32 and suffer from excessive memory resource consumption or degrading performance for large k. However, given the technology trend towards long reads of next-generation sequencers, support for large k becomes increasingly important.

Results

We present the open source k-mer counting software Gerbil that has been designed for the efficient counting of k-mers for k ≥ 32. Our software is the result of an intensive process of algorithm engineering. It implements a two-step approach. In the first step, genome reads are loaded from disk and redistributed to temporary files. In a second step, the k-mers of each temporary file are counted via a hash table approach. In addition to its basic functionality, Gerbil can optionally use GPUs to accelerate the counting step. In a set of experiments with real-world genome data sets, we show that Gerbil is able to efficiently support both small and large k.

Conclusions

While Gerbil’s performance is comparable to existing state-of-the-art open source k-mer counting tools for small k < 32, it vastly outperforms its competitors for large k, thereby enabling new applications which require large values of k.
  相似文献   

9.

Background

A study was undertaken to resolve preliminary conflicting results on the proliferation of leukemia cells observed with different c-myc antisense oligonucleotides.

Results

RNase H-active, chimeric methylphosphonodiester / phosphodiester antisense oligodeoxynucleotides targeting bases 1147–1166 of c-myc mRNA downregulated c-Myc protein and induced apoptosis and cell cycle arrest respectively in cultures of MOLT-4 and KYO1 human leukemia cells. In contrast, an RNase H-inactive, morpholino antisense oligonucleotide analogue 28-mer, simultaneously targeting the exon 2 splice acceptor site and initiation codon, reduced c-Myc protein to barely detectable levels but did not affect cell proliferation in these or other leukemia lines. The RNase H-active oligodeoxynucleotide 20-mers contained the phosphodiester linked motif CGTTG, which as an apoptosis inducing CpG oligodeoxynucleotide 5-mer of sequence type CGNNN (N = A, G, C, or T) had potent activity against MOLT-4 cells. The 5-mer mimicked the antiproliferative effects of the 20-mer in the absence of any antisense activity against c-myc mRNA, while the latter still reduced expression of c-myc in a subline of MOLT-4 cells that had been selected for resistance to CGTTA, but in this case the oligodeoxynucleotide failed to induce apoptosis or cell cycle arrest.

Conclusions

We conclude that the biological activity of the chimeric c-myc antisense 20-mers resulted from a non-antisense mechanism related to the CGTTG motif contained within the sequence, and not through downregulation of c-myc. Although the oncogene may have been implicated in the etiology of the original leukemias, expression of c-myc is apparently no longer required to sustain continuous cell proliferation in these culture lines.  相似文献   

10.
The 245 bp chromosomal origin, oriC, of Escherichia coli contains two iterated motifs. Three 13-mers tandemly repeated at one end of the origin and four 9-mers in a nearby segment of oriC are highly conserved in enteric bacteria, as is the distance separating these two sequence clusters. Mutant origins were constructed with altered spacing of the 9-mers relative to the 13-mers. Loss or addition of even a single base drastically reduced replication, both in vivo and in vitro. Spacing mutant origins bound effectively to DnaA protein but failed to support efficient open complex formation. These results suggest that interaction with the 9-mers positions at least one subunit of DnaA to recognize directly the nearest 13-mer for DNA melting.  相似文献   

11.
To produce a large quantity of the angiotensin-converting-enzyme(ACE)-inhibiting peptide YG-1, which consists of ten amino acids derived from yeast glyceraldehyde-3-phosphate dehydrogenase, a high-level expression was explored with tandem multimers of the YG-1 gene in Escherichia coli. The genes encoding YG-1 were tandemly multimerized to 9-mers, 18-mers and 27-mers, in which each of the repeating units in the tandem multimers was connected to the neighboring genes by a DNA linker encoding Pro-Gly-Arg for the cleavage of multimers by clostripain. The multimers were cloned into the expression vector pET-21b, and expressed in E. coli BL21(DE3) with isopropyl β-d-thiogalactopyranoside induction. The expressed multimeric peptides encoded by the 9-mer, 18-mer and 27-mer accumulated intracellularly as inclusion bodies and comprised about 67%, 25% and 15% of the total proteins in E. coli respectively. The multimeric peptides expressed as inclusion bodies were cleaved with clostripain, and active monomers were purified to homogeneity by reversed-phase high-performance liquid chromatography. In total, 105 mg pure recombinant YG-1 was obtained from 1 l E. coli culture harboring pETYG9, which contained the 9-mer of the YG-1 gene. The recombinant YG-1 was identical to the natural YG-1 in molecular mass, amino acid sequence and ACE-inhibiting activity. Received: 6 January 1998 / Received revision: 23 February 1998 / Accepted: 24 February 1998  相似文献   

12.
We propose a tetrahedral Gray code that facilitates visualization of genome information on the surfaces of a tetrahedron, where the relative abundance of each -mer in the genomic sequence is represented by a color of the corresponding cell of a triangular lattice. For biological significance, the code is designed such that the -mers corresponding to any adjacent pair of cells differ from each other by only one nucleotide. We present a simple procedure to draw such a pattern on the development surfaces of a tetrahedron. The thus constructed tetrahedral Gray code can demonstrate evolutionary conservation and variation of the genome information of many organisms at a glance. We also apply the tetrahedral Gray code to the honey bee (Apis mellifera) genome to analyze its methylation structure. The results indicate that the honey bee genome exhibits CpG overrepresentation in spite of its methylation ability and that two conserved motifs, CTCGAG and CGCGCG, in the unmethylated regions are responsible for the overrepresentation of CpG.  相似文献   

13.

Background  

The empirical frequencies of DNA k-mers in whole genome sequences provide an interesting perspective on genomic complexity, and the availability of large segments of genomic sequence from many organisms means that analysis of k-mers with non-trivial lengths is now possible.  相似文献   

14.
Girotto  Samuele  Comin  Matteo  Pizzi  Cinzia 《BMC bioinformatics》2018,19(15):441-38

Background

Spaced-seeds, i.e. patterns in which some fixed positions are allowed to be wild-cards, play a crucial role in several bioinformatics applications involving substrings counting and indexing, by often providing better sensitivity with respect to k-mers based approaches. K-mers based approaches are usually fast, being based on efficient hashing and indexing that exploits the large overlap between consecutive k-mers. Spaced-seeds hashing is not as straightforward, and it is usually computed from scratch for each position in the input sequence. Recently, the FSH (Fast Spaced seed Hashing) approach was proposed to improve the time required for computation of the spaced seed hashing of DNA sequences with a speed-up of about 1.5 with respect to standard hashing computation.

Results

In this work we propose a novel algorithm, Fast Indexing for Spaced seed Hashing (FISH), based on the indexing of small blocks that can be combined to obtain the hashing of spaced-seeds of any length. The method exploits the fast computation of the hashing of runs of consecutive 1 in the spaced seeds, that basically correspond to k-mer of the length of the run.

Conclusions

We run several experiments, on NGS data from simulated and synthetic metagenomic experiments, to assess the time required for the computation of the hashing for each position in each read with respect to several spaced seeds. In our experiments, FISH can compute the hashing values of spaced seeds with a speedup, with respect to the traditional approach, between 1.9x to 6.03x, depending on the structure of the spaced seeds.
  相似文献   

15.
《Genomics》2020,112(3):2233-2240
MicroRNA-like small RNAs (milRNAs) with length of 21–22 nucleotides are a type of small non-coding RNAs that are firstly found in Neurospora crassa in 2010. Identifying milRNAs of species without genomic information is a difficult problem. Here, knowledge-based energy features are developed to identify milRNAs by tactfully incorporating k-mer scheme and distance-dependent pair potential. Compared with k-mer scheme, features developed here can alleviate the inherent curse of dimensionality in k-scheme once k becomes large. In addition, milRNApredictor built on novel features performs comparably to k-mer scheme, and achieves sensitivity of 74.21%, and specificity of 75.72% based on 10-fold cross-validation. Furthermore, for novel miRNA prediction, there exists high overlap of results from milRNApredictor and state-of-the-art mirnovo. However, milRNApredictor is simpler to use with reduced requirements of input data and dependencies. Taken together, milRNApredictor can be used to de novo identify fungi milRNAs and other very short small RNAs of non-model organisms.  相似文献   

16.
Lighter is a fast, memory-efficient tool for correcting sequencing errors. Lighter avoids counting k-mers. Instead, it uses a pair of Bloom filters, one holding a sample of the input k-mers and the other holding k-mers likely to be correct. As long as the sampling fraction is adjusted in inverse proportion to the depth of sequencing, Bloom filter size can be held constant while maintaining near-constant accuracy. Lighter is parallelized, uses no secondary storage, and is both faster and more memory-efficient than competing approaches while achieving comparable accuracy.

Electronic supplementary material

The online version of this article (doi:10.1186/s13059-014-0509-9) contains supplementary material, which is available to authorized users.  相似文献   

17.
The minimal replication origin of the broad-host-range plasmid RK2, oriV, contains five iterons which are binding sites for the plasmid-encoded replication initiation protein TrfA, four DnaA boxes, which bind the host DnaA protein, and an AT-rich region containing four 13-mer sequences. In this study, 26 mutants with altered sequence and/or spacing of 13-mer motifs have been constructed and analysed for replication activity in vivo and in vitro. The data show that the replacement of oriV 13-mers by similar but not identical 13-mer sequences from Escherichia coli oriC inactivates the origin. In addition, interchanging the positions of the oriV 13-mers results in greatly reduced activity. Mutants with T/A substitutions are also inactive. Furthermore, introduction of single-nucleotide substitutions demonstrates very restricted sequence requirements depending on the 13-mer position. Only two of the mutants are host specific, functional in Pseudomonas aeruginosa but not in E. coli. Our experiments demonstrate considerable complexity in the plasmid AT-rich region architecture required for functionality. It is evident that low internal stability of this region is not the only feature contributing to origin activity. Our studies suggest a requirement for sequence-specific protein interactions within the 13-mers during assembly of replication complexes at the plasmid origin.  相似文献   

18.
In this study, a simple 4k-dimension feature representation vector is proposed to reconstruct phylogenetic trees, where k is the length of a word. The vector is composed of elements which characterize the relative difference of biological sequence from sequence generated by an independent random process. In addition, the variance of a vector which is obtained by averaging every column of feature representation matrix is employed to determine appropriate word length. In our experiments, reliable results can always be generated when word length is <7 which appears to be of lower computational complexity. Phylogenetic trees of 24 transferrins and 48 Hepatitis E viruses reconstructed at word length 6 are in good agreements with previous study, it shows that our method is efficient and powerful.  相似文献   

19.
Understanding the mechanisms that coordinate replication initiation with subsequent segregation of chromosomes is an important biological problem. Here we report two replication-control mechanisms mediated by a chromosome segregation protein, ParB2, encoded by chromosome II of the model multichromosome bacterium, Vibrio cholerae. We find by the ChIP-chip assay that ParB2, a centromere binding protein, spreads beyond the centromere and covers a replication inhibitory site (a 39-mer). Unexpectedly, without nucleation at the centromere, ParB2 could also bind directly to a related 39-mer. The 39-mers are the strongest inhibitors of chromosome II replication and they mediate inhibition by binding the replication initiator protein. ParB2 thus appears to promote replication by out-competing initiator binding to the 39-mers using two mechanisms: spreading into one and direct binding to the other. We suggest that both these are novel mechanisms to coordinate replication initiation with segregation of chromosomes.  相似文献   

20.
The opening of the three tandem 13-mers (iterons) in the replication origin (oriC) of Escherichia coli by DnaA protein, assisted by protein HU or IHF (Hwang, D. S., and Kornberg, A. (1992) J. Biol. Chem. 267, 23083-23086), represents an essential early stage in the initiation of chromosomal replication (Bramhill, D., and Kornberg, A. (1988) Cell 54, 915-918). We now show by mutational alterations of the 13-mer region that oriC function, both in vitro and in vivo, requires AT-richness in the left 13-mer and sequence specificity in the middle and right 13-mers. Interactions of DnaA protein with the middle and right 13-mers are crucial for the opening of the region. Binding of the protein to the top strand of the 13-mers appeared to maintain single-strandedness in the bottom strand. IciA protein, the inhibitor of initiation, binds the three 13-mers and blocks the opening of the region. The degrees of inhibition by IciA protein of 13-mer opening and of oriC plasmid replication observed with mutant forms of the 13-mers could be correlated with the binding affinity of IciA protein. Whereas the binding of IciA protein to the 13-mers did not affect the binding of DnaA protein to its four 9-mers boxes, interaction of DnaA protein with the 13-mers was blocked. The selective interactions of DnaA and IciA proteins with the 13-mer region appear to be components of the on/off switch that controls initiation of E. coli chromosomal replication.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号