期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

K-mer natural vector and its application to the phylogenetic analysis of genetic sequences

Jia Wen Raymond H.F. Chan Shek-Chung Yau Rong L. He Stephen S.T. Yau 《Gene》2014

Based on the well-known k-mer model, we propose a k-mer natural vector model for representing a genetic sequence based on the numbers and distributions of k-mers in the sequence. We show that there exists a one-to-one correspondence between a genetic sequence and its associated k-mer natural vector. The k-mer natural vector method can be easily and quickly used to perform phylogenetic analysis of genetic sequences without requiring evolutionary models or human intervention. Whole or partial genomes can be handled more effective with our proposed method. It is applied to the phylogenetic analysis of genetic sequences, and the obtaining results fully demonstrate that the k-mer natural vector method is a very powerful tool for analysing and annotating genetic sequences and determining evolutionary relationships both in terms of accuracy and efficiency. 相似文献

2.

Phylogenetic analysis of protein sequences based on a novel k-mer natural vector method

《Genomics》2019,111(6):1298-1305

Based on the k-mer model for protein sequence, a novel k-mer natural vector method is proposed to characterize the features of k-mers in a protein sequence, in which the numbers and distributions of k-mers are considered. It is proved that the relationship between a protein sequence and its k-mer natural vector is one-to-one. Phylogenetic analysis of protein sequences therefore can be easily performed without requiring evolutionary models or human intervention. In addition, there exists no a criterion to choose a suitable k, and k has a great influence on obtaining results as well as computational complexity. In this paper, a compound k-mer natural vector is utilized to quantify each protein sequence. The results gotten from phylogenetic analysis on three protein datasets demonstrate that our new method can precisely describe the evolutionary relationships of proteins, and greatly heighten the computing efficiency. 相似文献

3.

Mining statistically-solid <Emphasis Type="Italic">k</Emphasis>-mers for accurate NGS error correction

Liang Zhao Jin Xie Lin Bai Wen Chen Mingju Wang Zhonglei Zhang Yiqi Wang Zhe Zhao Jinyan Li 《BMC genomics》2018,19(10):912

Background

NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f₀ to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance.

Results

We propose to use a Gamma distribution to model the frequencies of erroneous k-mers and a mixture of Gaussian distributions to model correct k-mers, and combine them to determine f₀. To identify the two special subsets of k-mers, we use the z-score of k-mers which measures the number of standard deviations a k-mer’s frequency is from the mean. Then these statistically-solid k-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets.

Conclusion

The z-score is adequate to distinguish solid k-mers from weak k-mers, particularly useful for pinpointing out solid k-mers having very low frequency. Applying z-score on k-mer can markedly improve the error correction accuracy.

相似文献

4.

Mining statistically-solid k-mers for accurate NGS error correction

Zhao Liang Xie Jin Bai Lin Chen Wen Wang Mingju Zhang Zhonglei Wang Yiqi Zhao Zhe Li Jinyan 《BMC genomics》2018,19(10):1-10

Background

NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f₀ to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance.

Results

We propose to use a Gamma distribution to model the frequencies of erroneous k-mers and a mixture of Gaussian distributions to model correct k-mers, and combine them to determine f₀. To identify the two special subsets of k-mers, we use the z-score of k-mers which measures the number of standard deviations a k-mer’s frequency is from the mean. Then these statistically-solid k-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets.

Conclusion

The z-score is adequate to distinguish solid k-mers from weak k-mers, particularly useful for pinpointing out solid k-mers having very low frequency. Applying z-score on k-mer can markedly improve the error correction accuracy.

相似文献

5.

Robust k-mer frequency estimation using gapped k-mers

Mahmoud Ghandi Morteza Mohammad-Noori Michael A. Beer 《Journal of mathematical biology》2014,69(2):469-500

相似文献

6.

milRNApredictor: Genome-free prediction of fungi milRNAs by incorporating k-mer scheme and distance-dependent pair potential

《Genomics》2020,112(3):2233-2240

MicroRNA-like small RNAs (milRNAs) with length of 21–22 nucleotides are a type of small non-coding RNAs that are firstly found in Neurospora crassa in 2010. Identifying milRNAs of species without genomic information is a difficult problem. Here, knowledge-based energy features are developed to identify milRNAs by tactfully incorporating k-mer scheme and distance-dependent pair potential. Compared with k-mer scheme, features developed here can alleviate the inherent curse of dimensionality in k-scheme once k becomes large. In addition, milRNApredictor built on novel features performs comparably to k-mer scheme, and achieves sensitivity of 74.21%, and specificity of 75.72% based on 10-fold cross-validation. Furthermore, for novel miRNA prediction, there exists high overlap of results from milRNApredictor and state-of-the-art mirnovo. However, milRNApredictor is simpler to use with reduced requirements of input data and dependencies. Taken together, milRNApredictor can be used to de novo identify fungi milRNAs and other very short small RNAs of non-model organisms. 相似文献

7.

Local homology recognition and distance measures in linear time using compressed amino acid alphabets 总被引：1，自引：0，他引：1

Edgar RC 《Nucleic acids research》2004,32(1):380-385

Methods for discovery of local similarities and estimation of evolutionary distance by identifying k-mers (contiguous subsequences of length k) common to two sequences are described. Given unaligned sequences of length L, these methods have O(L) time complexity. The ability of compressed amino acid alphabets to extend these techniques to distantly related proteins was investigated. The performance of these algorithms was evaluated for different alphabets and choices of k using a test set of 1848 pairs of structurally alignable sequences selected from the FSSP database. Distance measures derived from k-mer counting were found to correlate well with percentage identity derived from sequence alignments. Compressed alphabets were seen to improve performance in local similarity discovery, but no evidence was found of improvements when applied to distance estimates. The performance of our local similarity discovery method was compared with the fast Fourier transform (FFT) used in MAFFT, which has O(L log L) time complexity. The method for achieving comparable coverage to FFT is revealed here, and is more than an order of magnitude faster. We suggest using k-mer distance for fast, approximate phylogenetic tree construction, and show that a speed improvement of more than three orders of magnitude can be achieved relative to standard distance methods, which require alignments. 相似文献

8.

Optimal choice of k-mer in composition vector method for genome sequence comparison

Subhram Das Tamal Deb Nilanjan Dey Amira S. Ashour D.K. Bhattacharya D.N. Tibarewala 《Genomics》2018,110(5):263-273

Several proteins and genes are members of families that share a public evolutionary. In order to outline the evolutionary relationships and to recognize conserved patterns, sequence comparison becomes an emerging process. The current work investigates critically the k-mer role in composition vector method for comparing genome sequences. Generally, composition vector methods using k-mer are applied under choice of different value of k to compare genome sequences. For some values of k, results are satisfactory, but for other values of k, results are unsatisfactory. Standard composition vector method is carried out in the proposed work using 3-mer string length. In addition, special type of information based similarity index is used as a distance measure. It establishes that use of 3-mer and information based similarity index provide satisfactory results especially for comparison of whole genome sequences in all cases. These selections provide a sort of unified approach towards comparison of genome sequences. 相似文献

9.

Gerbil: a fast and memory-efficient <Emphasis Type="Italic">k</Emphasis>-mer counter with GPU-support

Marius Erbert Steffen Rechner Matthias Müller-Hannemann 《Algorithms for molecular biology : AMB》2017,12(1):9

Background

A basic task in bioinformatics is the counting of k-mers in genome sequences. Existing k-mer counting tools are most often optimized for small k < 32 and suffer from excessive memory resource consumption or degrading performance for large k. However, given the technology trend towards long reads of next-generation sequencers, support for large k becomes increasingly important.

Results

We present the open source k-mer counting software Gerbil that has been designed for the efficient counting of k-mers for k ≥ 32. Our software is the result of an intensive process of algorithm engineering. It implements a two-step approach. In the first step, genome reads are loaded from disk and redistributed to temporary files. In a second step, the k-mers of each temporary file are counted via a hash table approach. In addition to its basic functionality, Gerbil can optionally use GPUs to accelerate the counting step. In a set of experiments with real-world genome data sets, we show that Gerbil is able to efficiently support both small and large k.

Conclusions

While Gerbil’s performance is comparable to existing state-of-the-art open source k-mer counting tools for small k < 32, it vastly outperforms its competitors for large k, thereby enabling new applications which require large values of k.

相似文献

10.

A compact,in vivo screen of all 6-mers reveals drivers of tissue-specific expression and guides synthetic regulatory element design

Robin P Smith Samantha J Riesenfeld Alisha K Holloway Qiang Li Karl K Murphy Natalie M Feliciano Lorenzo Orecchia Nir Oksenberg Katherine S Pollard Nadav Ahituv 《Genome biology》2013,14(7):R72

相似文献

11.

Genomic DNA k-mer spectra: models and modalities

Benny Chor David Horn Nick Goldman Yaron Levy Tim Massingham 《Genome biology》2009,10(10):R108

Background

The empirical frequencies of DNA k-mers in whole genome sequences provide an interesting perspective on genomic complexity, and the availability of large segments of genomic sequence from many organisms means that analysis of k-mers with non-trivial lengths is now possible. 相似文献

12.

Spectrum structures and biological functions of 8-mers in the human genome

Yun Jia Hong Li Jingfeng Wang Hu Meng Zhenhua Yang 《Genomics》2019,111(3):483-491

The spectra of k-mer frequencies can reveal the structures and evolution of genome sequences. We confirmed that the trimodal spectrum of 8-mers in human genome sequences is distinguished only by CG2, CG1 and CG0 8-mer sets, containing 2,1 or 0 CpG, respectively. This phenomenon is called independent selection law. The three types of CG 8-mers were considered as different functional elements. We conjectured that (1) nucleosome binding motifs are mainly characterized by CG1 8-mers and (2) the core structural units of CpG island sequences are predominantly characterized by CG2 8-mers. To validate our conjectures, nucleosome occupied sequences and CGI sequences were extracted, then the sequence parameters were constructed through the information of the three CG 8-mer sets respectively. ROC analysis showed that CG1 8-mers are more preference in nucleosome occupied segments (AUC > 0.7) and CG2 8-mers are more preference in CGI sequences (AUC > 0.99). This validates our conjecture in principle. 相似文献

13.

Efficient computation of spaced seed hashing with block indexing

Girotto Samuele Comin Matteo Pizzi Cinzia 《BMC bioinformatics》2018,19(15):441-38

Background

Spaced-seeds, i.e. patterns in which some fixed positions are allowed to be wild-cards, play a crucial role in several bioinformatics applications involving substrings counting and indexing, by often providing better sensitivity with respect to k-mers based approaches. K-mers based approaches are usually fast, being based on efficient hashing and indexing that exploits the large overlap between consecutive k-mers. Spaced-seeds hashing is not as straightforward, and it is usually computed from scratch for each position in the input sequence. Recently, the FSH (Fast Spaced seed Hashing) approach was proposed to improve the time required for computation of the spaced seed hashing of DNA sequences with a speed-up of about 1.5 with respect to standard hashing computation.

Results

In this work we propose a novel algorithm, Fast Indexing for Spaced seed Hashing (FISH), based on the indexing of small blocks that can be combined to obtain the hashing of spaced-seeds of any length. The method exploits the fast computation of the hashing of runs of consecutive 1 in the spaced seeds, that basically correspond to k-mer of the length of the run.

Conclusions

We run several experiments, on NGS data from simulated and synthetic metagenomic experiments, to assess the time required for the computation of the hashing for each position in each read with respect to several spaced seeds. In our experiments, FISH can compute the hashing values of spaced seeds with a speedup, with respect to the traditional approach, between 1.9x to 6.03x, depending on the structure of the spaced seeds.

相似文献

14.

Taxonomic analysis of metagenomic data with kASA

Silvio Weging Andreas Gogol-Dring Ivo Grosse 《Nucleic acids research》2021,49(12):e68

The taxonomic analysis of sequencing data has become important in many areas of life sciences. However, currently available tools for that purpose either consume large amounts of RAM or yield insufficient quality and robustness. Here, we present kASA, a k-mer based tool capable of identifying and profiling metagenomic DNA or protein sequences with high computational efficiency and a user-definable memory footprint. We ensure both high sensitivity and precision by using an amino acid-like encoding of k-mers together with a range of multiple k’s. Custom algorithms and data structures optimized for external memory storage enable a full-scale taxonomic analysis without compromise on laptop, desktop, and HPCC. 相似文献

15.

All-atom computer simulations of amyloid fibrils disaggregation

Wang J Tan C Chen HF Luo R 《Biophysical journal》2008,95(11):5037-5047

Amyloidlike fibrils are found in many fatal diseases, including Alzheimer's disease, type II diabetes mellitus, transmissible spongiform encephalopathies, and prion diseases. These diseases are linked to proteins that have partially unfolded, misfolded, and aggregated into amyloidlike fibrils. The kinetics of amyloidlike fibrils aggregation is still hotly debated and remains an important open question. We have utilized the GNNQQNY crystal structure and high-temperature molecular dynamics simulation in explicit solvent to study the disaggregation mechanism of the GNNQQNY fibrils and to infer its likely aggregation pathways. A hexamer model and a 12-mer model both with two parallel β-sheets separated by a dry side-chain interface were adopted in our computational analysis. A cumulative time of 1 μs was simulated for the hexamer model at five different temperatures (298 K, 348 K, 398 K, 448 K, and 498 K), and a cumulative time of 2.1 μs was simulated for the 12-mer model at four temperatures (298 K, 398 K, 448 K, and 498 K). Our disaggregation landscape and kinetics analyses indicate that tetramers probably act as the transition state in both the hexamer and the 12-mer simulations. In addition, the 12-mer simulations show that the initial aggregation nucleus is with eight peptides. Furthermore, the landscape is rather flat from 8-mers to 12-mers, indicating the absence of major barriers once the initial aggregation nucleus forms. Thus, the likely aggregation pathway is from monomers to the initial nucleus of 8-mers with tetramers as the transition state. Transition state structure analysis shows that the two dominant transition state conformations are tetramers in the 3-1 and 2-2 arrangements. The predominant nucleus conformations are in peptide arrangements maximizing dry side-chain contacts. Landscape and kinetics analyses also indicate that the parallel β-sheets form earlier than the dry side-chain contacts during aggregation. These results provide further insights in understanding the early fibrils aggregation. 相似文献

16.

Unexpected Diversity of Cellular Immune Responses against Nef and Vif in HIV-1-Infected Patients Who Spontaneously Control Viral Replication

Leandro F. Tarosso Mariana M. Sauer Sabri Sanabani Maria Teresa Giret Helena I. Tomiyama John Sidney Shari M. Piaskowski Ricardo S. Diaz Ester C. Sabino Alessandro Sette Jorge Kalil-Filho David I. Watkins Esper G. Kallas 《PloS one》2010,5(7)

Background

HIV-1-infected individuals who spontaneously control viral replication represent an example of successful containment of the AIDS virus. Understanding the anti-viral immune responses in these individuals may help in vaccine design. However, immune responses against HIV-1 are normally analyzed using HIV-1 consensus B 15-mers that overlap by 11 amino acids. Unfortunately, this method may underestimate the real breadth of the cellular immune responses against the autologous sequence of the infecting virus.

Methodology and Principal Findings

Here we compared cellular immune responses against nef and vif-encoded consensus B 15-mer peptides to responses against HLA class I-predicted minimal optimal epitopes from consensus B and autologous sequences in six patients who have controlled HIV-1 replication. Interestingly, our analysis revealed that three of our patients had broader cellular immune responses against HLA class I-predicted minimal optimal epitopes from either autologous viruses or from the HIV-1 consensus B sequence, when compared to responses against the 15-mer HIV-1 type B consensus peptides.

Conclusion and Significance

This suggests that the cellular immune responses against HIV-1 in controller patients may be broader than we had previously anticipated. 相似文献

17.

Learning “graph-mer” Motifs that Predict Gene Expression Trajectories in Development

Xuejing Li Casandra Panea Chris H. Wiggins Valerie Reinke Christina Leslie 《PLoS computational biology》2010,6(4)

相似文献

18.

Conservation and regulatory associations of a wide affinity range of mouse transcription factor binding sites

Savina A. Jaeger Esther T. Chan Michael F. Berger Rolf Stottmann Timothy R. Hughes Martha L. Bulyk 《Genomics》2010,95(4):185-195

相似文献

19.

Open complex formation by DnaA initiation protein at the Escherichia coli chromosomal origin requires the 13-mers precisely spaced relative to the 9-mers 总被引：2，自引：1，他引：1

Julia Hsu David Bramhill Chris M. Thompson 《Molecular microbiology》1994,11(5):903-911

The 245 bp chromosomal origin, oriC, of Escherichia coli contains two iterated motifs. Three 13-mers tandemly repeated at one end of the origin and four 9-mers in a nearby segment of oriC are highly conserved in enteric bacteria, as is the distance separating these two sequence clusters. Mutant origins were constructed with altered spacing of the 9-mers relative to the 13-mers. Loss or addition of even a single base drastically reduced replication, both in vivo and in vitro. Spacing mutant origins bound effectively to DnaA protein but failed to support efficient open complex formation. These results suggest that interaction with the 9-mers positions at least one subunit of DnaA to recognize directly the nearest 13-mer for DNA melting. 相似文献

20.

Lighter: fast and memory-efficient sequencing error correction without counting

Li Song Liliana Florea Ben Langmead 《Genome biology》2014,15(11)

Lighter is a fast, memory-efficient tool for correcting sequencing errors. Lighter avoids counting k-mers. Instead, it uses a pair of Bloom filters, one holding a sample of the input k-mers and the other holding k-mers likely to be correct. As long as the sampling fraction is adjusted in inverse proportion to the depth of sequencing, Bloom filter size can be held constant while maintaining near-constant accuracy. Lighter is parallelized, uses no secondary storage, and is both faster and more memory-efficient than competing approaches while achieving comparable accuracy.

Electronic supplementary material

The online version of this article (doi:10.1186/s13059-014-0509-9) contains supplementary material, which is available to authorized users. 相似文献