共查询到20条相似文献,搜索用时 0 毫秒
1.
Burden CJ Jing J Wilson SR 《Statistical applications in genetics and molecular biology》2012,11(1):Article 3
The D(2) statistic, defined as the number of matches of words of some pre-specified length k, is a computationally fast alignment-free measure of biological sequence similarity. However there is some debate about its suitability for this purpose as the variability in D(2) may be dominated by the terms that reflect the noise in each of the single sequences only. We examine the extent of the problem and the effectiveness of overcoming it by using two mean-centred variants of this statistic, D(2)* and D(2c). We conclude that all three statistics are potentially useful measures of sequence similarity, for which reasonably accurate p-values can be estimated under a null hypothesis of sequences composed of identically and independently distributed letters. We show that D(2) and D(2)c, and to a somewhat lesser extent D(2)*, perform well in tests to classify moderate length query sequences as putative cis-regulatory modules. 相似文献
2.
MOTIVATION: Sequencing capacity is currently growing more rapidly than CPU speed, leading to an analysis bottleneck in many genome projects. Alignment-free sequence analysis methods tend to be more efficient than their alignment-based counterparts. They may, therefore, be important in the long run for keeping sequence analysis abreast with sequencing. RESULTS: We derive and implement an alignment-free estimator of the number of pairwise mismatches, . Our implementation of , pim, is based on an enhanced suffix array and inherits the superior time and memory efficiency of this data structure. Simulations demonstrate that is accurate if mutations are distributed randomly along the chromosome. While real data often deviates from this ideal, remains useful for identifying regions of low genetic diversity using a sliding window approach. We demonstrate this by applying it to the complete genomes of 37 strains of Drosophila melanogaster, and to the genomes of two closely related Drosophila species, D.simulans and D.sechellia. In both cases, we detect the diversity minimum and discuss its biological implications. 相似文献
3.
The data deluge in post-genomic era demands development of novel data mining tools. Existing molecular phylogeny analyses (MPAs) developed for individual gene/protein sequences are alignment-based. However, the size of genomic data and uncertainties associated with alignments, necessitate development of alignment-free methods for MPA. Derivation of distances between sequences is an important step in both, alignment-dependant and alignment-free methods. Various alignment-free distance measures based on oligo-nucleotide frequencies, information content, compression techniques, etc. have been proposed. However, these distance measures do not account for relative order of components viz. nucleotides or amino acids. A new distance measure, based on the concept of 'return time distribution' (RTD) of k-mers is proposed, which accounts for the sequence composition and their relative orders. Statistical parameters of RTDs are used to derive a distance function. The resultant distance matrix is used for clustering and phylogeny using Neighbor-joining. Its performance for MPA and subtyping was evaluated using simulated data generated by block-bootstrap, receiver operating characteristics and leave-one-out cross validation methods. The proposed method was successfully applied for MPA of family Flaviviridae and subtyping of Dengue viruses. It is observed that method retains resolution for classification and subtyping of viruses at varying levels of sequence similarity and taxonomic hierarchy. 相似文献
4.
5.
In order to compare different genome sequences, an alignment-free method has proposed. First, we presented a new graphical representation of DNA sequences without degeneracy, which is conducive to intuitive comparison of sequences. Then, a new numerical characterization based on the representation was introduced to quantitatively depict the intrinsic nature of genome sequences, and considered as a 10-dimensional vector in the mathematical space. Alignment-free comparison of sequences was performed by computing the distances between vectors of the corresponding numerical characterizations, which define the evolutionary relationship. Two data sets of DNA sequences were constructed to assess the performance on sequence comparison. The results illustrate well validity of the method. The new numerical characterization provides a powerful tool for genome comparison. 相似文献
6.
Cangzhi Jia Tian Liu Xiangde Zhang Haoyue Fu Qing Yang 《Journal of biomolecular structure & dynamics》2013,31(6):763-769
Abstract Protein sequences are treated as stochastic processes on the basis of a reduced amino acid alphabet of 10 types of amino acids. The realization of a stochastic process is described by associated transition probability matrix that corresponds to the process uniquely. Then new distances between transition probability matrices are defined for sequences similarity analysis. Two separate datasets are prepared and tested to identify the validity of the method. The results demonstrate the new method is powerful and efficient. 相似文献
7.
《基因组蛋白质组与生物信息学报(英文版)》2015,(5)
A faithful phylogeny and an objective taxonomy for prokaryotes should agree with each other and ultimately follow the genome data. With the number of sequenced genomes reaching tens of thousands, both tree inference and detailed comparison with taxonomy are great challenges. We now provide one solution in the latest Release 3.0 of the alignment-free and whole-genome-based web server CVTree3. The server resides in a cluster of 64 cores and is equipped with an interactive,collapsible, and expandable tree display. It is capable of comparing the tree branching order with prokaryotic classification at all taxonomic ranks from domains down to species and strains.CVTree3 allows for inquiry by taxon names and trial on lineage modifications. In addition, it reports a summary of monophyletic and non-monophyletic taxa at all ranks as well as produces print-quality subtree figures. After giving an overview of retrospective verification of the CVTree approach, the power of the new server is described for the mega-classification of prokaryotes and determination of taxonomic placement of some newly-sequenced genomes. A few discrepancies between CVTree and 16 S r RNA analyses are also summarized with regard to possible taxonomic revisions. CVTree3 is freely accessible to all users at http://tlife.fudan.edu.cn/cvtree3/ without login requirements. 相似文献
8.
9.
Perez-Bello A Munteanu CR Ubeira FM De Magalhães AL Uriarte E González-Díaz H 《Journal of theoretical biology》2009,256(3):458-466
The importance of the promoter sequences in the function regulation of several important mycobacterial pathogens creates the necessity to design simple and fast theoretical models that can predict them. This work proposes two DNA promoter QSAR models based on pseudo-folding lattice network (LN) and star-graphs (SG) topological indices. In addition, a comparative study with the previous RNA electrostatic parameters of thermodynamically-driven secondary structure folding representations has been carried out. The best model of this work was obtained with only two LN stochastic electrostatic potentials and it is characterized by accuracy, selectivity and specificity of 90.87%, 82.96% and 92.95%, respectively. In addition, we pointed out the SG result dependence on the DNA sequence codification and we proposed a QSAR model based on codons and only three SG spectral moments. 相似文献
10.
A method is proposed to represent and to analyze complete genome sequences (52 species from procaryotes and eukaryotes), based upon n-gram sequence's frequencies of amino acid pairs (bigrams), separated by a given number of other residues. For each of the species analyzed, it allows us to construct over-abundant and over-deficient occurrence profiles, summarizing amino acid bigram frequencies over the entire genome. The method deals efficiently with a sparseness of statistical representations of individual sequences, and describes every gene sequence in the same way, independently of its length and of the genome sizes. The frequency of over-abundant and over-deficient occurrences of bigrams presents a singular periodicity around 3.5 peptide bonds, suggesting a relation with the alpha helical secondary structure. 相似文献
11.
Background
An Escherichia coli O104:H4 outbreak in Germany in summer 2011 caused 53 deaths, over 4000 individual infections across Europe, and considerable economic, social and political impact. This outbreak was the first in a position to exploit rapid, benchtop high-throughput sequencing (HTS) technologies and crowdsourced data analysis early in its investigation, establishing a new paradigm for rapid response to disease threats. We describe a novel strategy for design of diagnostic PCR primers that exploited this rapid draft bacterial genome sequencing to distinguish between E. coli O104:H4 outbreak isolates and other pathogenic E. coli isolates, including the historical hæmolytic uræmic syndrome (HUSEC) E. coli HUSEC041 O104:H4 strain, which possesses the same serotype as the outbreak isolates.Methodology/Principal Findings
Primers were designed using a novel alignment-free strategy against eleven draft whole genome assemblies of E. coli O104:H4 German outbreak isolates from the E. coli O104:H4 Genome Analysis Crowd-Sourcing Consortium website, and a negative sequence set containing 69 E. coli chromosome and plasmid sequences from public databases. Validation in vitro against 21 ‘positive’ E. coli O104:H4 outbreak and 32 ‘negative’ non-outbreak EHEC isolates indicated that individual primer sets exhibited 100% sensitivity for outbreak isolates, with false positive rates of between 9% and 22%. A minimal combination of two primers discriminated between outbreak and non-outbreak E. coli isolates with 100% sensitivity and 100% specificity.Conclusions/Significance
Draft genomes of isolates of disease outbreak bacteria enable high throughput primer design and enhanced diagnostic performance in comparison to traditional molecular assays. Future outbreak investigations will be able to harness HTS rapidly to generate draft genome sequences and diagnostic primer sets, greatly facilitating epidemiology and clinical diagnostics. We expect that high throughput primer design strategies will enable faster, more precise responses to future disease outbreaks of bacterial origin, and help to mitigate their societal impact. 相似文献12.
Bajić VB 《Briefings in bioinformatics》2000,1(3):214-228
The abundance of computer software for different types of prediction in DNA and protein sequence analyses raises the problem of adequate ranking of prediction program quality. A single measure of success of predictor software, which adequately ranks the predictors, does not exist. A typical example of such an incomplete measure is the so-called correlation coefficient. This paper provides an overview and short analysis of several different measures of prediction quality. Frequently, some of these measures give results contradictory to each other even when they relate to the same prediction scores.This may lead to confusion. In order to overcome some of the problems, a few new measures are proposed including some variants of a 'generalised distance from the ideal predictor score'; these are based on topological properties, rather than on statistics. In order to provide a sort of a balanced ranking, the averaged score measure (ASM) is introduced.The ASM provides a possibility for the selection of the predictor that probably has the best overall performance.The method presented in the paper applies to the ranking problem of any prediction software whose results can be properly represented in a true positive-false positive framework, thus providing a natural set-up for linear biological sequence analysis. 相似文献
13.
This review summarises the presentations and discussions that took place during a European Science Foundation-funded workshop whose purpose was to gain current perspectives on the mutational mechanisms of simple sequence repeats and the contribution of localised hypermutation in such repeats to bacterial pathogenesis. In vitro biophysical and biochemical assays of mutational mechanisms were covered as well as genetic studies in various eukaryotic and prokaryotic organisms. Presentations on bacterial pathogenesis elaborated investigations of the use of repeats for typing of strains, epidemiological investigations of mutation rates and functions of loci whose expression is controlled by simple sequence repeats. This review tabulates current perspectives on the cis- and trans-acting factors for mutation of simple sequence repeats and the orientations of mononucleotide repeats in some bacterial species that utilise repeats for adaptation. 相似文献
14.
Proteoglycans (PGs) are among the most structurally complex biomacromolecules in nature. They are present in all animal cells and frequently exert their critical biological functions through interactions with protein ligands and receptors. PGs are comprised of a core protein to which one or multiple, heterogeneous, and polydisperse glycosaminoglycan (GAG) chains are attached. Proteins, including the protein core of PGs, are now routinely sequenced either directly using proteomics or indirectly using molecular biology through their encoding DNA. The sequencing of the GAG component of PGs poses a considerably more difficult challenge because of the relatively underdeveloped state of glycomics and because the control of their biosynthesis in the endoplasmic reticulum and the Golgi is poorly understood and not believed to be template driven. Recently, the GAG chain of the simplest PG has been suggested to have a defined sequence based on its top-down Fourier transform mass spectral sequencing. This review examines the advances made over the past decade in the sequencing of GAG chains and the challenges the field face in sequencing complex PGs having critical biological functions in developmental biology and pathogenesis. 相似文献
15.
16.
Malde K 《Bioinformatics (Oxford, England)》2008,24(7):897-900
Motivation: The nucleotide sequencing process produces not onlythe sequence of nucleotides, but also associated quality values.Quality values provide valuable information, but are primarilyused only for trimming sequences and generally ignored in subsequentanalyses. Results: This article describes how the scoring schemes of standardalignment algorithms can be modified to take into account qualityvalues to produce improved alignments and statistically moreaccurate scores. A prototype implementation is also provided,and used to post-process a set of BLAST results. Quality-adjustedalignment is a natural extension of standard alignment methods,and can be implemented with only a small constant factor performancepenalty. The method can also be applied to related methods includingheuristic search algorithms like BLAST and FASTA. Availability: Software is available at http://malde.org/~ketil/qaa. Contact: ketil.malde{at}imr.no Supplementary information: Supplementary data are availableat Bioinformatics online.
Associate Editor: Limsoon Wong 相似文献
17.
We investigated protein sequence/structure correlation by constructing a space of protein sequences, based on methods developed previously for constructing a space of protein structures. The space is constructed by using a representation of the amino acids as vectors of 10 property factors that encode almost all of their physical properties. Each sequence is represented by a distribution of overlapping sequence fragments. A distance between any two sequences can be calculated. By attaching a weight to each factor, intersequence distances can be varied. We optimize the correlation between corresponding distances in the sequence and structure spaces. The optimal correlation between the sequence and structure spaces is significantly better than that which results from correlating randomly generated sequences, having the overall composition of the data base, with the structure space. However, sets of randomly generated sequences, each of which approximates the composition of the real sequence it replaces, produce correlations with the structure space that are as good as that observed for the actual protein sequences. A connection is proposed with previous studies of the protein folding code. It is shown that the most important property factors for the correlation of the sequence and structure spaces are related to helix/bend preference, side chain bulk, and beta-structure preference. 相似文献
18.
19.
20.