首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Comparative biosequence metrics   总被引:27,自引:0,他引:27  
Summary The sequence alignment algorithms of Needleman and Wunsch (1970) and Sellers (1974) are compared. Although the former maximizes similarity and the latter minimizes differences, the two procedures are proven to be equivalent. The equivalence relations necessary for each procedure to give the same result are: 1, the weight assigned to gaps in the Sellers algorithm exceed that in the Needleman-Wunsch algorithm by exactly half the length of the gap times the maximum match value; and 2, for any pair of aligned elements, the degree of similarity assigned by the Needleman-Wunsch algorithm plus the degree of dissimilarity assigned by the Sellers algorithm equal a constant. The utility of the algorithms is independent of the nature of the elements in the sequence and could include anything from geological sequences to the amino acid sequences of proteins. Examples are provided using known nucleotide sequences, one of which shows two sequences to be analogous rather than homologous.  相似文献   

2.
Novel methods are discussed for using fast Fourier transforms for DNA or protein sequence comparison. These methods are also intended as a contribution to the more general computer science problem of text search. These methods extend the capabilities of previous FFT methods and show that these methods are capable of considerable refinement. In particular, novel methods are given which (1) enable the detection of clusters of matching letters, (2) facilitate the insertion of gaps to enhance sequence similarity, and (3) accommodate to varying densities of letters in the input sequences. These methods use Fourier analysis in two distinct ways. (1) Fast Fourier transforms are used to facilitate rapid computation. (2) Fourier expansions are used to form an 'image' of the sequence comparison.  相似文献   

3.

Background  

Ambiguity is a problem in biosequence analysis that arises in various analysis tasks solved via dynamic programming, and in particular, in the modeling of families of RNA secondary structures with stochastic context free grammars. Several types of analysis are invalidated by the presence of ambiguity. As this problem inherits undecidability (as we show here) from the namely problem for context free languages, there is no complete algorithmic solution to the problem of ambiguity checking.  相似文献   

4.
Digital signal processing methods for biosequence comparison.   总被引:1,自引:1,他引:0       下载免费PDF全文
A method is discussed for DNA or protein sequence comparison using a finite field fast Fourier transform, a digital signal processing technique; and statistical methods are discussed for analyzing the output of this algorithm. This method compares two sequences of length N in computing time proportional to N log N compared to N2 for methods currently used. This method makes it feasible to compare very long sequences. An example is given to show that the method correctly identifies sites of known homology.  相似文献   

5.
Fast algorithms for pairwise biosequence similarity search frequently use filtering and indexing strategies to identify potential matches between a query sequence and a database. For the most part, these strategies are not informed by the substitution score matrices commonly used by comparison algorithms to assign numerical scores to pairs of aligned residues. Consequently, although many filtering strategies offer strong formal guarantees about their ability to detect pairs of sequences differing by few substitutions, these methods can make no guarantee of detecting pairs with high similarity scores. We describe a general technique, score simulation, to help resolve the tension between existing filtering techniques and the use of score matrices. Score simulation, using score matrices, maps ungapped similarity search problems to the simpler problem of finding pairs of strings that differ by few substitutions. Score simulation leads to indexing schemes for biosequences that permit efficient ungapped similarity search with arbitrary score matrices while maintaining strong formal guarantees of sensitivity. We introduce the LSH-ALL-PAIRS-SIM algorithm for finding local similarities in large biosequence collections and show that it is both computationally feasible and sensitive in practice.  相似文献   

6.
A few models have appeared in recent years that consider not only the way substitutions occur through evolutionary history at each site of a genome, but also the way the process changes from one site to the next. These models combine phylogenetic models of molecular evolution, which apply to individual sites, and hidden Markov models, which allow for changes from site to site. Besides improving the realism of ordinary phylogenetic models, they are potentially very powerful tools for inference and prediction--for example, for gene finding or prediction of secondary structure. In this paper, we review progress on combined phylogenetic and hidden Markov models and present some extensions to previous work. Our main result is a simple and efficient method for accommodating higher-order states in the HMM, which allows for context-dependent models of substitution--that is, models that consider the effects of neighboring bases on the pattern of substitution. We present experimental results indicating that higher-order states, autocorrelated rates, and multiple functional categories all lead to significant improvements in the fit of a combined phylogenetic and hidden Markov model, with the effect of higher-order states being particularly pronounced.  相似文献   

7.
We introduce a metric for local sequence alignments that has utility for accelerating optimal alignment searches without loss of sensitivity. The metric's triangle inequality property permits identification of redundant database entries guaranteed to have optimal alignments to the query sequence that fall below a specified score threshold, thereby permitting comparisons to these entries to be skipped. We prove the existence of the metric for a variety of scoring systems, including the most commonly used ones, and show that a triangle inequality can be established as well for nucleotide-to-protein sequence comparisons. We discuss a database clustering and search strategy that takes advantage of the triangle inequality. The strategy permits moderate but significant acceleration of searches against the widely used "nr" protein database. It also provides a theoretically based method for database clustering in general and provides a standard against which to compare heuristic clustering strategies.  相似文献   

8.
Quantifying similarity and dissimilarity of spike trains is an important requisite for understanding neural codes. Spike metrics constitute a class of approaches to this problem. In contrast to most signal-processing methods, spike metrics operate on time series of all-or-none events, and are, thus, particularly appropriate for extracellularly recorded neural signals. The spike metric approach can be extended to multineuronal recordings, mitigating the 'curse of dimensionality' typically associated with analyses of multivariate data. Spike metrics have been usefully applied to the analysis of neural coding in a variety of systems, including vision, audition, olfaction, taste and electric sense.  相似文献   

9.
10.
Although landscape configuration and landscape composition metrics are correlated theoretically and empirically, the effectiveness of configuration metrics from composition metrics has not been explicitly investigated. This study explored to what extent substantial information of configuration metrics increases from certain easily calculated and extensively used composition metrics and how strongly the effectiveness is influenced by different factors. The effectiveness of 12 landscape configuration metrics from the percentage of landscape (PLAND) of each land-use class and patch density (PD) was evaluated through the coefficient of determination (R 2) of multivariate stepwise linear regression analysis of 150 town-based landscape samples from three regions. The different landscape configuration metrics from PLAND and PD presented significantly different performances in terms of effectiveness [the contagion index and aggregation index possess minimal information, and the effective mesh size (MESH) and area-weighted mean patch fractal dimension possess abundant information]. Furthermore, the effectiveness of configuration metrics showed different responses to changing cell sizes and different land-use categorization in different regions (interspersion and juxtaposition index, patch cohesion index, and MESH exhibited large variations in R 2 among the different regions). No single, uniform, consistent characteristic of effectiveness was determined across different factors. This new approach to understanding the effectiveness of configuration metrics helps clarify landscape metrics and is fundamental to landscape metric assessment.  相似文献   

11.
R. D. Routledge 《Oecologia》1979,43(1):121-124
Summary Pielou's (1972) measures of niche width and overlap are related to ecological components of diversity. This relation is exploited to derive modified niche metrics with improved characteristics.  相似文献   

12.
The study of genetic sequences is of great importance in biology and medicine. Mathematics is playing an important role in the study of genetic sequences and, generally, in bioinformatics. In this paper, we extend the work concerning the Fuzzy Polynuclotide Space (FPS) introduced in Torres, A., Nieto, J.J., 2003. The fuzzy polynuclotide Space: Basic properties. Bioinformatics 19(5); 587–592 and Nieto, J.J., Torres, A., Vazquez-Trasande, M.M. 2003. A metric space to study differences between polynucleotides. Appl. Math. Lett. 27:1289–1294: by studying distances between nucleotides and some complete genomes using several metrics. We also present new results concerning the notions of similarity, difference and equality between polynucleotides. The results are encouraging since they demonstrate how the notions of distance and similarity between polynucleotides in the FPS can be employed in the analysis of genetic material.  相似文献   

13.
Numerous metrics describing landscape patterns have been used to explain landscape-scale habitat selection by birds. The myriad metrics, their complexity, and inconsistent responses to them by birds have led to a lack of clear recommendations for managing land for desired species. The amount of a target land cover type in the landscape (percentage cover) often has been a useful indicator of the likelihood of species occurrence or of habitat selection; is it also a more adequate and parsimonious measure for explaining species distributions than patch size or more complex measures of landscape configuration? We examined responses of 6 woodland-interior bird species to the percentage tree cover within prescribed areas and to patch size, edge density, and other metrics. We examined responses in 2 landscapes: a mixed woodland-savanna and an eastern deciduous forest. For these 6 species, percentage tree cover explained bird occurrence as well as or better than other measures in both study areas. We then repeated the analysis on a larger group of woodland species, including those associated with woodland edges. The bird species we studied had varied responses to landscape metrics, but percentage tree cover was the strongest explanatory variable overall. Although percentage cover estimated from remotely sensed data is an inexact representation of habitat in the landscape, it does appear to be reliable and easy to conceptualize, relative to other measures. We suggest that, at least for woodland habitat, percentage cover is a broadly useful measure that can be helpful in pragmatic questions of explaining responses to landscapes or in anticipating responses to landscape change. © 2011 The Wildlife Society.  相似文献   

14.
We derive a new metric of community similarity that takes into account the phylogenetic relatedness among species. This metric, phylogenetic community dissimilarity (PCD), can be partitioned into two components, a nonphylogenetic component that reflects shared species between communities (analogous to S?rensen' s similarity metric) and a phylogenetic component that reflects the evolutionary relationships among nonshared species. Therefore, even if a species is not shared between two communities, it will increase the similarity of the two communities if it is phylogenetically related to species in the other community. We illustrate PCD with data on fish and aquatic macrophyte communities from 59 temperate lakes. Dissimilarity between fish communities associated with environmental differences between lakes often has a phylogenetic component, whereas this is not the case for macrophyte communities. With simulations, we then compare PCD with two other metrics of phylogenetic community similarity, II(ST) and UniFrac. Of the three metrics, PCD was best at identifying environmental drivers of community dissimilarity, showing lower variability and greater statistical power. Thus, PCD is a statistically powerful metric that separates the effects of environmental drivers on compositional versus phylogenetic components of community structure.  相似文献   

15.
16.
In contrast to clock time, which is extrinsic, universal and reversible, age is an intrinsic, directed measure of the state of a particular system. It is proposed that if the dynamical equations of a given system are cast into canonical form, a time scale intrinsic to that system can be derived. The metric which converts a given intrinsic time to clock time is derived in terms of the given system's constitutive parameters. Age becomes a question of similitude, two systems being in corresponding states (i.e. at the same age) at identical instants of intrinsic time (not clock time).It is further proposed that there is an intrinsic time associated with any dissipative process and that the coupling coefficients, Lik, of irreversible thermodynamics are metrics which scale the passage of intrinsic time to clock time as measured by a standard harmonic oscillator. Thus in addition to the long standing conjecture that entropy production determines the direction of time's arrow there also is a sense in which it determines the rate of its flow.  相似文献   

17.

Background  

Identifying coevolving positions in protein sequences has myriad applications, ranging from understanding and predicting the structure of single molecules to generating proteome-wide predictions of interactions. Algorithms for detecting coevolving positions can be classified into two categories: tree-aware, which incorporate knowledge of phylogeny, and tree-ignorant, which do not. Tree-ignorant methods are frequently orders of magnitude faster, but are widely held to be insufficiently accurate because of a confounding of shared ancestry with coevolution. We conjectured that by using a null distribution that appropriately controls for the shared-ancestry signal, tree-ignorant methods would exhibit equivalent statistical power to tree-aware methods. Using a novel t-test transformation of coevolution metrics, we systematically compared four tree-aware and five tree-ignorant coevolution algorithms, applying them to myoglobin and myosin. We further considered the influence of sequence recoding using reduced-state amino acid alphabets, a common tactic employed in coevolutionary analyses to improve both statistical and computational performance.  相似文献   

18.
We explore functional connectivity in nine subjects measured with 1.5T fMRI-BOLD in a longitudinal study of recovery from unilateral stroke affecting the motor area (Small et al., 2002). We found that several measures of complexity of covariance matrices show strong correlations with behavioral measures of recovery. In Schmah et al. (2010), we applied Linear and Quadratic Discriminants (LD and QD) computed on a principal components (PC) subspace to classify the fMRI volumes into "early" and "late" sessions. We demonstrated excellent classification accuracy with QD but not LD, indicating that potentially important differences in functional connectivity exist between the early and late sessions. Motivated by Mclntosh et al. (2008), who showed that EEG brain-signal variability and behavioral performance both increased with age during development, we investigated complexity of the covariance matrix for this longitudinal stroke recovery data set. We used three complexity measures: the sphericity index described by Abdi (2010); "unsupervised dimensionality", which is the number of PCs that minimizes unsupervised generalization error of a covariance matrix (Hansen et al., 1999); and "QD dimensionality", which is the number of PCs that minimizes the classification accuracy of QD. Although these approaches measure different kinds of complexity, all showed strong correlations with one or more behavioral tests: nine-hole peg test, hand grip test and pinch test. We could not demonstrate that either sphericity or unsupervised dimensionality were significantly different for the "early" and "late" sessions using a paired Wilcoxon test. However, the amount of relative behavioral improvement was correlated with sphericity of the overall covariance matrix (pooled across all sessions), as well as with the divergence of the eigenspectra between the "early" and "late" covariance matrices. Complexity measures that use the number of PCs (which optimize QD classification or unsupervised generalization) were correlated with the behavioral performance of the final session, but not with the relative improvement. These are suggestive, but limited, results given the sample size, restricted behavioral measurements and older 1.5T BOLD data sets. Nevertheless, they indicate one potentially fruitful direction for future data-driven fMRI studies of stroke recovery in larger, better-characterized longitudinal stroke data sets recorded at higher field strength. Finally, we produced sensitivity maps (Kjems et al., 2002) corresponding to both linear and quadratic discriminants for the "early" vs. "late" classification. These maps measure the influence of each voxel on the class assignments for a given classifier. Differences between the scaled sensitivity maps for the linear and quadratic discriminants indicate brain regions involved in changes in functional connectivity. These regions are highly variable across subjects, but include the cerebellum and the motor area contralateral to the lesion.  相似文献   

19.
Levenshtein dissimilarity measures are used to compare sequences in application areas including coding theory, computer science and macromolecular biology. In general, they measure sequence dissimilarity by the length of a shortest weighted sequence of insertions, deletions and substitutions required, to transform one sequence into another. Those Levenshtein dissimilarity measures based on insertions and deletions are analyzed by a model involving valuations on a partially ordered set. The model reveals structural relationships among poset, valuation and dissimilarity measure. As a consequence, certain Levenshtein dissimilarity measures are shown to be metrics characterized by betweenness properties and computable in terms of well-known measures of sequence similarity. This work was supported in part by the Natural Sciences and Engineering Research Council of Canada under Grant A-4142.  相似文献   

20.
The purpose of our work was to develop heuristics for visualizing and interpreting gene-environment interactions (GEIs) and to assess the dependence of candidate visualization metrics on biological and study-design factors. Two information-theoretic metrics, the k-way interaction information (KWII) and the total correlation information (TCI), were investigated. The effectiveness of the KWII and TCI to detect GEIs in a diverse range of simulated data sets and a Crohn disease data set was assessed. The sensitivity of the KWII and TCI spectra to biological and study-design variables was determined. Head-to-head comparisons with the relevance-chain, multifactor dimensionality reduction, and the pedigree disequilibrium test (PDT) methods were obtained. The KWII and TCI spectra, which are graphical summaries of the KWII and TCI for each subset of environmental and genotype variables, were found to detect each known GEI in the simulated data sets. The patterns in the KWII and TCI spectra were informative for factors such as case-control misassignment, locus heterogeneity, allele frequencies, and linkage disequilibrium. The KWII and TCI spectra were found to have excellent sensitivity for identifying the key disease-associated genetic variations in the Crohn disease data set. In head-to-head comparisons with the relevance-chain, multifactor dimensionality reduction, and PDT methods, the results from visual interpretation of the KWII and TCI spectra performed satisfactorily. The KWII and TCI are promising metrics for visualizing GEIs. They are capable of detecting interactions among numerous single-nucleotide polymorphisms and environmental variables for a diverse range of GEI models.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号