首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Statistical and learning techniques are becoming increasingly popular for different tasks in bioinformatics. Many of the most powerful statistical and learning techniques are applicable to points in a Euclidean space but not directly applicable to discrete sequences such as protein sequences. One way to apply these techniques to protein sequences is to embed the sequences into a Euclidean space and then apply these techniques to the embedded points. In this work we introduce a biologically motivated sequence embedding, the homology kernel, which takes into account intuitions from local alignment, sequence homology, and predicted secondary structure. This embedding allows us to directly apply learning techniques to protein sequences. We apply the homology kernel in several ways. We demonstrate how the homology kernel can be used for protein family classification and outperforms state-of-the-art methods for remote homology detection. We show that the homology kernel can be used for secondary structure prediction and is competitive with popular secondary structure prediction methods. Finally, we show how the homology kernel can be used to incorporate information from homologous sequences in local sequence alignment.  相似文献   

2.
Mathematical spaces are widely used in the sciences for representing quantitative and qualitative relations between objects or individuals. Phenotype spaces—spaces whose elements represent phenotypes—are frequently applied in morphometrics, evolutionary quantitative genetics, and systematics. In many applications, several quantitative measurements are taken as the orthogonal axes of a Euclidean vector space. We show that incommensurable units, geometric dependencies between measurements, and arbitrary spacing of measurements do not warrant a Euclidean geometry for phenotype spaces. Instead, we propose that most phenotype spaces have an affine structure. This has profound consequences for the meaningfulness of biological statements derived from a phenotype space, as they should be invariant relative to the transformations determining the structure of the phenotype space. Meaningful geometric relations in an affine space are incidence, linearity, parallel lines, distances along parallel lines, intermediacy, and ratios of volumes. Biological hypotheses should be phrased and tested in terms of these fundamental geometries, whereas the interpretation of angles and of phenotypic distances in different directions should be avoided. We present meaningful notions of phenotypic variance and other statistics for an affine phenotype space. Furthermore, we connect our findings to standard examples of morphospaces such as Raup’s space of coiled shells and Kendall’s shape space.  相似文献   

3.
In this paper we present modern approaches to the classification of hydrobiological samples based on various metrics of species-structure similarity—Euclidean distance, Renkonen index, and the cosine of the angle between the species abundances vectors. We use the cophenetic correlation coefficient, Gower distance, and Shepard-like plot for the justification of clustering method. For the choice of the optimal number of clusters, we apply approaches based on silhouette widths and binary matrices representing partitions. An analysis of the spatial structure of zooplankton communities in the small Linda River shows that average agglomerative clustering is an optimal algorithm for objects of this type. A comparative analysis of the results of cluster analysis on the basis of different similarity metrics shows that the most adequate classification can be obtained using the cosine of the angle between the species abundances vectors and the Renkonen index, whereas the classification based on the Euclidean distances is less successful from the biological point of view. Approaches outlined in this paper allow researchers to make quantitative decisions about key elements of classification, greatly reducing the subjectivity of the cluster analysis results.  相似文献   

4.
Genome sequencing and microarray technology produce ever-increasing amounts of complex data that need analysis. Visualization is an effective analytical technique that exploits the ability of the human brain to process large amounts of data. Here, we review traditional visualization methods based on clustering and tree representation, and also describe an alternative approach that involves projecting objects onto a Euclidean space in a way that reflects their structural or functional distances. Data are visualized without preclustering and can be dynamically explored by the user using ‘virtual-reality’. We illustrate this approach with two case studies from protein topology and gene expression.  相似文献   

5.
A common problem in classification is the assignment of objects to meaningful clusters given their relative positions in reduced ordination space. We examine the distinctiveness of such putative clusters.Clusters are enlarged by simulation and then the relative frequency distributions of interpoint distances in the “enlarged” clusters are compared with the distribution of the original points in the original space. This procedure may be repeated for all reasonable clustering hypotheses until the best fit is found.We suggest that one way of looking at the distinctness of clusters is to look at the distribution of interpoint distances for various hypothetical clusters from which the data might have been sampled. The best clustering, given the data, is the one that best matches the original distribution of interpoint distances.  相似文献   

6.
The web can be regarded as an ecosystem of digital resources connected and shaped by collective successive behaviors of users. Knowing how people allocate limited attention on different resources is of great importance. To answer this, we embed the most popular Chinese web sites into a high dimensional Euclidean space based on the open flow network model of a large number of Chinese users’ collective attention flows, which both considers the connection topology of hyperlinks between the sites and the collective behaviors of the users. With these tools, we rank the web sites and compare their centralities based on flow distances with other metrics. We also study the patterns of attention flow allocation, and find that a large number of web sites concentrate on the central area of the embedding space, and only a small fraction of web sites disperse in the periphery. The entire embedding space can be separated into 3 regions(core, interim, and periphery). The sites in the core (1%) occupy a majority of the attention flows (40%), and the sites (34%) in the interim attract 40%, whereas other sites (65%) only take 20% flows. What’s more, we clustered the web sites into 4 groups according to their positions in the space, and found that similar web sites in contents and topics are grouped together. In short, by incorporating the open flow network model, we can clearly see how collective attention allocates and flows on different web sites, and how web sites connected each other.  相似文献   

7.
Mojie Duan  Minghai Li  Li Han  Shuanghong Huo 《Proteins》2014,82(10):2585-2596
Dimensionality reduction is widely used in searching for the intrinsic reaction coordinates for protein conformational changes. We find the dimensionality?reduction methods using the pairwise root?mean?square deviation (RMSD) as the local distance metric face a challenge. We use Isomap as an example to illustrate the problem. We believe that there is an implied assumption for the dimensionality‐reduction approaches that aim to preserve the geometric relations between the objects: both the original space and the reduced space have the same kind of geometry, such as Euclidean geometry vs. Euclidean geometry or spherical geometry vs. spherical geometry. When the protein free energy landscape is mapped onto a 2D plane or 3D space, the reduced space is Euclidean, thus the original space should also be Euclidean. For a protein with N atoms, its conformation space is a subset of the 3N‐dimensional Euclidean space R3N. We formally define the protein conformation space as the quotient space of R3N by the equivalence relation of rigid motions. Whether the quotient space is Euclidean or not depends on how it is parameterized. When the pairwise RMSD is employed as the local distance metric, implicit representations are used for the protein conformation space, leading to no direct correspondence to a Euclidean set. We have demonstrated that an explicit Euclidean‐based representation of protein conformation space and the local distance metric associated to it improve the quality of dimensionality reduction in the tetra‐peptide and β‐hairpin systems. Proteins 2014; 82:2585–2596. © 2014 Wiley Periodicals, Inc.  相似文献   

8.
Abstract

Enumerating procedure for symbol sequences is proposed. Relationship between Hamming distance for symbol sequences and Euclidean distance for corresponding enumerations is established, and more universal Hamming-transformed Euclidean measure is constructed. A distribution function of amino acid substitutions and some of its point estimators (consensus, subconsensus, sample mean, sample central moments and asymmetry coefficient) are introduced. Hamming-transformed Euclidean measures between consensuses, subconsensuses and sample means for ten HIV-1 taxons of gp120 V3 regions are calculated. It is demonstrated that these taxons have a complicated pattern which is significant for their classification.  相似文献   

9.
Object Oriented Data Analysis is a new area in statistics that studies populations of general data objects. In this article we consider populations of tree-structured objects as our focus of interest. We develop improved analysis tools for data lying in a binary tree space analogous to classical Principal Component Analysis methods in Euclidean space. Our extensions of PCA are analogs of one dimensional subspaces that best fit the data. Previous work was based on the notion of tree-lines.  相似文献   

10.
Inter-temporal decisions involves assigning values to various payoffs occurring at different temporal distances. Past research has used different approaches to study these decisions made by humans and animals. For instance, considering that people discount future payoffs at a constant rate (e.g., exponential discounting) or at variable rate (e.g., hyperbolic discounting). In this research, we question the widely assumed, but seldom questioned, notion across many of the existing approaches that the decision space, where the decision-maker perceives time and monetary payoffs, is a Euclidean space. By relaxing the rigid assumption of Euclidean space, we propose that the decision space is a more flexible Riemannian space of Constant Negative Curvature. We test our proposal by deriving a discount function, which uses the distance in the Negative Curvature space instead of Euclidean temporal distance. The distance function includes both perceived values of time as well as money, unlike past work which has considered just time. By doing so we are able to explain many of the empirical findings in inter-temporal decision-making literature. We provide converging evidence for our proposal by estimating the curvature of the decision space utilizing manifold learning algorithm and showing that the characteristics (i.e., metric properties) of the decision space resembles those of the Negative Curvature space rather than the Euclidean space. We conclude by presenting new theoretical predictions derived from our proposal and implications for how non-normative behavior is defined.  相似文献   

11.
The four-dimensional spherical emotional space was constructed by multidimensional scaling of visually perceived differences between emotional expressions of schematic faces. In this spherical model Euclidean distances between the points representing the schematic faces are directly proportional to perceived differences of emotional expressions. Three angles of the four-dimensional sphere correspond to specific characteristics of emotions, such as emotional modality (joy, fear, anger, etc.), intensity of emotions, and emotional fullness (saturation). At the same time Cartesian coordinates represent excitations in the neuronal channels encoding line orientations. It was shown that the structure of the emotional space is similar to the structure of color space, i.e., emotional modality corresponds to color hue, emotional intensity to brightness, and emotional fullness to color saturation. The obtained evidence suggests the common mechanisms of information coding in the visual system.  相似文献   

12.

Background  

An algorithm is presented to compute a multiple structure alignment for a set of proteins and to generate a consensus (pseudo) protein which captures common substructures present in the given proteins. The algorithm represents each protein as a sequence of triples of coordinates of the alpha-carbon atoms along the backbone. It then computes iteratively a sequence of transformation matrices (i.e., translations and rotations) to align the proteins in space and generate the consensus. The algorithm is a heuristic in that it computes an approximation to the optimal alignment that minimizes the sum of the pairwise distances between the consensus and the transformed proteins.  相似文献   

13.
The minimal folding pathway or trajectory for a biopolymer can be defined as the transformation that minimizes the total distance traveled between a folded and an unfolded structure. This involves generalizing the usual Euclidean distance from points to one-dimensional objects such as a polymer. We apply this distance here to find minimal folding pathways for several candidate protein fragments, including the helix, the β-hairpin, and a nonplanar structure where chain noncrossing is important. Comparing the distances traveled with root mean-squared distance and mean root-squared distance, we show that chain noncrossing can have large effects on the kinetic proximity of apparently similar conformations. Structures that are aligned to the β-hairpin by minimizing mean root-squared distance, a quantity that closely approximates the true distance for long chains, show globally different orientation than structures aligned by minimizing root mean-squared distance.  相似文献   

14.
MOTIVATION: Alignment-free metrics were recently reviewed by the authors, but have not until now been object of a comparative study. This paper compares the classification accuracy of word composition metrics therein reviewed. It also presents a new distance definition between protein sequences, the W-metric, which bridges between alignment metrics, such as scores produced by the Smith-Waterman algorithm, and methods based solely in L-tuple composition, such as Euclidean distance and Information content. RESULTS: The comparative study reported here used the SCOP/ASTRAL protein structure hierarchical database and accessed the discriminant value of alternative sequence dissimilarity measures by calculating areas under the Receiver Operating Characteristic curves. Although alignment methods resulted in very good classification accuracy at family and superfamily levels, alignment-free distances, in particular Standard Euclidean Distance, are as good as alignment algorithms when sequence similarity is smaller, such as for recognition of fold or class relationships. This observation justifies its advantageous use to pre-filter homologous proteins since word statistics techniques are computed much faster than the alignment methods. AVAILABILITY: All MATLAB code used to generate the data is available upon request to the authors. Additional material available at http://bioinformatics.musc.edu/wmetric  相似文献   

15.
Single-cell Hi-C (scHi-C) sequencing technologies allow us to investigate three-dimensional chromatin organization at the single-cell level. However, we still need computational tools to deal with the sparsity of the contact maps from single cells and embed single cells in a lower-dimensional Euclidean space. This embedding helps us understand relationships between the cells in different dimensions, such as cell-cycle dynamics and cell differentiation. We present an open-source computational toolbox, scHiCTools, for analyzing single-cell Hi-C data comprehensively and efficiently. The toolbox provides two methods for screening single cells, three common methods for smoothing scHi-C data, three efficient methods for calculating the pairwise similarity of cells, three methods for embedding single cells, three methods for clustering cells, and a build-in function to visualize the cells embedding in a two-dimensional or three-dimensional plot. scHiCTools, written in Python3, is compatible with different platforms, including Linux, macOS, and Windows.  相似文献   

16.
Ordination is a powerful method for analysing complex data setsbut has been largely ignored in sequence analysis. This papershows how to use principal coordinates analysis to find low–dimensionalrepresentations of distance matrices derived from aligned setsof sequences. The method takes a matrix of Euclidean distancesbetween all pairs of sequence and finds a coordinate space wherethe distances are exactly preserved The main problem is to finda measure of distance between aligned sequences that is Euclidean.The simplest distance function is the square root of the percentagedifference (as measured by identities) between two sequences,where one ignores any positions in the alignment where thereis a gap in any sequence. If one does not ignore positions witha gap, the distances cannot be guaranteed to be Euclidean butthe deleterious effects are trivial. Two examples of using themethod are shown. A set of 226 aligned globins were analysedand the resulting ordination very successfully represents theknown patterns of relationship between the sequences. In theother example, a set of 610 aligned 5S rRNA sequences were analysed.Sequence ordinations complement phylogenetic analyses. Theyshould not be viewed as a complete alternative.  相似文献   

17.
An algorithm is presented to compute a multiple structure alignment for a set of proteins and to generate a consensus (pseudo) protein for the set. The algorithm is a heuristic in that it computes an approximation to the optimal multiple structure alignment that minimizes the sum of the pairwise distances between the protein structures. The algorithm chooses an input protein as the initial consensus and computes a correspondence between the protein structures (which are represented as sets of unit vectors) using an approach analogous to the center-star method for multiple sequence alignment. From this correspondence, a set of rotation matrices (optimal for the given correspondence) is derived to align the structures and derive the new consensus. The process is iterated until the sum of pairwise distances converges. The computation of the optimal rotations is itself an iterative process that both makes use of the current consensus and generates simultaneously a new one. This approach is based on an interesting result that allows the sum of all pairwise distances to be represented compactly as distances to the consensus. Experimental results on several protein families are presented, showing that the algorithm converges quite rapidly.  相似文献   

18.
MOTIVATION: Biological objects tend to cluster into discrete groups. Objects within a group typically possess similar properties. It is important to have fast and efficient tools for grouping objects that result in biologically meaningful clusters. Protein sequences reflect biological diversity and offer an extraordinary variety of objects for polishing clustering strategies. Grouping of sequences should reflect their evolutionary history and their functional properties. Visualization of relationships between sequences is of no less importance. Tree-building methods are typically used for such visualization. An alternative concept to visualization is a multidimensional sequence space. In this space, proteins are defined as points and distances between the points reflect the relationships between the proteins. Such a space can also be a basis for model-based clustering strategies that typically produce results correlating better with biological properties of proteins. RESULTS: We developed an approach to classification of biological objects that combines evolutionary measures of their similarity with a model-based clustering procedure. We apply the methodology to amino acid sequences. On the first step, given a multiple sequence alignment, we estimate evolutionary distances between proteins measured in expected numbers of amino acid substitutions per site. These distances are additive and are suitable for evolutionary tree reconstruction. On the second step, we find the best fit approximation of the evolutionary distances by Euclidian distances and thus represent each protein by a point in a multidimensional space. The Euclidian space may be projected in two or three dimensions and the projections can be used to visualize relationships between proteins. On the third step, we find a non-parametric estimate of the probability density of the points and cluster the points that belong to the same local maximum of this density in a group. The number of groups is controlled by a sigma-parameter that determines the shape of the density estimate and the number of maxima in it. The grouping procedure outperforms commonly used methods such as UPGMA and single linkage clustering.  相似文献   

19.
A consensus in dex method comprises a consensus method and a consensus index that are defined on a common set of objects (e.g. classifications). For each profile of objects, the consensus method returns a consensus object representing information or structure shared among profile objects, while the consensus index returns a quantitative measure of agreement among profile objects. Since the relationship between consensus method and consensus index is poorly understood, we propose simple axioms prescribing it in the most general terms. Many taxonomic consensus index methods violate these axioms because their consensus indices measure consensus object invariants rather than profile agreement. We propose paradigms to obtain consensus index methods that measure agreement and satisfy the axioms. These paradigms salvage concepts underlying consensus index methods violating the axioms. This work was supported in part by the Faculty of Science at Memorial University of Newfoundland, and by the Natural Sciences and Engineering Research Council of Canada Under Grant A-4142.  相似文献   

20.
Species dispersal studies provide valuable information in biological research. Restricted dispersal may give rise to a non-random distribution of genotypes in space. Detection of spatial genetic structure may therefore provide valuable insight into dispersal. Spatial structure has been treated via autocorrelation analysis with several univariate statistics for which results could dependent on sampling designs. New geostatistical approaches (variogram-based analysis) have been proposed to overcome this problem. However, modelling parametric variograms could be difficult in practice. We introduce a non-parametric variogram-based method for autocorrelation analysis between DNA samples that have been genotyped by means of multilocus-multiallele molecular markers. The method addresses two important aspects of fine-scale spatial genetic analyses: the identification of a non-random distribution of genotypes in space, and the estimation of the magnitude of any non-random structure. The method uses a plot of the squared Euclidean genetic distances vs. spatial distances between pairs of DNA-samples as empirical variogram. The underlying spatial trend in the plot is fitted by a non-parametric smoothing (LOESS, Local Regression). Finally, the predicted LOESS values are explained by segmented regressions (SR) to obtain classical spatial values such as the extent of autocorrelation. For illustration we use multivariate and single-locus genetic distances calculated from a microsatellite data set for which autocorrelation was previously reported. The LOESS/SR method produced a good fit providing similar value of published autocorrelation for this data. The fit by LOESS/SR was simpler to obtain than the parametric analysis since initial parameter values are not required during the trend estimation process. The LOESS/SR method offers a new alternative for spatial analysis.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号