首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In statistical mechanics, the equilibrium properties of a physical system of particles can be calculated as the statistical average over accessible microstates of the system. In general, these calculations are computationally intractable since they involve summations over an exponentially large number of microstates. Clustering algorithms are one of the methods used to numerically approximate these sums. The most basic clustering algorithms first sub-divide the system into a set of smaller subsets (clusters). Then, interactions between particles within each cluster are treated exactly, while all interactions between different clusters are ignored. These smaller clusters have far fewer microstates, making the summation over these microstates, tractable. These algorithms have been previously used for biomolecular computations, but remain relatively unexplored in this context. Presented here, is a theoretical analysis of the error and computational complexity for the two most basic clustering algorithms that were previously applied in the context of biomolecular electrostatics. We derive a tight, computationally inexpensive, error bound for the equilibrium state of a particle computed via these clustering algorithms. For some practical applications, it is the root mean square error, which can be significantly lower than the error bound, that may be more important. We how that there is a strong empirical relationship between error bound and root mean square error, suggesting that the error bound could be used as a computationally inexpensive metric for predicting the accuracy of clustering algorithms for practical applications. An example of error analysis for such an application-computation of average charge of ionizable amino-acids in proteins-is given, demonstrating that the clustering algorithm can be accurate enough for practical purposes.  相似文献   

2.
Previous studies have been conducted in gene expression profiling to identify groups of genes that characterize the colorectal carcinoma disease. Despite the success of previous attempts to identify groups of genes in the progression of the colorectal carcinoma disease, their methods either require subjective interpretation of the number of clusters, or lack stability during different runs of the algorithms. All of which limits the usefulness of these methods. In this study, we propose an enhanced algorithm that provides stability and robustness in identifying differentially expressed genes in an expression profile analysis. Our proposed algorithm uses multiple clustering algorithms under the consensus clustering framework. The results of the experiment show that the robustness of our method provides a consistent structure of clusters, similar to the structure found in the previous study. Furthermore, our algorithm outperforms any single clustering algorithms in terms of the cluster quality score.  相似文献   

3.
With improvements in computer speed and algorithm efficiency, MD simulations are sampling larger amounts of molecular and biomolecular conformations. Being able to qualitatively and quantitatively sift these conformations into meaningful groups is a difficult and important task, especially when considering the structure-activity paradigm. Here we present a study that combines two popular techniques, principal component (PC) analysis and clustering, for revealing major conformational changes that occur in molecular dynamics (MD) simulations. Specifically, we explored how clustering different PC subspaces effects the resulting clusters versus clustering the complete trajectory data. As a case example, we used the trajectory data from an explicitly solvated simulation of a bacteria’s L11·23S ribosomal subdomain, which is a target of thiopeptide antibiotics. Clustering was performed, using K-means and average-linkage algorithms, on data involving the first two to the first five PC subspace dimensions. For the average-linkage algorithm we found that data-point membership, cluster shape, and cluster size depended on the selected PC subspace data. In contrast, K-means provided very consistent results regardless of the selected subspace. Since we present results on a single model system, generalization concerning the clustering of different PC subspaces of other molecular systems is currently premature. However, our hope is that this study illustrates a) the complexities in selecting the appropriate clustering algorithm, b) the complexities in interpreting and validating their results, and c) by combining PC analysis with subsequent clustering valuable dynamic and conformational information can be obtained.  相似文献   

4.
We assess the progress in biomolecular modeling and simulation, focusing on structure prediction and dynamics, by presenting the field’s history, metrics for its rise in popularity, early expressed expectations, and current significant applications. The increases in computational power combined with improvements in algorithms and force fields have led to considerable success, especially in protein folding, specificity of ligand/biomolecule interactions, and interpretation of complex experimental phenomena (e.g. NMR relaxation, protein-folding kinetics and multiple conformational states) through the generation of structural hypotheses and pathway mechanisms. Although far from a general automated tool, structure prediction is notable for proteins and RNA that preceded the experiment, especially by knowledge-based approaches. Thus, despite early unrealistic expectations and the realization that computer technology alone will not quickly bridge the gap between experimental and theoretical time frames, ongoing improvements to enhance the accuracy and scope of modeling and simulation are propelling the field onto a productive trajectory to become full partner with experiment and a field on its own right.  相似文献   

5.
A multiscale simulation method of protein folding is proposed, using atomic representation of protein and solvent, combing genetic algorithms to determine the key protein structures from a global view, with molecular dynamic simulations to reveal the local folding pathways, thus providing an integrated landscape of protein folding. The method is found to be superior to previously investigated global search algorithms or dynamic simulations alone. For secondary structure formation of a selected peptide, RN24, the structures and dynamics produced by this method agree well with corresponding experimental results. Three most populated conformations are observed, including hairpin, β-sheet and α-helix. The energetic barriers separating these three structures are comparable to the kinetic energy of the atoms of the peptide, implying that the transition between these states can be easily triggered by kinetic perturbations, mainly through electrostatic interactions between charged atoms. Transitions between α-helix and β-sheet should jump over at least two energy barriers and may stay in the energetic trap of hairpin. It is proposed that the structure of proteins should be jointly governed by thermodynamic and dynamic factors; free energy is not the exclusive dominant for stability of proteins.  相似文献   

6.
A class of novel explicit analytic solutions for a system of n+1 coupled partial differential equations governing biomolecular mass transfer and reaction in living organisms are proposed, evaluated, and analyzed. The solution process uses Laplace and Hankel transforms and results in a recursive convolution of an exponentially scaled Gaussian with modified Bessel functions. The solution is developed for wide range of biomolecular binding kinetics from pure diffusion to multiple binding reactions. The proposed approach provides solutions for both Dirac and Gaussian laser beam (or fluorescence-labeled biomacromolecule) profiles during the course of a Fluorescence Recovery After Photobleaching (FRAP) experiment. We demonstrate that previous models are simplified forms of our theory for special cases. Model analysis indicates that at the early stages of the transport process, biomolecular dynamics is governed by pure diffusion. At large times, the dominant mass transfer process is effective diffusion. Analysis of the sensitivity equations, derived analytically and verified by finite difference differentiation, indicates that experimental biologists should use full space-time profile (instead of the averaged time series) obtained at the early stages of the fluorescence microscopy experiments to extract meaningful physiological information from the protocol. Such a small time frame requires improved bioinstrumentation relative to that in use today. Our mathematical analysis highlights several limitations of the FRAP protocol and provides strategies to improve it. The proposed model can be used to study biomolecular dynamics in molecular biology, targeted drug delivery in normal and cancerous tissues, motor-driven axonal transport in normal and abnormal nervous systems, kinetics of diffusion-controlled reactions between enzyme and substrate, and to validate numerical simulators of biological mass transport processes in vivo.  相似文献   

7.
Taxonomy-independent analysis plays an essential role in microbial community analysis. Hierarchical clustering is one of the most widely employed approaches to finding operational taxonomic units, the basis for many downstream analyses. Most existing algorithms have quadratic space and computational complexities, and thus can be used only for small or medium-scale problems. We propose a new online learning-based algorithm that simultaneously addresses the space and computational issues of prior work. The basic idea is to partition a sequence space into a set of subspaces using a partition tree constructed using a pseudometric, then recursively refine a clustering structure in these subspaces. The technique relies on new methods for fast closest-pair searching and efficient dynamic insertion and deletion of tree nodes. To avoid exhaustive computation of pairwise distances between clusters, we represent each cluster of sequences as a probabilistic sequence, and define a set of operations to align these probabilistic sequences and compute genetic distances between them. We present analyses of space and computational complexity, and demonstrate the effectiveness of our new algorithm using a human gut microbiota data set with over one million sequences. The new algorithm exhibits a quasilinear time and space complexity comparable to greedy heuristic clustering algorithms, while achieving a similar accuracy to the standard hierarchical clustering algorithm.  相似文献   

8.
In recent years, cryo-electron microscopy (cryo-EM) has established itself as a key method in structural biology, permitting the structural characterization of large biomolecular complexes in various functional states. The data obtained through single-particle cryo-EM has recently seen a leap in resolution thanks to landmark advances in experimental and computational techniques, resulting in sub-nanometer resolution structures being obtained routinely. The remaining gap between these data and revealing the mechanisms of molecular function can be closed through hybrid modeling tools that incorporate known atomic structures into the cryo-EM data. One such tool, molecular dynamics flexible fitting (MDFF), uses molecular dynamics simulations to combine structures from X-ray crystallography with cryo-EM density maps to derive atomic models of large biomolecular complexes. The structures furnished by MDFF can be used subsequently in computational investigations aimed at revealing the dynamics of the complexes under study. In the present work, recent applications of MDFF are presented, including the interpretation of cryo-EM data of the ribosome at different stages of translation and the structure of a membrane-curvature-inducing photosynthetic complex.  相似文献   

9.
Successful clustering algorithms are highly dependent on parameter settings. The clustering performance degrades significantly unless parameters are properly set, and yet, it is difficult to set these parameters a priori. To address this issue, in this paper, we propose a unique splitting-while-merging clustering framework, named “splitting merging awareness tactics” (SMART), which does not require any a priori knowledge of either the number of clusters or even the possible range of this number. Unlike existing self-splitting algorithms, which over-cluster the dataset to a large number of clusters and then merge some similar clusters, our framework has the ability to split and merge clusters automatically during the process and produces the the most reliable clustering results, by intrinsically integrating many clustering techniques and tasks. The SMART framework is implemented with two distinct clustering paradigms in two algorithms: competitive learning and finite mixture model. Nevertheless, within the proposed SMART framework, many other algorithms can be derived for different clustering paradigms. The minimum message length algorithm is integrated into the framework as the clustering selection criterion. The usefulness of the SMART framework and its algorithms is tested in demonstration datasets and simulated gene expression datasets. Moreover, two real microarray gene expression datasets are studied using this approach. Based on the performance of many metrics, all numerical results show that SMART is superior to compared existing self-splitting algorithms and traditional algorithms. Three main properties of the proposed SMART framework are summarized as: (1) needing no parameters dependent on the respective dataset or a priori knowledge about the datasets, (2) extendible to many different applications, (3) offering superior performance compared with counterpart algorithms.  相似文献   

10.
Validating clustering for gene expression data   总被引:24,自引:0,他引:24  
MOTIVATION: Many clustering algorithms have been proposed for the analysis of gene expression data, but little guidance is available to help choose among them. We provide a systematic framework for assessing the results of clustering algorithms. Clustering algorithms attempt to partition the genes into groups exhibiting similar patterns of variation in expression level. Our methodology is to apply a clustering algorithm to the data from all but one experimental condition. The remaining condition is used to assess the predictive power of the resulting clusters-meaningful clusters should exhibit less variation in the remaining condition than clusters formed by chance. RESULTS: We successfully applied our methodology to compare six clustering algorithms on four gene expression data sets. We found our quantitative measures of cluster quality to be positively correlated with external standards of cluster quality.  相似文献   

11.
In recent years, significant effort has been given to predicting protein functions from protein interaction data generated from high throughput techniques. However, predicting protein functions correctly and reliably still remains a challenge. Recently, many computational methods have been proposed for predicting protein functions. Among these methods, clustering based methods are the most promising. The existing methods, however, mainly focus on protein relationship modeling and the prediction algorithms that statically predict functions from the clusters that are related to the unannotated proteins. In fact, the clustering itself is a dynamic process and the function prediction should take this dynamic feature of clustering into consideration. Unfortunately, this dynamic feature of clustering is ignored in the existing prediction methods. In this paper, we propose an innovative progressive clustering based prediction method to trace the functions of relevant annotated proteins across all clusters that are generated through the progressive clustering of proteins. A set of prediction criteria is proposed to predict functions of unannotated proteins from all relevant clusters and traced functions. The method was evaluated on real protein interaction datasets and the results demonstrated the effectiveness of the proposed method compared with representative existing methods.  相似文献   

12.
《Biophysical journal》2021,120(22):4944-4954
E-cadherins play a critical role in the formation of cell-cell adhesions for several physiological functions, including tissue development, repair, and homeostasis. The formation of clusters of E-cadherins involves extracellular adhesive (trans-) and lateral (cis-) associations between E-cadherin ectodomains and stabilization through intracellular binding to the actomyosin cytoskeleton. This binding provides force to the adhesion and is required for mechanotransduction. However, the exact role of cytoskeletal force on the clustering of E-cadherins is not well understood. To gain insights into this mechanism, we developed a computational model based on Brownian dynamics. In the model, E-cadherins transit between structural and functional states; they are able to bind and unbind other E-cadherins on the same and/or opposite cell(s) through trans- and cis-interactions while also creating dynamic links with the actomyosin cytoskeleton. Our results show that actomyosin force governs the fraction of E-cadherins in clusters and the size and number of clusters. For low forces (below 10 pN), a large number of small E-cadherin clusters form with less than five E-cadherins each. At higher forces, the probability of forming fewer but larger clusters increases. These findings support the idea that force reinforces cell-cell adhesions, which is consistent with differences in cluster size previously observed between apical and lateral junctions of epithelial tissues.  相似文献   

13.
Inferring the structure of populations has many applications for genetic research. In addition to providing information for evolutionary studies, it can be used to account for the bias induced by population stratification in association studies. To this end, many algorithms have been proposed to cluster individuals into genetically homogeneous sub-populations. The parametric algorithms, such as Structure, are very popular but their underlying complexity and their high computational cost led to the development of faster parametric alternatives such as Admixture. Alternatives to these methods are the non-parametric approaches. Among this category, AWclust has proven efficient but fails to properly identify population structure for complex datasets. We present in this article a new clustering algorithm called Spectral Hierarchical clustering for the Inference of Population Structure (SHIPS), based on a divisive hierarchical clustering strategy, allowing a progressive investigation of population structure. This method takes genetic data as input to cluster individuals into homogeneous sub-populations and with the use of the gap statistic estimates the optimal number of such sub-populations. SHIPS was applied to a set of simulated discrete and admixed datasets and to real SNP datasets, that are data from the HapMap and Pan-Asian SNP consortium. The programs Structure, Admixture, AWclust and PCAclust were also investigated in a comparison study. SHIPS and the parametric approach Structure were the most accurate when applied to simulated datasets both in terms of individual assignments and estimation of the correct number of clusters. The analysis of the results on the real datasets highlighted that the clusterings of SHIPS were the more consistent with the population labels or those produced by the Admixture program. The performances of SHIPS when applied to SNP data, along with its relatively low computational cost and its ease of use make this method a promising solution to infer fine-scale genetic patterns.  相似文献   

14.
One of the most intriguing dynamics in biological systems is the emergence of clustering, in the sense that individuals self-organize into separate agglomerations in physical or behavioral space. Several theories have been developed to explain clustering in, for instance, multi-cellular organisms, ant colonies, bee hives, flocks of birds, schools of fish, and animal herds. A persistent puzzle, however, is the clustering of opinions in human populations, particularly when opinions vary continuously, such as the degree to which citizens are in favor of or against a vaccination program. Existing continuous opinion formation models predict "monoculture" in the long run, unless subsets of the population are perfectly separated from each other. Yet, social diversity is a robust empirical phenomenon, although perfect separation is hardly possible in an increasingly connected world. Considering randomness has not overcome the theoretical shortcomings so far. Small perturbations of individual opinions trigger social influence cascades that inevitably lead to monoculture, while larger noise disrupts opinion clusters and results in rampant individualism without any social structure. Our solution to the puzzle builds on recent empirical research, combining the integrative tendencies of social influence with the disintegrative effects of individualization. A key element of the new computational model is an adaptive kind of noise. We conduct computer simulation experiments demonstrating that with this kind of noise a third phase besides individualism and monoculture becomes possible, characterized by the formation of metastable clusters with diversity between and consensus within clusters. When clusters are small, individualization tendencies are too weak to prohibit a fusion of clusters. When clusters grow too large, however, individualization increases in strength, which promotes their splitting. In summary, the new model can explain cultural clustering in human societies. Strikingly, model predictions are not only robust to "noise"-randomness is actually the central mechanism that sustains pluralism and clustering.  相似文献   

15.
Although many numerical clustering algorithms have been applied to gene expression dataanalysis,the essential step is still biological interpretation by manual inspection.The correlation betweengenetic co-regulation and affiliation to a common biological process is what biologists expect.Here,weintroduce some clustering algorithms that are based on graph structure constituted by biological knowledge.After applying a widely used dataset,we compared the result clusters of two of these algorithms in terms ofthe homogeneity of clusters and coherence of annotation and matching ratio.The results show that theclusters of knowledge-guided analysis are the kernel parts of the clusters of Gene Ontology (GO)-Clustersoftware,which contains the genes that are most expression correlative and most consistent with biologicalfunctions.Moreover,knowledge-guided analysis seems much more applicable than GO-Cluster in a largerdataset.  相似文献   

16.
In this paper, we present a novel approach Bio-IEDM (biomedical information extraction and data mining) to integrate text mining and predictive modeling to analyze biomolecular network from biomedical literature databases. Our method consists of two phases. In phase 1, we discuss a semisupervised efficient learning approach to automatically extract biological relationships such as protein-protein interaction, protein-gene interaction from the biomedical literature databases to construct the biomolecular network. Our method automatically learns the patterns based on a few user seed tuples and then extracts new tuples from the biomedical literature based on the discovered patterns. The derived biomolecular network forms a large scale-free network graph. In phase 2, we present a novel clustering algorithm to analyze the biomolecular network graph to identify biologically meaningful subnetworks (communities). The clustering algorithm considers the characteristics of the scale-free network graphs and is based on the local density of the vertex and its neighborhood functions that can be used to find more meaningful clusters with different density level. The experimental results indicate our approach is very effective in extracting biological knowledge from a huge collection of biomedical literature. The integration of data mining and information extraction provides a promising direction for analyzing the biomolecular network  相似文献   

17.
Finding subtypes of heterogeneous diseases is the biggest challenge in the area of biology. Often, clustering is used to provide a hypothesis for the subtypes of a heterogeneous disease. However, there are usually discrepancies between the clusterings produced by different algorithms. This work introduces a simple method which provides the most consistent clusters across three different clustering algorithms for a melanoma and a breast cancer data set. The method is validated by showing that the Silhouette, Dunne's and Davies-Bouldin's cluster validation indices are better for the proposed algorithm than those obtained by k-means and another consensus clustering algorithm. The hypotheses of the consensus clusters on both the data sets are corroborated by clear genetic markers and 100 percent classification accuracy. In Bittner et al.'s melanoma data set, a previously hypothesized primary cluster is recognized as the largest consensus cluster and a new partition of this cluster into two subclusters is proposed. In van't Veer et al.'s breast cancer data set, previously proposed "basal” and "luminal A” subtypes are clearly recognized as the two predominant clusters. Furthermore, a new hypothesis is provided about the existence of two subgroups within the "basal” subtype in this data set. The clusters of van't Veer's data set is also validated by high classification accuracy obtained in the data set of van de Vijver et al.  相似文献   

18.
The biomolecules in and around a living cell – proteins, nucleic acids, lipids and carbohydrates – continuously sample myriad conformational states that are thermally accessible at physiological temperatures. Simultaneously, a given biomolecule also samples (and is sampled by) a rapidly fluctuating local environment comprising other biopolymers, small molecules, water, ions, etc. that diffuse to within a few nanometres, leading to inter-molecular contacts that stitch together large supramolecular assemblies. Indeed, all biological systems can be viewed as dynamic networks of molecular interactions. As a complement to experimentation, molecular simulation offers a uniquely powerful approach to analyse biomolecular structure, mechanism and dynamics; this is possible because the molecular contacts that define a complicated biomolecular system are governed by the same physical principles (forces and energetics) that characterise individual small molecules, and these simpler systems are relatively well-understood. With modern algorithms and computing capabilities, simulations are now an indispensable tool for examining biomolecular assemblies in atomic detail, from the conformational motion in an individual protein to the diffusional dynamics and inter-molecular collisions in the early stages of formation of cellular-scale assemblies such as the ribosome. This text introduces the physicochemical foundations of molecular simulations and docking, largely from the perspective of biomolecular interactions.  相似文献   

19.
EXCAVATOR: a computer program for efficiently mining gene expression data   总被引:1,自引:0,他引:1  
Xu D  Olman V  Wang L  Xu Y 《Nucleic acids research》2003,31(19):5582-5589
Massive amounts of gene expression data are generated using microarrays for functional studies of genes and gene expression data clustering is a useful tool for studying the functional relationship among genes in a biological process. We have developed a computer package EXCAVATOR for clustering gene expression profiles based on our new framework for representing gene expression data as a minimum spanning tree. EXCAVATOR uses a number of rigorous and efficient clustering algorithms. This program has a number of unique features, including capabilities for: (i) data- constrained clustering; (ii) identification of genes with similar expression profiles to pre-specified seed genes; (iii) cluster identification from a noisy background; (iv) computational comparison between different clustering results of the same data set. EXCAVATOR can be run from a Unix/Linux/DOS shell, from a Java interface or from a Web server. The clustering results can be visualized as colored figures and 2-dimensional plots. Moreover, EXCAVATOR provides a wide range of options for data formats, distance measures, objective functions, clustering algorithms, methods to choose number of clusters, etc. The effectiveness of EXCAVATOR has been demonstrated on several experimental data sets. Its performance compares favorably against the popular K-means clustering method in terms of clustering quality and computing time.  相似文献   

20.
The rapidly expanding body of available genomic and protein structural data provides a rich resource for understanding protein dynamics with biomolecular simulation. While computational infrastructure has grown rapidly, simulations on an omics scale are not yet widespread, primarily because software infrastructure to enable simulations at this scale has not kept pace. It should now be possible to study protein dynamics across entire (super)families, exploiting both available structural biology data and conformational similarities across homologous proteins. Here, we present a new tool for enabling high-throughput simulation in the genomics era. Ensembler takes any set of sequences—from a single sequence to an entire superfamily—and shepherds them through various stages of modeling and refinement to produce simulation-ready structures. This includes comparative modeling to all relevant PDB structures (which may span multiple conformational states of interest), reconstruction of missing loops, addition of missing atoms, culling of nearly identical structures, assignment of appropriate protonation states, solvation in explicit solvent, and refinement and filtering with molecular simulation to ensure stable simulation. The output of this pipeline is an ensemble of structures ready for subsequent molecular simulations using computer clusters, supercomputers, or distributed computing projects like Folding@home. Ensembler thus automates much of the time-consuming process of preparing protein models suitable for simulation, while allowing scalability up to entire superfamilies. A particular advantage of this approach can be found in the construction of kinetic models of conformational dynamics—such as Markov state models (MSMs)—which benefit from a diverse array of initial configurations that span the accessible conformational states to aid sampling. We demonstrate the power of this approach by constructing models for all catalytic domains in the human tyrosine kinase family, using all available kinase catalytic domain structures from any organism as structural templates. Ensembler is free and open source software licensed under the GNU General Public License (GPL) v2. It is compatible with Linux and OS X. The latest release can be installed via the conda package manager, and the latest source can be downloaded from https://github.com/choderalab/ensembler.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号