首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
MOTIVATION: In 2001 and 2002, we published two papers (Bioinformatics, 17, 282-283, Bioinformatics, 18, 77-82) describing an ultrafast protein sequence clustering program called cd-hit. This program can efficiently cluster a huge protein database with millions of sequences. However, the applications of the underlying algorithm are not limited to only protein sequences clustering, here we present several new programs using the same algorithm including cd-hit-2d, cd-hit-est and cd-hit-est-2d. Cd-hit-2d compares two protein datasets and reports similar matches between them; cd-hit-est clusters a DNA/RNA sequence database and cd-hit-est-2d compares two nucleotide datasets. All these programs can handle huge datasets with millions of sequences and can be hundreds of times faster than methods based on the popular sequence comparison and database search tools, such as BLAST.  相似文献   

2.
Successful clustering algorithms are highly dependent on parameter settings. The clustering performance degrades significantly unless parameters are properly set, and yet, it is difficult to set these parameters a priori. To address this issue, in this paper, we propose a unique splitting-while-merging clustering framework, named “splitting merging awareness tactics” (SMART), which does not require any a priori knowledge of either the number of clusters or even the possible range of this number. Unlike existing self-splitting algorithms, which over-cluster the dataset to a large number of clusters and then merge some similar clusters, our framework has the ability to split and merge clusters automatically during the process and produces the the most reliable clustering results, by intrinsically integrating many clustering techniques and tasks. The SMART framework is implemented with two distinct clustering paradigms in two algorithms: competitive learning and finite mixture model. Nevertheless, within the proposed SMART framework, many other algorithms can be derived for different clustering paradigms. The minimum message length algorithm is integrated into the framework as the clustering selection criterion. The usefulness of the SMART framework and its algorithms is tested in demonstration datasets and simulated gene expression datasets. Moreover, two real microarray gene expression datasets are studied using this approach. Based on the performance of many metrics, all numerical results show that SMART is superior to compared existing self-splitting algorithms and traditional algorithms. Three main properties of the proposed SMART framework are summarized as: (1) needing no parameters dependent on the respective dataset or a priori knowledge about the datasets, (2) extendible to many different applications, (3) offering superior performance compared with counterpart algorithms.  相似文献   

3.
Peak lists are commonly used in NMR as input data for various software tools such as automatic assignment and structure calculation programs. Inconsistencies of chemical shift referencing among different peak lists or between peak and chemical shift lists can cause severe problems during peak assignment. Here we present a simple and robust tool to achieve self-consistency of the chemical shift referencing among a set of peak lists. The Peakmatch algorithm matches a set of peak lists to a specified reference peak list, neither of which have to be assigned. The chemical shift referencing offset between two peak lists is determined by optimizing an assignment-free match score function using either a complete grid search or downhill simplex optimization. It is shown that peak lists from many different types of spectra can be matched reliably as long as they contain at least two corresponding dimensions. Using a simulated peak list, the Peakmatch algorithm can also be used to obtain the optimal agreement between a chemical shift list and experimental peak lists. Combining these features makes Peakmatch a useful tool that can be applied routinely before automatic assignment or structure calculation in order to obtain an optimized input data set.  相似文献   

4.
MOTIVATION: The study of sequence space, and the deciphering of the structure of protein families and subfamilies, has up to now been required for work in comparative genomics and for the prediction of protein function. With the emergence of structural proteomics projects, it is becoming increasingly important to be able to select protein targets for structural studies that will appropriately cover the space of protein sequences, functions and genomic distribution. These problems are the motivation for the development of methods for clustering protein sequences and building families of potentially orthologous sequences, such as those proposed here. RESULTS: First we developed a clustering strategy (Ncut algorithm) capable of forming groups of related sequences by assessing their pairwise relationships. The results presented for the ras super-family of proteins are similar to those produced by other clustering methods, but without the need for clustering the full sequence space. The Ncut clusters are then used as the input to a process of reconstruction of groups with equilibrated genomic composition formed by closely-related sequences. The results of applying this technique to the data set used in the construction of the COG database are very similar to those derived by the human experts responsible for this database. AVAILABILITY: The analysis of different systems, including the COG equivalent 21 genomes are available at http://www.pdg.cnb.uam.es/GenoClustering.html.  相似文献   

5.
MOTIVATION: Protein sequence clustering has been widely exploited to facilitate in-depth analysis of protein functions and families. For some applications of protein sequence clustering, it is highly desirable that a hierarchical structure, also referred to as dendrogram, which shows how proteins are clustered at various levels, is generated. However, as the sizes of contemporary protein databases continue to grow at rapid rates, it is of great interest to develop some summarization mechanisms so that the users can browse the dendrogram and/or search for the desired information more effectively. RESULTS: In this paper, the design of a novel incremental clustering algorithm aimed at generating summarized dendrograms for analysis of protein databases is described. The proposed incremental clustering algorithm employs a statistics-based model to summarize the distributions of the similarity scores among the proteins in the database and to control formation of clusters. Experimental results reveal that, due to the summarization mechanism incorporated, the proposed incremental clustering algorithm offers the users highly concise dendrograms for analysis of protein clusters with biological significance. Another distinction of the proposed algorithm is its incremental nature. As the sizes of the contemporary protein databases continue to grow at fast rates, due to the concern of efficiency, it is desirable that cluster analysis of a protein database can be carried out incrementally, when the protein database is updated. Experimental results with the Swiss-Prot protein database reveal that the time complexity for carrying out incremental clustering with k new proteins added into the database containing n proteins is O(n2betalogn), where beta congruent with 0.865, provided that k < n. AVAILABILITY: The Linux executable is available on the following supplementary page.  相似文献   

6.
In this paper, we describe a new brute force algorithm for building the -Nearest Neighbor Graph (k-NNG). The k-NNG algorithm has many applications in areas such as machine learning, bio-informatics, and clustering analysis. While there are very efficient algorithms for data of low dimensions, for high dimensional data the brute force search is the best algorithm. There are two main parts to the algorithm: the first part is finding the distances between the input vectors, which may be formulated as a matrix multiplication problem; the second is the selection of the k-NNs for each of the query vectors. For the second part, we describe a novel graphics processing unit (GPU)-based multi-select algorithm based on quick sort. Our optimization makes clever use of warp voting functions available on the latest GPUs along with user-controlled cache. Benchmarks show significant improvement over state-of-the-art implementations of the k-NN search on GPUs.  相似文献   

7.
A new more efficient variant of a recently developed algorithm for unsupervised fuzzy clustering is introduced. A Weighted Incremental Neural Network (WINN) is introduced and used for this purpose. The new approach is called FC-WINN (Fuzzy Clustering using WINN). The WINN algorithm produces a net of nodes connected by edges, which reflects and preserves the topology of the input data set. Additional weights, which are proportional to the local densities in input space, are associated with the resulting nodes and edges to store useful information about the topological relations in the given input data set. A fuzziness factor, proportional to the connectedness of the net, is introduced in the system. A watershed-like procedure is used to cluster the resulting net. The number of the resulting clusters is determined by this procedure. Only two parameters must be chosen by the user for the FC-WINN algorithm to determine the resolution and the connectedness of the net. Other parameters that must be specified are those which are necessary for the used incremental neural network, which is a modified version of the Growing Neural Gas algorithm (GNG). The FC-WINN algorithm is computationally efficient when compared to other approaches for clustering large high-dimensional data sets.  相似文献   

8.
Clustering protein sequences--structure prediction by transitive homology.   总被引:2,自引:0,他引:2  
MOTIVATION: It is widely believed that for two proteins Aand Ba sequence identity above some threshold implies structural similarity due to a common evolutionary ancestor. Since this is only a sufficient, but not a necessary condition for structural similarity, the question remains what other criteria can be used to identify remote homologues. Transitivity refers to the concept of deducing a structural similarity between proteins A and C from the existence of a third protein B, such that A and B as well as B and C are homologues, as ascertained if the sequence identity between A and B as well as that between B and C is above the aforementioned threshold. It is not fully understood if transitivity always holds and whether transitivity can be extended ad infinitum. RESULTS: We developed a graph-based clustering approach, where transitivity plays a crucial role. We determined all pair-wise similarities for the sequences in the SwissProt database using the Smith-Waterman local alignment algorithm. This data was transformed into a directed graph, where protein sequences constitute vertices. A directed edge was drawn from vertex A to vertex B if the sequences A and B showed similarity, scaled with respect to the self-similarity of A, above a fixed threshold. Transitivity was important in the clustering process, as intermediate sequences were used, limited though by the requirement of having directed paths in both directions between proteins linked over such sequences. The length dependency-implied by the self-similarity-of the scaling of the alignment scores appears to be an effective criterion to avoid clustering errors due to multi-domain proteins. To deal with the resulting large graphs we have developed an efficient library. Methods include the novel graph-based clustering algorithm capable of handling multi-domain proteins and cluster comparison algorithms. Structural Classification of Proteins (SCOP) was used as an evaluation data set for our method, yielding a 24% improvement over pair-wise comparisons in terms of detecting remote homologues. AVAILABILITY: The software is available to academic users on request from the authors. CONTACT: e.bolten@science-factory.com; schliep@zpr.uni-koeln.de; s.schneckener@science-factory.com; d.schomburg@uni-koeln.de; schrader@zpr.uni-koeln.de. SUPPLEMENTARY INFORMATION: http://www.zaik.uni-koeln.de/~schliep/ProtClust.html.  相似文献   

9.
Traditional k-means and most k-means variants are still computationally expensive for large datasets, such as microarray data, which have large datasets with large dimension size d. In k-means clustering, we are given a set of n data points in d-dimensional space Rd and an integer k. The problem is to determine a set of k points in Rd, called centers, so as to minimize the mean squared distance from each data point to its nearest center. In this work, we develop a novel k-means algorithm, which is simple but more efficient than the traditional k-means and the recent enhanced k-means. Our new algorithm is based on the recently established relationship between principal component analysis and the k-means clustering. We provided the correctness proof for this algorithm. Results obtained from testing the algorithm on three biological data and six non-biological data (three of these data are real, while the other three are simulated) also indicate that our algorithm is empirically faster than other known k-means algorithms. We assessed the quality of our algorithm clusters against the clusters of a known structure using the Hubert-Arabie Adjusted Rand index (ARIHA). We found that when k is close to d, the quality is good (ARIHA>0.8) and when k is not close to d, the quality of our new k-means algorithm is excellent (ARIHA>0.9). In this paper, emphases are on the reduction of the time requirement of the k-means algorithm and its application to microarray data due to the desire to create a tool for clustering and malaria research. However, the new clustering algorithm can be used for other clustering needs as long as an appropriate measure of distance between the centroids and the members is used. This has been demonstrated in this work on six non-biological data.  相似文献   

10.
Investigation of patterns in beta diversity has received increased attention over the last years particularly in light of new ecological theories such as the metapopulation paradigm and metacommunity theory. Traditionally, beta diversity patterns can be described by cluster analysis (i.e. dendrograms) that enables the classification of samples. Clustering algorithms define the structure of dendrograms, consequently assessing their performance is crucial. A common, although not always appropriate approach for assessing algorithm suitability is the cophenetic correlation coefficient c. Alternatively the 2-norm has been recently proposed as an increasingly informative method for evaluating the distortion engendered by clustering algorithms. In the present work, the 2-norm is applied for the first time on field data and is compared with the cophenetic correlation coefficient using a set of 105 pairwise combinations of 7 clustering methods (e.g. UPGMA) and 15 (dis)similarity/distance indices (e.g. Jaccard index). In contrast to the 2-norm, cophenetic correlation coefficient does not provide a clear indication on the efficiency of the clustering algorithms for all combinations. The two approaches were not always in agreement in the choice of the most faithful algorithm. Additionally, the 2-norm revealed that UPGMA is the most efficient clustering algorithm and Ward's the least. The present results suggest that goodness-of-fit measures such as the 2-norm should be applied prior to clustering analyses for reliable beta diversity measures.  相似文献   

11.

Motivation

It has been proposed that clustering clinical markers, such as blood test results, can be used to stratify patients. However, the robustness of clusters formed with this approach to data pre-processing and clustering algorithm choices has not been evaluated, nor has clustering reproducibility. Here, we made use of the NHANES survey to compare clusters generated with various combinations of pre-processing and clustering algorithms, and tested their reproducibility in two separate samples.

Method

Values of 44 biomarkers and 19 health/life style traits were extracted from the National Health and Nutrition Examination Survey (NHANES). The 1999–2002 survey was used for training, while data from the 2003–2006 survey was tested as a validation set. Twelve combinations of pre-processing and clustering algorithms were applied to the training set. The quality of the resulting clusters was evaluated both by considering their properties and by comparative enrichment analysis. Cluster assignments were projected to the validation set (using an artificial neural network) and enrichment in health/life style traits in the resulting clusters was compared to the clusters generated from the original training set.

Results

The clusters obtained with different pre-processing and clustering combinations differed both in terms of cluster quality measures and in terms of reproducibility of enrichment with health/life style properties. Z-score normalization, for example, dramatically improved cluster quality and enrichments, as compared to unprocessed data, regardless of the clustering algorithm used. Clustering diabetes patients revealed a group of patients enriched with retinopathies. This could indicate that routine laboratory tests can be used to detect patients suffering from complications of diabetes, although other explanations for this observation should also be considered.

Conclusions

Clustering according to classical clinical biomarkers is a robust process, which may help in patient stratification. However, optimization of the pre-processing and clustering process may be still required.  相似文献   

12.
Vorolign, a fast and flexible structural alignment method for two or more protein structures is introduced. The method aligns protein structures using double dynamic programming and measures the similarity of two residues based on the evolutionary conservation of their corresponding Voronoi-contacts in the protein structure. This similarity function allows aligning protein structures even in cases where structural flexibilities exist. Multiple structural alignments are generated from a set of pairwise alignments using a consistency-based, progressive multiple alignment strategy. RESULTS: The performance of Vorolign is evaluated for different applications of protein structure comparison, including automatic family detection as well as pairwise and multiple structure alignment. Vorolign accurately detects the correct family, superfamily or fold of a protein with respect to the SCOP classification on a set of difficult target structures. A scan against a database of >4000 proteins takes on average 1 min per target. The performance of Vorolign in calculating pairwise and multiple alignments is found to be comparable with other pairwise and multiple protein structure alignment methods. AVAILABILITY: Vorolign is freely available for academic users as a web server at http://www.bio.ifi.lmu.de/Vorolign  相似文献   

13.
Four of the most common limitations of the many available clustering methods are: i) the lack of a proper strategy to deal with outliers; ii) the need for a good a priori estimate of the number of clusters to obtain reasonable results; iii) the lack of a method able to detect when partitioning of a specific data set is not appropriate; and iv) the dependence of the result on the initialization. Here we propose Cross-clustering (CC), a partial clustering algorithm that overcomes these four limitations by combining the principles of two well established hierarchical clustering algorithms: Ward’s minimum variance and Complete-linkage. We validated CC by comparing it with a number of existing clustering methods, including Ward’s and Complete-linkage. We show on both simulated and real datasets, that CC performs better than the other methods in terms of: the identification of the correct number of clusters, the identification of outliers, and the determination of real cluster memberships. We used CC to cluster samples in order to identify disease subtypes, and on gene profiles, in order to determine groups of genes with the same behavior. Results obtained on a non-biological dataset show that the method is general enough to be successfully used in such diverse applications. The algorithm has been implemented in the statistical language R and is freely available from the CRAN contributed packages repository.  相似文献   

14.
An algorithm is presented for the fast and accurate definition of protein structural domains from coordinate data without prior knowledge of the number or type of domains. The algorithm explicitly locates domains that comprise one or two continuous segments of protein chain. Domains that include more than two segments are also located. The algorithm was applied to a nonredundant database of 230 protein structures and the results compared to domain definitions obtained from the literature, or by inspection of the coordinates on molecular graphics. For 70% of the proteins, the derived domains agree with the reference definitions, 18% show minor differences and only 12% (28 proteins) show very different definitions. Three screens were applied to identify the derived domains least likely to agree with the subjective definition set. These screens revealed a set of 173 proteins, 97% of which agree well with the subjective definitions. The algorithm represents a practical domain identification tool that can be run routinely on the entire structural database. Adjustment of parameters also allows smaller compact units to be identified in proteins.  相似文献   

15.
We propose a new particle swarm optimization algorithm for problems where objective functions are subject to zero-mean, independent, and identically distributed stochastic noise. While particle swarm optimization has been successfully applied to solve many complex deterministic nonlinear optimization problems, straightforward applications of particle swarm optimization to noisy optimization problems are subject to failure because the noise in objective function values can lead the algorithm to incorrectly identify positions as the global/personal best positions. Instead of having the entire swarm follow a global best position based on the sample average of objective function values, the proposed new algorithm works with a set of statistically global best positions that include one or more positions with objective function values that are statistically equivalent, which is achieved using a combination of statistical subset selection and clustering analysis. The new PSO algorithm can be seamlessly integrated with adaptive resampling procedures to enhance the capability of PSO to cope with noisy objective functions. Numerical experiments demonstrate that the new algorithm is able to consistently find better solutions than the canonical particle swarm optimization algorithm in the presence of stochastic noise in objective function values with different resampling procedures.  相似文献   

16.
Workflow Information Storage Toolkit (WIST) is a set of application programming interfaces and web applications that allow for the rapid development of customized laboratory information management systems (LIMS). WIST provides common LIMS input components, and allows them to be arranged and configured using a flexible language that specifies each component's visual and semantic characteristics. WIST includes a complete set of web applications for adding, editing and viewing data, as well as a powerful setup tool that can build new LIMS modules by analyzing existing database schema. Availability and implementation: WIST is implemented in Perl and may be obtained from http://vimss.sf.net under the BSD license.  相似文献   

17.
A novel algorithm for unsupervised fuzzy clustering is introduced. The algorithm uses a so-called Weighted Fixed Neural Network (WFNN) to store important and useful information about the topological relations in a given data set. The algorithm produces a weighted connected net, of weighted nodes connected by weighted edges, which reflects and preserves the topology of the input data set. The weights of the nodes and the edges in the resulting net are proportional to the local densities of data samples in input space. The connectedness of the net can be changed, and the higher the connectedness of the net is chosen, the fuzzier the system becomes. The new algorithm is computationally efficient when compared to other existing methods for clustering multi-dimensional data, such as color images.  相似文献   

18.
Tomography emerges as a powerful methodology for determining the complex architectures of biological specimens that are better regarded from the structural point of view as singular entities. However, once the structure of a sufficiently large number of singular specimens is solved, quite possibly structural patterns start to emerge. This latter situation is addressed here, where the clustering of a set of 3D reconstructions using a novel quantitative approach is presented. In general terms, we propose a new variant of a self-organizing neural network for the unsupervised classification of 3D reconstructions. The novelty of the algorithm lies in its rigorous mathematical formulation that, starting from a large set of noisy input data, finds a set of "representative" items, organized onto an ordered output map, such that the probability density of this set of representative items resembles at its possible best the probability density of the input data. In this study, we evaluate the feasibility of application of the proposed neural approach to the problem of identifying similar 3D motifs within tomograms of insect flight muscle. Our experimental results prove that this technique is suitable for this type of problem, providing the electron microscopy community with a new tool for exploring large sets of tomogram data to find complex patterns.  相似文献   

19.
Members of a superfamily of proteins could result from divergent evolution of homologues with insignificant similarity in the amino acid sequences. A superfamily relationship is detected commonly after the three-dimensional structures of the proteins are determined using X-ray analysis or NMR. The SUPFAM database described here relates two homologous protein families in a multiple sequence alignment database of either known or unknown structure. The present release (1.1), which is the first version of the SUPFAM database, has been derived by analysing Pfam, which is one of the commonly used databases of multiple sequence alignments of homologous proteins. The first step in establishing SUPFAM is to relate Pfam families with the families in PALI, which is an alignment database of homologous proteins of known structure that is derived largely from SCOP. The second step involves relating Pfam families which could not be associated reliably with a protein superfamily of known structure. The profile matching procedure, IMPALA, has been used in these steps. The first step resulted in identification of 1280 Pfam families (out of 2697, i.e. 47%) which are related, either by close homologous connection to a SCOP family or by distant relationship to a SCOP family, potentially forming new superfamily connections. Using the profiles of 1417 Pfam families with apparently no structural information, an all-against-all comparison involving a sequence-profile match using IMPALA resulted in clustering of 67 homologous protein families of Pfam into 28 potential new superfamilies. Expansion of groups of related proteins of yet unknown structural information, as proposed in SUPFAM, should help in identifying ‘priority proteins’ for structure determination in structural genomics initiatives to expand the coverage of structural information in the protein sequence space. For example, we could assign 858 distinct Pfam domains in 2203 of the gene products in the genome of Mycobacterium tubercolosis. Fifty-one of these Pfam families of unknown structure could be clustered into 17 potentially new superfamilies forming good targets for structural genomics. SUPFAM database can be accessed at http://pauling.mbu.iisc.ernet.in/~supfam.  相似文献   

20.
MSAT     
This article describes the development of a new method for multiple sequence alignment based on fold-level protein structure alignments, which provides an improvement in accuracy compared with the most commonly used sequence-only-based techniques. This method integrates the widely used, progressive multiple sequence alignment approach ClustalW with the Topology of Protein Structure (TOPS) topology-based alignment algorithm. The TOPS approach produces a structural alignment for the input protein set by using a topology-based pattern discovery program, providing a set of matched sequence regions that can be used to guide a sequence alignment using ClustalW. The resulting alignments are more reliable than a sequence-only alignment, as determined by 20-fold cross-validation with a set of 106 protein examples from the CATH database, distributed in seven superfold families. The method is particularly effective for sets of proteins that have similar structures at the fold level but low sequence identity. The aim of this research is to contribute towards bridging the gap between protein sequence and structure analysis, in the hope that this can be used to assist the understanding of the relationship between sequence, structure and function. The tool is available at http://balabio.dcs.gla.ac.uk/msat/.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号