期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A highly efficient multi-core algorithm for clustering extremely large datasets

Johann M Kraus Hans A Kestler 《BMC bioinformatics》2010,11(1):169

Background

In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer. 相似文献

2.

Performance assessment of kernel density clustering for gene expression profile data

Shu G Zeng B Chen YP Smith OH 《Comparative and Functional Genomics》2003,4(3):287-299

Kernel density smoothing techniques have been used in classification or supervised learning of gene expression profile (GEP) data, but their applications to clustering or unsupervised learning of those data have not been explored and assessed. Here we report a kernel density clustering method for analysing GEP data and compare its performance with the three most widely-used clustering methods: hierarchical clustering, K-means clustering, and multivariate mixture model-based clustering. Using several methods to measure agreement, between-cluster isolation, and withincluster coherence, such as the Adjusted Rand Index, the Pseudo F test, the r(2) test, and the profile plot, we have assessed the effectiveness of kernel density clustering for recovering clusters, and its robustness against noise on clustering both simulated and real GEP data. Our results show that the kernel density clustering method has excellent performance in recovering clusters from simulated data and in grouping large real expression profile data sets into compact and well-isolated clusters, and that it is the most robust clustering method for analysing noisy expression profile data compared to the other three methods assessed. 相似文献

3.

Hierarchical clustering algorithm for comprehensive orthologous-domain classification in multiple genomes

下载免费PDF全文

Uchiyama I 《Nucleic acids research》2006,34(2):647-658

Ortholog identification is a crucial first step in comparative genomics. Here, we present a rapid method of ortholog grouping which is effective enough to allow the comparison of many genomes simultaneously. The method takes as input all-against-all similarity data and classifies genes based on the traditional hierarchical clustering algorithm UPGMA. In the course of clustering, the method detects domain fusion or fission events, and splits clusters into domains if required. The subsequent procedure splits the resulting trees such that intra-species paralogous genes are divided into different groups so as to create plausible orthologous groups. As a result, the procedure can split genes into the domains minimally required for ortholog grouping. The procedure, named DomClust, was tested using the COG database as a reference. When comparing several clustering algorithms combined with the conventional bidirectional best-hit (BBH) criterion, we found that our method generally showed better agreement with the COG classification. By comparing the clustering results generated from datasets of different releases, we also found that our method showed relatively good stability in comparison to the BBH-based methods. 相似文献

4.

MCAM: multiple clustering analysis methodology for deriving hypotheses and insights from high-throughput proteomic datasets

Naegle KM Welsch RE Yaffe MB White FM Lauffenburger DA 《PLoS computational biology》2011,7(7):e1002119

Advances in proteomic technologies continue to substantially accelerate capability for generating experimental data on protein levels, states, and activities in biological samples. For example, studies on receptor tyrosine kinase signaling networks can now capture the phosphorylation state of hundreds to thousands of proteins across multiple conditions. However, little is known about the function of many of these protein modifications, or the enzymes responsible for modifying them. To address this challenge, we have developed an approach that enhances the power of clustering techniques to infer functional and regulatory meaning of protein states in cell signaling networks. We have created a new computational framework for applying clustering to biological data in order to overcome the typical dependence on specific a priori assumptions and expert knowledge concerning the technical aspects of clustering. Multiple clustering analysis methodology ('MCAM') employs an array of diverse data transformations, distance metrics, set sizes, and clustering algorithms, in a combinatorial fashion, to create a suite of clustering sets. These sets are then evaluated based on their ability to produce biological insights through statistical enrichment of metadata relating to knowledge concerning protein functions, kinase substrates, and sequence motifs. We applied MCAM to a set of dynamic phosphorylation measurements of the ERRB network to explore the relationships between algorithmic parameters and the biological meaning that could be inferred and report on interesting biological predictions. Further, we applied MCAM to multiple phosphoproteomic datasets for the ERBB network, which allowed us to compare independent and incomplete overlapping measurements of phosphorylation sites in the network. We report specific and global differences of the ERBB network stimulated with different ligands and with changes in HER2 expression. Overall, we offer MCAM as a broadly-applicable approach for analysis of proteomic data which may help increase the current understanding of molecular networks in a variety of biological problems. 相似文献

5.

Robust multi-scale clustering of large DNA microarray datasets with the consensus algorithm

Grotkjaer T Winther O Regenberg B Nielsen J Hansen LK 《Bioinformatics (Oxford, England)》2006,22(1):58-67

相似文献

6.

Integrating bioinformatics approaches for a comprehensive interpretation of metabolomics datasets

《Current opinion in biotechnology》2018

Download : Download high-res image (169KB)
Download : Download full-size image

相似文献

7.

A biclustering algorithm for extracting bit-patterns from binary datasets

Rodriguez-Baena DS Perez-Pulido AJ Aguilar-Ruiz JS 《Bioinformatics (Oxford, England)》2011,27(19):2738-2745

相似文献

8.

Automated programming for bioinformatics algorithm deployment

Alterovitz G Jiwaji A Ramoni MF 《Bioinformatics (Oxford, England)》2008,24(3):450-451

Many bioinformatics solutions suffer from the lack of usable interface/platform from which results can be analyzed and visualized. Overcoming this hurdle would allow for more widespread dissemination of bioinformatics algorithms within the biological and medical communities. The algorithms should be accessible without extensive technical support or programming knowledge. Here, we propose a dynamic wizard platform that provides users with a Graphical User Interface (GUI) for most Java bioinformatics library toolkits. The application interface is generated in real-time based on the original source code. This platform lets developers focus on designing algorithms and biologists/physicians on testing hypotheses and analyzing results. AVAILABILITY: The open source code can be downloaded from: http://bcl.med.harvard.edu/proteomics/proj/APBA/. 相似文献

9.

Adjusted geometric-mean: a novel performance measure for imbalanced bioinformatics datasets learning

Batuwita R Palade V 《Journal of bioinformatics and computational biology》2012,10(4):1250003

One common and challenging problem faced by many bioinformatics applications, such as promoter recognition, splice site prediction, RNA gene prediction, drug discovery and protein classification, is the imbalance of the available datasets. In most of these applications, the positive data examples are largely outnumbered by the negative data examples, which often leads to the development of sub-optimal prediction models having high negative recognition rate (Specificity = SP) and low positive recognition rate (Sensitivity = SE). When class imbalance learning methods are applied, usually, the SE is increased at the expense of reducing some amount of the SP. In this paper, we point out that in these data-imbalanced bioinformatics applications, the goal of applying class imbalance learning methods would be to increase the SE as high as possible by keeping the reduction of SP as low as possible. We explain that the existing performance measures used in class imbalance learning can still produce sub-optimal models with respect to this classification goal. In order to overcome these problems, we introduce a new performance measure called Adjusted Geometric-mean (AGm). The experimental results obtained on ten real-world imbalanced bioinformatics datasets demonstrates that the AGm metric can achieve a lower rate of reduction of SP than the existing performance metrics, when increasing the SE through class imbalance learning methods. This characteristic of AGm metric makes it more suitable for achieving the proposed classification goal in imbalanced bioinformatics datasets learning. 相似文献

10.

On an improved clustering algorithm based on node density for WSN routing protocol

Chang Luyao Li Fan Niu Xinzheng Zhu Jiahui 《Cluster computing》2022,25(4):3005-3017

To better collect data in context to balance energy consumption, wireless sensor networks (WSN) need to be divided into clusters. The division of clusters makes the network become a hierarchical organizational structure, which plays the role of balancing the network load and prolonging the life cycle of the system. In clustering routing algorithm, the pros and cons of clustering algorithm directly affect the result of cluster division. In this paper, an algorithm for selecting cluster heads based on node distribution density and allocating remaining nodes is proposed for the defects of cluster head random election and uneven clustering in the traditional LEACH protocol clustering algorithm in WSN. Experiments show that the algorithm can realize the rapid selection of cluster heads and division of clusters, which is effective for node clustering and is conducive to equalizing energy consumption.

相似文献

11.

Automated SNP genotype clustering algorithm to improve data completeness in high-throughput SNP genotyping datasets from custom arrays

Smith EM Littrell J Olivier M 《基因组蛋白质组与生物信息学报(英文版)》2007,5(3-4):256-259

High-throughput SNP genotyping platforms use automated genotype calling algorithms to assign genotypes. While these algorithms work efficiently for individual platforms, they are not compatible with other platforms, and have individual biases that result in missed genotype calls. Here we present data on the use of a second complementary SNP genotype clustering algorithm. The algorithm was originally designed for individual fluorescent SNP genotyping assays, and has been optimized to permit the clustering of large datasets generated from custom-designed Affymetrix SNP panels. In an analysis of data from a 3K array genotyped on 1,560 samples, the additional analysis increased the overall number of genotypes by over 45,000, significantly improving the completeness of the experimental data. This analysis suggests that the use of multiple genotype calling algorithms may be advisable in high-throughput SNP genotyping experiments. The software is written in Perl and is available from the corresponding author. 相似文献

12.

A biased random-key genetic algorithm for data clustering

P. Festa 《Mathematical biosciences》2013

Cluster analysis aims at finding subsets (clusters) of a given set of entities, which are homogeneous and/or well separated. 相似文献

13.

A self-healing clustering algorithm for underwater sensor networks

Chenn-Jung Huang Yu-Wu Wang Chin-Fa Lin Yu-To Chen Heng-Ming Chen Hung-Yen Shen You-Jia Chen I-Fan Chen Kai-Wen Hu Dian-Xiu Yang 《Cluster computing》2011,14(1):91-99

Underwater wireless sensor networks (UWSNs) is a novel networking paradigm to explore aqueous environments. The characteristics of mobile UWSNs, such as low communication bandwidth, large propagation delay, floating node mobility, and high error probability, are significantly different from terrestrial wireless sensor networks. Energy-efficient communication protocols are thus urgently demanded in mobile UWSNs. In this paper, we develop a novel clustering algorithm that combines the ideas of energy-efficient cluster-based routing and application-specific data aggregation to achieve good performance in terms of system lifetime, and application-perceived quality. The proposed clustering technique organizes sensor nodes into direction-sensitive clusters, with one node acting as the head of each cluster, in order to fit the unique characteristic of up/down transmission direction in UWSNs. Meanwhile, the concept of self-healing is adopted to avoid excessively frequent re-clustering owing to the disruption of individual clusters. The self-healing mechanism significantly enhances the robustness of clustered UWSNs. The experimental results verify the effectiveness and feasibility of the proposed algorithm. 相似文献

14.

CLU: A new algorithm for EST clustering

Ptitsyn A Hide W 《BMC bioinformatics》2005,6(Z2):S3

Background

The continuous flow of EST data remains one of the richest sources for discoveries in modern biology. The first step in EST data mining is usually associated with EST clustering, the process of grouping of original fragments according to their annotation, similarity to known genomic DNA or each other. Clustered EST data, accumulated in databases such as UniGene, STACK and TIGR Gene Indices have proven to be crucial in research areas from gene discovery to regulation of gene expression.

Results

We have developed a new nucleotide sequence matching algorithm and its implementation for clustering EST sequences. The program is based on the original CLU match detection algorithm, which has improved performance over the widely used d2_cluster. The CLU algorithm automatically ignores low-complexity regions like poly-tracts and short tandem repeats.

Conclusion

CLU represents a new generation of EST clustering algorithm with improved performance over current approaches. An early implementation can be applied in small and medium-size projects. The CLU program is available on an open source basis free of charge. It can be downloaded from http://compbio.pbrc.edu/pti

相似文献

15.

A distributed clustering algorithm for large-scale dynamic networks

Thibault Bernard Alain Bui Laurence Pilard Devan Sohier 《Cluster computing》2012,15(4):335-350

We propose an algorithm that builds and maintains clusters over a network subject to mobility. This algorithm is fully decentralized and makes all the different clusters grow concurrently. The algorithm uses circulating tokens that collect data and move according to a random walk traversal scheme. Their task consists in (i) creating a cluster with the nodes it discovers and (ii) managing the cluster expansion; all decisions affecting the cluster are taken only by a node that owns the token. The size of each cluster is maintained higher than m nodes (m is a parameter of the algorithm). The obtained clustering is locally optimal in the sense that, with only a local view of each clusters, it computes the largest possible number of clusters (i.e. the sizes of the clusters are as close to m as possible). This algorithm is designed as a decentralized control algorithm for large scale networks and is mobility-adaptive: after a series of topological changes, the algorithm converges to a clustering. This recomputation only affects nodes in clusters where topological changes happened, and in adjacent clusters. 相似文献

16.

A scalable method for integration and functional analysis of multiple microarray datasets 总被引：6，自引：0，他引：6

Huttenhower C Hibbs M Myers C Troyanskaya OG 《Bioinformatics (Oxford, England)》2006,22(23):2890-2897

MOTIVATION: The diverse microarray datasets that have become available over the past several years represent a rich opportunity and challenge for biological data mining. Many supervised and unsupervised methods have been developed for the analysis of individual microarray datasets. However, integrated analysis of multiple datasets can provide a broader insight into genetic regulation of specific biological pathways under a variety of conditions. RESULTS: To aid in the analysis of such large compendia of microarray experiments, we present Microarray Experiment Functional Integration Technology (MEFIT), a scalable Bayesian framework for predicting functional relationships from integrated microarray datasets. Furthermore, MEFIT predicts these functional relationships within the context of specific biological processes. All results are provided in the context of one or more specific biological functions, which can be provided by a biologist or drawn automatically from catalogs such as the Gene Ontology (GO). Using MEFIT, we integrated 40 Saccharomyces cerevisiae microarray datasets spanning 712 unique conditions. In tests based on 110 biological functions drawn from the GO biological process ontology, MEFIT provided a 5% or greater performance increase for 54 functions, with a 5% or more decrease in performance in only two functions. 相似文献

17.

On dimension reduction of clustering results in structural bioinformatics

《Biochimica et Biophysica Acta - Proteins and Proteomics》2014,1844(12):2277-2283

OPTICS is a density-based clustering algorithm that performs well in a wide variety of applications. For a set of input objects, the algorithm creates a reachability plot that can either be used to produce cluster membership assignments, or interpreted itself as an expressive two-dimensional representation of the clustering structure of the input set, even if the input set is embedded in higher dimensions. The focus of this work is a visualization method that can be applied for comparing two, independent hierarchical clusterings by assigning colors to all entries of the input database. We give two applications related to macromolecular structural properties: the first is a sequence-based clustering of the SwissProt database that is evaluated using NCBI taxonomy identifiers, and the second application involves clustering locations of specific atoms in the serine protease enzyme family—and the clusters are evaluated using SCOP structural classifications. 相似文献

18.

An automatic bandwidth selector for kernel density estimation 总被引：4，自引：0，他引：4

CHIU SHEAN-TSONG 《Biometrika》1992,79(4):771-782

相似文献

19.

A new DNA sequence entropy-based Kullback-Leibler algorithm for gene clustering

Houshang Dehghanzadeh Mostafa Ghaderi-Zefrehei Seyed Ziaeddin Mirhoseini Saeid Esmaeilkhaniyan Ishaku Lemu Haruna Hamed Amirpour Najafabadi 《Journal of applied genetics》2020,61(2):231-238

Information theory is a branch of mathematics that overlaps with communications, biology, and medical engineering. Entropy is a measure of uncertainty in the set of information. In this study, for each gene and its exons sets, the entropy was calculated in orders one to four. Based on the relative entropy of genes and exons, Kullback-Leibler divergence was calculated. After obtaining the Kullback-Leibler distance for genes and exons sets, the results were entered as input into 7 clustering algorithms: single, complete, average, weighted, centroid, median, and K-means. To aggregate the results of clustering, the AdaBoost algorithm was used. Finally, the results of the AdaBoost algorithm were investigated by GeneMANIA prediction server to explore the results from gene annotation point of view. All calculations were performed using the MATLAB Engineering Software (2015). Following our findings on investigating the results of genes metabolic pathways based on the gene annotations, it was revealed that our proposed clustering method yielded correct, logical, and fast results. This method at the same that had not had the disadvantages of aligning allowed the genes with actual length and content to be considered and also did not require high memory for large-length sequences. We believe that the performance of the proposed method could be used with other competitive gene clustering methods to group biologically relevant set of genes. Also, the proposed method can be seen as a predictive method for those genes bearing up weak genomic annotations. 相似文献

20.

GenClust: A genetic algorithm for clustering gene expression data

Vito?Di Gesú Raffaele?Giancarlo Email author Giosué?Lo Bosco Alessandra?Raimondi Davide?Scaturro 《BMC bioinformatics》2005,6(1):289

Background

Clustering is a key step in the analysis of gene expression data, and in fact, many classical clustering algorithms are used, or more innovative ones have been designed and validated for the task. Despite the widespread use of artificial intelligence techniques in bioinformatics and, more generally, data analysis, there are very few clustering algorithms based on the genetic paradigm, yet that paradigm has great potential in finding good heuristic solutions to a difficult optimization problem such as clustering. 相似文献