首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 320 毫秒
1.
MOTIVATION: Currently the most popular approach to analyze genome-wide expression data is clustering. One of the major drawbacks of most of the existing clustering methods is that the number of clusters has to be specified a priori. Furthermore, by using pure unsupervised algorithms prior biological knowledge is totally ignored Moreover, most current tools lack an effective framework for tight integration of unsupervised and supervised learning for the analysis of high-dimensional expression data and only very few multi-class supervised approaches are designed with the provision for effectively utilizing multiple functional class labeling. RESULTS: The paper adapts a novel Self-Organizing map called supervised Network Self-Organized Map (sNet-SOM) to the peculiarities of multi-labeled gene expression data. The sNet-SOM determines adaptively the number of clusters with a dynamic extension process. This process is driven by an inhomogeneous measure that tries to balance unsupervised, supervised and model complexity criteria. Nodes within a rectangular grid are grown at the boundary nodes, weights rippled from the internal nodes towards the outer nodes of the grid, and whole columns inserted within the map The appropriate level of expansion is determined automatically. Multiple sNet-SOM models are constructed dynamically each for a different unsupervised/supervised balance and model selection criteria are used to select the one optimum one. The results indicate that sNet-SOM yields competitive performance to other recently proposed approaches for supervised classification at a significantly reduced computational cost and it provides extensive exploratory analysis potentiality within the analysis framework. Furthermore, it explores simple design decisions that are easier to comprehend and computationally efficient.  相似文献   

2.
Ants, the most abundant taxa among canopy‐dwelling animals in tropical rainforests, are mostly represented by territorially dominant arboreal ants (TDAs) whose territories are distributed in a mosaic pattern (arboreal ant mosaics). Large TDA colonies regulate insect herbivores, with implications for forestry and agronomy. What generates these mosaics in vegetal formations, which are dynamic, still needs to be better understood. So, from empirical research based on 3 Cameroonian tree species (Lophira alata, Ochnaceae; Anthocleista vogelii, Gentianaceae; and Barteria fistulosa, Passifloraceae), we used the Self‐Organizing Map (SOM, neural network) to illustrate the succession of TDAs as their host trees grow and age. The SOM separated the trees by species and by size for L. alata, which can reach 60 m in height and live several centuries. An ontogenic succession of TDAs from sapling to mature trees is shown, and some ecological traits are highlighted for certain TDAs. Also, because the SOM permits the analysis of data with many zeroes with no effect of outliers on the overall scatterplot distributions, we obtained ecological information on rare species. Finally, the SOM permitted us to show that functional groups cannot be selected at the genus level as congeneric species can have very different ecological niches, something particularly true for Crematogaster spp., which include a species specifically associated with B. fistulosa, nondominant species and TDAs. Therefore, the SOM permitted the complex relationships between TDAs and their growing host trees to be analyzed, while also providing new information on the ecological traits of the ant species involved.  相似文献   

3.

Background

Clustering is a widely used technique for analysis of gene expression data. Most clustering methods group genes based on the distances, while few methods group genes according to the similarities of the distributions of the gene expression levels. Furthermore, as the biological annotation resources accumulated, an increasing number of genes have been annotated into functional categories. As a result, evaluating the performance of clustering methods in terms of the functional consistency of the resulting clusters is of great interest.

Results

In this paper, we proposed the WDCM (Weibull Distribution-based Clustering Method), a robust approach for clustering gene expression data, in which the gene expressions of individual genes are considered as the random variables following unique Weibull distributions. Our WDCM is based on the concept that the genes with similar expression profiles have similar distribution parameters, and thus the genes are clustered via the Weibull distribution parameters. We used the WDCM to cluster three cancer gene expression data sets from the lung cancer, B-cell follicular lymphoma and bladder carcinoma and obtained well-clustered results. We compared the performance of WDCM with k-means and Self Organizing Map (SOM) using functional annotation information given by the Gene Ontology (GO). The results showed that the functional annotation ratios of WDCM are higher than those of the other methods. We also utilized the external measure Adjusted Rand Index to validate the performance of the WDCM. The comparative results demonstrate that the WDCM provides the better clustering performance compared to k-means and SOM algorithms. The merit of the proposed WDCM is that it can be applied to cluster incomplete gene expression data without imputing the missing values. Moreover, the robustness of WDCM is also evaluated on the incomplete data sets.

Conclusions

The results demonstrate that our WDCM produces clusters with more consistent functional annotations than the other methods. The WDCM is also verified to be robust and is capable of clustering gene expression data containing a small quantity of missing values.  相似文献   

4.
We report on the application of the Self-Organizing Map (SOM) classification method to the task of categorizing texts according to their register and the style of their author. The SOM has been selected as its performance in various data-mining applications has been found to be highly successful. Here, the method is evaluated against the task of clustering textual data which are corpora of texts written in the Greek language; the parameters used depict linguistically important structural properties of the texts. The experiments reported indicate that the SOM results are equivalent to those generated by statistical methods.  相似文献   

5.
Self-Organising Map (SOM) clustering methods applied to the monthly and seasonal averaged flowering intensity records of eight Eucalypt species are shown to successfully quantify, visualise and model synchronisation of multivariate time series. The SOM algorithm converts complex, nonlinear relationships between high-dimensional data into simple networks and a map based on the most likely patterns in the multiplicity of time series that it trains. Monthly- and seasonal-based SOMs identified three synchronous species groups (clusters): E. camaldulensis, E. melliodora, E. polyanthemos; E. goniocalyx, E. microcarpa, E. macrorhyncha; and E. leucoxylon, E. tricarpa. The main factor in synchronisation (clustering) appears to be the season in which flowering commences. SOMs also identified the asynchronous relationship among the eight species. Hence, the likelihood of the production, or not, of hybrids between sympatric species is also identified. The SOM pattern-based correlation values mirror earlier synchrony statistics gleaned from Moran correlations obtained from the raw flowering records. Synchronisation of flowering is shown to be a complex mechanism that incorporates all the flowering characteristics: flowering duration, timing of peak flowering, of start and finishing of flowering, as well as possibly specific climate drivers for flowering. SOMs can accommodate for all this complexity and we advocate their use by phenologists and ecologists as a powerful, accessible and interpretable tool for visualisation and clustering of multivariate time series and for synchrony studies.  相似文献   

6.
7.
The wealth of interaction information provided in biomedical articles motivated the implementation of text mining approaches to automatically extract biomedical relations. This paper presents an unsupervised method based on pattern clustering and sentence parsing to deal with biomedical relation extraction. Pattern clustering algorithm is based on Polynomial Kernel method, which identifies interaction words from unlabeled data; these interaction words are then used in relation extraction between entity pairs. Dependency parsing and phrase structure parsing are combined for relation extraction. Based on the semi-supervised KNN algorithm, we extend the proposed unsupervised approach to a semi-supervised approach by combining pattern clustering, dependency parsing and phrase structure parsing rules. We evaluated the approaches on two different tasks: (1) Protein–protein interactions extraction, and (2) Gene–suicide association extraction. The evaluation of task (1) on the benchmark dataset (AImed corpus) showed that our proposed unsupervised approach outperformed three supervised methods. The three supervised methods are rule based, SVM based, and Kernel based separately. The proposed semi-supervised approach is superior to the existing semi-supervised methods. The evaluation on gene–suicide association extraction on a smaller dataset from Genetic Association Database and a larger dataset from publicly available PubMed showed that the proposed unsupervised and semi-supervised methods achieved much higher F-scores than co-occurrence based method.  相似文献   

8.
We employed a multi-scale clustering methodology known as “data cloud geometry” to extract functional connectivity patterns derived from functional magnetic resonance imaging (fMRI) protocol. The method was applied to correlation matrices of 106 regions of interest (ROIs) in 29 individuals with autism spectrum disorders (ASD), and 29 individuals with typical development (TD) while they completed a cognitive control task. Connectivity clustering geometry was examined at both “fine” and “coarse” scales. At the coarse scale, the connectivity clustering geometry produced 10 valid clusters with a coherent relationship to neural anatomy. A supervised learning algorithm employed fine scale information about clustering motif configurations and prevalence, and coarse scale information about intra- and inter-regional connectivity; the algorithm correctly classified ASD and TD participants with sensitivity of and specificity of . Most of the predictive power of the logistic regression model resided at the level of the fine-scale clustering geometry, suggesting that cellular versus systems level disturbances are more prominent in individuals with ASD. This article provides validation for this multi-scale geometric approach to extracting brain functional connectivity pattern information and for its use in classification of ASD.  相似文献   

9.
Multiple sequence alignments (MSAs) are one of the most important sources of information in sequence analysis. Many methods have been proposed to detect, extract and visualize their most significant properties. To the same extent that site-specific methods like sequence logos successfully visualize site conservations and sequence-based methods like clustering approaches detect relationships between sequences, both types of methods fail at revealing informational elements of MSAs at the level of sequence–site interactions, i.e. finding clusters of sequences and sites responsible for their clustering, which together account for a high fraction of the overall information of the MSA. To fill this gap, we present here a method that combines the Fisher score-based embedding of sequences from a profile hidden Markov model (pHMM) with correspondence analysis. This method is capable of detecting and visualizing group-specific or conflicting signals in an MSA and allows for a detailed explorative investigation of alignments of any size tractable by pHMMs. Applications of our methods are exemplified on an alignment of the Neisseria surface antigen LP2086, where it is used to detect sites of recombinatory horizontal gene transfer and on the vitamin K epoxide reductase family to distinguish between evolutionary and functional signals.  相似文献   

10.
Tephrochronology uses recognizable volcanic ash layers (from airborne pyroclastic deposits, or tephras) in geological strata to set unique time references for paleoenvironmental events across wide geographic areas. This involves the detection of tephra layers which sometimes are not evident to the naked eye, including the so-called cryptotephras. Tests that are expensive, time-consuming, and/or destructive are often required. Destructive testing for tephra layers of cores from difficult regions, such as Antarctica, which are useful sources of other kinds of information beyond tephras, is always undesirable. Here we propose hyperspectral imaging of cores, Self-Organizing Map (SOM) clustering of the preprocessed spectral signatures, and spatial analysis of the classified images as a convenient, fast, non-destructive method for tephra detection. We test the method in five sediment cores from three Antarctic lakes, and show its potential for detection of tephras and cryptotephras.  相似文献   

11.
An improved algorithm for clustering gene expression data   总被引:1,自引:0,他引:1  
MOTIVATION: Recent advancements in microarray technology allows simultaneous monitoring of the expression levels of a large number of genes over different time points. Clustering is an important tool for analyzing such microarray data, typical properties of which are its inherent uncertainty, noise and imprecision. In this article, a two-stage clustering algorithm, which employs a recently proposed variable string length genetic scheme and a multiobjective genetic clustering algorithm, is proposed. It is based on the novel concept of points having significant membership to multiple classes. An iterated version of the well-known Fuzzy C-Means is also utilized for clustering. RESULTS: The significant superiority of the proposed two-stage clustering algorithm as compared to the average linkage method, Self Organizing Map (SOM) and a recently developed weighted Chinese restaurant-based clustering method (CRC), widely used methods for clustering gene expression data, is established on a variety of artificial and publicly available real life data sets. The biological relevance of the clustering solutions are also analyzed.  相似文献   

12.
13.
Although remarkable progress in metagenomic sequencing of various environmental samples has been made, large numbers of fragment sequences have been registered in the international DNA databanks, primarily without information on gene function and phylotype, and thus with limited usefulness. Industrial useful biological activity is often carried out by a set of genes, such as those constituting an operon. In this connection, metagenomic approaches have a weakness because sets of the genes are usually split up, since the sequences obtained by metagenome analyses are fragmented into 1-kb or much shorter segments. Therefore, even when a set of genes responsible for an industrially useful function is found in one metagenome library, it is usually difficult to know whether a single genome harbors the entire gene set or whether different genomes have individual genes. By modifying Self-Organizing Map (SOM), we previously developed BLSOM for oligonucleotide composition, which allowed classification (self-organization) of sequence fragments according to genomes. Because BLSOM could reassociate genomic fragments according to genomes, BLSOM may ameliorate the abovementioned weakness of metagenome analyses. Here, we have developed a strategy for clustering of metagenomic sequences according to phylotypes and genomes, by testing a gene set contributing to environment preservation.  相似文献   

14.
Given a set of related proteins, two important problems in biology are the inference of protein subsets such that members of one subset share a common function and the identification of protein regions that possess functional significance. The former is typically approached by hierarchical bottom-up clustering based on pairwise sequence similarity and various linkage rules. The latter is typically approached in a supervised manner, based on global multiple sequence alignment. However, the two problems are inextricably linked, since functional subsets are usually characterized by distinctive functional regions. This paper introduces CASTOR, an automatic and unsupervised system that addresses both problems simultaneously and efficiently. It identifies protein regions that are likely to have functional significance by discovering and refining statistically significant motifs. It infers likely functional protein subsets and their relationships based on the presence of the discovered motifs in a top-down and recursive manner, allowing the identification of both hierarchical and nonhierarchical subset relationships. This is, to our knowledge, the first system that approaches both problems simultaneously in a top-down, systematic manner. CASTOR's performance is evaluated against the G-protein coupled receptor superfamily. The identified protein regions lead to a taxonomical organization of this superfamily that is in remarkable agreement with a biologically motivated one and which outperforms those produced by bottom-up clustering methods. We also find that conventional hierarchical representations may fail to accurately describe the complexity of evolutionary development responsible for the final organization of a complex protein family. In particular, many functional relationships governing distant subfamilies of such a protein family may not be represented hierarchically.  相似文献   

15.
A wheeled mobile mechanism with a passive and/or active linkage mechanism for rough terrain environment is developed and evaluated. The wheeled mobile mechanism which has high mobility in rough terrain needs sophisticated system to adapt various environments.We focus on the development of a switching controller system for wheeled mobile robots in rough terrain. This system consists of two sub-systems: an environment recognition system using link angles and an adaptive control system. In the environment recognition system, we introduce a Self-Organizing Map (SOM) for clustering link angles. In the adaptive controllers, we introduce neural networks to calculate the inverse model of the wheeled mobile robot.The environment recognition system can recognize the environment in which the robot travels, and the adjustable controllers are tuned by experimental results for each environment. The dual sub-system switching controller system is experimentally evaluated. The system recognizes its environment and adapts by switching the adjustable controllers. This system demonstrates superior performance to a well-tuned single PID controller.  相似文献   

16.
Benthic macroinvertebrates are considered to be one of the most representative taxa in assessing the ecological integrity of aquatic ecosystems. Data for benthic macroinvertebrates collected using the Surber sampler were used for analysis at different sampling sites across different levels of pollution. Species Abundance Distribution (SAD) and Self-Organizing Map (SOM) were utilized in combination to reveal both consistency and variability in community compositions under natural and anthropogenic conditions. According to the SOM benthic macroinvertebrates were clustered in different season groups (e.g., “summer”, “autumn–winter”) at the less polluted site. SADs of the sampled communities, however, were overall stable across different seasons except the period from late spring to summer (i.e., low level of abundance for the mid-ranked species in SADs) due to heavy rainfall in the Monsoon climate. Along with increase in degree of pollution, seasonality deceased for both SOMs and SADs. In all seasons, the SAD curves were fitted to a lognormal distribution for the less polluted site while the polluted site was in accordance with a geometric series. The parameters in the SAD models were not significantly different across different seasons. Species in the highest ranks in the SADs were persistently dominant regardless of seasons, while densities of the mid-ranked species were variable in different seasons at the less and intermediately polluted sites. At the severely polluted site a few selected tolerant species showed high densities persistently and variability of densities in different seasons was minimized. Species groups clustered using the SOM also presented stronger persistence in SADs, and were feasible in addressing diverse patterns of species composition and in outlining species associations presented in different sampling sites through ordination and clustering. The combined use of SOM and SAD is highly be suitable in presenting community properties and ecological integrity in stream ecosystems in response to natural variability and anthropogenic disturbances.  相似文献   

17.
MOTIVATION: Grouping genes having similar expression patterns is called gene clustering, which has been proved to be a useful tool for extracting underlying biological information of gene expression data. Many clustering procedures have shown success in microarray gene clustering; most of them belong to the family of heuristic clustering algorithms. Model-based algorithms are alternative clustering algorithms, which are based on the assumption that the whole set of microarray data is a finite mixture of a certain type of distributions with different parameters. Application of the model-based algorithms to unsupervised clustering has been reported. Here, for the first time, we demonstrated the use of the model-based algorithm in supervised clustering of microarray data. RESULTS: We applied the proposed methods to real gene expression data and simulated data. We showed that the supervised model-based algorithm is superior over the unsupervised method and the support vector machines (SVM) method. AVAILABILITY: The program written in the SAS language implementing methods I-III in this report is available upon request. The software of SVMs is available in the website http://svm.sdsc.edu/cgi-bin/nph-SVMsubmit.cgi  相似文献   

18.

Background  

There are many important clustering questions in computational biology for which no satisfactory method exists. Automated clustering algorithms, when applied to large, multidimensional datasets, such as flow cytometry data, prove unsatisfactory in terms of speed, problems with local minima or cluster shape bias. Model-based approaches are restricted by the assumptions of the fitting functions. Furthermore, model based clustering requires serial clustering for all cluster numbers within a user defined interval. The final cluster number is then selected by various criteria. These supervised serial clustering methods are time consuming and frequently different criteria result in different optimal cluster numbers. Various unsupervised heuristic approaches that have been developed such as affinity propagation are too expensive to be applied to datasets on the order of 106 points that are often generated by high throughput experiments.  相似文献   

19.
We introduce an unsupervised competitive learning rule, called the extended Maximum Entropy learning Rule (eMER), for topographic map formation. Unlike Kohonen's Self-Organizing Map (SOM) algorithm, the presence of a neighborhood function is not a prerequisite for achieving topology-preserving mappings, but instead it is intended: (1) to speed up the learning process and (2) to perform nonparametric regression. We show that, when the neighborhood function vanishes, the neural weigh t density at convergence approaches a linear function of the input density so that the map can be regarded as a nonparametric model of the input density. We apply eMER to density estimation and compare its performance with that of the SOM algorithm and the variable kernel method. Finally, we apply the ‘batch’ version of eMER to nonparametric projection pursuit regression and compare its performance with that of back-propagation learning, projection pursuit learning, constrained topolog ical mapping, and the Heskes and Kappen approach. Received: 12 August 1996 / Accepted in revised form: 9 April 1997  相似文献   

20.
An ensemble framework for clustering protein-protein interaction networks   总被引:3,自引:0,他引:3  
MOTIVATION: Protein-Protein Interaction (PPI) networks are believed to be important sources of information related to biological processes and complex metabolic functions of the cell. The presence of biologically relevant functional modules in these networks has been theorized by many researchers. However, the application of traditional clustering algorithms for extracting these modules has not been successful, largely due to the presence of noisy false positive interactions as well as specific topological challenges in the network. RESULTS: In this article, we propose an ensemble clustering framework to address this problem. For base clustering, we introduce two topology-based distance metrics to counteract the effects of noise. We develop a PCA-based consensus clustering technique, designed to reduce the dimensionality of the consensus problem and yield informative clusters. We also develop a soft consensus clustering variant to assign multifaceted proteins to multiple functional groups. We conduct an empirical evaluation of different consensus techniques using topology-based, information theoretic and domain-specific validation metrics and show that our approaches can provide significant benefits over other state-of-the-art approaches. Our analysis of the consensus clusters obtained demonstrates that ensemble clustering can (a) produce improved biologically significant functional groupings; and (b) facilitate soft clustering by discovering multiple functional associations for proteins. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号