首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Analysis of gene expression data using self-organizing maps.   总被引:29,自引:0,他引:29  
DNA microarray technologies together with rapidly increasing genomic sequence information is leading to an explosion in available gene expression data. Currently there is a great need for efficient methods to analyze and visualize these massive data sets. A self-organizing map (SOM) is an unsupervised neural network learning algorithm which has been successfully used for the analysis and organization of large data files. We have here applied the SOM algorithm to analyze published data of yeast gene expression and show that SOM is an excellent tool for the analysis and visualization of gene expression profiles.  相似文献   

2.
Novel tools are needed for efficient analysis and visualization of the massive data sets associated with metabolomics. Here, we describe a batch-learning self-organizing map (BL-SOM) for metabolome informatics that makes the learning process and resulting map independent of the order of data input. This approach was successfully used in analyzing and organizing the metabolome data forArabidopsis thaliana cells cultured under salt stress. Our 6 × 4 matrix presented patterns of metabolite levels at different time periods. A negative correlation was found between the levels of amino acids and metabolites related to glycolysis metabolism in response to this stress. Therefore, BL-SOM could be an excellent tool for clustering and visualizing high dimensional, complex metabolome data in a single map.  相似文献   

3.
Many clustering methods require that the number of clusters believed present in a given data set be specified a priori, and a number of methods for estimating the number of clusters have been developed. However, the selection of the number of clusters is well recognized as a difficult and open problem and there is a need for methods which can shed light on specific aspects of the data. This paper adopts a model for clustering based on a specific structure for a similarity matrix. Publicly available gene expression data sets are analyzed to illustrate the method and the performance of our method is assessed by simulation.  相似文献   

4.
5.
With the rapid advances of various single-cell technologies, an increasing number of single-cell datasets are being generated, and the computational tools for aligning the datasets which make subsequent integration or meta-analysis possible have become critical. Typically, single-cell datasets from different technologies cannot be directly combined or concatenated, due to the innate difference in the data, such as the number of measured parameters and the distributions. Even datasets generated by the same technology are often affected by the batch effect. A computational approach for aligning different datasets and hence identifying related clusters will be useful for data integration and interpretation in large scale single-cell experiments. Our proposed algorithm called JSOM, a variation of the Self-organizing map, aligns two related datasets that contain similar clusters, by constructing two maps—low-dimensional discretized representation of datasets–that jointly evolve according to both datasets. Here we applied the JSOM algorithm to flow cytometry, mass cytometry, and single-cell RNA sequencing datasets. The resulting JSOM maps not only align the related clusters in the two datasets but also preserve the topology of the datasets so that the maps could be used for further analysis, such as clustering.  相似文献   

6.

Background  

One of the goals of global metabolomic analysis is to identify metabolic markers that are hidden within a large background of data originating from high-throughput analytical measurements. Metabolite-based clustering is an unsupervised approach for marker identification based on grouping similar concentration profiles of putative metabolites. A major problem of this approach is that in general there is no prior information about an adequate number of clusters.  相似文献   

7.
8.
Next-generation sequencing (NGS) technologies have enabled the application of broad-scale sequencing in microbial biodiversity and metagenome studies. Biodiversity is usually targeted by classifying 16S ribosomal RNA genes, while metagenomic approaches target metabolic genes. However, both approaches remain isolated, as long as the taxonomic and functional information cannot be interrelated. Techniques like self-organizing maps (SOMs) have been applied to cluster metagenomes into taxon-specific bins in order to link biodiversity with functions, but have not been applied to broad-scale NGS-based metagenomics yet. Here, we provide a novel implementation, demonstrate its potential and practicability, and provide a web-based service for public usage. Evaluation with published data sets mimicking varyingly complex habitats resulted into classification specificities and sensitivities of close to 100% to above 90% from phylum to genus level for assemblies exceeding 8 kb for low and medium complexity data. When applied to five real-world metagenomes of medium complexity from direct pyrosequencing of marine subsurface waters, classifications of assemblies above 2.5 kb were in good agreement with fluorescence in situ hybridizations, indicating that biodiversity was mostly retained within the metagenomes, and confirming high classification specificities. This was validated by two protein-based classifications (PBCs) methods. SOMs were able to retrieve the relevant taxa down to the genus level, while surpassing PBCs in resolution. In order to make the approach accessible to a broad audience, we implemented a feature-rich web-based SOM application named TaxSOM, which is freely available at http://www.megx.net/toolbox/taxsom. TaxSOM can classify reads or assemblies exceeding 2.5 kb with high accuracy and thus assists in linking biodiversity and functions in metagenome studies, which is a precondition to study microbial ecology in a holistic fashion.  相似文献   

9.
Epidemiological processes leave a fingerprint in the pattern of genetic structure of virus populations. Here, we provide a new method to infer epidemiological parameters directly from viral sequence data. The method is based on phylogenetic analysis using a birth-death model (BDM) rather than the commonly used coalescent as the model for the epidemiological transmission of the pathogen. Using the BDM has the advantage that transmission and death rates are estimated independently and therefore enables for the first time the estimation of the basic reproductive number of the pathogen using only sequence data, without further assumptions like the average duration of infection. We apply the method to genetic data of the HIV-1 epidemic in Switzerland.  相似文献   

10.
11.
The Self-organizing map (SOM) is an unsupervised learning method based on the neural computation, which has found wide applications. However, the learning process sometime takes multi-stable states, within which the map is trapped to an undesirable disordered state including topological defects on the map. These topological defects critically aggravate the performance of the SOM. In order to overcome this problem, we propose to introduce an asymmetric neighborhood function for the SOM algorithm. Compared with the conventional symmetric one, the asymmetric neighborhood function accelerates the ordering process even in the presence of the defect. However, this asymmetry tends to generate a distorted map. This can be suppressed by an improved method of the asymmetric neighborhood function. In the case of one-dimensional SOM, it is found that the required steps for perfect ordering is numerically shown to be reduced from O(N 3) to O(N 2). We also discuss the ordering process of a twisted state in two-dimensional SOM, which can not be rectified by the ordinary symmetric neighborhood function.  相似文献   

12.
13.
Using the data on proteins encoded in complete genomes, combined with a rigorous theory of the sampling process, we estimate the total number of protein folds and families, as well as the number of folds and families in each genome. The total number of folds in globular, water- soluble proteins is estimated at about 1000, with structural information currently available for about one-third of the number. The sequenced genomes of unicellular organisms encode from approximately 25%, for the minimal genomes of the Mycoplasmas, to 70-80% for larger genomes, such as Escherichia coli and yeast, of the total number of folds. The number of protein families with significant sequence conservation was estimated to be between 4000 and 7000, with structures available for about 20% of these.  相似文献   

14.
Mahé C  Chevret S 《Biometrics》1999,55(4):1078-1084
Multivariate failure time data are frequently encountered in longitudinal studies when subjects may experience several events or when there is a grouping of individuals into a cluster. To take into account the dependence of the failure times within the unit (the individual or the cluster) as well as censoring, two multivariate generalizations of the Cox proportional hazards model are commonly used. The marginal hazard model is used when the purpose is to estimate mean regression parameters, while the frailty model is retained when the purpose is to assess the degree of dependence within the unit. We propose a new approach based on the combination of the two aforementioned models to estimate both these quantities. This two-step estimation procedure is quicker and more simple to implement than the EM algorithm used in frailty models estimation. Simulation results are provided to illustrate robustness, consistency, and large-sample properties of estimators. Finally, this method is exemplified on a diabetic retinopathy study in order to assess the effect of photocoagulation in delaying the onset of blindness as well as the dependence between the two eyes blindness times of a patient.  相似文献   

15.
Many external and internal validity measures have been proposed in order to estimate the number of clusters in gene expression data but as a rule they do not consider the analysis of the stability of the groupings produced by a clustering algorithm. Based on the approach assessing the predictive power or stability of a partitioning, we propose the new measure of cluster validation and the selection procedure to determine the suitable number of clusters. The validity measure is based on the estimation of the "clearness" of the consensus matrix, which is the result of a resampling clustering scheme or consensus clustering. According to the proposed selection procedure the stable clustering result is determined with the reference to the validity measure for the null hypothesis encoding for the absence of clusters. The final number of clusters is selected by analyzing the distance between the validity plots for initial and permutated data sets. We applied the selection procedure to estimate the clustering results on several datasets. As a result the proposed procedure produced an accurate and robust estimate of the number of clusters, which are in agreement with the biological knowledge and gold standards of cluster quality.  相似文献   

16.
Background modeling and foreground detection are key parts of any computer vision system. These problems have been addressed in literature with several probabilistic approaches based on mixture models. Here we propose a new kind of probabilistic background models which is based on probabilistic self-organising maps. This way, the background pixels are modeled with more flexibility. On the other hand, a statistical correlation measure is used to test the similarity among nearby pixels, so as to enhance the detection performance by providing a feedback to the process. Several well known benchmark videos have been used to assess the relative performance of our proposal with respect to traditional neural and non neural based methods, with favourable results, both qualitatively and quantitatively. A statistical analysis of the differences among methods demonstrates that our method is significantly better than its competitors. This way, a strong alternative to classical methods is presented.  相似文献   

17.
The number of families in the urban fox population of Sapporo, Japan, was estimated from two sets of data reported by the public to government: records of road-killed foxes (information-A) and records of complaints about foxes (information-B). We assumed that fox populations consist of families that have exclusive home ranges, i.e., territories, during the period between gestation and dispersal. The urban area was then divided into hexagons that correspond to the territories. The locations from the two sets of records during the territorial period were plotted on the map. The number of fox families for which information-A and/or B was reported was estimated by counting the number of hexagons that include the record. The total number of families was estimated by using a double-observation method. We adopted Chapman’s unbiased estimator which is based on the hypergeometric distribution that corresponds to the conditional likelihood. We demonstrated the possibility of estimating the abundance of animals from government data such as road kill and complaints if the animals have territories. Electronic supplementary material  The online version of this article (doi:) contains supplementary material, which is available to authorized users.  相似文献   

18.

Background  

Clustering techniques are routinely used in gene expression data analysis to organize the massive data. Clustering techniques arrange a large number of genes or assays into a few clusters while maximizing the intra-cluster similarity and inter-cluster separation. While clustering of genes facilitates learning the functions of un-characterized genes using their association with known genes, clustering of assays reveals the disease stages and subtypes. Many clustering algorithms require the user to specify the number of clusters a priori. A wrong specification of number of clusters generally leads to either failure to detect novel clusters (disease subtypes) or unnecessary splitting of natural clusters.  相似文献   

19.
The use of self-organizing maps to analyze data often depends on finding effective methods to visualize the SOM's structure. In this paper we propose a new way to perform that visualization using a variant of Andrews' Curves. Also we show that the interaction between these two methods allows us to find sub-clusters within identified clusters. Perhaps more importantly, using the SOM to pre-process data by identifying gross features enables us to use Andrews' Curves on data sets which would have previously been too large for the methodology. Finally we show how a three way interaction between the human user and these two methods can be a valuable exploratory data analysis tool.  相似文献   

20.
Murakoshi K 《Bio Systems》2005,80(1):37-40
Overfitting in multilayer perceptron (MLP) training is a serious problem. The purpose of this study is to avoid overfitting in on-line learning. To overcome the overfitting problem, we have investigated feeling-of-knowing (FOK) using self-organizing maps (SOMs). We propose MLPs with FOK using the SOMs method to overcome the overfitting problem. In this method, the learning process advances according to the degree of FOK calculated using SOMs. The mean square error obtained for the test set using the proposed method is significantly less than that in a conventional MLP method. Consequently, the proposed method avoids overfitting.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号