首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Microarray technology plays an important role in drawing useful biological conclusions by analyzing thousands of gene expressions simultaneously. Especially, image analysis is a key step in microarray analysis and its accuracy strongly depends on segmentation. The pioneering works of clustering based segmentation have shown that k-means clustering algorithm and moving k-means clustering algorithm are two commonly used methods in microarray image processing. However, they usually face unsatisfactory results because the real microarray image contains noise, artifacts and spots that vary in size, shape and contrast. To improve the segmentation accuracy, in this article we present a combination clustering based segmentation approach that may be more reliable and able to segment spots automatically. First, this new method starts with a very simple but effective contrast enhancement operation to improve the image quality. Then, an automatic gridding based on the maximum between-class variance is applied to separate the spots into independent areas. Next, among each spot region, the moving k-means clustering is first conducted to separate the spot from background and then the k-means clustering algorithms are combined for those spots failing to obtain the entire boundary. Finally, a refinement step is used to replace the false segmentation and the inseparable ones of missing spots. In addition, quantitative comparisons between the improved method and the other four segmentation algorithms--edge detection, thresholding, k-means clustering and moving k-means clustering--are carried out on cDNA microarray images from six different data sets. Experiments on six different data sets, 1) Stanford Microarray Database (SMD), 2) Gene Expression Omnibus (GEO), 3) Baylor College of Medicine (BCM), 4) Swiss Institute of Bioinformatics (SIB), 5) Joe DeRisi’s individual tiff files (DeRisi), and 6) University of California, San Francisco (UCSF), indicate that the improved approach is more robust and sensitive to weak spots. More importantly, it can obtain higher segmentation accuracy in the presence of noise, artifacts and weakly expressed spots compared with the other four methods.  相似文献   

2.

Motivation

In cluster analysis, the validity of specific solutions, algorithms, and procedures present significant challenges because there is no null hypothesis to test and no 'right answer'. It has been noted that a replicable classification is not necessarily a useful one, but a useful one that characterizes some aspect of the population must be replicable. By replicable we mean reproducible across multiple samplings from the same population. Methodologists have suggested that the validity of clustering methods should be based on classifications that yield reproducible findings beyond chance levels. We used this approach to determine the performance of commonly used clustering algorithms and the degree of replicability achieved using several microarray datasets.

Methods

We considered four commonly used iterative partitioning algorithms (Self Organizing Maps (SOM), K-means, Clutsering LARge Applications (CLARA), and Fuzzy C-means) and evaluated their performances on 37 microarray datasets, with sample sizes ranging from 12 to 172. We assessed reproducibility of the clustering algorithm by measuring the strength of relationship between clustering outputs of subsamples of 37 datasets. Cluster stability was quantified using Cramer's v 2 from a kXk table. Cramer's v 2 is equivalent to the squared canonical correlation coefficient between two sets of nominal variables. Potential scores range from 0 to 1, with 1 denoting perfect reproducibility.

Results

All four clustering routines show increased stability with larger sample sizes. K-means and SOM showed a gradual increase in stability with increasing sample size. CLARA and Fuzzy C-means, however, yielded low stability scores until sample sizes approached 30 and then gradually increased thereafter. Average stability never exceeded 0.55 for the four clustering routines, even at a sample size of 50. These findings suggest several plausible scenarios: (1) microarray datasets lack natural clustering structure thereby producing low stability scores on all four methods; (2) the algorithms studied do not produce reliable results and/or (3) sample sizes typically used in microarray research may be too small to support derivation of reliable clustering results. Further research should be directed towards evaluating stability performances of more clustering algorithms on more datasets specially having larger sample sizes with larger numbers of clusters considered.
  相似文献   

3.
Agglomerative hierarchical clustering becomes infeasible when applied to large datasets due to its O(N 2) storage requirements. We present a multi-stage agglomerative hierarchical clustering (MAHC) approach aimed at large datasets of speech segments. The algorithm is based on an iterative divide-and-conquer strategy. The data is first split into independent subsets, each of which is clustered separately. Thus reduces the storage required for sequential implementations, and allows concurrent computation on parallel computing hardware. The resultant clusters are merged and subsequently re-divided into subsets, which are passed to the following iteration. We show that MAHC can match and even surpass the performance of the exact implementation when applied to datasets of speech segments.  相似文献   

4.
利用基因芯片可以得到不同基因在不同生命过程中的表达,因此在医学诊断与病变分析中受到重视,并开始大量应用.经测定发现,不同基因在病变过程的不同阶段中的表达是不相同的,由此可以得到在病变过程的不同基因的表达特征.在本文中,我们给出了乳腺癌在转移过程中的基因表达特征的聚类分析法分析,并改进了k-means聚类算法,使之具有自动搜索聚类数的功能,并且有助于改善k-means算法的聚类结果陷入局部最小值的状况.通过对平均聚类误差指标的比较,kr—means要优于k-means算法.本文所得到的结果可供乳腺癌诊断与病变分析参考,同时可以应用于小型基因检测芯片的制备,也可以用于构建基因网络调控图.  相似文献   

5.
The central purpose of this study is to further evaluate the quality of the performance of a new algorithm. The study provides additional evidence on this algorithm that was designed to increase the overall efficiency of the original k-means clustering technique—the Fast, Efficient, and Scalable k-means algorithm (FES-k-means). The FES-k-means algorithm uses a hybrid approach that comprises the k-d tree data structure that enhances the nearest neighbor query, the original k-means algorithm, and an adaptation rate proposed by Mashor. This algorithm was tested using two real datasets and one synthetic dataset. It was employed twice on all three datasets: once on data trained by the innovative MIL-SOM method and then on the actual untrained data in order to evaluate its competence. This two-step approach of data training prior to clustering provides a solid foundation for knowledge discovery and data mining, otherwise unclaimed by clustering methods alone. The benefits of this method are that it produces clusters similar to the original k-means method at a much faster rate as shown by runtime comparison data; and it provides efficient analysis of large geospatial data with implications for disease mechanism discovery. From a disease mechanism discovery perspective, it is hypothesized that the linear-like pattern of elevated blood lead levels discovered in the city of Chicago may be spatially linked to the city''s water service lines.  相似文献   

6.
A new integrated image analysis package with quantitative quality control schemes is described for cDNA microarray technology. The package employs an iterative algorithm that utilizes both intensity characteristics and spatial information of the spots on a microarray image for signal–background segmentation and defines five quality scores for each spot to record irregularities in spot intensity, size and background noise levels. A composite score qcom is defined based on these individual scores to give an overall assessment of spot quality. Using qcom we demonstrate that the inherent variability in intensity ratio measurements is closely correlated with spot quality, namely spots with higher quality give less variable measurements and vice versa. In addition, gauging data by qcom can improve data reliability dramatically and efficiently. We further show that the variability in ratio measurements drops exponentially with increasing qcom and, for the majority of spots at the high quality end, this improvement is mainly due to an improvement in correlation between the two dyes. Based on these studies, we discuss the potential of quantitative quality control for microarray data and the possibility of filtering and normalizing microarray data using a quality metrics-dependent scheme.  相似文献   

7.
8.
Partitioning closely related genes into clusters has become an important element of practically all statistical analyses of microarray data. A number of computer algorithms have been developed for this task. Although these algorithms have demonstrated their usefulness for gene clustering, some basic problems remain. This paper describes our work on extracting functional keywords from MEDLINE for a set of genes that are isolated for further study from microarray experiments based on their differential expression patterns. The sharing of functional keywords among genes is used as a basis for clustering in a new approach called BEA-PARTITION in this paper. Functional keywords associated with genes were extracted from MEDLINE abstracts. We modified the Bond Energy Algorithm (BEA), which is widely accepted in psychology and database design but is virtually unknown in bioinformatics, to cluster genes by functional keyword associations. The results showed that BEA-PARTITION and hierarchical clustering algorithm outperformed k-means clustering and self-organizing map by correctly assigning 25 of 26 genes in a test set of four known gene groups. To evaluate the effectiveness of BEA-PARTITION for clustering genes identified by microarray profiles, 44 yeast genes that are differentially expressed during the cell cycle and have been widely studied in the literature were used as a second test set. Using established measures of cluster quality, the results produced by BEA-PARTITION had higher purity, lower entropy, and higher mutual information than those produced by k-means and self-organizing map. Whereas BEA-PARTITION and the hierarchical clustering produced similar quality of clusters, BEA-PARTITION provides clear cluster boundaries compared to the hierarchical clustering. BEA-PARTITION is simple to implement and provides a powerful approach to clustering genes or to any clustering problem where starting matrices are available from experimental observations.  相似文献   

9.

Background  

Data clustering analysis has been extensively applied to extract information from gene expression profiles obtained with DNA microarrays. To this aim, existing clustering approaches, mainly developed in computer science, have been adapted to microarray data analysis. However, previous studies revealed that microarray datasets have very diverse structures, some of which may not be correctly captured by current clustering methods. We therefore approached the problem from a new starting point, and developed a clustering algorithm designed to capture dataset-specific structures at the beginning of the process.  相似文献   

10.
Application of independent component analysis to microarrays   总被引:4,自引:1,他引:3  
We apply linear and nonlinear independent component analysis (ICA) to project microarray data into statistically independent components that correspond to putative biological processes, and to cluster genes according to over- or under-expression in each component. We test the statistical significance of enrichment of gene annotations within clusters. ICA outperforms other leading methods, such as principal component analysis, k-means clustering and the Plaid model, in constructing functionally coherent clusters on microarray datasets from Saccharomyces cerevisiae, Caenorhabditis elegans and human.  相似文献   

11.
Diffusion of thiocyanate (SCN?) and thiocyanic acid (HSCN) (pK=?1.8) through lipid bilayer membranes was studied as a function of pH. Membranes were made of egg phosphatidylcholine or phosphatidylcholine plus cholesterol (1:1 mol ratio) dissolved in decane or tetradecane. Tracer fluxes and electrical conductances were used to estimate the permeabilities to HSCN and SCN?. Over the pH range 1.0 to 3.3 only HSCN crosses the membrane at a significant rate. The relation between the total SCN flux (JA), concentrations and permeabilities is: 1/JA=1/Pul([A?]+[HA])+1/PHAm[HA], where [A?] and [HA] are the concentrations of SCN? and HSCN, Pul is permeability coefficient of the unstirred layer, and PHAm is the membrane permeability to HSCN. By fitting this equation to the data we find that PHAm = 2.6 cm · s?1 and Pul = 9.0 · 10?4 cm · s?1. Conductance measurements indicate that PA?m is 5 · 10?9 cm · s?1. Addition of cholesterol to phosphatidylcholine (1:1 mol ratio) reduces PHAm by a factor of 0.4 but has no effect on PA?m. SCN? is potent inhibitor of acid secretion in gastric mucosa, but the mechanism of SCN? action is unknown. Our results suggest that SCN? acts by combining with H+ in the mucosal unstirred layer (secretory pits) and diffusing back into the cells as HSCN, thus dissipating the proton gradient across the secretory membrane. A similar mechanism of action is proposed for some other inhibitors of gastric acid secretion, e.g. nitrite (NO2?), cyanate (CNO?) and NH4+.  相似文献   

12.
Successful clustering algorithms are highly dependent on parameter settings. The clustering performance degrades significantly unless parameters are properly set, and yet, it is difficult to set these parameters a priori. To address this issue, in this paper, we propose a unique splitting-while-merging clustering framework, named “splitting merging awareness tactics” (SMART), which does not require any a priori knowledge of either the number of clusters or even the possible range of this number. Unlike existing self-splitting algorithms, which over-cluster the dataset to a large number of clusters and then merge some similar clusters, our framework has the ability to split and merge clusters automatically during the process and produces the the most reliable clustering results, by intrinsically integrating many clustering techniques and tasks. The SMART framework is implemented with two distinct clustering paradigms in two algorithms: competitive learning and finite mixture model. Nevertheless, within the proposed SMART framework, many other algorithms can be derived for different clustering paradigms. The minimum message length algorithm is integrated into the framework as the clustering selection criterion. The usefulness of the SMART framework and its algorithms is tested in demonstration datasets and simulated gene expression datasets. Moreover, two real microarray gene expression datasets are studied using this approach. Based on the performance of many metrics, all numerical results show that SMART is superior to compared existing self-splitting algorithms and traditional algorithms. Three main properties of the proposed SMART framework are summarized as: (1) needing no parameters dependent on the respective dataset or a priori knowledge about the datasets, (2) extendible to many different applications, (3) offering superior performance compared with counterpart algorithms.  相似文献   

13.

Background  

Visualization tools allow researchers to obtain a global view of the interrelationships between the probes or experiments of a gene expression (e.g. microarray) data set. Some existing methods include hierarchical clustering and k-means. In recent years, others have proposed applying minimum spanning trees (MST) for microarray clustering. Although MST-based clustering is formally equivalent to the dendrograms produced by hierarchical clustering under certain conditions; visually they can be quite different.  相似文献   

14.
The Shigella flexneri outer membrane (OM) protease IcsP (SopA) is a member of the enterobacterial Omptin family of proteases which cleaves the polarly localised OM protein IcsA that is essential for Shigella virulence. Unlike IcsA however, the specific localisation of IcsP on the cell surface is unknown. To determine the distribution of IcsP, a haemagglutinin (HA) epitope was inserted into the non-essential IcsP OM loop 5 using Splicing by Overlap Extension (SOE) PCR, and IcsPHA was characterised. Quantum Dot (QD) immunofluorescence (IF) surface labelling of IcsPHA was then undertaken. Quantitative fluorescence analysis of S. flexneri 2a 2457T treated with and without tunicaymcin to deplete lipopolysaccharide (LPS) O antigen (Oag) showed that IcsPHA was asymmetrically distributed on the surface of septating and non-septating cells, and that this distribution was masked by LPS Oag in untreated cells. Double QD IF labelling of IcsPHA and IcsA showed that IcsPHA preferentially localised to the new pole of non-septating cells and to the septum of septating cells. The localisation of IcsPHA in a rough LPS S. flexneri 2457T strain (with no Oag) was also investigated and a similar distribution of IcsPHA was observed. Complementation of the rough LPS strain with rmlD resulted in restored LPS Oag chain expression and loss of IcsPHA detection, providing further support for LPS Oag masking of surface proteins. Our data presents for the first time the distribution for the Omptin OM protease IcsP, relative to IcsA, and the effect of LPS Oag masking on its detection.  相似文献   

15.

Background  

Cells dynamically adapt their gene expression patterns in response to various stimuli. This response is orchestrated into a number of gene expression modules consisting of co-regulated genes. A growing pool of publicly available microarray datasets allows the identification of modules by monitoring expression changes over time. These time-series datasets can be searched for gene expression modules by one of the many clustering methods published to date. For an integrative analysis, several time-series datasets can be joined into a three-dimensional gene-condition-time dataset, to which standard clustering or biclustering methods are, however, not applicable. We thus devise a probabilistic clustering algorithm for gene-condition-time datasets.  相似文献   

16.

Background

Biclustering algorithm can find a number of co-expressed genes under a set of experimental conditions. Recently, differential co-expression bicluster mining has been used to infer the reasonable patterns in two microarray datasets, such as, normal and cancer cells.

Methods

In this paper, we propose an algorithm, DECluster, to mine Differential co-Expression biCluster in two discretized microarray datasets. Firstly, DECluster produces the differential co-expressed genes from each pair of samples in two microarray datasets, and constructs a differential weighted undirected sample–sample relational graph. Secondly, the differential biclusters are generated in the above differential weighted undirected sample–sample relational graph. In order to mine maximal differential co-expression biclusters efficiently, we design several pruning techniques for generating maximal biclusters without candidate maintenance.

Results

The experimental results show that our algorithm is more efficient than existing methods. The performance of DECluster is evaluated by empirical p-value and gene ontology, the results show that our algorithm can find more statistically significant and biological differential co-expression biclusters than other algorithms.

Conclusions

Our proposed algorithm can find more statistically significant and biological biclusters in two microarray datasets than the other two algorithms.  相似文献   

17.
MOTIVATION: A promising and reliable approach to annotate gene function is clustering genes not only by using gene expression data but also literature information, especially gene networks. RESULTS: We present a systematic method for gene clustering by combining these totally different two types of data, particularly focusing on network modularity, a global feature of gene networks. Our method is based on learning a probabilistic model, which we call a hidden modular random field in which the relation between hidden variables directly represents a given gene network. Our learning algorithm which minimizes an energy function considering the network modularity is practically time-efficient, regardless of using the global network property. We evaluated our method by using a metabolic network and microarray expression data, changing with microarray datasets, parameters of our model and gold standard clusters. Experimental results showed that our method outperformed other four competing methods, including k-means and existing graph partitioning methods, being statistically significant in all cases. Further detailed analysis showed that our method could group a set of genes into a cluster which corresponds to the folate metabolic pathway while other methods could not. From these results, we can say that our method is highly effective for gene clustering and annotating gene function.  相似文献   

18.
Four of the most common limitations of the many available clustering methods are: i) the lack of a proper strategy to deal with outliers; ii) the need for a good a priori estimate of the number of clusters to obtain reasonable results; iii) the lack of a method able to detect when partitioning of a specific data set is not appropriate; and iv) the dependence of the result on the initialization. Here we propose Cross-clustering (CC), a partial clustering algorithm that overcomes these four limitations by combining the principles of two well established hierarchical clustering algorithms: Ward’s minimum variance and Complete-linkage. We validated CC by comparing it with a number of existing clustering methods, including Ward’s and Complete-linkage. We show on both simulated and real datasets, that CC performs better than the other methods in terms of: the identification of the correct number of clusters, the identification of outliers, and the determination of real cluster memberships. We used CC to cluster samples in order to identify disease subtypes, and on gene profiles, in order to determine groups of genes with the same behavior. Results obtained on a non-biological dataset show that the method is general enough to be successfully used in such diverse applications. The algorithm has been implemented in the statistical language R and is freely available from the CRAN contributed packages repository.  相似文献   

19.
20.
Surface water contamination from agricultural and urban runoff and wastewater discharges from industrial and municipal activities is of major concern to people worldwide. Classical models can be insufficient to visualise the results because the water quality variables used to describe dynamic pollution sources are complex, multivariable, and nonlinearly related. Artificial intelligence techniques with the ability to analyse multivariant water quality data by means of a sophisticated visualisation capacity can offer an alternative to current models. In this study, the Kohonen self-organising feature maps (SOM) neural network was initially applied to analyse the complex nonlinear relationships among multivariable surface water quality variables using the component planes of the variables to determine the complex behaviour of water quality parameters. The dependencies between water quality variables were extracted and interpreted using the pattern analysis visualised in component planes. For further investigation, the k-means clustering algorithm was used to determine the optimal number of clusters by partitioning the maps and utilising the Davies–Bouldin clustering index, leading to seven groups or clusters corresponding to water quality variables. The results reveal that the concentrations of Na, K, Cl, NH4-N, NO2-N, o-PO4, component planes of organic matter (pV), and dissolved oxygen (DO) were significantly affected by seasonal changes, and that the SOM technique is an efficient tool with which to analyse and determine the complex behaviour of multidimensional surface water quality data. These results suggest that this technique could also be applied to other environmentally sensitive areas such as air and groundwater pollution.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号