共查询到20条相似文献,搜索用时 0 毫秒
1.
Redestig H Repsilber D Sohler F Selbig J 《Biometrical journal. Biometrische Zeitschrift》2007,49(2):214-229
Clustering of microarray gene expression data is performed routinely, for genes as well as for samples. Clustering of genes can exhibit functional relationships between genes; clustering of samples on the other hand is important for finding e.g. disease subtypes, relevant patient groups for stratification or related treatments. Usually this is done by first filtering the genes for high-variance under the assumption that they carry most of the information needed for separating different sample groups. If this assumption is violated, important groupings in the data might be lost. Furthermore, classical clustering methods do not facilitate the biological interpretation of the results. Therefore, we propose to methodologically integrate the clustering algorithm with prior biological information. This is different from other approaches as knowledge about classes of genes can be directly used to ease the interpretation of the results and possibly boost clustering performance. Our approach computes dendrograms that resemble decision trees with gene classes used to split the data at each node which can help to find biologically meaningful differences between the sample groups. We have tested the proposed method both on simulated and real data and conclude its usefulness as a complementary method, especially when assumptions of few differentially expressed genes along with an informative mapping of genes to different classes are met. 相似文献
2.
This paper suggests a model building methodology for dealing with new processes. The methodology, called Hybrid Fuzzy Neural Networks (HFNN), combines unsupervised fuzzy clustering and supervised neural networks in order to create simple and flexible models. Fuzzy clustering was used to define relevant domains on the input space. Then, sets of multilayer perceptrons (MLP) were trained (one for each domain) to map input-output relations, creating, in the process, a set of specified sub-models. The estimated output of the model was obtained by fusing the different sub-model outputs weighted by their predicted possibilities. On-line reinforcement learning enabled improvement of the model. The determination of the optimal number of clusters is fundamental to the success of the HFNN approach. The effectiveness of several validity measures was compared to the generalization capability of the model and information criteria. The validity measures were tested with fermentation simulations and real fermentations of a yeast-like fungus, Aureobasidium pullulans. The results outline the criteria limitations. The learning capability of the HFNN was tested with the fermentation data. The results underline the advantages of HFNN over a single neural network. 相似文献
3.
Clustering 16S rRNA for OTU prediction: a method of unsupervised Bayesian clustering 总被引:1,自引:0,他引:1
MOTIVATION: With the advancements of next-generation sequencing technology, it is now possible to study samples directly obtained from the environment. Particularly, 16S rRNA gene sequences have been frequently used to profile the diversity of organisms in a sample. However, such studies are still taxed to determine both the number of operational taxonomic units (OTUs) and their relative abundance in a sample. RESULTS: To address these challenges, we propose an unsupervised Bayesian clustering method termed Clustering 16S rRNA for OTU Prediction (CROP). CROP can find clusters based on the natural organization of data without setting a hard cut-off threshold (3%/5%) as required by hierarchical clustering methods. By applying our method to several datasets, we demonstrate that CROP is robust against sequencing errors and that it produces more accurate results than conventional hierarchical clustering methods. Availability and Implementation: Source code freely available at the following URL: http://code.google.com/p/crop-tingchenlab/, implemented in C++ and supported on Linux and MS Windows. 相似文献
4.
5.
Background
Clustering the information content of large high-dimensional gene expression datasets has widespread application in "omics" biology. Unfortunately, the underlying structure of these natural datasets is often fuzzy, and the computational identification of data clusters generally requires knowledge about cluster number and geometry. 相似文献6.
Background
Human cancers are complex ecosystems composed of cells with distinct molecular signatures. Such intratumoral heterogeneity poses a major challenge to cancer diagnosis and treatment. Recent advancements of single-cell techniques such as scRNA-seq have brought unprecedented insights into cellular heterogeneity. Subsequently, a challenging computational problem is to cluster high dimensional noisy datasets with substantially fewer cells than the number of genes.Methods
In this paper, we introduced a consensus clustering framework conCluster, for cancer subtype identification from single-cell RNA-seq data. Using an ensemble strategy, conCluster fuses multiple basic partitions to consensus clusters.Results
Applied to real cancer scRNA-seq datasets, conCluster can more accurately detect cancer subtypes than the widely used scRNA-seq clustering methods. Further, we conducted co-expression network analysis for the identified melanoma subtypes.Conclusions
Our analysis demonstrates that these subtypes exhibit distinct gene co-expression networks and significant gene sets with different functional enrichment.7.
8.
9.
Fluorescence time-lapse microscopy has become a powerful tool in the study of many biological processes at the single-cell level. In particular, movies depicting the temporal dependence of gene expression provide insight into the dynamics of its regulation; however, there are many technical challenges to obtaining and analyzing fluorescence movies of single cells. We describe here a simple protocol using a commercially available microfluidic culture device to generate such data, and a MATLAB-based, graphical user interface (GUI) -based software package to quantify the fluorescence images. The software segments and tracks cells, enables the user to visually curate errors in the data, and automatically assigns lineage and division times. The GUI further analyzes the time series to produce whole cell traces as well as their first and second time derivatives. While the software was designed for S. cerevisiae, its modularity and versatility should allow it to serve as a platform for studying other cell types with few modifications. 相似文献
10.
Various methods have been used to identify cultivares of olive trees; herein we used different bioinformatics algorithms to propose new tools to classify 10 cultivares of olive based on RAPD and ISSR genetic markers datasets generated from PCR reactions. Five RAPD markers (OPA0a21, OPD16a, OP01a1, OPD16a1 and OPA0a8) and five ISSR markers (UBC841a4, UBC868a7, UBC841a14, U12BC807a and UBC810a13) selected as the most important markers by all attribute weighting models. K-Medoids unsupervised clustering run on SVM dataset was fully able to cluster each olive cultivar to the right classes. All trees (176) induced by decision tree models generated meaningful trees and UBC841a4 attribute clearly distinguished between foreign and domestic olive cultivars with 100% accuracy. Predictive machine learning algorithms (SVM and Naïve Bayes) were also able to predict the right class of olive cultivares with 100% accuracy. For the first time, our results showed data mining techniques can be effectively used to distinguish between plant cultivares and proposed machine learning based systems in this study can predict new olive cultivars with the best possible accuracy. 相似文献
11.
Wang Y Wu R Cho KR Shedden KA Barder TJ Lubman DM 《Molecular & cellular proteomics : MCP》2006,5(1):43-52
A two-dimensional liquid mapping method was used to map the protein expression of eight ovarian serous carcinoma cell lines and three immortalized ovarian surface epithelial cell lines. Maps were produced using pI as the separation parameter in the first dimension and hydrophobicity based upon reversed-phase HPLC separation in the second dimension. The method can be reproducibly used to produce protein expression maps over a pH range from 4.0 to 8.5. A dynamic programming method was used to correct for minor shifts in peaks during the HPLC gradient between sample runs. The resulting corrected maps can then be compared using hierarchical clustering to produce dendrograms indicating the relationship between different cell lines. It was found that several of the ovarian surface epithelial cell lines clustered together, whereas specific groups of serous carcinoma cell lines clustered with each other. Although there is limited information on the current biology of these cell lines, it was shown that the protein expression of certain cell lines is closely related to each other. Other cell lines, including one ovarian clear cell carcinoma cell line, two endometrioid carcinoma cell lines, and three breast epithelial cell lines, were also mapped for comparison to show that their protein profiles cluster differently than the serous samples and to study how they cluster relative to each other. In addition, comparisons can be made between proteins differentially expressed between cell lines that may serve as markers of ovarian serous carcinomas. The automation of the method allows reproducible comparison of many samples, and the use of differential analysis limits the number of proteins that might require further analysis by mass spectrometry techniques. 相似文献
12.
Gronwald W Moussa S Elsner R Jung A Ganslmeier B Trenner J Kremer W Neidig KP Kalbitzer HR 《Journal of biomolecular NMR》2002,23(4):271-287
Automated assignment of NOESY spectra is a prerequisite for automated structure determination of biological macromolecules. With the program KNOWNOE we present a novel, knowledge based approach to this problem. KNOWNOE is devised to work directly with the experimental spectra without interference of an expert. Besides making use of routines already implemented in AUREMOL, it contains as a central part a knowledge driven Bayesian algorithm for solving ambiguities in the NOE assignments. These ambiguities mainly arise from chemical shift degeneration which allows multiple assignments of cross peaks. Using a set of 326 protein NMR structures, statistical tables in the form of atom-pairwise volume probability distributions (VPDs) were derived. VPDs for all assignment possibilities relevant to the assignments of interproton NOEs were calculated. With these data for a given cross peak with N possible assignments A
i(i = 1,...,N) the conditional probabilities P(A
i, a|V
0) can be calculated that the assignment A
idetermines essentially all (a-times) of the cross peak volume V
0. An assignment A
kwith a probability P(A
k, a|V
0) higher than 0.8 is transiently considered as unambiguously assigned. With a list of unambiguously assigned peaks a set of structures is calculated. These structures are used as input for a next cycle of iteration where a distance threshold D
maxis dynamically reduced. The program KNOWNOE was tested on NOESY spectra of a medium size protein, the cold shock protein (TmCsp) from Thermotoga maritima. The results show that a high quality structure of this protein can be obtained by automated assignment of NOESY spectra which is at least as good as the structure obtained from manual data evaluation. 相似文献
13.
Dawy Z Goebel B Hagenauer J Andreoli C Meitinger T Mueller JC 《IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM》2006,3(1):47-56
Finding the causal genetic regions underlying complex traits is one of the main aims in human genetics. In the context of complex diseases, which are believed to be controlled by multiple contributing loci of largely unknown effect and position, it is especially important to develop general yet sensitive methods for gene mapping. We discuss the use of Shannon's information theory for population-based gene mapping of discrete and quantitative traits and for marker clustering. Various measures of mutual information were employed in order to develop a comprehensive framework for gene mapping analyses. An algorithm aimed at finding so-called relevance chains of causal markers is proposed. Moreover, entropy measures are used in conjunction with multidimensional scaling to visualize clusters of genetic markers. The relevance chain algorithm successfully detected the two causal regions in a simulated scenario. The approach has also been applied to a published clinical study on autoimmune (Graves') disease. Results were consistent with those of standard statistical methods, but identified an additional locus of interest in the promoter region of the associated gene CTLA4. The developed software is freely available at http://www.lnt.ei.tum.de/download/InfoGeneMap/. 相似文献
14.
15.
An unsupervised automatic method for sorting neuronal spike waveforms in awake and freely moving animals 总被引:1,自引:0,他引:1
Aksenova TI Chibirova OK Dryga OA Tetko IV Benabid AL Villa AE 《Methods (San Diego, Calif.)》2003,30(2):178-187
The present study introduces an approach to automatic classification of extracellularly recorded action potentials of neurons. The classification of spike waveform is considered a pattern recognition problem of special segments of signal that correspond to the appearance of spikes. The spikes generated by one neuron should be recognized as members of the same class. The spike waveforms are described by the nonlinear oscillating model as an ordinary differential equation with perturbation, thus characterizing the signal distortions in both amplitude and phase. It is shown that the use of local variables reduces the problem of spike recognition to the separation of a mixture of normal distributions in the transformed feature space. We have developed an unsupervised iteration-learning algorithm that estimates the number of classes and their centers according to the distance between spike trajectories in phase space. This algorithm scans the learning set to evaluate spike trajectories with maximal probability density in their neighborhood. Following the learning, the procedure of minimal distance is used to perform spike recognition. Estimation of trajectories in phase space requires calculation of the first- and second-order derivatives, and integral operators with piecewise polynomial kernels were used. This provided the computational efficiency of the developed approach for real-time application as required by recordings in behaving animals and in human neurosurgical operations. The new method of spike sorting was tested on simulated and real data and performed better than other approaches currently used in neurophysiology. 相似文献
16.
In this paper, an unsupervised learning algorithm is developed. Two versions of an artificial neural network, termed a differentiator, are described. It is shown that our algorithm is a dynamic variation of the competitive learning found in most unsupervised learning systems. These systems are frequently used for solving certain pattern recognition tasks such as pattern classification and k-means clustering. Using computer simulation, it is shown that dynamic competitive learning outperforms simple competitive learning methods in solving cluster detection and centroid estimation problems. The simulation results demonstrate that high quality clusters are detected by our method in a short training time. Either a distortion function or the minimum spanning tree method of clustering is used to verify the clustering results. By taking full advantage of all the information presented in the course of training in the differentiator, we demonstrate a powerful adaptive system capable of learning continuously changing patterns. 相似文献
17.
A PCR method for detection of bifidobacteria in raw milk and raw milk cheese: comparison with culture-based methods 总被引:2,自引:0,他引:2
Delcenserie V Bechoux N China B Daube G Gavini F 《Journal of microbiological methods》2005,61(1):55-67
Bifidobacteria are well known for their beneficial effects on health and are used as probiotics in food and pharmaceutical products. As they form one of the most important groups in both human and animal feces, their use as fecal indicator organisms in raw milk products has recently been proposed. Bifidobacteria species isolated in humans are different from those isolated in animals. It should therefore be possible to determine contamination origin (human or animal). A method of detecting the Bifidobacterium genus was developed by PCR targeting the hsp60 gene. The genus Bifidobacterium was identified by PCR amplification of a 217-bp hsp60 gene fragment. The degenerated primer pair specific to the Bifidobacterium genus used was tested for it specificity on 127 strains. Sensitivity was measured on artificially contaminated samples. Food can however be a difficult matrix for PCR testing since it contains PCR inhibitors. So an internal PCR control was used. An artificially created DNA fragment of 315 bp was constructed. The PCR detection method was tested on raw milk and cheese samples and compared with three culture-based methods, which comprised enrichment and isolation steps. The enrichment step used Brain Heart Infusion medium with propionic acid, iron citrate, yeast extract, supplemented with mupirocin (BHMup) or not (BH) and the isolation step used Columbia blood agar medium, supplemented with mupirocin (CMup) or not (C). The method using mupirocin at both enrichment and isolation steps and the PCR method performed from the culture in BHMup enrichment medium were shown to be the most efficient. No significant difference was observed in raw milk samples between PCR from BHMup and the culture-based method BHMup/CMup, while a significant difference was noticed between the same methods in raw milk cheese samples, which would favor using PCR. The results suggested that PCR on the hsp60 gene was convenient for a rapid detection of bifidobacteria in raw milk and raw milk cheese samples and that bifidobacteria always present throughout raw milk cheese production could be efficiently used as fecal indicators. 相似文献
18.
Jayanta Kumar Das Pabitra Pal Choudhury Neelambuj Chaturvedi Mohd Tayyab Sk. Sarif Hassan 《Genomics》2019,111(4):549-559
This article introduces an alignment-free clustering method in order to cluster all the 66 DORs sequentially diverse protein sequences. Two different methods are discussed: one is utilizing twenty standard amino acids (without grouping) and another one is using chemical grouping of amino acids (with grouping). Two grayscale images (representing two protein sequences by order pair frequency matrices) are compared to find the similarity index using morphology technique. We could achieve the correlation coefficients of 0.9734 and 0.9403 for without and with grouping methods respectively with the ClustalW result in the ND5 dataset, which are much better than some of the existing alignment-free methods. Based on the similarity index, the 66 DORs are clustered into three classes - Highest, Moderate and Lowest - which are seen to be best fitted for 66 DORs protein sequences. OR83b is the distinguished olfactory receptor expressed in divergent insect population which is substantiated through our investigation. 相似文献
19.
20.
Wise MJ 《Bioinformatics (Oxford, England)》2002,18(Z1):S38-S45
The POPPs is a suite of inter-related software tools which allow the user to discover what is statistically 'unusual' in the composition of an unknown protein, or to automatically cluster proteins into families based on peptide composition. Finally, the user can search for related proteins based on peptide composition. Statistically based peptide composition provides a view of proteins that is, to some extent, orthogonal to that provided by sequence. In a test study, the POPP suite is able to regroup into their families sets of approximately 100 randomised Pfam protein domains. The POPPs suite is used to explore the diverse set of late embryogenesis abundant (LEA) proteins. 相似文献