首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Feature selection is widely established as one of the fundamental computational techniques in mining microarray data. Due to the lack of categorized information in practice, unsupervised feature selection is more practically important but correspondingly more difficult. Motivated by the cluster ensemble techniques, which combine multiple clustering solutions into a consensus solution of higher accuracy and stability, recent efforts in unsupervised feature selection proposed to use these consensus solutions as oracles. However,these methods are dependent on both the particular cluster ensemble algorithm used and the knowledge of the true cluster number. These methods will be unsuitable when the true cluster number is not available, which is common in practice. In view of the above problems, a new unsupervised feature ranking method is proposed to evaluate the importance of the features based on consensus affinity. Different from previous works, our method compares the corresponding affinity of each feature between a pair of instances based on the consensus matrix of clustering solutions. As a result, our method alleviates the need to know the true number of clusters and the dependence on particular cluster ensemble approaches as in previous works. Experiments on real gene expression data sets demonstrate significant improvement of the feature ranking results when compared to several state-of-the-art techniques.  相似文献   

2.
We present CLIFF, an algorithm for clustering biological samples using gene expression microarray data. This clustering problem is difficult for several reasons, in particular the sparsity of the data, the high dimensionality of the feature (gene) space, and the fact that many features are irrelevant or redundant. Our algorithm iterates between two computational processes, feature filtering and clustering. Given a reference partition that approximates the correct clustering of the samples, our feature filtering procedure ranks the features according to their intrinsic discriminability, relevance to the reference partition, and irredundancy to other relevant features, and uses this ranking to select the features to be used in the following round of clustering. Our clustering algorithm, which is based on the concept of a normalized cut, clusters the samples into a new reference partition on the basis of the selected features. On a well-studied problem involving 72 leukemia samples and 7130 genes, we demonstrate that CLIFF outperforms standard clustering approaches that do not consider the feature selection issue, and produces a result that is very close to the original expert labeling of the sample set.  相似文献   

3.
MOTIVATION: Feature selection methods aim to reduce the complexity of data and to uncover the most relevant biological variables. In reality, information in biological datasets is often incomplete as a result of untrustworthy samples and missing values. The reliability of selection methods may therefore be questioned. METHOD: Information loss is incorporated into a perturbation scheme, testing which features are stable under it. This method is applied to data analysis by unsupervised feature filtering (UFF). The latter has been shown to be a very successful method in analysis of gene-expression data. RESULTS: We find that the UFF quality degrades smoothly with information loss. It remains successful even under substantial damage. Our method allows for selection of a best imputation method on a dataset treated by UFF. More importantly, scoring features according to their stability under information loss is shown to be correlated with biological importance in cancer studies. This scoring may lead to novel biological insights.  相似文献   

4.
The fractal dimension of subsets of time series data can be used to modulate the extent of filtering to which the data is subjected. In general, such fractal filtering makes it possible to retain large transient shifts in baseline with very little decrease in amplitude, while the baseline noise itself is markedly reduced (Strahle, W.C. (1988) Electron. Lett. 24, 1248-1249). The fractal filter concept is readily applicable to single channel data in which there are numerous opening/closing events and flickering. Using a simple recursive filter of the form: Yn = w.Yn-1 + (1 - w)Xn, where Xn is the data, Yn the filtered result, and w is a weighting factor, 0 less than w less than 1, we adjusted w as a function of the fractal dimension (D) for data subsets. Linear and ogive functions of D were used to modify w. Of these, the ogive function: w = [1 + p(1.5-D)]-1 (where p affects the amount of filtering), is most useful for removing extraneous noise while retaining opening/closing events.  相似文献   

5.
6.
Li  Xin  Wu  Yufeng 《BMC bioinformatics》2023,23(8):1-16
Background

Structural variation (SV), which ranges from 50 bp to \(\sim\) 3 Mb in size, is an important type of genetic variations. Deletion is a type of SV in which a part of a chromosome or a sequence of DNA is lost during DNA replication. Three types of signals, including discordant read-pairs, reads depth and split reads, are commonly used for SV detection from high-throughput sequence data. Many tools have been developed for detecting SVs by using one or multiple of these signals.

Results

In this paper, we develop a new method called EigenDel for detecting the germline submicroscopic genomic deletions. EigenDel first takes advantage of discordant read-pairs and clipped reads to get initial deletion candidates, and then it clusters similar candidates by using unsupervised learning methods. After that, EigenDel uses a carefully designed approach for calling true deletions from each cluster. We conduct various experiments to evaluate the performance of EigenDel on low coverage sequence data.

Conclusions

Our results show that EigenDel outperforms other major methods in terms of improving capability of balancing accuracy and sensitivity as well as reducing bias. EigenDel can be downloaded from https://github.com/lxwgcool/EigenDel.

  相似文献   

7.
Brillouin imaging relies on the reliable extraction of subtle spectral information from hyperspectral datasets. To date, the mainstream practice has been to use line fitting of spectral features to retrieve the average peak shift and linewidth parameters. Good results, however, depend heavily on sufficient signal-to-noise ratio and may not be applicable in complex samples that consist of spectral mixtures. In this work, we thus propose the use of various multivariate algorithms that can be used to perform supervised or unsupervised analysis of the hyperspectral data, with which we explore advanced image analysis applications, namely unmixing, classification and segmentation in a phantom and live cells. The resulting images are shown to provide more contrast and detail, and obtained on a timescale ∼102 faster than fitting. The estimated spectral parameters are consistent with those calculated from pure fitting.  相似文献   

8.
9.

Background

Although post-traumatic stress disorder (PTSD) is primarily a mental disorder, it can cause additional symptoms that do not seem to be directly related to the central nervous system, which PTSD is assumed to directly affect. PTSD-mediated heart diseases are some of such secondary disorders. In spite of the significant correlations between PTSD and heart diseases, spatial separation between the heart and brain (where PTSD is primarily active) prevents researchers from elucidating the mechanisms that bridge the two disorders. Our purpose was to identify genes linking PTSD and heart diseases.

Methods

In this study, gene expression profiles of various murine tissues observed under various types of stress or without stress were analyzed in an integrated manner using tensor decomposition (TD).

Results

Based upon the obtained features, ~?400 genes were identified as candidate genes that may mediate heart diseases associated with PTSD. Various gene enrichment analyses supported biological reliability of the identified genes. Ten genes encoding protein-, DNA-, or mRNA-interacting proteins—ILF2, ILF3, ESR1, ESR2, RAD21, HTT, ATF2, NR3C1, TP53, and TP63—were found to be likely to regulate expression of most of these ~?400 genes and therefore are candidate primary genes that cause PTSD-mediated heart diseases. Approximately 400 genes in the heart were also found to be strongly affected by various drugs whose known adverse effects are related to heart diseases and/or fear memory conditioning; these data support the reliability of our findings.

Conclusions

TD-based unsupervised feature extraction turned out to be a useful method for gene selection and successfully identified possible genes causing PTSD-mediated heart diseases.
  相似文献   

10.
High-dimensional clustering analysis is a challenging problem in statistics and machine learning, with broad applications such as the analysis of microarray data and RNA-seq data. In this paper, we propose a new clustering procedure called spectral clustering with feature selection (SC-FS), where we first obtain an initial estimate of labels via spectral clustering, then select a small fraction of features with the largest R-squared with these labels, that is, the proportion of variation explained by group labels, and conduct clustering again using selected features. Under mild conditions, we prove that the proposed method identifies all informative features with high probability and achieves the minimax optimal clustering error rate for the sparse Gaussian mixture model. Applications of SC-FS to four real-world datasets demonstrate its usefulness in clustering high-dimensional data.  相似文献   

11.
The study presents a recursive least-squares estimation method with an exponential forgetting factor for noise removal in functional near-infrared spectroscopy data and extraction of hemodynamic responses (HRs) from the measured data. The HR is modeled as a linear regression form in which the expected HR, the first and second derivatives of the expected HR, a short-separation measurement data, three physiological noises, and the baseline drift are included as components in the regression vector. The proposed method is applied to left-motor-cortex experiments on the right thumb and little finger movements in five healthy male participants. The algorithm is evaluated with respect to its performance improvement in terms of contrast-to-noise ratio in comparison with Kalman filter, low-pass filtering, and independent component method. The experimental results show that the proposed model achieves reductions of 77% and 99% in terms of the number of channels exhibiting higher contrast-to-noise ratios in oxy-hemoglobin and deoxy-hemoglobin, respectively. The approach is robust in obtaining consistent HR data. The proposed method is applied for both offline and online noise removal.  相似文献   

12.
The ability to predict human phenotypes and identify biomarkers of disease from metagenomic data is crucial for the development of therapeutics for microbiome-associated diseases. However, metagenomic data is commonly affected by technical variables unrelated to the phenotype of interest, such as sequencing protocol, which can make it difficult to predict phenotype and find biomarkers of disease. Supervised methods to correct for background noise, originally designed for gene expression and RNA-seq data, are commonly applied to microbiome data but may be limited because they cannot account for unmeasured sources of variation. Unsupervised approaches address this issue, but current methods are limited because they are ill-equipped to deal with the unique aspects of microbiome data, which is compositional, highly skewed, and sparse. We perform a comparative analysis of the ability of different denoising transformations in combination with supervised correction methods as well as an unsupervised principal component correction approach that is presently used in other domains but has not been applied to microbiome data to date. We find that the unsupervised principal component correction approach has comparable ability in reducing false discovery of biomarkers as the supervised approaches, with the added benefit of not needing to know the sources of variation apriori. However, in prediction tasks, it appears to only improve prediction when technical variables contribute to the majority of variance in the data. As new and larger metagenomic datasets become increasingly available, background noise correction will become essential for generating reproducible microbiome analyses.  相似文献   

13.
14.
MOTIVATION: In this paper, we propose using the Kalman filter (KF) as a pre-processing step in microarray-based molecular diagnosis. Incorporating the expression covariance between genes is important in such classification problems, since this represents the functional relationships that govern tissue state. Failing to fulfil such requirements may result in biologically implausible class prediction models. Here, we show that employing the KF to remove noise (while retaining meaningful covariance and thus being able to estimate the underlying biological state from microarray measurements) yields linearly separable data suitable for most classification algorithms. RESULTS: We demonstrate the utility and performance of the KF as a robust disease-state estimator on publicly available binary and multi-class microarray datasets in combination with the most widely used classification methods to date. Moreover, using popular graphical representation schemes we show that our filtered datasets also have an improved visualization capability.  相似文献   

15.
The recent increase in data accuracy from high resolution accelerometers offers substantial potential for improved understanding and prediction of animal movements. However, current approaches used for analysing these multivariable datasets typically require existing knowledge of the behaviors of the animals to inform the behavioral classification process. These methods are thus not well‐suited for the many cases where limited knowledge of the different behaviors performed exist. Here, we introduce the use of an unsupervised learning algorithm. To illustrate the method's capability we analyse data collected using a combination of GPS and Accelerometers on two seabird species: razorbills (Alca torda) and common guillemots (Uria aalge). We applied the unsupervised learning algorithm Expectation Maximization to characterize latent behavioral states both above and below water at both individual and group level. The application of this flexible approach yielded significant new insights into the foraging strategies of the two study species, both above and below the surface of the water. In addition to general behavioral modes such as flying, floating, as well as descending and ascending phases within the water column, this approach allowed an exploration of previously unstudied and important behaviors such as searching and prey chasing/capture events. We propose that this unsupervised learning approach provides an ideal tool for the systematic analysis of such complex multivariable movement data that are increasingly being obtained with accelerometer tags across species. In particular, we recommend its application in cases where we have limited current knowledge of the behaviors performed and existing supervised learning approaches may have limited utility.  相似文献   

16.
Clustering of microarray gene expression data is performed routinely, for genes as well as for samples. Clustering of genes can exhibit functional relationships between genes; clustering of samples on the other hand is important for finding e.g. disease subtypes, relevant patient groups for stratification or related treatments. Usually this is done by first filtering the genes for high-variance under the assumption that they carry most of the information needed for separating different sample groups. If this assumption is violated, important groupings in the data might be lost. Furthermore, classical clustering methods do not facilitate the biological interpretation of the results. Therefore, we propose to methodologically integrate the clustering algorithm with prior biological information. This is different from other approaches as knowledge about classes of genes can be directly used to ease the interpretation of the results and possibly boost clustering performance. Our approach computes dendrograms that resemble decision trees with gene classes used to split the data at each node which can help to find biologically meaningful differences between the sample groups. We have tested the proposed method both on simulated and real data and conclude its usefulness as a complementary method, especially when assumptions of few differentially expressed genes along with an informative mapping of genes to different classes are met.  相似文献   

17.
18.
Surface water contamination from agricultural and urban runoff and wastewater discharges from industrial and municipal activities is of major concern to people worldwide. Classical models can be insufficient to visualise the results because the water quality variables used to describe dynamic pollution sources are complex, multivariable, and nonlinearly related. Artificial intelligence techniques with the ability to analyse multivariant water quality data by means of a sophisticated visualisation capacity can offer an alternative to current models. In this study, the Kohonen self-organising feature maps (SOM) neural network was initially applied to analyse the complex nonlinear relationships among multivariable surface water quality variables using the component planes of the variables to determine the complex behaviour of water quality parameters. The dependencies between water quality variables were extracted and interpreted using the pattern analysis visualised in component planes. For further investigation, the k-means clustering algorithm was used to determine the optimal number of clusters by partitioning the maps and utilising the Davies–Bouldin clustering index, leading to seven groups or clusters corresponding to water quality variables. The results reveal that the concentrations of Na, K, Cl, NH4-N, NO2-N, o-PO4, component planes of organic matter (pV), and dissolved oxygen (DO) were significantly affected by seasonal changes, and that the SOM technique is an efficient tool with which to analyse and determine the complex behaviour of multidimensional surface water quality data. These results suggest that this technique could also be applied to other environmentally sensitive areas such as air and groundwater pollution.  相似文献   

19.
The process of building a new database relevant to some field of study in biomedicine involves transforming, integrating and cleansing multiple data sources, as well as adding new material and annotations. This paper reviews some of the requirements of a general solution to this data integration problem. Several representative technologies and approaches to data integration in biomedicine are surveyed. Then some interesting features that separate the more general data integration technologies from the more specialised ones are highlighted.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号