首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Non-negative matrix factorization is a useful tool for reducing the dimension of large datasets. This work considers simultaneous non-negative matrix factorization of multiple sources of data. In particular, we perform the first study that involves more than two datasets. We discuss the algorithmic issues required to convert the approach into a practical computational tool and apply the technique to new gene expression data quantifying the molecular changes in four tissue types due to different dosages of an experimental panPPAR agonist in mouse. This study is of interest in toxicology because, whilst PPARs form potential therapeutic targets for diabetes, it is known that they can induce serious side-effects. Our results show that the practical simultaneous non-negative matrix factorization developed here can add value to the data analysis. In particular, we find that factorizing the data as a single object allows us to distinguish between the four tissue types, but does not correctly reproduce the known dosage level groups. Applying our new approach, which treats the four tissue types as providing distinct, but related, datasets, we find that the dosage level groups are respected. The new algorithm then provides separate gene list orderings that can be studied for each tissue type, and compared with the ordering arising from the single factorization. We find that many of our conclusions can be corroborated with known biological behaviour, and others offer new insights into the toxicological effects. Overall, the algorithm shows promise for early detection of toxicity in the drug discovery process.  相似文献   

2.
Uncovering community structures is important for understanding networks. Currently, several nonnegative matrix factorization algorithms have been proposed for discovering community structure in complex networks. However, these algorithms exhibit some drawbacks, such as unstable results and inefficient running times. In view of the problems, a novel approach that utilizes an initialized Bayesian nonnegative matrix factorization model for determining community membership is proposed. First, based on singular value decomposition, we obtain simple initialized matrix factorizations from approximate decompositions of the complex network’s adjacency matrix. Then, within a few iterations, the final matrix factorizations are achieved by the Bayesian nonnegative matrix factorization method with the initialized matrix factorizations. Thus, the network’s community structure can be determined by judging the classification of nodes with a final matrix factor. Experimental results show that the proposed method is highly accurate and offers competitive performance to that of the state-of-the-art methods even though it is not designed for the purpose of modularity maximization.  相似文献   

3.
MOTIVATION: Identifying different cancer classes or subclasses with similar morphological appearances presents a challenging problem and has important implication in cancer diagnosis and treatment. Clustering based on gene-expression data has been shown to be a powerful method in cancer class discovery. Non-negative matrix factorization is one such method and was shown to be advantageous over other clustering techniques, such as hierarchical clustering or self-organizing maps. In this paper, we investigate the benefit of explicitly enforcing sparseness in the factorization process. RESULTS: We report an improved unsupervised method for cancer classification by the use of gene-expression profile via sparse non-negative matrix factorization. We demonstrate the improvement by direct comparison with classic non-negative matrix factorization on the three well-studied datasets. In addition, we illustrate how to identify a small subset of co-expressed genes that may be directly involved in cancer.  相似文献   

4.
We present a new algorithm, based on the multidimensional QR factorization, to remove redundancy from a multiple structural alignment by choosing representative protein structures that best preserve the phylogenetic tree topology of the homologous group. The classical QR factorization with pivoting, developed as a fast numerical solution to eigenvalue and linear least-squares problems of the form Ax=b, was designed to re-order the columns of A by increasing linear dependence. Removing the most linear dependent columns from A leads to the formation of a minimal basis set which well spans the phase space of the problem at hand. By recasting the problem of redundancy in multiple structural alignments into this framework, in which the matrix A now describes the multiple alignment, we adapted the QR factorization to produce a minimal basis set of protein structures which best spans the evolutionary (phase) space. The non-redundant and representative profiles obtained from this procedure, termed evolutionary profiles, are shown in initial results to outperform well-tested profiles in homology detection searches over a large sequence database. A measure of structural similarity between homologous proteins, Q(H), is presented. By properly accounting for the effect and presence of gaps, a phylogenetic tree computed using this metric is shown to be congruent with the maximum-likelihood sequence-based phylogeny. The results indicate that evolutionary information is indeed recoverable from the comparative analysis of protein structure alone. Applications of the QR ordering and this structural similarity metric to analyze the evolution of structure among key, universally distributed proteins involved in translation, and to the selection of representatives from an ensemble of NMR structures are also discussed.  相似文献   

5.

Background

The clinical investigation of human brain tumors often starts with a non-invasive imaging study, providing information about the tumor extent and location, but little insight into the biochemistry of the analyzed tissue. Magnetic Resonance Spectroscopy can complement imaging by supplying a metabolic fingerprint of the tissue. This study analyzes single-voxel magnetic resonance spectra, which represent signal information in the frequency domain. Given that a single voxel may contain a heterogeneous mix of tissues, signal source identification is a relevant challenge for the problem of tumor type classification from the spectroscopic signal.

Methodology/Principal Findings

Non-negative matrix factorization techniques have recently shown their potential for the identification of meaningful sources from brain tissue spectroscopy data. In this study, we use a convex variant of these methods that is capable of handling negatively-valued data and generating sources that can be interpreted as tumor class prototypes. A novel approach to convex non-negative matrix factorization is proposed, in which prior knowledge about class information is utilized in model optimization. Class-specific information is integrated into this semi-supervised process by setting the metric of a latent variable space where the matrix factorization is carried out. The reported experimental study comprises 196 cases from different tumor types drawn from two international, multi-center databases. The results indicate that the proposed approach outperforms a purely unsupervised process by achieving near perfect correlation of the extracted sources with the mean spectra of the tumor types. It also improves tissue type classification.

Conclusions/Significance

We show that source extraction by unsupervised matrix factorization benefits from the integration of the available class information, so operating in a semi-supervised learning manner, for discriminative source identification and brain tumor labeling from single-voxel spectroscopy data. We are confident that the proposed methodology has wider applicability for biomedical signal processing.  相似文献   

6.
According to the WHO, pollution is a worldwide public health problem. In Colombia, low-cost strategies for air quality monitoring have been implemented using wireless sensor networks (WSNs), which achieve a better spatial resolution than traditional sensor networks for a lower operating cost. Nevertheless, one of the recurrent issues of WSNs is the missing data due to environmental and location conditions, hindering data collection. Consequently, WSNs should have effective mechanisms to recover missing data, and matrix factorization (MF) has shown to be a solid alternative to solve this problem. This study proposes a novel MF technique with a neural network architecture (i.e., deep matrix factorization or DMF) to estimate missing particulate matter (PM) data in a WSN in Aburrá Valley, Colombia. We found that the model that included spatial-temporal features (using embedding layers) captured the behavior of the pollution measured at each node more efficiently, thus producing better estimations than standard matrix factorization and other variations of the model proposed here.  相似文献   

7.
MOTIVATION: Modern machine learning methods based on matrix decomposition techniques, like independent component analysis (ICA) or non-negative matrix factorization (NMF), provide new and efficient analysis tools which are currently explored to analyze gene expression profiles. These exploratory feature extraction techniques yield expression modes (ICA) or metagenes (NMF). These extracted features are considered indicative of underlying regulatory processes. They can as well be applied to the classification of gene expression datasets by grouping samples into different categories for diagnostic purposes or group genes into functional categories for further investigation of related metabolic pathways and regulatory networks. RESULTS: In this study we focus on unsupervised matrix factorization techniques and apply ICA and sparse NMF to microarray datasets. The latter monitor the gene expression levels of human peripheral blood cells during differentiation from monocytes to macrophages. We show that these tools are able to identify relevant signatures in the deduced component matrices and extract informative sets of marker genes from these gene expression profiles. The methods rely on the joint discriminative power of a set of marker genes rather than on single marker genes. With these sets of marker genes, corroborated by leave-one-out or random forest cross-validation, the datasets could easily be classified into related diagnostic categories. The latter correspond to either monocytes versus macrophages or healthy vs Niemann Pick C disease patients.  相似文献   

8.
MOTIVATION: When working with large-scale protein interaction data, an important analysis task is the assignment of pairs of proteins to groups that correspond to higher order assemblies. Previously a common approach to this problem has been to apply standard hierarchical clustering methods to identify such a groups. Here we propose a new algorithm for aggregating a diverse collection of matrix factorizations to produce a more informative clustering, which takes the form of a 'soft' hierarchy of clusters. RESULTS: We apply the proposed Ensemble non-negative matrix factorization (NMF) algorithm to a high-quality assembly of binary protein interactions derived from two proteome-wide studies in yeast. Our experimental evaluation demonstrates that the algorithm lends itself to discovering small localized structures in this data, which correspond to known functional groupings of complexes. In addition, we show that the algorithm also supports the assignment of putative functions for previously uncharacterized proteins, for instance the protein YNR024W, which may be an uncharacterized component of the exosome.  相似文献   

9.
Due to the exponential growth of information, recommender systems have been a widely exploited technique to solve the problem of information overload effectively. Collaborative filtering (CF) is the most successful and extensively employed recommendation approach. However, current CF methods recommend suitable items for users mainly by user-item matrix that contains the individual preference of users for items in a collection. So these methods suffer from such problems as the sparsity of the available data and low accuracy in predictions. To address these issues, borrowing the idea of cognition degree from cognitive psychology and employing the regularized matrix factorization (RMF) as the basic model, we propose a novel drifting cognition degree-based RMF collaborative filtering method named CogTime_RMF that incorporates both user-item matrix and users’ drifting cognition degree with time. Moreover, we conduct experiments on the real datasets MovieLens 1 M and MovieLens 100 k, and the method is compared with three similarity based methods and three other latest matrix factorization based methods. Empirical results demonstrate that our proposal can yield better performance over other methods in accuracy of recommendation. In addition, results show that CogTime_RMF can alleviate the data sparsity, particularly in the circumstance that few ratings are observed.  相似文献   

10.
In this study we propose a new feature extraction algorithm, dNMF (discriminant non-negative matrix factorization), to learn subtle class-related differences while maintaining an accurate generative capability. In addition to the minimum representation error for the standard NMF (non-negative matrix factorization) algorithm, the dNMF algorithm also results in higher between-class variance for discriminant power. The multiplicative NMF learning algorithm has been modified to cope with this additional constraint. The cost function was carefully designed so that the extraction of feature coefficients from a single testing pattern with pre-trained feature vectors resulted in a quadratic convex optimization problem in non-negative space for uniqueness. It also resolves issues related to the previous discriminant NMF algorithms. The developed dNMF algorithm has been applied to the emotion recognition task for speech, where it needs to emphasize the emotional differences while de-emphasizing the dominant phonetic components. The dNMF algorithm successfully extracted subtle emotional differences, demonstrated much better recognition performance and showed a smaller representation error from an emotional speech database.  相似文献   

11.

Background  

External stimulations of cells by hormones, cytokines or growth factors activate signal transduction pathways that subsequently induce a re-arrangement of cellular gene expression. The analysis of such changes is complicated, as they consist of multi-layered temporal responses. While classical analyses based on clustering or gene set enrichment only partly reveal this information, matrix factorization techniques are well suited for a detailed temporal analysis. In signal processing, factorization techniques incorporating data properties like spatial and temporal correlation structure have shown to be robust and computationally efficient. However, such correlation-based methods have so far not be applied in bioinformatics, because large scale biological data rarely imply a natural order that allows the definition of a delayed correlation function.  相似文献   

12.
Nonnegative tensor factorization for continuous EEG classification   总被引:1,自引:0,他引:1  
In this paper we present a method for continuous EEG classification, where we employ nonnegative tensor factorization (NTF) to determine discriminative spectral features and use the Viterbi algorithm to continuously classify multiple mental tasks. This is an extension of our previous work on the use of nonnegative matrix factorization (NMF) for EEG classification. Numerical experiments with two data sets in BCI competition, confirm the useful behavior of the method for continuous EEG classification.  相似文献   

13.
This paper summarizes approaches to developing mathematics that can act as a language for immunogenetics. The need for this has been documented by showing inadequacies of the standard symbolism. Apparent distinctions in symbolizing and conceptualizing factors involved in immunogenetics are seen to disappear in the mathematical models presented here. One model, a three-fold Boolean matrix factorization, subsumes all approaches to the idea of specificity and yet is general enough to incorporate data beyond that found only in a reaction matrix.  相似文献   

14.
Muscle synergies have been investigated during different types of human movement using nonnegative matrix factorization. However, there are not any reports available on the reliability of the method. To evaluate between-day reliability, 21 subjects performed bench press, in two test sessions separated by approximately 7 days. The movement consisted of 3 sets of 8 repetitions at 60% of the three repetition maximum in bench press. Muscle synergies were extracted from electromyography data of 13 muscles, using nonnegative matrix factorization. To evaluate between-day reliability, we performed a cross-correlation analysis and a cross-validation analysis, in which the synergy components extracted in the first test session were recomputed, using the fixed synergy components from the second test session. Two muscle synergies accounted for >90% of the total variance, and reflected the concentric and eccentric phase, respectively. The cross-correlation values were strong to very strong (r-values between 0.58 and 0.89), while the cross-validation values ranged from substantial to almost perfect (ICC3, 1 values between 0.70 and 0.95). The present findings revealed that the same general structure of the muscle synergies was present across days and the extraction of muscle synergies is thus deemed reliable.  相似文献   

15.
The increasing availability of temporal network data is calling for more research on extracting and characterizing mesoscopic structures in temporal networks and on relating such structure to specific functions or properties of the system. An outstanding challenge is the extension of the results achieved for static networks to time-varying networks, where the topological structure of the system and the temporal activity patterns of its components are intertwined. Here we investigate the use of a latent factor decomposition technique, non-negative tensor factorization, to extract the community-activity structure of temporal networks. The method is intrinsically temporal and allows to simultaneously identify communities and to track their activity over time. We represent the time-varying adjacency matrix of a temporal network as a three-way tensor and approximate this tensor as a sum of terms that can be interpreted as communities of nodes with an associated activity time series. We summarize known computational techniques for tensor decomposition and discuss some quality metrics that can be used to tune the complexity of the factorized representation. We subsequently apply tensor factorization to a temporal network for which a ground truth is available for both the community structure and the temporal activity patterns. The data we use describe the social interactions of students in a school, the associations between students and school classes, and the spatio-temporal trajectories of students over time. We show that non-negative tensor factorization is capable of recovering the class structure with high accuracy. In particular, the extracted tensor components can be validated either as known school classes, or in terms of correlated activity patterns, i.e., of spatial and temporal coincidences that are determined by the known school activity schedule.  相似文献   

16.
Predicting what items will be selected by a target user in the future is an important function for recommendation systems. Matrix factorization techniques have been shown to achieve good performance on temporal rating-type data, but little is known about temporal item selection data. In this paper, we developed a unified model that combines Multi-task Non-negative Matrix Factorization and Linear Dynamical Systems to capture the evolution of user preferences. Specifically, user and item features are projected into latent factor space by factoring co-occurrence matrices into a common basis item-factor matrix and multiple factor-user matrices. Moreover, we represented both within and between relationships of multiple factor-user matrices using a state transition matrix to capture the changes in user preferences over time. The experiments show that our proposed algorithm outperforms the other algorithms on two real datasets, which were extracted from Netflix movies and Last.fm music. Furthermore, our model provides a novel dynamic topic model for tracking the evolution of the behavior of a user over time.  相似文献   

17.
In pharmaceutical sciences, a crucial step of the drug discovery process is the identification of drug-target interactions. However, only a small portion of the drug-target interactions have been experimentally validated, as the experimental validation is laborious and costly. To improve the drug discovery efficiency, there is a great need for the development of accurate computational approaches that can predict potential drug-target interactions to direct the experimental verification. In this paper, we propose a novel drug-target interaction prediction algorithm, namely neighborhood regularized logistic matrix factorization (NRLMF). Specifically, the proposed NRLMF method focuses on modeling the probability that a drug would interact with a target by logistic matrix factorization, where the properties of drugs and targets are represented by drug-specific and target-specific latent vectors, respectively. Moreover, NRLMF assigns higher importance levels to positive observations (i.e., the observed interacting drug-target pairs) than negative observations (i.e., the unknown pairs). Because the positive observations are already experimentally verified, they are usually more trustworthy. Furthermore, the local structure of the drug-target interaction data has also been exploited via neighborhood regularization to achieve better prediction accuracy. We conducted extensive experiments over four benchmark datasets, and NRLMF demonstrated its effectiveness compared with five state-of-the-art approaches.  相似文献   

18.
Advances in molecular “omics” technologies have motivated new methodologies for the integration of multiple sources of high-content biomedical data. However, most statistical methods for integrating multiple data matrices only consider data shared vertically (one cohort on multiple platforms) or horizontally (different cohorts on a single platform). This is limiting for data that take the form of bidimensionally linked matrices (eg, multiple cohorts measured on multiple platforms), which are increasingly common in large-scale biomedical studies. In this paper, we propose bidimensional integrative factorization (BIDIFAC) for integrative dimension reduction and signal approximation of bidimensionally linked data matrices. Our method factorizes data into (a) globally shared, (b) row-shared, (c) column-shared, and (d) single-matrix structural components, facilitating the investigation of shared and unique patterns of variability. For estimation, we use a penalized objective function that extends the nuclear norm penalization for a single matrix. As an alternative to the complicated rank selection problem, we use results from the random matrix theory to choose tuning parameters. We apply our method to integrate two genomics platforms (messenger RNA and microRNA expression) across two sample cohorts (tumor samples and normal tissue samples) using the breast cancer data from the Cancer Genome Atlas. We provide R code for fitting BIDIFAC, imputing missing values, and generating simulated data.  相似文献   

19.
The success of high-throughput sequencing has lead to an increasing number of projects which sequence large populations of a species. Storage and analysis of sequence data is a key challenge in these projects, because of the sheer size of the datasets. Compression is one simple technology to deal with this challenge. Referential factorization and compression schemes, which store only the differences between input sequence and a reference sequence, gained lots of interest in this field. Highly-similar sequences, e.g., Human genomes, can be compressed with a compression ratio of 1,000:1 and more, up to two orders of magnitude better than with standard compression techniques. Recently, it was shown that the compression against multiple references from the same species can boost the compression ratio up to 4,000:1. However, a detailed analysis of using multiple references is lacking, e.g., for main memory consumption and optimality. In this paper, we describe one key technique for the referential compression against multiple references: The factorization of sequences. Based on the notion of an optimal factorization, we propose optimization heuristics and identify parameter settings which greatly influence 1) the size of the factorization, 2) the time for factorization, and 3) the required amount of main memory. We evaluate a total of 30 setups with a varying number of references on data from three different species. Our results show a wide range of factorization sizes (optimal to an overhead of up to 300%), factorization speed (0.01 MB/s to more than 600 MB/s), and main memory usage (few dozen MB to dozens of GB). Based on our evaluation, we identify the best configurations for common use cases. Our evaluation shows that multi-reference factorization is much better than single-reference factorization.  相似文献   

20.
We consider the statistical analysis of population structure using genetic data. We show how the two most widely used approaches to modeling population structure, admixture-based models and principal components analysis (PCA), can be viewed within a single unifying framework of matrix factorization. Specifically, they can both be interpreted as approximating an observed genotype matrix by a product of two lower-rank matrices, but with different constraints or prior distributions on these lower-rank matrices. This opens the door to a large range of possible approaches to analyzing population structure, by considering other constraints or priors. In this paper, we introduce one such novel approach, based on sparse factor analysis (SFA). We investigate the effects of the different types of constraint in several real and simulated data sets. We find that SFA produces similar results to admixture-based models when the samples are descended from a few well-differentiated ancestral populations and can recapitulate the results of PCA when the population structure is more “continuous,” as in isolation-by-distance models.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号