共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
MOTIVATION: A promising and reliable approach to annotate gene function is clustering genes not only by using gene expression data but also literature information, especially gene networks. RESULTS: We present a systematic method for gene clustering by combining these totally different two types of data, particularly focusing on network modularity, a global feature of gene networks. Our method is based on learning a probabilistic model, which we call a hidden modular random field in which the relation between hidden variables directly represents a given gene network. Our learning algorithm which minimizes an energy function considering the network modularity is practically time-efficient, regardless of using the global network property. We evaluated our method by using a metabolic network and microarray expression data, changing with microarray datasets, parameters of our model and gold standard clusters. Experimental results showed that our method outperformed other four competing methods, including k-means and existing graph partitioning methods, being statistically significant in all cases. Further detailed analysis showed that our method could group a set of genes into a cluster which corresponds to the folate metabolic pathway while other methods could not. From these results, we can say that our method is highly effective for gene clustering and annotating gene function. 相似文献
3.
Habegger L Sboner A Gianoulis TA Rozowsky J Agarwal A Snyder M Gerstein M 《Bioinformatics (Oxford, England)》2011,27(2):281-283
SUMMARY: The advent of next-generation sequencing for functional genomics has given rise to quantities of sequence information that are often so large that they are difficult to handle. Moreover, sequence reads from a specific individual can contain sufficient information to potentially identify and genetically characterize that person, raising privacy concerns. In order to address these issues, we have developed the Mapped Read Format (MRF), a compact data summary format for both short and long read alignments that enables the anonymization of confidential sequence information, while allowing one to still carry out many functional genomics studies. We have developed a suite of tools (RSEQtools) that use this format for the analysis of RNA-Seq experiments. These tools consist of a set of modules that perform common tasks such as calculating gene expression values, generating signal tracks of mapped reads and segmenting that signal into actively transcribed regions. Moreover, the tools can readily be used to build customizable RNA-Seq workflows. In addition to the anonymization afforded by MRF, this format also facilitates the decoupling of the alignment of reads from downstream analyses. Availability and implementation: RSEQtools is implemented in C and the source code is available at http://rseqtools.gersteinlab.org/. 相似文献
4.
Background
Analysis of microarray and other high-throughput data on the basis of gene sets, rather than individual genes, is becoming more important in genomic studies. Correspondingly, a large number of statistical approaches for detecting gene set enrichment have been proposed, but both the interrelations and the relative performance of the various methods are still very much unclear. 相似文献5.
6.
7.
8.
Shah M Corbeil J 《IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM》2011,8(1):14-26
We propose a general theoretical framework for analyzing differentially expressed genes and behavior patterns from two homogenous short time-course data. The framework generalizes the recently proposed Hilbert-Schmidt Independence Criterion (HSIC)-based framework adapting it to the time-series scenario by utilizing tensor analysis for data transformation. The proposed framework is effective in yielding criteria that can identify both the differentially expressed genes and time-course patterns of interest between two time-series experiments without requiring to explicitly cluster the data. The results, obtained by applying the proposed framework with a linear kernel formulation, on various data sets are found to be both biologically meaningful and consistent with published studies. 相似文献
9.
10.
Dimension reduction strategies for analyzing global gene expression data with a response 总被引:5,自引:0,他引:5
The analysis of global gene expression data from microarrays is breaking new ground in genetics research, while confronting modelers and statisticians with many critical issues. In this paper, we consider data sets in which a categorical or continuous response is recorded, along with gene expression, on a given number of experimental samples. Data of this type are usually employed to create a prediction mechanism for the response based on gene expression, and to identify a subset of relevant genes. This defines a regression setting characterized by a dramatic under-resolution with respect to the predictors (genes), whose number exceeds by orders of magnitude the number of available observations (samples). We present a dimension reduction strategy that, under appropriate assumptions, allows us to restrict attention to a few linear combinations of the original expression profiles, and thus to overcome under-resolution. These linear combinations can then be used to build and validate a regression model with standard techniques. Moreover, they can be used to rank original predictors, and ultimately to select a subset of them through comparison with a background 'chance scenario' based on a number of independent randomizations. We apply this strategy to publicly available data on leukemia classification. 相似文献
11.
Microarray technology is increasingly being applied in biological and medical research to address a wide range of problems. Cluster analysis has proven to be a very useful tool for investigating the structure of microarray data. This paper presents a program for clustering microarray data, which is based on the so call path-distance. The algorithm gives in each step a partition in two clusters and no prior assumptions on the structure of clusters are required. It assigns each object (gene or sample) to only one cluster and gives the global optimum for the function that quantifies the adequacy of a given partition of the sample into k clusters. The program was tested on experimental data sets, showing the robustness of the algorithm. 相似文献
12.
Metagenomic studies sequence DNA directly from environmental samples to explore the structure and function of complex microbial
and viral communities. Individual, short pieces of sequenced DNA (“reads”) are classified into (putative) taxonomic or metabolic
groups which are analyzed for patterns across samples. Analysis of such read matrices is at the core of using metagenomic
data to make inferences about ecosystem structure and function. Non-negative matrix factorization (NMF) is a numerical technique
for approximating high-dimensional data points as positive linear combinations of positive components. It is thus well suited
to interpretation of observed samples as combinations of different components. We develop, test and apply an NMF-based framework
to analyze metagenomic read matrices. In particular, we introduce a method for choosing NMF degree in the presence of overlap,
and apply spectral-reordering techniques to NMF-based similarity matrices to aid visualization. We show that our method can
robustly identify the appropriate degree and disentangle overlapping contributions using synthetic data sets. We then examine
and discuss the NMF decomposition of a metabolic profile matrix extracted from 39 publicly available metagenomic samples,
and identify canonical sample types, including one associated with coral ecosystems, one associated with highly saline ecosystems
and others. We also identify specific associations between pathways and canonical environments, and explore how alternative
choices of decompositions facilitate analysis of read matrices at a finer scale. 相似文献
13.
14.
Background
DNA methylation has been identified to be widely associated to complex diseases. Among biological platforms to profile DNA methylation in human, the Illumina Infinium HumanMethylation450 BeadChip (450K) has been accepted as one of the most efficient technologies. However, challenges exist in analysis of DNA methylation data generated by this technology due to widespread biases.Results
Here we proposed a generalized framework for evaluating data analysis methods for Illumina 450K array. This framework considers the following steps towards a successful analysis: importing data, quality control, within-array normalization, correcting type bias, detecting differentially methylated probes or regions and biological interpretation.Conclusions
We evaluated five methods using three real datasets, and proposed outperform methods for the Illumina 450K array data analysis. Minfi and methylumi are optimal choice when analyzing small dataset. BMIQ and RCP are proper to correcting type bias and the normalized result of them can be used to discover DMPs. R package missMethyl is suitable for GO term enrichment analysis and biological interpretation.15.
A large number of biclustering methods have been proposed to detect patterns in gene expression data. All these methods try to find some type of biclusters but no one can discover all the types of patterns in the data. Furthermore, researchers have to design new algorithms in order to find new types of biclusters/patterns that interest biologists. In this paper, we propose a novel approach for biclustering that, in general, can be used to discover all computable patterns in gene expression data. The method is based on the theory of Kolmogorov complexity. More precisely, we use Kolmogorov complexity to measure the randomness of submatrices as the merit of biclusters because randomness naturally consists in a lack of regularity, which is a common property of all types of patterns. On the basis of algorithmic probability measure, we develop a Markov Chain Monte Carlo algorithm to search for biclusters. Our method can also be easily extended to solve the problems of conventional clustering and checkerboard type biclustering. The preliminary experiments on simulated as well as real data show that our approach is very versatile and promising. 相似文献
16.
RNA-Seq and microarray platforms have emerged as important tools for detecting changes in gene expression and RNA processing
in biological samples. We present ExpressionPlot, a software package consisting of a default back end, which prepares raw
sequencing or Affymetrix microarray data, and a web-based front end, which offers a biologically centered interface to browse,
visualize, and compare different data sets. Download and installation instructions, a user's manual, discussion group, and
a prototype are available at . 相似文献
17.
Javier Garcia-Garcia Emre Guney Ramon Aragues Joan Planas-Iglesias Baldo Oliva 《BMC bioinformatics》2010,11(1):56
Background
The analysis and usage of biological data is hindered by the spread of information across multiple repositories and the difficulties posed by different nomenclature systems and storage formats. In particular, there is an important need for data unification in the study and use of protein-protein interactions. Without good integration strategies, it is difficult to analyze the whole set of available data and its properties. 相似文献18.
19.
20.
A method is described that allows the assessment of treelikeness of phylogenetic distance data before tree estimation. This method is related to statistical geometry as introduced by Eigen, Winkler-Oswatitsch, and Dress (1988 [Proc. Natl. Acad. Sci. USA. 85:5913-5917]), and in essence, displays a measure for treelikeness of quartets in terms of a histogram that we call a delta plot. This allows identification of nontreelike data and analysis of noisy data sets arising from processes such as, for example, parallel evolution, recombination, or lateral gene transfer. In addition to an overall assessment of treelikeness, individual taxa can be ranked by reference to the treelikeness of the quartets to which they belong. Removal of taxa on the basis of this ranking results in an increase in accuracy of tree estimation. Recombinant data sets are simulated, and the method is shown to be capable of identifying single recombinant taxa on the basis of distance information alone, provided the parents of the recombinant sequence are sufficiently divergent and the mixture of tree histories is not strongly skewed toward a single tree. delta Plots and taxon rankings are applied to three biological data sets using distances derived from sequence alignment, gene order, and fragment length polymorphism. 相似文献