共查询到20条相似文献,搜索用时 62 毫秒
1.
Harri T Kiiveri 《BMC bioinformatics》2008,9(1):195
Background
With the advent of high throughput biotechnology data acquisition platforms such as micro arrays, SNP chips and mass spectrometers, data sets with many more variables than observations are now routinely being collected. Finding relationships between response variables of interest and variables in such data sets is an important problem akin to finding needles in a haystack. Whilst methods for a number of response types have been developed a general approach has been lacking. 相似文献2.
Canonical correlation analysis (CCA) describes the associations between two sets of variables by maximizing the correlation between linear combinations of the variables in each dataset. However, in high‐dimensional settings where the number of variables exceeds the sample size or when the variables are highly correlated, traditional CCA is no longer appropriate. This paper proposes a method for sparse CCA. Sparse estimation produces linear combinations of only a subset of variables from each dataset, thereby increasing the interpretability of the canonical variates. We consider the CCA problem from a predictive point of view and recast it into a regression framework. By combining an alternating regression approach together with a lasso penalty, we induce sparsity in the canonical vectors. We compare the performance with other sparse CCA techniques in different simulation settings and illustrate its usefulness on a genomic dataset. 相似文献
3.
Background
Comparison of data produced on different microarray platforms often shows surprising discordance. It is not clear whether this discrepancy is caused by noisy data or by improper probe matching between platforms. We investigated whether the significant level of inconsistency between results produced by alternative gene expression microarray platforms could be reduced by stringent sequence matching of microarray probes. We mapped the short oligo probes of the Affymetrix platform onto cDNA clones of the Stanford microarray platform. Affymetrix probes were reassigned to redefined probe sets if they mapped to the same cDNA clone sequence, regardless of the original manufacturer-defined grouping. The NCI-60 gene expression profiles produced by Affymetrix HuFL platform were recalculated using these redefined probe sets and compared to previously published cDNA measurements of the same panel of RNA samples. 相似文献4.
Haley R. Eidem Jacob L. Steenwyk Jennifer H. Wisecaver John A. Capra Patrick Abbot Antonis Rokas 《BMC medical genomics》2018,11(1):107
Background
The integration of high-quality, genome-wide analyses offers a robust approach to elucidating genetic factors involved in complex human diseases. Even though several methods exist to integrate heterogeneous omics data, most biologists still manually select candidate genes by examining the intersection of lists of candidates stemming from analyses of different types of omics data that have been generated by imposing hard (strict) thresholds on quantitative variables, such as P-values and fold changes, increasing the chance of missing potentially important candidates.Methods
To better facilitate the unbiased integration of heterogeneous omics data collected from diverse platforms and samples, we propose a desirability function framework for identifying candidate genes with strong evidence across data types as targets for follow-up functional analysis. Our approach is targeted towards disease systems with sparse, heterogeneous omics data, so we tested it on one such pathology: spontaneous preterm birth (sPTB).Results
We developed the software integRATE, which uses desirability functions to rank genes both within and across studies, identifying well-supported candidate genes according to the cumulative weight of biological evidence rather than based on imposition of hard thresholds of key variables. Integrating 10 sPTB omics studies identified both genes in pathways previously suspected to be involved in sPTB as well as novel genes never before linked to this syndrome. integRATE is available as an R package on GitHub (https://github.com/haleyeidem/integRATE).Conclusions
Desirability-based data integration is a solution most applicable in biological research areas where omics data is especially heterogeneous and sparse, allowing for the prioritization of candidate genes that can be used to inform more targeted downstream functional analyses.5.
Richard Shippy Timothy J Sendera Randall Lockner Chockalingam Palaniappan Tamma Kaysser-Kranich George Watts John Alsobrook 《BMC genomics》2004,5(1):61-15
Background
Despite the widespread use of microarrays, much ambiguity regarding data analysis, interpretation and correlation of the different technologies exists. There is a considerable amount of interest in correlating results obtained between different microarray platforms. To date, only a few cross-platform evaluations have been published and unfortunately, no guidelines have been established on the best methods of making such correlations. To address this issue we conducted a thorough evaluation of two commercial microarray platforms to determine an appropriate methodology for making cross-platform correlations. 相似文献6.
Charlotte Soneson Henrik Lilljebjörn Thoas Fioretos Magnus Fontes 《BMC bioinformatics》2010,11(1):191
Background
With the rapid development of new genetic measurement methods, several types of genetic alterations can be quantified in a high-throughput manner. While the initial focus has been on investigating each data set separately, there is an increasing interest in studying the correlation structure between two or more data sets. Multivariate methods based on Canonical Correlation Analysis (CCA) have been proposed for integrating paired genetic data sets. The high dimensionality of microarray data imposes computational difficulties, which have been addressed for instance by studying the covariance structure of the data, or by reducing the number of variables prior to applying the CCA. In this work, we propose a new method for analyzing high-dimensional paired genetic data sets, which mainly emphasizes the correlation structure and still permits efficient application to very large data sets. The method is implemented by translating a regularized CCA to its dual form, where the computational complexity depends mainly on the number of samples instead of the number of variables. The optimal regularization parameters are chosen by cross-validation. We apply the regularized dual CCA, as well as a classical CCA preceded by a dimension-reducing Principal Components Analysis (PCA), to a paired data set of gene expression changes and copy number alterations in leukemia. 相似文献7.
Robert A van den Berg Iven Van Mechelen Tom F Wilderjans Katrijn Van Deun Henk AL Kiers Age K Smilde 《BMC bioinformatics》2009,10(1):340
Background
In contemporary biology, complex biological processes are increasingly studied by collecting and analyzing measurements of the same entities that are collected with different analytical platforms. Such data comprise a number of data blocks that are coupled via a common mode. The goal of collecting this type of data is to discover biological mechanisms that underlie the behavior of the variables in the different data blocks. The simultaneous component analysis (SCA) family of data analysis methods is suited for this task. However, a SCA may be hampered by the data blocks being subjected to different amounts of measurement error, or noise. To unveil the true mechanisms underlying the data, it could be fruitful to take noise heterogeneity into consideration in the data analysis. Maximum likelihood based SCA (MxLSCA-P) was developed for this purpose. In a previous simulation study it outperformed normal SCA-P. This previous study, however, did not mimic in many respects typical functional genomics data sets, such as, data blocks coupled via the experimental mode, more variables than experimental units, and medium to high correlations between variables. Here, we present a new simulation study in which the usefulness of MxLSCA-P compared to ordinary SCA-P is evaluated within a typical functional genomics setting. Subsequently, the performance of the two methods is evaluated by analysis of a real life Escherichia coli metabolomics data set. 相似文献8.
Background
Although prognostic biomarkers specific for particular cancers have been discovered, microarray analysis of gene expression profiles, supported by integrative analysis algorithms, helps to identify common factors in molecular oncology. Similarities of Ordered Gene Lists (SOGL) is a recently proposed approach to meta-analysis suitable for identifying features shared by two data sets. Here we extend the idea of SOGL to the detection of significant prognostic marker genes from microarrays of multiple data sets. Three data sets for leukemia and the other six for different solid tumors are used to demonstrate our method, using established statistical techniques. 相似文献9.
Ki-Yeol Kim Dong Hyuk Ki Ha Jin Jeong Hei-Cheul Jeung Hyun Cheol Chung Sun Young Rha 《BMC bioinformatics》2007,8(1):218
Background
With microarray technology, variability in experimental environments such as RNA sources, microarray production, or the use of different platforms, can cause bias. Such systematic differences present a substantial obstacle to the analysis of microarray data, resulting in inconsistent and unreliable information. Therefore, one of the most pressing challenges in the field of microarray technology is how to integrate results from different microarray experiments or combine data sets prior to the specific analysis. 相似文献10.
Background
As the canonical code is not universal, different theories about its origin and organization have appeared. The optimization or level of adaptation of the canonical genetic code was measured taking into account the harmful consequences resulting from point mutations leading to the replacement of one amino acid for another. There are two basic theories to measure the level of optimization: the statistical approach, which compares the canonical genetic code with many randomly generated alternative ones, and the engineering approach, which compares the canonical code with the best possible alternative. 相似文献11.
Maurizio Callari Matteo Dugo Valeria Musella Edoardo Marchesi Giovanna Chiorino Maurizia Mello Grand Marco Alessandro Pierotti Maria Grazia Daidone Silvana Canevari Loris De Cecco 《PloS one》2012,7(9)
Background
Microarray technology applied to microRNA (miRNA) profiling is a promising tool in many research fields; nevertheless, independent studies characterizing the same pathology have often reported poorly overlapping results. miRNA analysis methods have only recently been systematically compared but only in few cases using clinical samples.Methodology/Principal Findings
We investigated the inter-platform reproducibility of four miRNA microarray platforms (Agilent, Exiqon, Illumina, and Miltenyi), comparing nine paired tumor/normal colon tissues. The most concordant and selected discordant miRNAs were further studied by quantitative RT-PCR. Globally, a poor overlap among differentially expressed miRNAs identified by each platform was found. Nevertheless, for eight miRNAs high agreement in differential expression among the four platforms and comparability to qRT-PCR was observed. Furthermore, most of the miRNA sets identified by each platform are coherently enriched in data from the other platforms and the great majority of colon cancer associated miRNA sets derived from the literature were validated in our data, independently from the platform. Computational integration of miRNA and gene expression profiles suggested that anti-correlated predicted target genes of differentially expressed miRNAs are commonly enriched in cancer-related pathways and in genes involved in glycolysis and nutrient transport.Conclusions
Technical and analytical challenges in measuring miRNAs still remain and further research is required in order to increase consistency between different microarray-based methodologies. However, a better inter-platform agreement was found by looking at miRNA sets instead of single miRNAs and through a miRNAs – gene expression integration approach. 相似文献12.
Background
There is a large amount of microarray data accumulating in public databases, providing various data waiting to be analyzed jointly. Powerful kernel-based methods are commonly used in microarray analyses with support vector machines (SVMs) to approach a wide range of classification problems. However, the standard vectorial data kernel family (linear, RBF, etc.) that takes vectorial data as input, often fails in prediction if the data come from different platforms or laboratories, due to the low gene overlaps or consistencies between the different datasets. 相似文献13.
Background
Several preprocessing algorithms for Affymetrix gene expression microarrays have been developed, and their performance on spike-in data sets has been evaluated previously. However, a comprehensive comparison of preprocessing algorithms on samples taken under research conditions has not been performed.Methodology/Principal Findings
We used TaqMan RT-PCR arrays as a reference to evaluate the accuracy of expression values from Affymetrix microarrays in two experimental data sets: one comprising 84 genes in 36 colon biopsies, and the other comprising 75 genes in 29 cancer cell lines. We evaluated consistency using the Pearson correlation between measurements obtained on the two platforms. Also, we introduce the log-ratio discrepancy as a more relevant measure of discordance between gene expression platforms. Of nine preprocessing algorithms tested, PLIER+16 produced expression values that were most consistent with RT-PCR measurements, although the difference in performance between most of the algorithms was not statistically significant.Conclusions/Significance
Our results support the choice of PLIER+16 for the preprocessing of clinical Affymetrix microarray data. However, other algorithms performed similarly and are probably also good choices. 相似文献14.
Woochang Hwang Young-Rae Cho Aidong Zhang Murali Ramanathan 《Algorithms for molecular biology : AMB》2006,1(1):24-11
Background
The sparse connectivity of protein-protein interaction data sets makes identification of functional modules challenging. The purpose of this study is to critically evaluate a novel clustering technique for clustering and detecting functional modules in protein-protein interaction networks, termed STM. 相似文献15.
Background
Gene clustering has been widely used to group genes with similar expression pattern in microarray data analysis. Subsequent enrichment analysis using predefined gene sets can provide clues on which functional themes or regulatory sequence motifs are associated with individual gene clusters. In spite of the potential utility, gene clustering and enrichment analysis have been used in separate platforms, thus, the development of integrative algorithm linking both methods is highly challenging. 相似文献16.
Background
For fermentation process and strain improvement, where one wants to screen a large number of conditions and strains, robust and scalable high-throughput cultivation systems are crucial. Often, the time lag between bench-scale cultivations to production largely depends on approximate estimation of scalable physiological traits. Microtiter plate (MTP) based screening platforms have lately become an attractive alternative to shake flasks mainly because of the ease of automation. However, there are very few reports on applications for filamentous organisms; as well as efforts towards systematic validation of physiological behavior compared to larger scale are sparse. Moreover, available small-scale screening approaches are typically constrained by evaluating only an end point snapshot of phenotypes. 相似文献17.
18.
19.
Background
Random forest (RF) is a machine-learning method that generally works well with high-dimensional problems and allows for nonlinear relationships between predictors; however, the presence of correlated predictors has been shown to impact its ability to identify strong predictors. The Random Forest-Recursive Feature Elimination algorithm (RF-RFE) mitigates this problem in smaller data sets, but this approach has not been tested in high-dimensional omics data sets.Results
We integrated 202,919 genotypes and 153,422 methylation sites in 680 individuals, and compared the abilities of RF and RF-RFE to detect simulated causal associations, which included simulated genotype–methylation interactions, between these variables and triglyceride levels. Results show that RF was able to identify strong causal variables with a few highly correlated variables, but it did not detect other causal variables.Conclusions
Although RF-RFE decreased the importance of correlated variables, in the presence of many correlated variables, it also decreased the importance of causal variables, making both hard to detect. These findings suggest that RF-RFE may not scale to high-dimensional data.20.