共查询到20条相似文献,搜索用时 62 毫秒
1.
Microarray technology is increasingly being applied in biological and medical research to address a wide range of problems.
Cluster analysis has proven to be a very useful tool for investigating the structure of microarray data. This paper presents
a program for clustering microarray data, which is based on the so-called path-distance. The algorithm gives in each step
a partition in two clusters and no prior assumptions on the structure of clusters are required. It assigns each object (gene
or sample) to only one cluster and gives the global optimum for the function that quantifies the adequacy of a given partition
of the sample into k clusters. The program was tested on experimental data sets, showing the robustness of the algorithm. 相似文献
2.
Microarray technology is increasingly being applied in biological and medical research to address a wide range of problems. Cluster analysis has proven to be a very useful tool for investigating the structure of microarray data. This paper presents a program for clustering microarray data, which is based on the so call path-distance. The algorithm gives in each step a partition in two clusters and no prior assumptions on the structure of clusters are required. It assigns each object (gene or sample) to only one cluster and gives the global optimum for the function that quantifies the adequacy of a given partition of the sample into k clusters. The program was tested on experimental data sets, showing the robustness of the algorithm. 相似文献
3.
Xiao-Qin Xia Zhenyu Jia Steffen Porwollik Fred Long Claudia Hoemme Kai Ye Carsten Müller-Tidow Michael McClelland Yipeng Wang 《Nucleic acids research》2010,38(11):e121
Most current microarray oligonucleotide probe design strategies are based on probe design factors (PDFs), which include probe hybridization free energy (PHFE), probe minimum folding energy (PMFE), dimer score, hairpin score, homology score and complexity score. The impact of these PDFs on probe performance was evaluated using four sets of microarray comparative genome hybridization (aCGH) data, which included two array manufacturing methods and the genomes of two species. Since most of the hybridizing DNA is equimolar in CGH data, such data are ideal for testing the general hybridization properties of almost all candidate oligonucleotides. In all our data sets, PDFs related to probe secondary structure (PMFE, hairpin score and dimer score) are the most significant factors linearly correlated with probe hybridization intensities. PHFE, homology and complexity score are correlating significantly with probe specificities, but in a non-linear fashion. We developed a new PDF, pseudo probe binding energy (PPBE), by iteratively fitting dinucleotide positional weights and dinucleotide stacking energies until the average residue sum of squares for the model was minimized. PPBE showed a better correlation with probe sensitivity and a better specificity than all other PDFs, although training data are required to construct a PPBE model prior to designing new oligonucleotide probes. The physical properties that are measured by PPBE are as yet unknown but include a platform-dependent component. A practical way to use these PDFs for probe design is to set cutoff thresholds to filter out bad quality probes. Programs and correlation parameters from this study are freely available to facilitate the design of DNA microarray oligonucleotide probes. 相似文献
4.
Leo Lahti Aurora Torrente Laura L. Elo Alvis Brazma Johan Rung 《Nucleic acids research》2013,41(10):e110
Rapid accumulation of large and standardized microarray data collections is opening up novel opportunities for holistic characterization of genome function. The limited scalability of current preprocessing techniques has, however, formed a bottleneck for full utilization of these data resources. Although short oligonucleotide arrays constitute a major source of genome-wide profiling data, scalable probe-level techniques have been available only for few platforms based on pre-calculated probe effects from restricted reference training sets. To overcome these key limitations, we introduce a fully scalable online-learning algorithm for probe-level analysis and pre-processing of large microarray atlases involving tens of thousands of arrays. In contrast to the alternatives, our algorithm scales up linearly with respect to sample size and is applicable to all short oligonucleotide platforms. The model can use the most comprehensive data collections available to date to pinpoint individual probes affected by noise and biases, providing tools to guide array design and quality control. This is the only available algorithm that can learn probe-level parameters based on sequential hyperparameter updates at small consecutive batches of data, thus circumventing the extensive memory requirements of the standard approaches and opening up novel opportunities to take full advantage of contemporary microarray collections. 相似文献
5.
The major goal of two-color cDNA microarray experiments is to measure the relative gene expression level (i.e., relative amount of mRNA) of each gene between samples in studies of gene expression. More specifically, given an N-sample experiment, we need all N(N - 1)/2 relative expression levels of all sample pairs of each gene for identification of the differentially expressed genes and for clustering of gene expression patterns. However, the intensities observed from two-color cDNA microarray experiments do not simply represent the relative gene expression level. They are composed of signal (gene expression level), noise, and other factors. In discussions on the experimental design of two-color cDNA microarray experiments, little attention has been given to the fact that different combinations of test and control samples will produce microarray intensities data with varying intrinsic composition of factors. As a consequence, not all experimental designs for two-color cDNA microarray experiments are able to provide all possible relative gene expression levels. This phenomenon has never been addressed. To obtain all possible relative gene expression levels, a novel method for two-color cDNA microarray experimental design evaluation is necessary that will allow the making of an accurate choice. In this study, we propose a model-based approach to illustrate how the factor composition of microarray intensities changed with different experimental designs in two-color cDNA microarray experiments. By analyzing 12 experimental designs (including 5 general forms), we demonstrate that not all experimental designs are able to provide all possible relative gene expression levels due to the differences in factor composition. Our results indicate that whether an experimental design can provide all possible relative expression levels of all sample pairs for each gene should be the first criterion to be considered in an evaluation of experimental designs for two-color cDNA microarray experiments. 相似文献
6.
Sample phenotype clusters in high-density oligonucleotide microarray data sets are revealed using Isomap,a nonlinear algorithm 总被引:2,自引:0,他引:2
Background
Life processes are determined by the organism's genetic profile and multiple environmental variables. However the interaction between these factors is inherently non-linear [1]. Microarray data is one representation of the nonlinear interactions among genes and genes and environmental factors. Still most microarray studies use linear methods for the interpretation of nonlinear data. In this study, we apply Isomap, a nonlinear method of dimensionality reduction, to analyze three independent large Affymetrix high-density oligonucleotide microarray data sets. 相似文献7.
Xiaobo Zhou Xiaodong Wang Edward R Dougherty Daniel Russ Edward Suh 《Journal of computational biology》2004,11(1):147-161
Cluster analysis of gene-wide expression data from DNA microarray hybridization studies has proved to be a useful tool for identifying biologically relevant groupings of genes and constructing gene regulatory networks. The motivation for considering mutual information is its capacity to measure a general dependence among gene random variables. We propose a novel clustering strategy based on minimizing mutual information among gene clusters. Simulated annealing is employed to solve the optimization problem. Bootstrap techniques are employed to get more accurate estimates of mutual information when the data sample size is small. Moreover, we propose to combine the mutual information criterion and traditional distance criteria such as the Euclidean distance and the fuzzy membership metric in designing the clustering algorithm. The performances of the new clustering methods are compared with those of some existing methods, using both synthesized data and experimental data. It is seen that the clustering algorithm based on a combined metric of mutual information and fuzzy membership achieves the best performance. The supplemental material is available at www.gspsnap.tamu.edu/gspweb/zxb/glioma_zxb. 相似文献
8.
9.
Mfuzz: a software package for soft clustering of microarray data 总被引:1,自引:0,他引:1
For the analysis of microarray data, clustering techniques are frequently used. Most of such methods are based on hard clustering of data wherein one gene (or sample) is assigned to exactly one cluster. Hard clustering, however, suffers from several drawbacks such as sensitivity to noise and information loss. In contrast, soft clustering methods can assign a gene to several clusters. They can overcome shortcomings of conventional hard clustering techniques and offer further advantages. Thus, we constructed an R package termed Mfuzz implementing soft clustering tools for microarray data analysis. The additional package Mfuzzgui provides a convenient TclTk based graphical user interface. AVAILABILITY: The R package Mfuzz and Mfuzzgui are available at http://itb1.biologie.hu-berlin.de/~futschik/software/R/Mfuzz/index.html. Their distribution is subject to GPL version 2 license. 相似文献
10.
With microarray technology becoming more prevalent in recent years, it is now common for several laboratories to employ the same microarray technology to identify differentially expressed genes that are related to the same phenomenon in the same species. Although experimental specifics may be similar, each laboratory will typically produce a slightly different list of statistically significant genes, which calls into question the validity of each gene list (i.e. which list is best). A statistically-based meta-analytic approach to microarray analysis systematically combines results from the different laboratories to provide a single estimate of the degree of differential expression for each gene. This approach provides a more precise view of genes that are of significant interest, while simultaneously allowing for differences between laboratories. The widely-used Affymetrix oligonucleotide array and its software are of particular interest because the results are naturally suited to a meta-analysis. A simulation model based on the Affymetrix platform is developed to examine the adaptive nature of the meta-analytic approach and to illustrate the utility of such an approach in combining microarray results across laboratories. 相似文献
11.
MOTIVATION: Principal Component Analysis (PCA) is one of the most popular dimensionality reduction techniques for the analysis of high-dimensional datasets. However, in its standard form, it does not take into account any error measures associated with the data points beyond a standard spherical noise. This indiscriminate nature provides one of its main weaknesses when applied to biological data with inherently large variability, such as expression levels measured with microarrays. Methods now exist for extracting credibility intervals from the probe-level analysis of cDNA and oligonucleotide microarray experiments. These credibility intervals are gene and experiment specific, and can be propagated through an appropriate probabilistic downstream analysis. RESULTS: We propose a new model-based approach to PCA that takes into account the variances associated with each gene in each experiment. We develop an efficient EM-algorithm to estimate the parameters of our new model. The model provides significantly better results than standard PCA, while remaining computationally reasonable. We show how the model can be used to 'denoise' a microarray dataset leading to improved expression profiles and tighter clustering across profiles. The probabilistic nature of the model means that the correct number of principal components is automatically obtained. 相似文献
12.
A rainbow trout high-density oligonucleotide microarray was constructed using all tentative consensus (TC) sequences that are publicly available from all international rainbow trout Oncorhynchus mykiss genomic research projects through the Rainbow Trout Gene Index database. The new array contains 60-mer oligonucleotide probes representing 37 394 unique TC sequences and 1417 control spots. The array (4 × 44 format) was manufactured according to the design by Agilent Technologies using the inkjet-based SurePrint technology (design number 016320). The performance of the new microarray platform was evaluated by analysing gene expression associated with rainbow trout, vitellogenesis-induced muscle atrophy. This microarray will open new avenues of research that will aid in the development of novel strategies for genetic improvement for economically important traits benefiting the salmonid aquaculture industries. 相似文献
13.
Motivation
In cluster analysis, the validity of specific solutions, algorithms, and procedures present significant challenges because there is no null hypothesis to test and no 'right answer'. It has been noted that a replicable classification is not necessarily a useful one, but a useful one that characterizes some aspect of the population must be replicable. By replicable we mean reproducible across multiple samplings from the same population. Methodologists have suggested that the validity of clustering methods should be based on classifications that yield reproducible findings beyond chance levels. We used this approach to determine the performance of commonly used clustering algorithms and the degree of replicability achieved using several microarray datasets.Methods
We considered four commonly used iterative partitioning algorithms (Self Organizing Maps (SOM), K-means, Clutsering LARge Applications (CLARA), and Fuzzy C-means) and evaluated their performances on 37 microarray datasets, with sample sizes ranging from 12 to 172. We assessed reproducibility of the clustering algorithm by measuring the strength of relationship between clustering outputs of subsamples of 37 datasets. Cluster stability was quantified using Cramer's v 2 from a kXk table. Cramer's v 2 is equivalent to the squared canonical correlation coefficient between two sets of nominal variables. Potential scores range from 0 to 1, with 1 denoting perfect reproducibility.Results
All four clustering routines show increased stability with larger sample sizes. K-means and SOM showed a gradual increase in stability with increasing sample size. CLARA and Fuzzy C-means, however, yielded low stability scores until sample sizes approached 30 and then gradually increased thereafter. Average stability never exceeded 0.55 for the four clustering routines, even at a sample size of 50. These findings suggest several plausible scenarios: (1) microarray datasets lack natural clustering structure thereby producing low stability scores on all four methods; (2) the algorithms studied do not produce reliable results and/or (3) sample sizes typically used in microarray research may be too small to support derivation of reliable clustering results. Further research should be directed towards evaluating stability performances of more clustering algorithms on more datasets specially having larger sample sizes with larger numbers of clusters considered.14.
Jennifer G. Mulle Viren C. Patel Stephen T. Warren Madhuri R. Hegde David J. Cutler Michael E. Zwick 《PloS one》2010,5(3)
DNA-based microarrays are increasingly central to biomedical research. Selecting oligonucleotide sequences that will behave consistently across experiments is essential to the design, production and performance of DNA microarrays. Here our aim was to improve on probe design parameters by empirically and systematically evaluating probe performance in a multivariate context. We used experimental data from 19 array CGH hybridizations to assess the probe performance of 385,474 probes tiled in the Duchenne muscular dystrophy (DMD) region of the X chromosome. Our results demonstrate that probe melting temperature, single nucleotide polymorphisms (SNPs), and homocytosine motifs all have a strong effect on probe behavior. These findings, when incorporated into future microarray probe selection algorithms, may improve microarray performance for a wide variety of applications. 相似文献
15.
16.
17.
18.
Background
Exploratory factor analysis is a commonly used statistical technique in metabolic syndrome research to uncover latent structure amongst metabolic variables. The application of factor analysis requires methodological decisions that reflect the hypothesis of the metabolic syndrome construct. These decisions often raise the complexity of the interpretation from the output. We propose two alternative techniques developed from cluster analysis which can achieve a clinically relevant structure, whilst maintaining intuitive advantages of clustering methodology.Methods
Two advanced techniques of clustering in the VARCLUS and matroid methods are discussed and implemented on a metabolic syndrome data set to analyze the structure of ten metabolic risk factors. The subjects were selected from the normative aging study based in Boston, Massachusetts. The sample included a total of 847 men aged between 21 and 81 years who provided complete data on selected risk factors during the period 1987 to 1991.Results
Four core components were identified by the clustering methods. These are labelled obesity, lipids, insulin resistance and blood pressure. The exploratory factor analysis with oblique rotation suggested an overlap of the loadings identified on the insulin resistance and obesity factors. The VARCLUS and matroid analyses separated these components and were able to demonstrate associations between individual risk factors.Conclusions
An oblique rotation can be selected to reflect the clinical concept of a single underlying syndrome, however the results are often difficult to interpret. Factor loadings must be considered along with correlations between the factors. The correlated components produced by the VARCLUS and matroid analyses are not overlapped, which allows for a simpler application of the methodologies and interpretation of the results. These techniques encourage consistency in the interpretation whilst remaining faithful to the construct under study. 相似文献19.
Gwinn MR Keshava C Olivero OA Humsi JA Poirier MC Weston A 《Omics : a journal of integrative biology》2005,9(4):334-350
Microarrays are used to study gene expression in a variety of biological systems. A number of different platforms have been developed, but few studies exist that have directly compared the performance of one platform with another. The goal of this study was to determine array variation by analyzing the same RNA samples with three different array platforms. Using gene expression responses to benzo[a]pyrene exposure in normal human mammary epithelial cells (NHMECs), we compared the results of gene expression profiling using three microarray platforms: photolithographic oligonucleotide arrays (Affymetrix), spotted oligonucleotide arrays (Amersham), and spotted cDNA arrays (NCI). While most previous reports comparing microarrays have analyzed pre-existing data from different platforms, this comparison study used the same sample assayed on all three platforms, allowing for analysis of variation from each array platform. In general, poor correlation was found with corresponding measurements from each platform. Each platform yielded different gene expression profiles, suggesting that while microarray analysis is a useful discovery tool, further validation is needed to extrapolate results for broad use of the data. Also, microarray variability needs to be taken into consideration, not only in the data analysis but also in specific probe selection for each array type. 相似文献
20.
基于基因表达谱的肿瘤分型和特征基因选取 总被引:20,自引:0,他引:20
在分析基因表达谱数据特性的基础上,提出了一个将之用于肿瘤分子分型和选型和选取相应亚型特征基因的策略。该策略包括三个步骤:首先采用一个无监督的基因过滤算法以降低用于分型计算的数据的噪声,其次提出了一个概率模型对样本中的分类结构进行建模,最后基于聚类的结果采用相对熵的方法获得对分类贡献大的基因作为特征基因,应用该策略对两个公开发表的数据集进行了再挖掘,结果表明不但获得了其他方法可以得到的信息,而且还提供了更精细、更具有显著生物学意义的信息,具有明显的优越性。 相似文献