首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Microarray technology is increasingly being applied in biological and medical research to address a wide range of problems. Cluster analysis has proven to be a very useful tool for investigating the structure of microarray data. This paper presents a program for clustering microarray data, which is based on the so-called path-distance. The algorithm gives in each step a partition in two clusters and no prior assumptions on the structure of clusters are required. It assigns each object (gene or sample) to only one cluster and gives the global optimum for the function that quantifies the adequacy of a given partition of the sample into k clusters. The program was tested on experimental data sets, showing the robustness of the algorithm.  相似文献   

2.
Irigoien I  Fernandez E  Vives S  Arenas C 《Genetika》2008,44(8):1137-1140
Microarray technology is increasingly being applied in biological and medical research to address a wide range of problems. Cluster analysis has proven to be a very useful tool for investigating the structure of microarray data. This paper presents a program for clustering microarray data, which is based on the so call path-distance. The algorithm gives in each step a partition in two clusters and no prior assumptions on the structure of clusters are required. It assigns each object (gene or sample) to only one cluster and gives the global optimum for the function that quantifies the adequacy of a given partition of the sample into k clusters. The program was tested on experimental data sets, showing the robustness of the algorithm.  相似文献   

3.
Most current microarray oligonucleotide probe design strategies are based on probe design factors (PDFs), which include probe hybridization free energy (PHFE), probe minimum folding energy (PMFE), dimer score, hairpin score, homology score and complexity score. The impact of these PDFs on probe performance was evaluated using four sets of microarray comparative genome hybridization (aCGH) data, which included two array manufacturing methods and the genomes of two species. Since most of the hybridizing DNA is equimolar in CGH data, such data are ideal for testing the general hybridization properties of almost all candidate oligonucleotides. In all our data sets, PDFs related to probe secondary structure (PMFE, hairpin score and dimer score) are the most significant factors linearly correlated with probe hybridization intensities. PHFE, homology and complexity score are correlating significantly with probe specificities, but in a non-linear fashion. We developed a new PDF, pseudo probe binding energy (PPBE), by iteratively fitting dinucleotide positional weights and dinucleotide stacking energies until the average residue sum of squares for the model was minimized. PPBE showed a better correlation with probe sensitivity and a better specificity than all other PDFs, although training data are required to construct a PPBE model prior to designing new oligonucleotide probes. The physical properties that are measured by PPBE are as yet unknown but include a platform-dependent component. A practical way to use these PDFs for probe design is to set cutoff thresholds to filter out bad quality probes. Programs and correlation parameters from this study are freely available to facilitate the design of DNA microarray oligonucleotide probes.  相似文献   

4.
Rapid accumulation of large and standardized microarray data collections is opening up novel opportunities for holistic characterization of genome function. The limited scalability of current preprocessing techniques has, however, formed a bottleneck for full utilization of these data resources. Although short oligonucleotide arrays constitute a major source of genome-wide profiling data, scalable probe-level techniques have been available only for few platforms based on pre-calculated probe effects from restricted reference training sets. To overcome these key limitations, we introduce a fully scalable online-learning algorithm for probe-level analysis and pre-processing of large microarray atlases involving tens of thousands of arrays. In contrast to the alternatives, our algorithm scales up linearly with respect to sample size and is applicable to all short oligonucleotide platforms. The model can use the most comprehensive data collections available to date to pinpoint individual probes affected by noise and biases, providing tools to guide array design and quality control. This is the only available algorithm that can learn probe-level parameters based on sequential hyperparameter updates at small consecutive batches of data, thus circumventing the extensive memory requirements of the standard approaches and opening up novel opportunities to take full advantage of contemporary microarray collections.  相似文献   

5.
The major goal of two-color cDNA microarray experiments is to measure the relative gene expression level (i.e., relative amount of mRNA) of each gene between samples in studies of gene expression. More specifically, given an N-sample experiment, we need all N(N - 1)/2 relative expression levels of all sample pairs of each gene for identification of the differentially expressed genes and for clustering of gene expression patterns. However, the intensities observed from two-color cDNA microarray experiments do not simply represent the relative gene expression level. They are composed of signal (gene expression level), noise, and other factors. In discussions on the experimental design of two-color cDNA microarray experiments, little attention has been given to the fact that different combinations of test and control samples will produce microarray intensities data with varying intrinsic composition of factors. As a consequence, not all experimental designs for two-color cDNA microarray experiments are able to provide all possible relative gene expression levels. This phenomenon has never been addressed. To obtain all possible relative gene expression levels, a novel method for two-color cDNA microarray experimental design evaluation is necessary that will allow the making of an accurate choice. In this study, we propose a model-based approach to illustrate how the factor composition of microarray intensities changed with different experimental designs in two-color cDNA microarray experiments. By analyzing 12 experimental designs (including 5 general forms), we demonstrate that not all experimental designs are able to provide all possible relative gene expression levels due to the differences in factor composition. Our results indicate that whether an experimental design can provide all possible relative expression levels of all sample pairs for each gene should be the first criterion to be considered in an evaluation of experimental designs for two-color cDNA microarray experiments.  相似文献   

6.

Background  

Life processes are determined by the organism's genetic profile and multiple environmental variables. However the interaction between these factors is inherently non-linear [1]. Microarray data is one representation of the nonlinear interactions among genes and genes and environmental factors. Still most microarray studies use linear methods for the interpretation of nonlinear data. In this study, we apply Isomap, a nonlinear method of dimensionality reduction, to analyze three independent large Affymetrix high-density oligonucleotide microarray data sets.  相似文献   

7.
Cluster analysis of gene-wide expression data from DNA microarray hybridization studies has proved to be a useful tool for identifying biologically relevant groupings of genes and constructing gene regulatory networks. The motivation for considering mutual information is its capacity to measure a general dependence among gene random variables. We propose a novel clustering strategy based on minimizing mutual information among gene clusters. Simulated annealing is employed to solve the optimization problem. Bootstrap techniques are employed to get more accurate estimates of mutual information when the data sample size is small. Moreover, we propose to combine the mutual information criterion and traditional distance criteria such as the Euclidean distance and the fuzzy membership metric in designing the clustering algorithm. The performances of the new clustering methods are compared with those of some existing methods, using both synthesized data and experimental data. It is seen that the clustering algorithm based on a combined metric of mutual information and fuzzy membership achieves the best performance. The supplemental material is available at www.gspsnap.tamu.edu/gspweb/zxb/glioma_zxb.  相似文献   

8.
9.
Mfuzz: a software package for soft clustering of microarray data   总被引:1,自引:0,他引:1  
For the analysis of microarray data, clustering techniques are frequently used. Most of such methods are based on hard clustering of data wherein one gene (or sample) is assigned to exactly one cluster. Hard clustering, however, suffers from several drawbacks such as sensitivity to noise and information loss. In contrast, soft clustering methods can assign a gene to several clusters. They can overcome shortcomings of conventional hard clustering techniques and offer further advantages. Thus, we constructed an R package termed Mfuzz implementing soft clustering tools for microarray data analysis. The additional package Mfuzzgui provides a convenient TclTk based graphical user interface. AVAILABILITY: The R package Mfuzz and Mfuzzgui are available at http://itb1.biologie.hu-berlin.de/~futschik/software/R/Mfuzz/index.html. Their distribution is subject to GPL version 2 license.  相似文献   

10.
Meta-analysis combines affymetrix microarray results across laboratories   总被引:3,自引:0,他引:3  
With microarray technology becoming more prevalent in recent years, it is now common for several laboratories to employ the same microarray technology to identify differentially expressed genes that are related to the same phenomenon in the same species. Although experimental specifics may be similar, each laboratory will typically produce a slightly different list of statistically significant genes, which calls into question the validity of each gene list (i.e. which list is best). A statistically-based meta-analytic approach to microarray analysis systematically combines results from the different laboratories to provide a single estimate of the degree of differential expression for each gene. This approach provides a more precise view of genes that are of significant interest, while simultaneously allowing for differences between laboratories. The widely-used Affymetrix oligonucleotide array and its software are of particular interest because the results are naturally suited to a meta-analysis. A simulation model based on the Affymetrix platform is developed to examine the adaptive nature of the meta-analytic approach and to illustrate the utility of such an approach in combining microarray results across laboratories.  相似文献   

11.
MOTIVATION: Principal Component Analysis (PCA) is one of the most popular dimensionality reduction techniques for the analysis of high-dimensional datasets. However, in its standard form, it does not take into account any error measures associated with the data points beyond a standard spherical noise. This indiscriminate nature provides one of its main weaknesses when applied to biological data with inherently large variability, such as expression levels measured with microarrays. Methods now exist for extracting credibility intervals from the probe-level analysis of cDNA and oligonucleotide microarray experiments. These credibility intervals are gene and experiment specific, and can be propagated through an appropriate probabilistic downstream analysis. RESULTS: We propose a new model-based approach to PCA that takes into account the variances associated with each gene in each experiment. We develop an efficient EM-algorithm to estimate the parameters of our new model. The model provides significantly better results than standard PCA, while remaining computationally reasonable. We show how the model can be used to 'denoise' a microarray dataset leading to improved expression profiles and tighter clustering across profiles. The probabilistic nature of the model means that the correct number of principal components is automatically obtained.  相似文献   

12.
A rainbow trout high-density oligonucleotide microarray was constructed using all tentative consensus (TC) sequences that are publicly available from all international rainbow trout Oncorhynchus mykiss genomic research projects through the Rainbow Trout Gene Index database. The new array contains 60-mer oligonucleotide probes representing 37 394 unique TC sequences and 1417 control spots. The array (4 × 44 format) was manufactured according to the design by Agilent Technologies using the inkjet-based SurePrint technology (design number 016320). The performance of the new microarray platform was evaluated by analysing gene expression associated with rainbow trout, vitellogenesis-induced muscle atrophy. This microarray will open new avenues of research that will aid in the development of novel strategies for genetic improvement for economically important traits benefiting the salmonid aquaculture industries.  相似文献   

13.

Motivation

In cluster analysis, the validity of specific solutions, algorithms, and procedures present significant challenges because there is no null hypothesis to test and no 'right answer'. It has been noted that a replicable classification is not necessarily a useful one, but a useful one that characterizes some aspect of the population must be replicable. By replicable we mean reproducible across multiple samplings from the same population. Methodologists have suggested that the validity of clustering methods should be based on classifications that yield reproducible findings beyond chance levels. We used this approach to determine the performance of commonly used clustering algorithms and the degree of replicability achieved using several microarray datasets.

Methods

We considered four commonly used iterative partitioning algorithms (Self Organizing Maps (SOM), K-means, Clutsering LARge Applications (CLARA), and Fuzzy C-means) and evaluated their performances on 37 microarray datasets, with sample sizes ranging from 12 to 172. We assessed reproducibility of the clustering algorithm by measuring the strength of relationship between clustering outputs of subsamples of 37 datasets. Cluster stability was quantified using Cramer's v 2 from a kXk table. Cramer's v 2 is equivalent to the squared canonical correlation coefficient between two sets of nominal variables. Potential scores range from 0 to 1, with 1 denoting perfect reproducibility.

Results

All four clustering routines show increased stability with larger sample sizes. K-means and SOM showed a gradual increase in stability with increasing sample size. CLARA and Fuzzy C-means, however, yielded low stability scores until sample sizes approached 30 and then gradually increased thereafter. Average stability never exceeded 0.55 for the four clustering routines, even at a sample size of 50. These findings suggest several plausible scenarios: (1) microarray datasets lack natural clustering structure thereby producing low stability scores on all four methods; (2) the algorithms studied do not produce reliable results and/or (3) sample sizes typically used in microarray research may be too small to support derivation of reliable clustering results. Further research should be directed towards evaluating stability performances of more clustering algorithms on more datasets specially having larger sample sizes with larger numbers of clusters considered.
  相似文献   

14.
DNA-based microarrays are increasingly central to biomedical research. Selecting oligonucleotide sequences that will behave consistently across experiments is essential to the design, production and performance of DNA microarrays. Here our aim was to improve on probe design parameters by empirically and systematically evaluating probe performance in a multivariate context. We used experimental data from 19 array CGH hybridizations to assess the probe performance of 385,474 probes tiled in the Duchenne muscular dystrophy (DMD) region of the X chromosome. Our results demonstrate that probe melting temperature, single nucleotide polymorphisms (SNPs), and homocytosine motifs all have a strong effect on probe behavior. These findings, when incorporated into future microarray probe selection algorithms, may improve microarray performance for a wide variety of applications.  相似文献   

15.
16.
高危型人乳头瘤病毒(Human papillomavims,HPV)是宫颈癌的主要致病因子。利用Arraydesigner2.0和BLAST等生物学软件对10种型别的人乳头瘤病毒全基因组序列进行分析,设计高特异性、熔解温度(Tm)和GC含量相近的60mer HPV型特异性寡核苷酸探针,用于HPV检测芯片的制备,并对其中四型最常见HPV病毒(HPV6,11,16,18)探针的有效性进行初步验证,结果表明设计所得的探针型特异性好,可以应用于HPV的检测与分型。  相似文献   

17.
18.

Background

Exploratory factor analysis is a commonly used statistical technique in metabolic syndrome research to uncover latent structure amongst metabolic variables. The application of factor analysis requires methodological decisions that reflect the hypothesis of the metabolic syndrome construct. These decisions often raise the complexity of the interpretation from the output. We propose two alternative techniques developed from cluster analysis which can achieve a clinically relevant structure, whilst maintaining intuitive advantages of clustering methodology.

Methods

Two advanced techniques of clustering in the VARCLUS and matroid methods are discussed and implemented on a metabolic syndrome data set to analyze the structure of ten metabolic risk factors. The subjects were selected from the normative aging study based in Boston, Massachusetts. The sample included a total of 847 men aged between 21 and 81 years who provided complete data on selected risk factors during the period 1987 to 1991.

Results

Four core components were identified by the clustering methods. These are labelled obesity, lipids, insulin resistance and blood pressure. The exploratory factor analysis with oblique rotation suggested an overlap of the loadings identified on the insulin resistance and obesity factors. The VARCLUS and matroid analyses separated these components and were able to demonstrate associations between individual risk factors.

Conclusions

An oblique rotation can be selected to reflect the clinical concept of a single underlying syndrome, however the results are often difficult to interpret. Factor loadings must be considered along with correlations between the factors. The correlated components produced by the VARCLUS and matroid analyses are not overlapped, which allows for a simpler application of the methodologies and interpretation of the results. These techniques encourage consistency in the interpretation whilst remaining faithful to the construct under study.  相似文献   

19.
Microarrays are used to study gene expression in a variety of biological systems. A number of different platforms have been developed, but few studies exist that have directly compared the performance of one platform with another. The goal of this study was to determine array variation by analyzing the same RNA samples with three different array platforms. Using gene expression responses to benzo[a]pyrene exposure in normal human mammary epithelial cells (NHMECs), we compared the results of gene expression profiling using three microarray platforms: photolithographic oligonucleotide arrays (Affymetrix), spotted oligonucleotide arrays (Amersham), and spotted cDNA arrays (NCI). While most previous reports comparing microarrays have analyzed pre-existing data from different platforms, this comparison study used the same sample assayed on all three platforms, allowing for analysis of variation from each array platform. In general, poor correlation was found with corresponding measurements from each platform. Each platform yielded different gene expression profiles, suggesting that while microarray analysis is a useful discovery tool, further validation is needed to extrapolate results for broad use of the data. Also, microarray variability needs to be taken into consideration, not only in the data analysis but also in specific probe selection for each array type.  相似文献   

20.
基于基因表达谱的肿瘤分型和特征基因选取   总被引:20,自引:0,他引:20  
在分析基因表达谱数据特性的基础上,提出了一个将之用于肿瘤分子分型和选型和选取相应亚型特征基因的策略。该策略包括三个步骤:首先采用一个无监督的基因过滤算法以降低用于分型计算的数据的噪声,其次提出了一个概率模型对样本中的分类结构进行建模,最后基于聚类的结果采用相对熵的方法获得对分类贡献大的基因作为特征基因,应用该策略对两个公开发表的数据集进行了再挖掘,结果表明不但获得了其他方法可以得到的信息,而且还提供了更精细、更具有显著生物学意义的信息,具有明显的优越性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号