首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Clustering is an important tool in microarray data analysis. This unsupervised learning technique is commonly used to reveal structures hidden in large gene expression data sets. The vast majority of clustering algorithms applied so far produce hard partitions of the data, i.e. each gene is assigned exactly to one cluster. Hard clustering is favourable if clusters are well separated. However, this is generally not the case for microarray time-course data, where gene clusters frequently overlap. Additionally, hard clustering algorithms are often highly sensitive to noise. To overcome the limitations of hard clustering, we applied soft clustering which offers several advantages for researchers. First, it generates accessible internal cluster structures, i.e. it indicates how well corresponding clusters represent genes. This can be used for the more targeted search for regulatory elements. Second, the overall relation between clusters, and thus a global clustering structure, can be defined. Additionally, soft clustering is more noise robust and a priori pre-filtering of genes can be avoided. This prevents the exclusion of biologically relevant genes from the data analysis. Soft clustering was implemented here using the fuzzy c-means algorithm. Procedures to find optimal clustering parameters were developed. A software package for soft clustering has been developed based on the open-source statistical language R. The package called Mfuzz is freely available.  相似文献   

2.
3.
4.
5.
The endosymbiont theory proposes that chloroplasts have originated from ancestral cyanobacteria through a process of engulfment and subsequent symbiotic adaptation. The molecular data for testing this theory have mainly been the nucleotide sequence of rRNAs and of photosystem component genes. In order to provide additional data in this area, we have isolated genomic clones of Synechocystis DNA containing the ribosomal protein gene cluster rplJL. The nucleotide sequence of this cluster and flanking regions was determined and the derived amino acid sequences were compared to the available homologous sequences from other eubacteria and chloroplasts. In Escherichia coli these two genes are part of a larger cluster, i.e., rplKAJL-rpoBC. In Synechocystis, the genes for the RNA polymerase subunit (rpoBC) are shown to be widely separated from the r-protein genes. The Synechocystis gene arrangement is similar to that in the chloroplast system, where the rpoBC1C2 and rplKAJL clusters are separated and located in two cell compartments, the chloroplast and the nucleus, respectively.  相似文献   

6.
MOTIVATION: Current Self-Organizing Maps (SOMs) approaches to gene expression pattern clustering require the user to predefine the number of clusters likely to be expected. Hierarchical clustering methods used in this area do not provide unique partitioning of data. We describe an unsupervised dynamic hierarchical self-organizing approach, which suggests an appropriate number of clusters, to perform class discovery and marker gene identification in microarray data. In the process of class discovery, the proposed algorithm identifies corresponding sets of predictor genes that best distinguish one class from other classes. The approach integrates merits of hierarchical clustering with robustness against noise known from self-organizing approaches. RESULTS: The proposed algorithm applied to DNA microarray data sets of two types of cancers has demonstrated its ability to produce the most suitable number of clusters. Further, the corresponding marker genes identified through the unsupervised algorithm also have a strong biological relationship to the specific cancer class. The algorithm tested on leukemia microarray data, which contains three leukemia types, was able to determine three major and one minor cluster. Prediction models built for the four clusters indicate that the prediction strength for the smaller cluster is generally low, therefore labelled as uncertain cluster. Further analysis shows that the uncertain cluster can be subdivided further, and the subdivisions are related to two of the original clusters. Another test performed using colon cancer microarray data has automatically derived two clusters, which is consistent with the number of classes in data (cancerous and normal). AVAILABILITY: JAVA software of dynamic SOM tree algorithm is available upon request for academic use. SUPPLEMENTARY INFORMATION: A comparison of rectangular and hexagonal topologies for GSOM is available from http://www.mame.mu.oz.au/mechatronics/journalinfo/Hsu2003supp.pdf  相似文献   

7.
8.
We develop a quantitative method for analyzing repetitions of identical short oligomers in coding and noncoding DNA sequences. We analyze sequences presently available in the GenBank separately for primate, mammal, vertebrate, rodent, invertebrate and plant taxonomic partitions. We find that some oligomers "cluster" more than they would if randomly distributed, while other oligomers "repel" each other. To quantify this degree of clustering, we define clustering measures. We find that (i) clustering significantly differs in coding and noncoding DNA; (ii) in most cases, monomers, dimers and tetramers cluster in noncoding DNA but appear to repel each other in coding DNA. (iii) The degree of clustering for different sources (primates, invertebrates, and plants) is more conserved among these sources in the case of coding DNA than in the case of noncoding DNA. (iv) In contrast to other oligomers, we find that trimers always prefer to cluster. (v) Clustering of each particular oligomer is conserved within the same organism.  相似文献   

9.
Arabidopsis thaliana has a relatively small genome of approximately 130 Mb containing about 10% repetitive DNA. Genome sequencing studies reveal a gene-rich genome, predicted to contain approximately 25000 genes spaced on average every 4.5 kb. Between 10 to 20% of the predicted genes occur as clusters of related genes, indicating that local sequence duplication and subsequent divergence generates a significant proportion of gene families. In addition to gene families, repetitive sequences comprise individual and small clusters of two to three retroelements and other classes of smaller repeats. The clustering of highly repetitive elements is a striking feature of the A. thaliana genome emerging from sequence and other analyses.  相似文献   

10.
Validating clustering for gene expression data   总被引:24,自引:0,他引:24  
MOTIVATION: Many clustering algorithms have been proposed for the analysis of gene expression data, but little guidance is available to help choose among them. We provide a systematic framework for assessing the results of clustering algorithms. Clustering algorithms attempt to partition the genes into groups exhibiting similar patterns of variation in expression level. Our methodology is to apply a clustering algorithm to the data from all but one experimental condition. The remaining condition is used to assess the predictive power of the resulting clusters-meaningful clusters should exhibit less variation in the remaining condition than clusters formed by chance. RESULTS: We successfully applied our methodology to compare six clustering algorithms on four gene expression data sets. We found our quantitative measures of cluster quality to be positively correlated with external standards of cluster quality.  相似文献   

11.
12.
Clustering analysis has a growing role in the study of co-expressed genes for gene discovery. Conventional binary and fuzzy clustering do not embrace the biological reality that some genes may be irrelevant for a problem and not be assigned to a cluster, while other genes may participate in several biological functions and should simultaneously belong to multiple clusters. Also, these algorithms cannot generate tight clusters that focus on their cores or wide clusters that overlap and contain all possibly relevant genes. In this paper, a new clustering paradigm is proposed. In this paradigm, all three eventualities of a gene being exclusively assigned to a single cluster, being assigned to multiple clusters, and being not assigned to any cluster are possible. These possibilities are realised through the primary novelty of the introduction of tunable binarization techniques. Results from multiple clustering experiments are aggregated to generate one fuzzy consensus partition matrix (CoPaM), which is then binarized to obtain the final binary partitions. This is referred to as Binarization of Consensus Partition Matrices (Bi-CoPaM). The method has been tested with a set of synthetic datasets and a set of five real yeast cell-cycle datasets. The results demonstrate its validity in generating relevant tight, wide, and complementary clusters that can meet requirements of different gene discovery studies.  相似文献   

13.
The genomic organization of the histone genes of the newt Notophthalmus viridescens is described. Genes for the five proteins are clustered on a 9.0 kb segment of cloned DNA which is part of a homogeneous family of sequences containing 600–800 members per haploid genome. The 9.0 kb histone gene clusters are not adjacent in the genome, but are separated from neighboring clusters by up to 50 kb or more of cluster spacer sequences; some or all of these spacer sequences are members of a predominantly centromeric satellite DNA with a 225 bp repeating unit.  相似文献   

14.
15.
Alteration of gene expression in response to regulatory molecules or mutations could lead to different diseases. MicroRNAs (miRNAs) have been discovered to be involved in regulation of gene expression and a wide variety of diseases. In a tripartite biological network of human miRNAs, their predicted target genes and the diseases caused by altered expressions of these genes, valuable knowledge about the pathogenicity of miRNAs, involved genes and related disease classes can be revealed by co-clustering miRNAs, target genes and diseases simultaneously. Tripartite co-clustering can lead to more informative results than traditional co-clustering with only two kinds of members and pass the hidden relational information along the relation chain by considering multi-type members. Here we report a spectral co-clustering algorithm for k-partite graph to find clusters with heterogeneous members. We use the method to explore the potential relationships among miRNAs, genes and diseases. The clusters obtained from the algorithm have significantly higher density than randomly selected clusters, which means members in the same cluster are more likely to have common connections. Results also show that miRNAs in the same family based on the hairpin sequences tend to belong to the same cluster. We also validate the clustering results by checking the correlation of enriched gene functions and disease classes in the same cluster. Finally, widely studied miR-17-92 and its paralogs are analyzed as a case study to reveal that genes and diseases co-clustered with the miRNAs are in accordance with current research findings.  相似文献   

16.
Summary Invertebrate actins resemble vertebrate cytoplasmic actins, and the distinction between muscle and cytoplasmic actins in invertebrates is not well established as for vertebrate actins. However, Bombyx and Drosophila have actin genes specifically expressed in muscles. To investigate if the distinction between muscle and cytoplasmic actins evidenced by gene expression analysis is related to the sequence of corresponding genes, we compare the sequences of actin genes of these two insect species and of other Metazoa. We find that insect muscle actins form a family of related proteins characterized by about 10 muscle-specific amino acids. Insect muscle actins have clearly diverged from cytoplasmic actins and form a monophyletic group emerging from a cluster of closely related proteins including insect and vertebrate cytoplasmic actins and actins of mollusc, cestode, and nematode. We propose that muscle-specific actin genes have appeared independently at least twice during the evolution of animals: insect muscle actin genes have emerged from an ancestral cytoplasmic actin gene within the arthropod phylum, whereas vertebrate muscle actin genes evolved within the chordate lineage as previously described.Offprint requests to.: N. Mounier  相似文献   

17.
The genome of avian erythroblastosis virus contains two independently expressed genetic loci (v-erbA and v-erbB) whose activities are probably responsible for oncogenesis by the virus. Both loci are closely related to nucleotide sequences found in the DNA and RNA of chickens and other vertebrates. We have isolated and characterized chicken DNA homologous to v-erbA and v-erbB. The two viral genes are represented by separate domains within chicken DNA (c-erbA and c-erbB), which are separated by a minimum of 12 kilobases (kb) of DNA and may not be linked at all. The nucleotide sequences shared by the viral and cellular erb loci are colinear, but the cellular loci are interrupted by multiple intervening sequences of various lengths. Polyribosomes prepared from normal chicken embryos contain two polyadenylated RNAs transcribed from c-erbA and two transcribed from c-erbB. The evident coding regions of these RNAs represent an unusually small fraction of the lengths of the RNAs, as if the 3′ untranslated domains of the RNAs might be exceptionally large (3–11 kb). These findings indicate that the c-erb loci are normal vertebrate genes rather than genes of cryptic endogenous retroviruses, and that they may have a role in the metabolism of normal cells. It appears that the viral erb genes, like most other retrovirus oncogenes, have been copied from cellular genes. In the viral genome, the two genes are devoid of introns, but they remain independently expressed loci, and they remain colinear with the coding domains of their cellular progenitors.  相似文献   

18.
Fuzzy C-means method for clustering microarray data   总被引:9,自引:0,他引:9  
MOTIVATION: Clustering analysis of data from DNA microarray hybridization studies is essential for identifying biologically relevant groups of genes. Partitional clustering methods such as K-means or self-organizing maps assign each gene to a single cluster. However, these methods do not provide information about the influence of a given gene for the overall shape of clusters. Here we apply a fuzzy partitioning method, Fuzzy C-means (FCM), to attribute cluster membership values to genes. RESULTS: A major problem in applying the FCM method for clustering microarray data is the choice of the fuzziness parameter m. We show that the commonly used value m = 2 is not appropriate for some data sets, and that optimal values for m vary widely from one data set to another. We propose an empirical method, based on the distribution of distances between genes in a given data set, to determine an adequate value for m. By setting threshold levels for the membership values, genes which are tigthly associated to a given cluster can be selected. Using a yeast cell cycle data set as an example, we show that this selection increases the overall biological significance of the genes within the cluster. AVAILABILITY: Supplementary text and Matlab functions are available at http://www-igbmc.u-strasbg.fr/fcm/  相似文献   

19.
Using less stringent hybridization conditions and cloned viral DNA probes representing the avian sarcoma virus gag, pol, env, and long terminal repeat (LTR) gene sequences, we detected related sequences in two avian species purportedly lacking all endogenous avian leukosis viruses, the ev- chicken and the Japanese quail. The blot hybridization patterns obtained with the various probes suggest the presence of between 40 and 100 copies of retrovirus-related sequences in the genomes of these two species. An ev- chicken genomic DNA library was prepared and screened with gag-specific and pol-specific DNA probes. Several different clones were obtained from this library and characterized. Analysis of these clones revealed that the retrovirus-related gene sequences are linked in the order LTR-gag-pol-env-LTR, a structure indicative of a complete provirus. These data indicate the presence of previously unidentified endogenous retrovirus species in avian cells, suggesting that under the appropriate conditions of hybridization additional, more distantly evolved families of endogenous retrovirus genes may be identified in vertebrate species.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号