首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
Effective similarity measures for expression profiles   总被引:3,自引:0,他引:3  
It is commonly accepted that genes with similar expression profiles are functionally related. However, there are many ways one can measure the similarity of expression profiles, and it is not clear a priori what is the most effective one. Moreover, so far no clear distinction has been made as for the type of the functional link between genes as suggested by microarray data. Similarly expressed genes can be part of the same complex as interacting partners; they can participate in the same pathway without interacting directly; they can perform similar functions; or they can simply have similar regulatory sequences. Here we conduct a study of the notion of functional link as implied from expression data. We analyze different similarity measures of gene expression profiles and assess their usefulness and robustness in detecting biological relationships by comparing the similarity scores with results obtained from databases of interacting proteins, promoter signals and cellular pathways, as well as through sequence comparisons. We also introduce variations on similarity measures that are based on statistical analysis and better discriminate genes which are functionally nearby and faraway. Our tools can be used to assess other similarity measures for expression profiles, and are accessible at biozon.org/tools/expression/  相似文献   

2.
研究酵母(yeast)蛋白质相互作用与基因表达谱和蛋白质亚细胞定位的关系.首先,构建了蛋白质相互作用正样本集、负样本集、随机组对负样本集和混合样本集.然后,对于4个数据集中的所有蛋白质对,通过比较它们的基于距离的基因共表达的分布以及它们中具有已知亚细胞定位的蛋白质对的共定位出现率,实现了这些高通量数据的交叉量化分析.结果揭示,与非相互作用蛋白质对相比,相互作用蛋白质对的基因表达谱具有较高的相似性;相互作用蛋白质对更倾向于具有相同的亚细胞定位.结果还揭示出这些蛋白质特征相关的总体趋势.  相似文献   

3.
The relationship between the similarity of expression patterns for a pair of genes and interaction of the proteins they encode is demonstrated both for the simple genome of the bacteriophage T7 and the considerably more complex genome of the yeast Saccharomyces cerevisiae. Statistical analysis of large-scale gene expression and protein interaction data shows that protein pairs encoded by co-expressed genes interact with each other more frequently than with random proteins. Furthermore, the mean similarity of expression profiles is significantly higher for respective interacting protein pairs than for random ones. Such coupled analysis of gene expression and protein interaction data may allow evaluation of the results of large-scale gene expression and protein interaction screens as demonstrated for several publicly available datasets. The role of this link between expression and interaction in the evolution from monomeric to oligomeric protein structures is also discussed.  相似文献   

4.
MicroRNAs are a class of small non-protein coding RNAs that play an important role in the regulation of gene expression. Most studies on the identification of microRNA-mRNA pairs utilize the correlation coefficient as a measure of association. The use of correlation coefficient is appropriate if the expression data are available for several conditions and, for a given condition, both microRNA and mRNA expression profiles are obtained from the same set of individuals. However, there are many instances where one of the requirements is not satisfied. Therefore, there is a need for new measures of association to identify the microRNA-mRNA pairs of interest and we present two such measures. The first measure requires expression data for multiple conditions but, for a given condition, the microRNA and mRNA expression may be obtained from different individuals. The new measure, unlike the correlation coefficient, is suitable for analyzing large data sets which are obtained by combining several independent studies on microRNAs and mRNAs. Our second measure is able to handle expression data that correspond to just two conditions but, for a given condition, the microRNA and mRNA expression must be obtained from the same set of individuals. This measure, unlike the correlation coefficient, is appropriate for analyzing data sets with a small number of conditions. We apply our new measures of association to multiple myeloma data sets, which cannot be analyzed using the correlation coefficient, and identify several microRNA-mRNA pairs involved in apoptosis and cell proliferation.  相似文献   

5.
Distance-based clustering of CGH data   总被引:1,自引:0,他引:1  
MOTIVATION: We consider the problem of clustering a population of Comparative Genomic Hybridization (CGH) data samples. The goal is to develop a systematic way of placing patients with similar CGH imbalance profiles into the same cluster. Our expectation is that patients with the same cancer types will generally belong to the same cluster as their underlying CGH profiles will be similar. RESULTS: We focus on distance-based clustering strategies. We do this in two steps. (1) Distances of all pairs of CGH samples are computed. (2) CGH samples are clustered based on this distance. We develop three pairwise distance/similarity measures, namely raw, cosine and sim. Raw measure disregards correlation between contiguous genomic intervals. It compares the aberrations in each genomic interval separately. The remaining measures assume that consecutive genomic intervals may be correlated. Cosine maps pairs of CGH samples into vectors in a high-dimensional space and measures the angle between them. Sim measures the number of independent common aberrations. We test our distance/similarity measures on three well known clustering algorithms, bottom-up, top-down and k-means with and without centroid shrinking. Our results show that sim consistently performs better than the remaining measures. This indicates that the correlation of neighboring genomic intervals should be considered in the structural analysis of CGH datasets. The combination of sim with top-down clustering emerged as the best approach. AVAILABILITY: All software developed in this article and all the datasets are available from the authors upon request. CONTACT: juliu@cise.ufl.edu.  相似文献   

6.
This research analyzes some aspects of the relationship between gene expression, gene function, and gene annotation. Many recent studies are implicitly based on the assumption that gene products that are biologically and functionally related would maintain this similarity both in their expression profiles as well as in their gene ontology (GO) annotation. We analyze how accurate this assumption proves to be using real publicly available data. We also aim to validate a measure of semantic similarity for GO annotation. We use the Pearson correlation coefficient and its absolute value as a measure of similarity between expression profiles of gene products. We explore a number of semantic similarity measures (Resnik, Jiang, and Lin) and compute the similarity between gene products annotated using the GO. Finally, we compute correlation coefficients to compare gene expression similarity against GO semantic similarity. Our results suggest that the Resnik similarity measure outperforms the others and seems better suited for use in gene ontology. We also deduce that there seems to be correlation between semantic similarity in the GO annotation and gene expression for the three GO ontologies. We show that this correlation is negligible up to a certain semantic similarity value; then, for higher similarity values, the relationship trend becomes almost linear. These results can be used to augment the knowledge provided by clustering algorithms and in the development of bioinformatic tools for finding and characterizing gene products.  相似文献   

7.
Owing to duplication events in its progenitor, more than 90% of the genes in the Arabidopsis thaliana genome are members of multigene families. A set of 2108 gene families, each consisting of precisely two unlinked paralogous genes, was identified in the nuclear genome of A. thaliana on the basis of sequence similarity. A systematic method for the creation of double knock‐out lines for such gene pairs, designated as DUPLO lines, was established and 200 lines are now publicly available. Their initial phenotypic characterisation led to the identification of seven lines with defects that emerge only in the adult stage. A further six lines display seedling lethality and 23 lines were lethal before germination. Another 14 lines are known to show phenotypes under non‐standard conditions or at the molecular level. Knock‐out of gene pairs with very similar coding sequences or expression profiles is more likely to produce a mutant phenotype than inactivation of gene pairs with dissimilar profiles or sequences. High coding sequence similarity and highly similar expression profiles are only weakly correlated, implying that promoter and coding regions of these gene pairs display different degrees of diversification.  相似文献   

8.
9.
MOTIVATION: The inference of genes that are truly associated with inherited human diseases from a set of candidates resulting from genetic linkage studies has been one of the most challenging tasks in human genetics. Although several computational approaches have been proposed to prioritize candidate genes relying on protein-protein interaction (PPI) networks, these methods can usually cover less than half of known human genes. RESULTS: We propose to rely on the biological process domain of the gene ontology to construct a gene semantic similarity network and then use the network to infer disease genes. We show that the constructed network covers about 50% more genes than a typical PPI network. By analyzing the gene semantic similarity network with the PPI network, we show that gene pairs tend to have higher semantic similarity scores if the corresponding proteins are closer to each other in the PPI network. By analyzing the gene semantic similarity network with a phenotype similarity network, we show that semantic similarity scores of genes associated with similar diseases are significantly different from those of genes selected at random, and that genes with higher semantic similarity scores tend to be associated with diseases with higher phenotype similarity scores. We further use the gene semantic similarity network with a random walk with restart model to infer disease genes. Through a series of large-scale leave-one-out cross-validation experiments, we show that the gene semantic similarity network can achieve not only higher coverage but also higher accuracy than the PPI network in the inference of disease genes.  相似文献   

10.
Studies of gene expression profiles in response to external perturbation generate repeated measures data that generally follow nonlinear curves. To explore the evolution of such profiles across a gene family, we introduce phylogenetic repeated measures (PR) models. These models draw strength from 2 forms of correlation in the data. Through gene duplication, the family's evolutionary relatedness induces the first form. The second is the correlation across time points within taxonic units, individual genes in this example. We borrow a Brownian diffusion process along a given phylogenetic tree to account for the relatedness and co-opt a repeated measures framework to model the latter. Through simulation studies, we demonstrate that repeated measures models outperform the previously available approaches that consider the longitudinal observations or their differences as independent and identically distributed by using deviance information criteria as Bayesian model selection tools; PR models that borrow phylogenetic information also perform better than nonphylogenetic repeated measures models when appropriate. We then analyze the evolution of gene expression in the yeast kinase family using splines to estimate nonlinear behavior across 3 perturbation experiments. Again, the PR models outperform previous approaches and afford the prediction of ancestral expression profiles. To demonstrate PR model applicability more generally, we conclude with a short examination of variation in brain development across 4 primate species.  相似文献   

11.
Zhang S  Chang Z  Li Z  DuanMu H  Li Z  Li K  Liu Y  Qiu F  Xu Y 《Gene》2012,497(1):58-65
Phenotypic similarity is correlated with a number of measures of gene function, such as relatedness at the level of direct protein-protein interaction. The phenotypic effect of a deleted or mutated gene, which is one part of gene annotation, has caught broad attention. However, there have been few measures to study phenotypic similarity with the data from Human Phenotype Ontology (HPO) database, therefore more analogous measures should be developed and investigated. We used five semantic similarity-based measures (Jiang and Conrath, Lin, Schlicker, Yu and Wu) to calculate the human phenotypic similarity between genes (PSG) with data from HPO database, and evaluated their accuracy with information of protein-protein interaction, protein complex, protein family, gene function or DNA sequence. Compared with the gene pairs that were random selected, the results of these methods were statistically significant (all P<0.001). Furthermore, we assessed the performance of these five measures by receiver operating characteristic (ROC) curve analysis, and found that most of them performed better than the previous methods. This work had proved that these measures based on semantic similarity for calculation of PSG were effective for hierarchical structure data. Our study contributes to the development and optimization of novel algorithms of PSG calculation and provides more alternative methods to researchers as well as tools and directions for PSG study.  相似文献   

12.
GESTs (gene expression similarity and taxonomy similarity), a gene functional prediction approach previously proposed by us, is based on gene expression similarity and concept similarity of functional classes defined in Gene Ontology (GO). In this paper, we extend this method to protein-protein interaction data by introducing several methods to filter the neighbors in protein interaction networks for a protein of unknown function(s). Unlike other conventional methods, the proposed approach automatically selects the most appropriate functional classes as specific as possible during the learning process, and calls on genes annotated to nearby classes to support the predictions to some small-sized specific classes in GO. Based on the yeast protein-protein interaction information from MIPS and a dataset of gene expression profiles, we assess the performances of our approach for predicting protein functions to “biology process” by three measures particularly designed for functional classes organized in GO. Results show that our method is powerful for widely predicting gene functions with very specific functional terms. Based on the GO database published in December 2004, we predict some proteins whose functions were unknown at that time, and some of the predictions have been confirmed by the new SGD annotation data published in April, 2006.  相似文献   

13.
One of the most important objects in bioinformatics is a gene product (protein or RNA). For many gene products, functional information is summarized in a set of Gene Ontology (GO) annotations. For these genes, it is reasonable to include similarity measures based on the terms found in the GO or other taxonomy. In this paper, we introduce several novel measures for computing the similarity of two gene products annotated with GO terms. The fuzzy measure similarity (FMS) has the advantage that it takes into consideration the context of both complete sets of annotation terms when computing the similarity between two gene products. When the two gene products are not annotated by common taxonomy terms, we propose a method that avoids a zero similarity result. To account for the variations in the annotation reliability, we propose a similarity measure based on the Choquet integral. These similarity measures provide extra tools for the biologist in search of functional information for gene products. The initial testing on a group of 194 sequences representing three proteins families shows a higher correlation of the FMS and Choquet similarities to the BLAST sequence similarities than the traditional similarity measures such as pairwise average or pairwise maximum.  相似文献   

14.
Two genes are said to be coexpressed if their expression levels have a similar spatial or temporal pattern. Ever since the profiling of gene microarrays has been in progress, computational modeling of coexpression has acquired a major focus. As a result, several similarity/distance measures have evolved over time to quantify coexpression similarity/dissimilarity between gene pairs. Of these, correlation coefficient has been established to be a suitable quantifier of pairwise coexpression. In general, correlation coefficient is good for symbolizing linear dependence, but not for nonlinear dependence. In spite of this drawback, it outperforms many other existing measures in modeling the dependency in biological data. In this paper, for the first time, we point out a significant weakness of the existing similarity/distance measures, including the standard correlation coefficient, in modeling pairwise coexpression of genes. A novel measure, called BioSim, which assumes values between -1 and +1 corresponding to negative and positive dependency and 0 for independency, is introduced. The computation of BioSim is based on the aggregation of stepwise relative angular deviation of the expression vectors considered. The proposed measure is analytically suitable for modeling coexpression as it accounts for the features of expression similarity, expression deviation and also the relative dependence. It is demonstrated how the proposed measure is better able to capture the degree of coexpression between a pair of genes as compared to several other existing ones. The efficacy of the measure is statistically analyzed by integrating it with several module-finding algorithms based on coexpression values and then applying it on synthetic and biological data. The annotation results of the coexpressed genes as obtained from gene ontology establish the significance of the introduced measure. By further extending the BioSim measure, it has been shown that one can effectively identify the variability in the expression patterns over multiple phenotypes. We have also extended BioSim to figure out pairwise differential expression pattern and coexpression dynamics. The significance of these studies is shown based on the analysis over several real-life data sets. The computation of the measure by focusing on stepwise time points also makes it effective to identify partially coexpressed genes. On the whole, we put forward a complete framework for coexpression analysis based on the BioSim measure.  相似文献   

15.
We applied a new approach based on Mantel statistics to analyze the Genetic Analysis Workshop 14 simulated data with prior knowledge of the answers. The method was developed in order to improve the power of a haplotype sharing analysis for gene mapping in complex disease. The new statistic correlates genetic similarity and phenotypic similarity across pairs of haplotypes from case-control studies. The genetic similarity is measured as the shared length between haplotype pairs around a genetic marker. The phenotypic similarity is measured as the mean corrected cross-product based on the respective phenotypes. Cases with phenotype P1 and unrelated controls were drawn from the population of Danacaa. Power to detect main effects was compared to the X2-test for association based on 3-marker haplotypes and a global permutation test for haplotype association to test for main effects. Power to detect gene x gene interaction was compared to unconditional logistic regression. The results suggest that the Mantel statistics might be more powerful than alternative tests.  相似文献   

16.
The hypothesis of isolation by distance (IBD) predicts that genetic differentiation between populations increases with geographic distance. However, gene flow is governed by numerous factors and the correlation between genetic differentiation and geographic distance is never simply linear. In this study, we analyze the interaction between the effects of geographic distance and of wild or domesticated status of the host plant on genetic differentiation in the bean beetle Acanthoscelides obvelatus. Geographic distance explained most of the among-population genetic differentiation. However, IBD varied depending on the kind of population pairs for which the correlation between genetic differentiation and geographic distance was examined. Whereas pairs of beetle populations associated with wild beans showed significant IBD (P < 10(-4)), no IBD was found when pairs of beetle populations on domesticated beans were examined (P= 0.2992). This latter result can be explained by long-distance migrations of beetles on domesticated plants resulting from human exchanges of bean seeds. Beetle populations associated with wild beans were also significantly more likely than those on domesticated plants to contain rare alleles. However, at the population level, beetles on cultivated beans were similar in allelic richness to those on wild beans. This similarity in allelic richness combined with differences in other aspects of the genetic diversity (i.e., IBD, allelic diversity) is compatible with strongly contrasting effects of migration and drift. This novel indirect effect of human actions on gene flow of a serious pest of a domesticated plant has important implications for the spread of new adaptations such as resistance to pesticides.  相似文献   

17.
Genetic similarity within pairs of individuals was examined using both 10 polymorphic microsatellite loci and multi-locus DNA fingerprinting profiles in a semi-isolated population of great reed warblers at Lake Kvismaren, south Central Sweden, in 1987-1993. The population was founded by a few individuals in 1978, followed by a gradual increase in numbers until 1988, since when the population has remained relatively stable with about 60 breeding birds. We have previously found that high genetic similarity between pair-mates in the population during the early part of the study period reduced egg hatching success, and hence reproductive success. The measures of pairwise genetic similarity, microsatellite allele sharing and DNA fingerprinting band sharing, were highly correlated with pedigree-based relatedness. Both microsatellite and DNA fingerprinting similarities between pair-mates declined significantly over the study period, and the pattern was most pronounced in the DNA fingerprinting data. Analyses restricted to the microsatellite data showed that the average annual microsatellite similarity between pairwise combinations of individuals, as well as individual homozygosity in males, declined significantly over the study period, and that several immigrants carrying novel alleles entered the population during the study. Hence, the temporal decline in genetic similarity of mates in the population is probably a consequence of increased immigration, facilitated by the recent expansion of the species in the region. These results suggest that the population has now recovered genetically, or is in the process of recovering, from a recent founder event.  相似文献   

18.
GESTs (gene expression similarity and taxonomy similarity), a gene functional prediction approach previously proposed by us, is based on gene expression similarity and concept similarity of functional classes defined in Gene Ontology (GO). In this paper, we extend this method to protein-protein interac-tion data by introducing several methods to filter the neighbors in protein interaction networks for a protein of unknown function(s). Unlike other conventional methods, the proposed approach automati-cally selects the most appropriate functional classes as specific as possible during the learning proc-ess, and calls on genes annotated to nearby classes to support the predictions to some small-sized specific classes in GO. Based on the yeast protein-protein interaction information from MIPS and a dataset of gene expression profiles, we assess the performances of our approach for predicting protein functions to “biology process” by three measures particularly designed for functional classes organ-ized in GO. Results show that our method is powerful for widely predicting gene functions with very specific functional terms. Based on the GO database published in December 2004, we predict some proteins whose functions were unknown at that time, and some of the predictions have been confirmed by the new SGD annotation data published in April, 2006.  相似文献   

19.
20.
In landscape genetics, isolation-by-distance (IBD) is regarded as a baseline pattern that is obtained without additional effects of landscape elements on gene flow. However, the configuration of suitable habitat patches determines deme topology, which in turn should affect rates of gene flow. IBD patterns can be characterized either by monotonically increasing pairwise genetic differentiation (for example, FST) with increasing interdeme geographic distance (case-I pattern) or by monotonically increasing pairwise genetic differentiation up to a certain geographical distance beyond which no correlation is detectable anymore (case-IV pattern). We investigated if landscape configuration influenced the rate at which a case-IV pattern changed to a case-I pattern. We also determined at what interdeme distance the highest correlation was measured between genetic differentiation and geographic distance and whether this distance corresponded to the maximum migration distance. We set up a population genetic simulation study and assessed the development of IBD patterns for several habitat configurations and maximum migration distances. We show that the rate and likelihood of the transition of case-IV to case-I FST–distance relationships was strongly influenced by habitat configuration and maximum migration distance. We also found that the maximum correlation between genetic differentiation and geographic distance was not related to the maximum migration distance and was measured across all deme pairs in a case-I pattern and, for a case-IV pattern, at the distance where the FST–distance curve flattens out. We argue that in landscape genetics, separate analyses should be performed to either assess IBD or the landscape effects on gene flow.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号