首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Static expression experiments analyze samples from many individuals. These samples are often snapshots of the progression of a certain disease such as cancer. This raises an intriguing question: Can we determine a temporal order for these samples? Such an ordering can lead to better understanding of the dynamics of the disease and to the identification of genes associated with its progression. In this paper we formally prove, for the first time, that under a model for the dynamics of the expression levels of a single gene, it is indeed possible to recover the correct ordering of the static expression datasets by solving an instance of the traveling salesman problem (TSP). In addition, we devise an algorithm that combines a TSP heuristic and probabilistic modeling for inferring the underlying temporal order of the microarray experiments. This algorithm constructs probabilistic continuous curves to represent expression profiles leading to accurate temporal reconstruction for human data. Applying our method to cancer expression data we show that the ordering derived agrees well with survival duration. A classifier that utilizes this ordering improves upon other classifiers suggested for this task. The set of genes displaying consistent behavior for the determined ordering are enriched for genes associated with cancer progression.  相似文献   

2.
This paper concerns the discovery of patterns in gene expression matrices, in which each element gives the expression level of a given gene in a given experiment. Most existing methods for pattern discovery in such matrices are based on clustering genes by comparing their expression levels in all experiments, or clustering experiments by comparing their expression levels for all genes. Our work goes beyond such global approaches by looking for local patterns that manifest themselves when we focus simultaneously on a subset G of the genes and a subset T of the experiments. Specifically, we look for order-preserving submatrices (OPSMs), in which the expression levels of all genes induce the same linear ordering of the experiments (we show that the OPSM search problem is NP-hard in the worst case). Such a pattern might arise, for example, if the experiments in T represent distinct stages in the progress of a disease or in a cellular process and the expression levels of all genes in G vary across the stages in the same way. We define a probabilistic model in which an OPSM is hidden within an otherwise random matrix. Guided by this model, we develop an efficient algorithm for finding the hidden OPSM in the random matrix. In data generated according to the model, the algorithm recovers the hidden OPSM with a very high success rate. Application of the methods to breast cancer data seem to reveal significant local patterns.  相似文献   

3.
初步构建乳腺癌转移相关基因表达调控网络的线性微分方程模型,并分析模型的可靠性和生物学意义. 采用基因芯片技术,分别对30例伴有淋巴结转移的乳腺癌组织及其相应淋巴结转移癌组织进行基因表达谱的比较,选择差异基因通过线性微分数学方法构建表达调控网络模型. 差异表达基因共27个,其中Ratio > 3的明显上调基因14个,而Ratio < 0.33的明显下调基因13个. 比较伴有淋巴结转移的乳腺癌组织和其相应淋巴结转移癌组织,分析筛选了27个表达差异基因,应用数学线性微分方程方法初步构建乳腺癌转移相关基因表达调控网络的线性微分方程模型,通过分析模型中重要节点、通路的生物学意义,判定网络的数学特性,初步表明,调控网络的可靠性和乳腺癌转移的形成是与多基因、多通路异常引起的细胞恶性转化相关.  相似文献   

4.
The classification of tissue samples based on gene expression data is an important problem in medical diagnosis of diseases such as cancer. In gene expression data, the number of genes is usually very high (in the thousands) compared to the number of data samples (in the tens or low hundreds); that is, the data dimension is large compared to the number of data points (such data is said to be undersampled). To cope with performance and accuracy problems associated with high dimensionality, it is commonplace to apply a preprocessing step that transforms the data to a space of significantly lower dimension with limited loss of the information present in the original data. Linear discriminant analysis (LDA) is a well-known technique for dimension reduction and feature extraction, but it is not applicable for undersampled data due to singularity problems associated with the matrices in the underlying representation. This paper presents a dimension reduction and feature extraction scheme, called uncorrelated linear discriminant analysis (ULDA), for undersampled problems and illustrates its utility on gene expression data. ULDA employs the generalized singular value decomposition method to handle undersampled data and the features that it produces in the transformed space are uncorrelated, which makes it attractive for gene expression data. The properties of ULDA are established rigorously and extensive experimental results on gene expression data are presented to illustrate its effectiveness in classifying tissue samples. These results provide a comparative study of various state-of-the-art classification methods on well-known gene expression data sets  相似文献   

5.
Implantation of rat prostate cancer cells into the normal rat prostate results in tumor-stimulating changes in the tumor-bearing organ, for example growth of the vasculature, an altered extracellular matrix, and influx of inflammatory cells. To investigate this response further, we compared prostate morphology and the gene expression profile of tumor-bearing normal rat prostate tissue (termed tumor-instructed/indicating normal tissue (TINT)) with that of prostate tissue from controls. Dunning rat AT-1 prostate cancer cells were injected into rat prostate and tumors were established after 10 days. As controls we used intact animals, animals injected with heat-killed AT-1 cells or cell culture medium. None of the controls showed morphological TINT-changes. A rat Illumina whole-genome expression array was used to analyze gene expression in AT-1 tumors, TINT, and in medium injected prostate tissue. We identified 423 upregulated genes and 38 downregulated genes (p<0.05, ≥2-fold change) in TINT relative to controls. Quantitative RT-PCR analysis verified key TINT-changes, and they were not detected in controls. Expression of some genes was changed in a manner similar to that in the tumor, whereas other changes were exclusive to TINT. Ontological analysis using GeneGo software showed that the TINT gene expression profile was coupled to processes such as inflammation, immune response, and wounding. Many of the genes whose expression is altered in TINT have well-established roles in tumor biology, and the present findings indicate that they may also function by adapting the surrounding tumor-bearing organ to the needs of the tumor. Even though a minor tumor cell contamination in TINT samples cannot be ruled out, our data suggest that there are tumor-induced changes in gene expression in the normal tumor-bearing organ which can probably not be explained by tumor cell contamination. It is important to validate these changes further, as they could hypothetically serve as novel diagnostic and prognostic markers of prostate cancer.  相似文献   

6.
7.
A central step in the analysis of gene expression data is the identification of groups of genes that exhibit similar expression patterns. Clustering and ordering the genes using gene expression data into homogeneous groups was shown to be useful in functional annotation, tissue classification, regulatory motif identification, and other applications. Although there is a rich literature on gene ordering in hierarchical clustering framework for gene expression analysis, there is no work addressing and evaluating the importance of gene ordering in partitive clustering framework, to the best knowledge of the authors. Outside the framework of hierarchical clustering, different gene ordering algorithms are applied on the whole data set, and the domain of partitive clustering is still unexplored with gene ordering approaches. A new hybrid method is proposed for ordering genes in each of the clusters obtained from partitive clustering solution, using microarray gene expressions.Two existing algorithms for optimally ordering cities in travelling salesman problem (TSP), namely, FRAG_GALK and Concorde, are hybridized individually with self organizing MAP to show the importance of gene ordering in partitive clustering framework. We validated our hybrid approach using yeast and fibroblast data and showed that our approach improves the result quality of partitive clustering solution, by identifying subclusters within big clusters, grouping functionally correlated genes within clusters, minimization of summation of gene expression distances, and the maximization of biological gene ordering using MIPS categorization. Moreover, the new hybrid approach, finds comparable or sometimes superior biological gene order in less computation time than those obtained by optimal leaf ordering in hierarchical clustering solution.  相似文献   

8.

Background  

The search for cluster structure in microarray datasets is a base problem for the so-called "-omic sciences". A difficult problem in clustering is how to handle data with a manifold structure, i.e. data that is not shaped in the form of compact clouds of points, forming arbitrary shapes or paths embedded in a high-dimensional space, as could be the case of some gene expression datasets.  相似文献   

9.
Recent large-scale sequencing studies have revealed that cancer genomes contain variable numbers of somatic point mutations distributed across many genes. These somatic mutations most likely include passenger mutations that are not cancer causing and pathogenic driver mutations in cancer genes. Establishing a significant presence of driver mutations in such data sets is of biological interest. Whereas current techniques from phylogeny are applicable to large data sets composed of singly mutated samples, recently exemplified with a p53 mutation database, methods for smaller data sets containing individual samples with multiple mutations need to be developed. By constructing distinct models of both the mutation process and selection pressure upon the cancer samples, exact statistical tests to examine this problem are devised. Tests to examine the significance of selection toward missense, nonsense, and splice site mutations are derived, along with tests assessing variation in selection between functional domains. Maximum-likelihood methods facilitate parameter estimation, including levels of selection pressure and minimum numbers of pathogenic mutations. These methods are illustrated with 25 breast cancers screened across the coding sequences of 518 kinase genes, revealing 90 base substitutions in 71 genes. Significant selection pressure upon truncating mutations was established. Furthermore, an estimated minimum of 29.8 mutations were pathogenic.  相似文献   

10.
Phylogenomic studies aim to build phylogenies from large sets of homologous genes. Such "genome-sized" data require fast methods, because of the typically large numbers of taxa examined. In this framework, distance-based methods are useful for exploratory studies and building a starting tree to be refined by a more powerful maximum likelihood (ML) approach. However, estimating evolutionary distances directly from concatenated genes gives poor topological signal as genes evolve at different rates. We propose a novel method, named super distance matrix (SDM), which follows the same line as average consensus supertree (ACS; Lapointe and Cucumel, 1997) and combines the evolutionary distances obtained from each gene into a single distance supermatrix to be analyzed using a standard distance-based algorithm. SDM deforms the source matrices, without modifying their topological message, to bring them as close as possible to each other; these deformed matrices are then averaged to obtain the distance supermatrix. We show that this problem is equivalent to the minimization of a least-squares criterion subject to linear constraints. This problem has a unique solution which is obtained by resolving a linear system. As this system is sparse, its practical resolution requires O(naka) time, where n is the number of taxa, k the number of matrices, and a < 2, which allows the distance supermatrix to be quickly obtained. Several uses of SDM are proposed, from fast exploratory studies to more accurate approaches requiring heavier computing time. Using simulations, we show that SDM is a relevant alternative to the standard matrix representation with parsimony (MRP) method, notably when the taxa sets of the different genes have low overlap. We also show that SDM can be used to build an excellent starting tree for an ML approach, which both reduces the computing time and increases the topogical accuracy. We use SDM to analyze the data set of Gatesy et al. (2002, Syst. Biol. 51: 652-664) that involves 48 genes of 75 placental mammals. The results indicate that these genes have strong rate heterogeneity and confirm the simulation conclusions.  相似文献   

11.
SUMMARY: The fundamental problem of gene selection via cDNA data is to identify which genes are differentially expressed across different kinds of tissue samples (e.g. normal and cancer). cDNA data contain large number of variables (genes) and usually the sample size is relatively small so the selection process can be unstable. Therefore, models which incorporate sparsity in terms of variables (genes) are desirable for this kind of problem. This paper proposes a two-level hierarchical Bayesian model for variable selection which assumes a prior that favors sparseness. We adopt a Markov chain Monte Carlo (MCMC) based computation technique to simulate the parameters from the posteriors. The method is applied to leukemia data from a previous study and a published dataset on breast cancer. SUPPLEMENTARY INFORMATION: http://stat.tamu.edu/people/faculty/bmallick.html.  相似文献   

12.

Background  

A phylogenetic network is a generalization of phylogenetic trees that allows the representation of conflicting signals or alternative evolutionary histories in a single diagram. There are several methods for constructing these networks. Some of these methods are based on distances among taxa. In practice, the methods which are based on distance perform faster in comparison with other methods. The Neighbor-Net (N-Net) is a distance-based method. The N-Net produces a circular ordering from a distance matrix, then constructs a collection of weighted splits using circular ordering. The SplitsTree which is a program using these weighted splits makes a phylogenetic network. In general, finding an optimal circular ordering is an NP-hard problem. The N-Net is a heuristic algorithm to find the optimal circular ordering which is based on neighbor-joining algorithm.  相似文献   

13.
Xenotropic murine leukemia virus (MLV)-related virus (XMRV) has been amplified from human prostate cancer and chronic fatigue syndrome (CFS) patient samples. Other studies failed to replicate these findings and suggested PCR contamination with a prostate cancer cell line, 22Rv1, as a likely source. MLV-like sequences have also been detected in CFS patients in longitudinal samples 15 years apart. Here, we tested whether sequence data from these samples are consistent with viral evolution. Our phylogenetic analyses strongly reject a model of within-patient evolution and demonstrate that the sequences from the first and second time points represent distinct endogenous murine retroviruses, suggesting contamination.  相似文献   

14.
15.
The geometric shape is traditionally used to calculate phytoplankton cell measurements (e.g. biovolume), but it can also play an important role in determining community distributions. Little is known about how geometric shapes relate to other morphological traits or to the environment. We explored whether shapes and related morphological traits are selected by environmental forcing. For this, samples were collected seasonally at 21 stations in coastal-marine waters of the Salento Peninsula (Italy). Phytoplankton taxa were classified in terms of geometric shape, biovolume (organism size) and surface-to-volume ratio (S:V). The relationship between greatest axial linear dimension (GALD) and S:V was assessed for each shape. A Canonical Correspondence Analysis (CCA) was performed to evaluate phytoplankton shape distribution on temporal and spatial scales. Phytoplankton community was characterized by high morphological diversity. GALD and S:V were inversely related in most of the shapes. CCA showed that phytoplankton shape distribution was influenced more by seasonal than by spatial variation: elongated shapes characterized the cold period; rounded and combined shapes the warmer period. Most of the shapes showed conservatism of the S:V and trade-off with the size. Geometric shapes represent an interesting feature to be considered in trait-based approaches to study phytoplankton distributions in aquatic ecosystems.  相似文献   

16.
Neuroligins are postsynaptic cell-adhesion proteins that associate with their presynaptic partners, the neurexins. Using small-angle X-ray scattering, we determined the shapes of the extracellular region of several neuroligin isoforms in solution. We conclude that the neuroligins dimerize via the characteristic four-helix bundle observed in cholinesterases, and that the connecting sequence between the globular lobes of the dimer and the cell membrane is elongated, projecting away from the dimer interface. X-ray scattering and neutron contrast variation data show that two neurexin monomers, separated by 107 A, bind at symmetric locations on opposite sides of the long axis of the neuroligin dimer. Using these data, we developed structural models that delineate the spatial arrangements of different neuroligin domains and their partnering molecules. As mutations of neurexin and neuroligin genes appear to be linked to autism, these models provide a structural framework for understanding altered recognition by these proteins in neurodevelopmental disorders.  相似文献   

17.
A stochastic Markov chain model for metastatic progression is developed for primary lung cancer based on a network construction of metastatic sites with dynamics modeled as an ensemble of random walkers on the network. We calculate a transition matrix, with entries (transition probabilities) interpreted as random variables, and use it to construct a circular bi-directional network of primary and metastatic locations based on postmortem tissue analysis of 3827 autopsies on untreated patients documenting all primary tumor locations and metastatic sites from this population. The resulting 50 potential metastatic sites are connected by directed edges with distributed weightings, where the site connections and weightings are obtained by calculating the entries of an ensemble of transition matrices so that the steady-state distribution obtained from the long-time limit of the Markov chain dynamical system corresponds to the ensemble metastatic distribution obtained from the autopsy data set. We condition our search for a transition matrix on an initial distribution of metastatic tumors obtained from the data set. Through an iterative numerical search procedure, we adjust the entries of a sequence of approximations until a transition matrix with the correct steady-state is found (up to a numerical threshold). Since this constrained linear optimization problem is underdetermined, we characterize the statistical variance of the ensemble of transition matrices calculated using the means and variances of their singular value distributions as a diagnostic tool. We interpret the ensemble averaged transition probabilities as (approximately) normally distributed random variables. The model allows us to simulate and quantify disease progression pathways and timescales of progression from the lung position to other sites and we highlight several key findings based on the model.  相似文献   

18.
With the growing surge of biological measurements, the problem of integrating and analyzing different types of genomic measurements has become an immediate challenge for elucidating events at the molecular level. In order to address the problem of integrating different data types, we present a framework that locates variation patterns in two biological inputs based on the generalized singular value decomposition (GSVD). In this work, we jointly examine gene expression and copy number data and iteratively project the data on different decomposition directions defined by the projection angle /spl theta/ in the GSVD. With the proper choice of /spl theta/, we locate similar and dissimilar patterns of variation between both data types. We discuss the properties of our algorithm using simulated data and conduct a case study with biologically verified results. Ultimately, we demonstrate the efficacy of our method on two genome-wide breast cancer studies to identify genes with large variation in expression and copy number across numerous cell line and tumor samples. Our method identifies genes that are statistically significant in both input measurements. The proposed method is useful for a wide variety of joint copy number and expression-based studies. Supplementary information is available online, including software implementations and experimental data.  相似文献   

19.
Matrix correlation represents an innovative methodology to evaluate the explanatory power of several hypotheses by measuring their correspondence with observed morphological variation. In this paper, we view the origins of Patagonians from a matrix correlation approach. Personal and published data on nonmetric cranial traits were used to estimate a biological distance matrix involving five major groups from Patagonia and two from the northwest and northeast regions of Argentina. To evaluate correspondence with other important factors, we used a geographic distance matrix and four design matrices, representing several patterns of settlement and differentiation. Biological distance was found to be strongly associated with spatial separation; the correlation between geography and nonmetric cranial distances was highly significant. When geographic distance is held constant, correlation between a model representing high levels of heterogeneity between the samples and morphological (nonmetric) variation becomes highly significant.  相似文献   

20.
MOTIVATION: Clustering has been used as a popular technique for finding groups of genes that show similar expression patterns under multiple experimental conditions. Many clustering methods have been proposed for clustering gene-expression data, including the hierarchical clustering, k-means clustering and self-organizing map (SOM). However, the conventional methods are limited to identify different shapes of clusters because they use a fixed distance norm when calculating the distance between genes. The fixed distance norm imposes a fixed geometrical shape on the clusters regardless of the actual data distribution. Thus, different distance norms are required for handling the different shapes of clusters. RESULTS: We present the Gustafson-Kessel (GK) clustering method for microarray gene-expression data. To detect clusters of different shapes in a dataset, we use an adaptive distance norm that is calculated by a fuzzy covariance matrix (F) of each cluster in which the eigenstructure of F is used as an indicator of the shape of the cluster. Moreover, the GK method is less prone to falling into local minima than the k-means and SOM because it makes decisions through the use of membership degrees of a gene to clusters. The algorithmic procedure is accomplished by the alternating optimization technique, which iteratively improves a sequence of sets of clusters until no further improvement is possible. To test the performance of the GK method, we applied the GK method and well-known conventional methods to three recently published yeast datasets, and compared the performance of each method using the Saccharomyces Genome Database annotations. The clustering results of the GK method are more significantly relevant to the biological annotations than those of the other methods, demonstrating its effectiveness and potential for clustering gene-expression data. AVAILABILITY: The software was developed using Java language, and can be executed on the platforms that JVM (Java Virtual Machine) is running. It is available from the authors upon request. SUPPLEMENTARY INFORMATION: Supplementary data are available at http://dragon.kaist.ac.kr/gk.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号