首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Method

Genome-wide expression profiling is a widely used approach for characterizing heterogeneous populations of cells, tissues, biopsies, or other biological specimen. The exploratory analysis of such data typically relies on generic unsupervised methods, e.g. principal component analysis (PCA) or hierarchical clustering. However, generic methods fail to exploit prior knowledge about the molecular functions of genes. Here, I introduce GO-PCA, an unsupervised method that combines PCA with nonparametric GO enrichment analysis, in order to systematically search for sets of genes that are both strongly correlated and closely functionally related. These gene sets are then used to automatically generate expression signatures with functional labels, which collectively aim to provide a readily interpretable representation of biologically relevant similarities and differences. The robustness of the results obtained can be assessed by bootstrapping.

Results

I first applied GO-PCA to datasets containing diverse hematopoietic cell types from human and mouse, respectively. In both cases, GO-PCA generated a small number of signatures that represented the majority of lineages present, and whose labels reflected their respective biological characteristics. I then applied GO-PCA to human glioblastoma (GBM) data, and recovered signatures associated with four out of five previously defined GBM subtypes. My results demonstrate that GO-PCA is a powerful and versatile exploratory method that reduces an expression matrix containing thousands of genes to a much smaller set of interpretable signatures. In this way, GO-PCA aims to facilitate hypothesis generation, design of further analyses, and functional comparisons across datasets.  相似文献   

2.
Abstract. Variation partitioning by (partial) constrained ordination is a popular method for exploratory data analysis, but applications are mostly restricted to simple ecological questions only involving two or three sets of explanatory variables, such as climate and soil, this because of the rapid increase in complexity of calculations and results with an increasing number of explanatory variable sets. The existence is demonstrated of a unique algorithm for partitioning the variation in a set of response variables on n sets of explanatory variables; it is shown how the 2n– 1 non‐overlapping components of variation can be calculated. Methods for evaluation and presentation of variation partitioning results are reviewed, and a recursive algorithm is proposed for distributing the many small components of variation over simpler components. Several issues related to the use and usefulness of variation partitioning with n sets of explanatory variables are discussed with reference to a worked example.  相似文献   

3.
Several demographic factors can produce family structured patches within natural plant populations, particularly limited seed and pollen dispersal and small effective density. In this paper, we used computer simulations to examine how seed dispersal, density, and spatial distribution of adult trees and seedlings can explain the spatial genetic structure (SGS) of natural regeneration after a single reproductive event in a small population. We then illustrated the results of our simulations using genetic (isozymes and chloroplast microsatellites) and demographic experimental data from an Abies alba (silver fir) intensive study plot located in the Southern French Alps (Mont Ventoux). Simulations showed that the structuring effect of limited dispersal on seedling SGS can largely be counterbalanced by high effective density or a clumped spatial distribution of adult trees. In addition, the clumping of natural regeneration far from adult trees, which is common in temperate forest communities where gap dynamics are predominant, further decreases SGS intensity. Contrary to our simulation results, low adult tree density, aggregated spatial distribution of seedlings, and limited seed dispersal did not generate a significant SGS in our A. alba experimental plot. Although some level of long distance pollen and seed flow could explain this lack of SGS, our experimental data confirm the role of spatial aggregation (both in adult trees and in seedlings far from adult trees) in reducing SGS in natural populations.  相似文献   

4.
Until recently, numerous feature selection techniques have been proposed and found wide applications in genomics and proteomics. For instance, feature/gene selection has proven to be useful for biomarker discovery from microarray and mass spectrometry data. While supervised feature selection has been explored extensively, there are only a few unsupervised methods that can be applied to exploratory data analysis. In this paper, we address the problem of unsupervised feature selection. First, we extend Laplacian linear discriminant analysis (LLDA) to unsupervised cases. Second, we propose a novel algorithm for computing LLDA, which is efficient in the case of high dimensionality and small sample size as in microarray data. Finally, an unsupervised feature selection method, called LLDA-based Recursive Feature Elimination (LLDA-RFE), is proposed. We apply LLDA-RFE to several public data sets of cancer microarrays and compare its performance with those of Laplacian score and SVD-entropy, two state-of-the-art unsupervised methods, and with that of Fisher score, a supervised filter method. Our results demonstrate that LLDA-RFE outperforms Laplacian score and shows favorable performance against SVD-entropy. It performs even better than Fisher score for some of the data sets, despite the fact that LLDA-RFE is fully unsupervised.  相似文献   

5.
Most of the conventional feature selection algorithms have a drawback whereby a weakly ranked gene that could perform well in terms of classification accuracy with an appropriate subset of genes will be left out of the selection. Considering this shortcoming, we propose a feature selection algorithm in gene expression data analysis of sample classifications. The proposed algorithm first divides genes into subsets, the sizes of which are relatively small (roughly of size h), then selects informative smaller subsets of genes (of size r < h) from a subset and merges the chosen genes with another gene subset (of size r) to update the gene subset. We repeat this process until all subsets are merged into one informative subset. We illustrate the effectiveness of the proposed algorithm by analyzing three distinct gene expression data sets. Our method shows promising classification accuracy for all the test data sets. We also show the relevance of the selected genes in terms of their biological functions.  相似文献   

6.
The effective extraction of information from multidimensional data sets derived from phenotyping experiments is a growing challenge in biology. Data visualization tools are important resources that can aid in exploratory data analysis of complex data sets. Phenotyping experiments of model organisms produce data sets in which a large number of phenotypic measures are collected for each individual in a group. A critical initial step in the analysis of such multidimensional data sets is the exploratory analysis of data distribution and correlation. To facilitate the rapid visualization and exploratory analysis of multidimensional complex trait data, we have developed a user-friendly, web-based software tool called Phenostat. Phenostat is composed of a dynamic graphical environment that allows the user to inspect the distribution of multiple variables in a data set simultaneously. Individuals can be selected by directly clicking on the graphs and thus displaying their identity, highlighting corresponding values in all graphs, allowing their inclusion or exclusion from the analysis. Statistical analysis is provided by R package functions. Phenostat is particularly suited for rapid distribution and correlation analysis of subsets of data. An analysis of behavioral and physiologic data stemming from a large mouse phenotyping experiment using Phenostat reveals previously unsuspected correlations. Phenostat is freely available to academic institutions and nonprofit organizations and can be used from our website at .  相似文献   

7.
The relative location of proteins and lipids in particles of medicinal leech salivary gland secretion (SGS) is revealed for the first time. Their sizes and morphology are described. Using scanning electron microscopy and transmission electron microscopy, it was determined that SGS consists of particles of different sizes and form. This picture is supported by confocal laser scanning microscopy of SGS preparations treated with fluorescein isothiocyanate. After incubation with nonionic detergents (Brij 35 and Tween 20), transmission electron microscopy revealed the dissociation of fragments composing protein-lipid particles (PLP), and in this case an increase in free protein concentration determined by a modification of the Lowry method was observed. Perylene probing of lipids in SGS preparations showed that they are concentrated mainly inside PLP and are almost absent on the surface. Cholesterol was detected during SGS probing using the cholesteryl-Bodipy (hydrophobic fluorescent analog of cholesterol) on surface sections during confocal analysis of electron microphotographs of SGS. This analysis detected PLP structures in SGS resembling caveoles full of cholesterol. SGS, preliminary frozen at −70°C, transformed into a multitude of similar small particles visualized by transmission electron microscopy, whose fixed distribution resembled water crystal structure.  相似文献   

8.
As the data resulting from modern genotyping tools are astoundingly complex, genotyping studies require great care in the sampling design, genotyping, data analysis and interpretation. Such care is necessary because, with data sets containing thousands of loci, small biases can easily become strongly significant patterns. Such biases may already be present in routine tasks that are present in almost every genotyping study. Here, I discuss seven common mistakes that can be frequently encountered in the genotyping literature: (i) giving more attention to genotyping than to sampling, (ii) failing to perform or report experimental randomization in the laboratory, (iii) equating geopolitical borders with biological borders, (iv) testing significance of clustering output, (v) misinterpreting Mantel's r statistic, (vi) only interpreting a single value of k and (vii) forgetting that only a small portion of the genome will be associated with climate. For every of those issues, I give some suggestions how to avoid the mistake. Overall, I argue that genotyping studies would benefit from establishing a more rigorous experimental design, involving proper sampling design, randomization and better distinction of a priori hypotheses and exploratory analyses.  相似文献   

9.
Spatial genetic structure (SGS) results from the interplay of several demographical processes that are difficult to tease apart. In this study, we explore the specific effects of seed and pollen dispersal and of early postdispersal mortality on the SGS of a seedling cohort (N = 786) recruiting within and around an expanding pedunculate oak (Quercus robur) stand. Using data on dispersal (derived from parentage analysis) and mortality (monitored in the field through two growing seasons), we decompose the overall SGS of the cohort into its components by contrasting the SGS of dispersed (i.e. growing away from their mother tree) vs. nondispersed (i.e. growing beneath their mother tree) and initial vs. surviving seedlings. Patterns differ strongly between nondispersed and dispersed seedlings. Nondispersed seedlings are largely responsible for the positive kinship values observed at short distances in the studied population, whereas dispersed seedlings determine the overall SGS at distances beyond c. 30 m. The paternal alleles of nondispersed seedlings show weak yet significantly positive kinships up to c. 15 m, indicating some limitations in pollen flow that should further promote pedigree structures at short distances. Seedling mortality does not alter SGS, except for a slight increase in the nondispersed group. Field data reveal that mortality in this group is negatively density‐dependent, probably because of small‐scale variation in light conditions. Finally, we observe a remarkable similarity between the SGS of the dispersed seedlings and that of the adults, which probably reflects dispersal processes during the initial expansion of the population. Overall, this study demonstrates that incorporating individual‐level complementary information into analyses can greatly improve the detail and confidence of ecological inferences drawn from SGS.  相似文献   

10.
One of the first steps in analyzing high-dimensional functional genomics data is an exploratory analysis of such data. Cluster Analysis and Principal Component Analysis are then usually the method of choice. Despite their versatility they also have a severe drawback: they do not always generate simple and interpretable solutions. On the basis of the observation that functional genomics data often contain both informative and non-informative variation, we propose a method that finds sets of variables containing informative variation. This informative variation is subsequently expressed in easily interpretable simplivariate components.We present a new implementation of the recently introduced simplivariate models. In this implementation, the informative variation is described by multiplicative models that can adequately represent the relations between functional genomics data. Both a simulated and two real-life metabolomics data sets show good performance of the method.  相似文献   

11.
Studies of fine-scale spatial genetic structure (SGS) in wind-pollinated trees have shown that SGS is generally weak and extends over relatively short distances (less than 30-40 m) from individual trees. However, recent simulations have shown that detection of SGS is heavily dependent on both the choice of molecular markers and the strategy used to sample the studied population. Published studies may not always have used sufficient markers and/or individuals for the accurate estimation of SGS. To assess the extent of SGS within a population of the wind-pollinated tree Fagus sylvatica, we genotyped 200 trees at six microsatellite or simple sequence repeat (SSR) loci and 250 amplified fragment length polymorphisms (AFLP) and conducted spatial analyses of pairwise kinship coefficients. We re-sampled our data set over individuals and over loci to determine the effect of reducing the sample size and number of loci used for SGS estimation. We found that SGS estimated from AFLP markers extended nearly four times further than has been estimated before using other molecular markers in this species, indicating a persistent effect of restricted gene flow at small spatial scales. However, our SSR-based estimate was in agreement with other published studies. Spatial genetic structure in F. sylvatica and similar wind-pollinated trees may therefore be substantially larger than has been estimated previously. Although 100-150 AFLP loci and 150-200 individuals appear sufficient for adequately estimating SGS in our analysis, 150-200 individuals and six SSR loci may still be too few to provide a good estimation of SGS in this species.  相似文献   

12.
13.
This study explores the capability of an extended sequential Gaussian simulation algorithm with incorporation of categorical land use information (SGS-CI) for simulating spatial variability of soil total nitrogen (TN) contents and assessing associated spatial uncertainty. 402 sampled data in soil TN contents in a county scale region and the categorical land use map data of the study area were used to perform sequential simulations for comparing the SGS-CI algorithm and the conventional SGS algorithm, and 135 validation samples were used to assess the improvement of SGS-CI over SGS in prediction accuracy and uncertainty reduction. Results showed that the validation data were more strongly correlated with the optimal prediction (i.e., E-type estimates) data of SGS-CI than with those of SGS, and the mean error and the root mean square error of the optimal prediction using SGS-CI were smaller than those using SGS. SGS-CI also performed slightly better than SGS in uncertainty modeling in terms of accuracy plots and goodness statistic G. In addition, because demands for soil total nitrogen by different crops are usually different in agricultural practice, we showed that SGS-CI could be used to assess spatial uncertainty of deficiency or abundance degrees of soil TN based on demands of different crops in different land use types. Therefore, SGS-CI may provide an effective method for improving prediction accuracy and reducing uncertainty in soil TN prediction.  相似文献   

14.
Since recombination leads to the generation of mosaic genomes that violate the assumption of traditional phylogenetic methods that sequence evolution can be accurately described by a single tree, results and conclusions based on phylogenetic analysis of data sets including recombinant sequences can be severely misleading. Many methods are able to adequately detect recombination between diverse sequences, for example between different HIV-1 subtypes. More problematic is the identification of recombinants among closely related sequences such as a viral population within a host. We describe a simple algorithmic procedure that enables detection of intra-host recombinants based on split-decomposition networks and a robust statistical test for recombination. By applying this algorithm to several published HIV-1 data sets we conclude that intra-host recombination was significantly underestimated in previous studies and that up to one-third of the env sequences longitudinally sampled from a given subject can be of recombinant origin. The results show that our procedure can be a valuable exploratory tool for detection of recombinant sequences before phylogenetic analysis, and also suggest that HIV-1 recombination in vivo is far more frequent and significant than previously thought.  相似文献   

15.
Large scale gene duplication is a major force driving the evolution of genetic functional innovation. Whole genome duplications are widely believed to have played an important role in the evolution of the maize, yeast, and vertebrate genomes. The use of evolutionary trees to analyze the history of gene duplication and estimate duplication times provides a powerful tool for studying this process. Many studies in the molecular evolution literature have used this approach on small data sets, using analyses performed by hand. The rapid growth of genetic sequence data will soon allow similar studies on a genomic scale, but such studies will be limited unless the analysis can be automated. Even existing data sets admit alternative hypotheses that would be too tedious to consider without automation. In this paper, we describe a program called NOTUNG that facilitates large scale analysis, using both rooted and unrooted trees. When tested on trees analyzed in the literature, NOTUNG consistently yielded results that agree with the assessments in the original publications. Thus, NOTUNG provides a basic building block for inferring duplication dates from gene trees automatically and can also be used as an exploratory analysis tool for evaluating alternative hypotheses.  相似文献   

16.
Obtaining satisfactory results with neural networks depends on the availability of large data samples. The use of small training sets generally reduces performance. Most classical Quantitative Structure-Activity Relationship (QSAR) studies for a specific enzyme system have been performed on small data sets. We focus on the neuro-fuzzy prediction of biological activities of HIV-1 protease inhibitory compounds when inferring from small training sets. We propose two computational intelligence prediction techniques which are suitable for small training sets, at the expense of some computational overhead. Both techniques are based on the FAMR model. The FAMR is a Fuzzy ARTMAP (FAM) incremental learning system used for classification and probability estimation. During the learning phase, each sample pair is assigned a relevance factor proportional to the importance of that pair. The two proposed algorithms in this paper are: 1) The GA-FAMR algorithm, which is new, consists of two stages: a) During the first stage, we use a genetic algorithm (GA) to optimize the relevances assigned to the training data. This improves the generalization capability of the FAMR. b) In the second stage, we use the optimized relevances to train the FAMR. 2) The Ordered FAMR is derived from a known algorithm. Instead of optimizing relevances, it optimizes the order of data presentation using the algorithm of Dagher et al. In our experiments, we compare these two algorithms with an algorithm not based on the FAM, the FS-GA-FNN introduced in [4], [5]. We conclude that when inferring from small training sets, both techniques are efficient, in terms of generalization capability and execution time. The computational overhead introduced is compensated by better accuracy. Finally, the proposed techniques are used to predict the biological activities of newly designed potential HIV-1 protease inhibitors.  相似文献   

17.
Path analyses of selection   总被引:1,自引:0,他引:1  
Identifying the targets and causal mechanisms of phenotypic selection in natural populations remains an important challenge for evolutionary biologists. Path analysis is a statistical modeling approach that may aid in meeting this challenge. We describe several types of path model that are relevant to the analysis of selection, and review some recent empirical studies that apply path models to issues in pollination biology, phenotypic integration and selection on morphometric and ontogenetic traits. Path analysis may play two roles in the analysis of selection: first, as an exploratory analysis suggesting possible targets of selection, which are then tested by direct experimentation; and second, as a means of evaluating the relative importance of different causal pathways of selection, once the likely targets of selection have been established.  相似文献   

18.
Fine-scale spatial genetic structure (SGS) in natural tree populations is largely a result of restricted pollen and seed dispersal. Understanding the link between limitations to dispersal in gene vectors and SGS is of key interest to biologists and the availability of highly variable molecular markers has facilitated fine-scale analysis of populations. However, estimation of SGS may depend strongly on the type of genetic marker and sampling strategy (of both loci and individuals). To explore sampling limits, we created a model population with simulated distributions of dominant and codominant alleles, resulting from natural regeneration with restricted gene flow. SGS estimates from subsamples (simulating collection and analysis with amplified fragment length polymorphism (AFLP) and microsatellite markers) were correlated with the 'real' estimate (from the full model population). For both marker types, sampling ranges were evident, with lower limits below which estimation was poorly correlated and upper limits above which sampling became inefficient. Lower limits (correlation of 0.9) were 100 individuals, 10 loci for microsatellites and 150 individuals, 100 loci for AFLPs. Upper limits were 200 individuals, five loci for microsatellites and 200 individuals, 100 loci for AFLPs. The limits indicated by simulation were compared with data sets from real species. Instances where sampling effort had been either insufficient or inefficient were identified. The model results should form practical boundaries for studies aiming to detect SGS. However, greater sample sizes will be required in cases where SGS is weaker than for our simulated population, for example, in species with effective pollen/seed dispersal mechanisms.  相似文献   

19.
Comparative analyses of spatial genetic structure (SGS) among species, populations, or cohorts give insight into the genetic consequences of seed dispersal in plants. We analysed SGS of a weedy tree in populations with known and unknown recruitment histories to first establish patterns in populations with single vs. multiple founders, and then to infer possible recruitment scenarios in populations with unknown histories. We analysed SGS in six populations of the colonizing tree Albizia julibrissin Durazz. (Fabaceae) in Athens, Georgia. Study sites included two large populations with multiple, known founders, two small populations with a single, known founder, and two large populations with unknown recruitment histories. Eleven allozyme loci were used to genotype 1385 individuals. Insights about the effects of colonization history from the SGS analyses were obtained from correlograms and Sp statistics. Distinct differences in patterns of SGS were identified between populations with multiple founders vs. a single founder. We observed significant, positive SGS, which decayed with increasing distance in the populations with multiple colonists, but little to no SGS in populations founded by one colonist. Because relatedness among individuals is estimated relative to a local reference population, which usually consists of those individuals sampled in the study population, SGS in populations with high background relatedness, such as those with a single founder, may be obscured. We performed additional analyses using a regional reference population and, in populations with a single founder, detected significant, positive SGS at all distances, indicating that these populations consist of highly related descendants and receive little seed immigration. Subsequent analyses of SGS in size cohorts in the four large study populations showed significant SGS in both juveniles and adults, probably because of a relative lack of intraspecific demographic thinning. SGS in populations of this colonizing tree is pronounced and persistent and is determined by the number and relatedness of founding individuals and adjacent seed sources. Patterns of SGS in populations with known histories may be used to indirectly infer possible colonization scenarios for populations where it is unknown.  相似文献   

20.
To improve the accuracy of tree reconstruction, phylogeneticists are extracting increasingly large multigene data sets from sequence databases. Determining whether a database contains at least k genes sampled from at least m species is an NP-complete problem. However, the skewed distribution of sequences in these databases permits all such data sets to be obtained in reasonable computing times even for large numbers of sequences. We developed an exact algorithm for obtaining the largest multigene data sets from a collection of sequences. The algorithm was then tested on a set of 100,000 protein sequences of green plants and used to identify the largest multigene ortholog data sets having at least 3 genes and 6 species. The distribution of sizes of these data sets forms a hollow curve, and the largest are surprisingly small, ranging from 62 genes by 6 species, to 3 genes by 65 species, with more symmetrical data sets of around 15 taxa by 15 genes. These upper bounds to sequence concatenation have important implications for building the tree of life from large sequence databases.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号