首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 18 毫秒
1.
Microarrays allow researchers to measure the expression of thousands of genes in a single experiment. Before statistical comparisons can be made, the data must be assessed for quality and normalisation procedures must be applied, of which many have been proposed. Methods of comparing the normalised data are also abundant, and no clear consensus has yet been reached. The purpose of this paper was to compare those methods used by the EADGENE network on a very noisy simulated data set. With the a priori knowledge of which genes are differentially expressed, it is possible to compare the success of each approach quantitatively. Use of an intensity-dependent normalisation procedure was common, as was correction for multiple testing. Most variety in performance resulted from differing approaches to data quality and the use of different statistical tests. Very few of the methods used any kind of background correction. A number of approaches achieved a success rate of 95% or above, with relatively small numbers of false positives and negatives. Applying stringent spot selection criteria and elimination of data did not improve the false positive rate and greatly increased the false negative rate. However, most approaches performed well, and it is encouraging that widely available techniques can achieve such good results on a very noisy data set.  相似文献   

2.
Martins H  Villesen P 《PloS one》2011,6(3):e14745

Background

Endogenous retroviruses (ERVs) are genetic fossils of ancient retroviral integrations that remain in the genome of many organisms. Most loci are rendered non-functional by mutations, but several intact retroviral genes are known in mammalian genomes. Some have been adopted by the host species, while the beneficial roles of others remain unclear. Besides the obvious possible immunogenic impact from transcribing intact viral genes, endogenous retroviruses have also become an interesting and useful tool to study phylogenetic relationships. The determination of the integration time of these viruses has been based upon the assumption that both 5′ and 3′ Long Terminal Repeats (LTRs) sequences are identical at the time of integration, but evolve separately afterwards. Similar approaches have been using either a constant evolutionary rate or a range of rates for these viral loci, and only single species data. Here we show the advantages of using different approaches.

Results

We show that there are strong advantages in using multiple species data and state-of-the-art phylogenetic analysis. We incorporate both simple phylogenetic information and Monte Carlo Markov Chain (MCMC) methods to date the integrations of these viruses based on a relaxed molecular clock approach over a Bayesian phylogeny model and applied them to several selected ERV sequences in primates. These methods treat each ERV locus as having a distinct evolutionary rate for each LTR, and make use of consensual speciation time intervals between primates to calibrate the relaxed molecular clocks.

Conclusions

The use of a fixed rate produces results that vary considerably with ERV family and the actual evolutionary rate of the sequence, and should be avoided whenever multi-species phylogenetic data are available. For genome-wide studies, the simple phylogenetic approach constitutes a better alternative, while still being computationally feasible.  相似文献   

3.
Fusion genes formed by chromosomal rearrangements are common drivers of cancer. Recent innovations in the field of next-generation sequencing (NGS) have seen a dynamic shift from traditional fusion detection approaches, such as visual characterization by fluorescence, to more precise multiplexed methods. There are many different NGS-based approaches to fusion gene detection and deciding on the most appropriate method can be difficult. Beyond the experimental approach, consideration needs to be given to factors such as the ease of implementation, processing time, associated costs, and the level of expertise required for data analysis. Here, the different NGS-based methods for fusion gene detection, the basic principles underlying the techniques, and the benefits and limitations of each approach are reviewed. This article concludes with a discussion of how NGS will impact fusion gene detection in a clinical context and from where the next innovations are evolving.  相似文献   

4.
The investigation of associations between rare genetic variants and diseases or phenotypes has two goals. Firstly, the identification of which genes or genomic regions are associated, and secondly, discrimination of associated variants from background noise within each region. Over the last few years, many new methods have been developed which associate genomic regions with phenotypes. However, classical methods for high-dimensional data have received little attention. Here we investigate whether several classical statistical methods for high-dimensional data: ridge regression (RR), principal components regression (PCR), partial least squares regression (PLS), a sparse version of PLS (SPLS), and the LASSO are able to detect associations with rare genetic variants. These approaches have been extensively used in statistics to identify the true associations in data sets containing many predictor variables. Using genetic variants identified in three genes that were Sanger sequenced in 1998 individuals, we simulated continuous phenotypes under several different models, and we show that these feature selection and feature extraction methods can substantially outperform several popular methods for rare variant analysis. Furthermore, these approaches can identify which variants are contributing most to the model fit, and therefore both goals of rare variant analysis can be achieved simultaneously with the use of regression regularization methods. These methods are briefly illustrated with an analysis of adiponectin levels and variants in the ADIPOQ gene.  相似文献   

5.
MOTIVATION: Gene expression experiments provide a fast and systematic way to identify disease markers relevant to clinical care. In this study, we address the problem of robust identification of differentially expressed genes from microarray data. Differentially expressed genes, or discriminator genes, are genes with significantly different expression in two user-defined groups of microarray experiments. We compare three model-free approaches: (1). nonparametric t-test, (2). Wilcoxon (or Mann-Whitney) rank sum test, and (3). a heuristic method based on high Pearson correlation to a perfectly differentiating gene ('ideal discriminator method'). We systematically assess the performance of each method based on simulated and biological data under varying noise levels and p-value cutoffs. RESULTS: All methods exhibit very low false positive rates and identify a large fraction of the differentially expressed genes in simulated data sets with noise level similar to that of actual data. Overall, the rank sum test appears most conservative, which may be advantageous when the computationally identified genes need to be tested biologically. However, if a more inclusive list of markers is desired, a higher p-value cutoff or the nonparametric t-test may be appropriate. When applied to data from lung tumor and lymphoma data sets, the methods identify biologically relevant differentially expressed genes that allow clear separation of groups in question. Thus the methods described and evaluated here provide a convenient and robust way to identify differentially expressed genes for further biological and clinical analysis.  相似文献   

6.
With tens of billions of dollars spent each year on the development of drugs to treat human diseases, and with fewer and fewer applications for investigational new drugs filed each year despite this massive spending, questions now abound on what changes to the drug discovery paradigm can be made to achieve greater success. The high rate of failure of drug candidates in clinical development, where the great majority of these drugs fail due to lack of efficacy, speak directly to the need for more innovative approaches to study the mechanisms of disease and drug discovery. Here we review systems biology approaches that have been devised over the last several years to understand the biology of disease at a more holistic level. By integrating a diversity of data like DNA variation, gene expression, protein–protein interaction, DNA–protein binding, and other types of molecular phenotype data, more comprehensive networks of genes both within and between tissues can be constructed to paint a more complete picture of the molecular processes underlying physiological states associated with disease. These more integrative, systems-level methods lead to networks that are demonstrably predictive, which in turn provides a deeper context within which single genes operate such as those identified from genome-wide association studies or those targeted for therapeutic intervention. The more comprehensive views of disease that result from these methods have the potential to dramatically enhance the way in which novel drug targets are identified and developed, ultimately increasing the probability of success for taking new drugs through clinical development. We highlight a number of the integrative approaches via examples that have resulted not only in the identification of novel genes for diabetes and cardiovascular disease, but in more comprehensive networks as well that describe the context in which the disease genes operate.  相似文献   

7.
Do LH  Bier E 《Bioinformation》2011,6(2):83-85
Redundancy among sequence identifiers is a recurring problem in bioinformatics. Here, we present a rapid and efficient method of fingerprinting identifiers to ascertain whether two or more aliases are identical. A number of tools and approaches have been developed to resolve differing names for the same genes and proteins, however, these methods each have their own limitations associated with their various goals. We have taken a different approach to the aliasing problem by simplifying the way aliases are stored and curated with the objective of simultaneously achieving speed and flexibility. Our approach (Booly-hashing) is to link identifiers with their corresponding hash keys derived from unique fingerprints such as gene or protein sequences. This tool has proven invaluable for designing a new data integration platform known as Booly, and has wide applicability to situations in which a dedicated efficient aliasing system is required. Compared with other aliasing techniques, Booly-hashing methodology provides 1) reduced run time complexity, 2) increased flexibility (aliasing of other data types, e.g. pharmaceutical drugs), 3) no required assumptions regarding gene clusters or hierarchies, and 4) simplicity in data addition, updating, and maintenance. The new Booly-hashing aliasing model has been incorporated as a central component of the Booly data integration platform we have recently developed and shoud be broadly applicable to other situations in which an efficient streamlined aliasing systems is required. This aliasing tool and database, which allows users to quickly group the same genes and proteins together can be accessed at: http://booly.ucsd.edu/alias. AVAILABILITY: The database is available for free at http://booly.ucsd.edu/alias.  相似文献   

8.
Over the last decade, many analytical methods and tools have been developed for microarray data. The detection of differentially expressed genes (DEGs) among different treatment groups is often a primary purpose of microarray data analysis. In addition, association studies investigating the relationship between genes and a phenotype of interest such as survival time are also popular in microarray data analysis. Phenotype association analysis provides a list of phenotype-associated genes (PAGs). However, it is sometimes necessary to identify genes that are both DEGs and PAGs. We consider the joint identification of DEGs and PAGs in microarray data analyses. The first approach we used was a naïve approach that detects DEGs and PAGs separately and then identifies the genes in an intersection of the list of PAGs and DEGs. The second approach we considered was a hierarchical approach that detects DEGs first and then chooses PAGs from among the DEGs or vice versa. In this study, we propose a new model-based approach for the joint identification of DEGs and PAGs. Unlike the previous two-step approaches, the proposed method identifies genes simultaneously that are DEGs and PAGs. This method uses standard regression models but adopts different null hypothesis from ordinary regression models, which allows us to perform joint identification in one-step. The proposed model-based methods were evaluated using experimental data and simulation studies. The proposed methods were used to analyze a microarray experiment in which the main interest lies in detecting genes that are both DEGs and PAGs, where DEGs are identified between two diet groups and PAGs are associated with four phenotypes reflecting the expression of leptin, adiponectin, insulin-like growth factor 1, and insulin. Model-based approaches provided a larger number of genes, which are both DEGs and PAGs, than other methods. Simulation studies showed that they have more power than other methods. Through analysis of data from experimental microarrays and simulation studies, the proposed model-based approach was shown to provide a more powerful result than the naïve approach and the hierarchical approach. Since our approach is model-based, it is very flexible and can easily handle different types of covariates.  相似文献   

9.
10.
Goal, Scope and Background The paper describes different ecotoxicity effect indicator methods/approaches. The approaches cover three main groups, viz. PNEC approaches, PAF approaches and damage approaches. Ecotoxicity effect indicators used in life cycle impact assessment (LCIA) are typically modelled to the level of impact, indicating the potential impact on 'ecosystem health'. The few existing indicators, which are modelled all the way to damage, are poorly developed, and even though relevant alternatives from risk assessment exist (e.g. recovery time and mean extinction time), these are unfortunately at a very early stage of development, and only few attempts have been made to include them in LCIA. Methods The approaches are described and evaluated against a set of assessment criteria comprising compatibility with the methodological requirements of LCIA, environmental relevance, reproducibility, data demand, data availability, quantification of uncertainty, transparency and spatial differentiation. Results and Discussion The results of the evaluation of the two impact approaches (i.e. PNEC and PAF) show both pros and cons for each of them. The assessment factor-based PNEC approaches have a low data demand and use only the lowest data (e.g. lowest NOEC value). Because it is developed in tiered risk assessment, and hence makes use of conservative assessment factors, it is not optimal, in its present form, to use in the comparative framework of LCIA, where best estimates are sought. The PAF approaches have a higher data demand but use all data and can be based on effect data (PNEC is no-effect-based), thus making these approaches non-conservative and more suitable for LCIA. However, indiscriminate use of ecotoxicity data tends to make the PAF-approaches no more environmentally relevant than the assessment factor-based PNEC approaches. The PAF approaches, however, can at least in theory be linked to damage modelling. All the approaches for damage modelling which are included here have a high environmental relevance but very low data availability, apart from the 'media recovery-approach', which depends directly on the fate model. They are all at a very early stage of development. Conclusion Recommendations and Outlook. An analysis of the different PAF approaches shows that the crucial point is according to which principles and based on which data the hazardous concentration to 50% of the included species (i.e. HC50) is estimated. The ability to calculate many characterisation factors for ecotoxicity is important for this impact category to be included in LCIA in a proper way. However, the access to effect data for the relevant chemicals is typically limited. So, besides the coupling to damage modelling, the main challenge within the further development and improvement of ecotoxicity effect indicators is to find an optimal method to estimate HC50 based on little data.  相似文献   

11.
Phylogenomic studies aim to build phylogenies from large sets of homologous genes. Such "genome-sized" data require fast methods, because of the typically large numbers of taxa examined. In this framework, distance-based methods are useful for exploratory studies and building a starting tree to be refined by a more powerful maximum likelihood (ML) approach. However, estimating evolutionary distances directly from concatenated genes gives poor topological signal as genes evolve at different rates. We propose a novel method, named super distance matrix (SDM), which follows the same line as average consensus supertree (ACS; Lapointe and Cucumel, 1997) and combines the evolutionary distances obtained from each gene into a single distance supermatrix to be analyzed using a standard distance-based algorithm. SDM deforms the source matrices, without modifying their topological message, to bring them as close as possible to each other; these deformed matrices are then averaged to obtain the distance supermatrix. We show that this problem is equivalent to the minimization of a least-squares criterion subject to linear constraints. This problem has a unique solution which is obtained by resolving a linear system. As this system is sparse, its practical resolution requires O(naka) time, where n is the number of taxa, k the number of matrices, and a < 2, which allows the distance supermatrix to be quickly obtained. Several uses of SDM are proposed, from fast exploratory studies to more accurate approaches requiring heavier computing time. Using simulations, we show that SDM is a relevant alternative to the standard matrix representation with parsimony (MRP) method, notably when the taxa sets of the different genes have low overlap. We also show that SDM can be used to build an excellent starting tree for an ML approach, which both reduces the computing time and increases the topogical accuracy. We use SDM to analyze the data set of Gatesy et al. (2002, Syst. Biol. 51: 652-664) that involves 48 genes of 75 placental mammals. The results indicate that these genes have strong rate heterogeneity and confirm the simulation conclusions.  相似文献   

12.
MOTIVATION: There is a very large and growing level of effort toward improving the platforms, experiment designs, and data analysis methods for microarray expression profiling. Along with a growing richness in the approaches there is a growing confusion among most scientists as to how to make objective comparisons and choices between them for different applications. There is a need for a standard framework for the microarray community to compare and improve analytical and statistical methods. RESULTS: We report on a microarray data set comprising 204 in-situ synthesized oligonucleotide arrays, each hybridized with two-color cDNA samples derived from 20 different human tissues and cell lines. Design of the approximately 24 000 60mer oligonucleotides that report approximately 2500 known genes on the arrays, and design of the hybridization experiments, were carried out in a way that supports the performance assessment of alternative data processing approaches and of alternative experiment and array designs. We also propose standard figures of merit for success in detecting individual differential expression changes or expression levels, and for detecting similarities and differences in expression patterns across genes and experiments. We expect this data set and the proposed figures of merit will provide a standard framework for much of the microarray community to compare and improve many analytical and statistical methods relevant to microarray data analysis, including image processing, normalization, error modeling, combining of multiple reporters per gene, use of replicate experiments, and sample referencing schemes in measurements based on expression change. AVAILABILITY/SUPPLEMENTARY INFORMATION: Expression data and supplementary information are available at http://www.rii.com/publications/2003/HE_SDS.htm  相似文献   

13.
Hierarchical Bayes models for cDNA microarray gene expression   总被引:2,自引:0,他引:2  
cDNA microarrays are used in many contexts to compare mRNA levels between samples of cells. Microarray experiments typically give us expression measurements on 1000-20 000 genes, but with few replicates for each gene. Traditional methods using means and standard deviations to detect differential expression are not satisfactory in this context. A handful of alternative statistics have been developed, including several empirical Bayes methods. In the present paper we present two full hierarchical Bayes models for detecting gene expression, of which one (D) describes our microarray data very well. We also compare the full Bayes and empirical Bayes approaches with respect to model assumptions, false discovery rates and computer running time. The proposed models are compared to existing empirical Bayes models in a simulation study and for a set of data (Yuen et al., 2002), where 27 genes have been categorized by quantitative real-time PCR. It turns out that the existing empirical Bayes methods have at least as good performance as the full Bayes ones.  相似文献   

14.
Analyzing gene expression data in terms of gene sets: methodological issues   总被引:3,自引:0,他引:3  
MOTIVATION: Many statistical tests have been proposed in recent years for analyzing gene expression data in terms of gene sets, usually from Gene Ontology. These methods are based on widely different methodological assumptions. Some approaches test differential expression of each gene set against differential expression of the rest of the genes, whereas others test each gene set on its own. Also, some methods are based on a model in which the genes are the sampling units, whereas others treat the subjects as the sampling units. This article aims to clarify the assumptions behind different approaches and to indicate a preferential methodology of gene set testing. RESULTS: We identify some crucial assumptions which are needed by the majority of methods. P-values derived from methods that use a model which takes the genes as the sampling unit are easily misinterpreted, as they are based on a statistical model that does not resemble the biological experiment actually performed. Furthermore, because these models are based on a crucial and unrealistic independence assumption between genes, the P-values derived from such methods can be wildly anti-conservative, as a simulation experiment shows. We also argue that methods that competitively test each gene set against the rest of the genes create an unnecessary rift between single gene testing and gene set testing.  相似文献   

15.
MOTIVATION: The field of microarray data analysis is shifting emphasis from methods for identifying differentially expressed genes to methods for identifying differentially expressed gene categories. The latter approaches utilize a priori information about genes to group genes into categories and enhance the interpretation of experiments aimed at identifying expression differences across treatments. While almost all of the existing approaches for identifying differentially expressed gene categories are practically useful, they suffer from a variety of drawbacks. Perhaps most notably, many popular tools are based exclusively on gene-specific statistics that cannot detect many types of multivariate expression change. RESULTS: We have developed a nonparametric multivariate method for identifying gene categories whose multivariate expression distribution differs across two or more conditions. We illustrate our approach and compare its performance to several existing procedures via the analysis of a real data set and a unique data-based simulation study designed to capture the challenges and complexities of practical data analysis. We show that our method has good power for differentiating between differentially expressed and non-differentially expressed gene categories, and we utilize a resampling based strategy for controlling the false discovery rate when testing multiple categories. AVAILABILITY: R code (www.r-project.org) for implementing our approach is available from the first author by request.  相似文献   

16.
Various methodological approaches using molecular sequence data have been developed and applied across several fields, including phylogeography, conservation biology, virology and human evolution. The aim of these approaches is to obtain predictive estimates of population history from DNA sequence data that can then be used for hypothesis testing with empirical data. This recent work provides opportunities to evaluate hypotheses of constant population size through time, of population growth or decline, of the rate of growth or decline, and of migration and growth in subdivided populations. At the core of many of these approaches is the extraction of information from the structure of phylogenetic trees to infer the demographic history of a population, and underlying nearly all methods is coalescent theory. With the increasing availability of DNA sequence data, it is important to review the different ways in which information can be extracted from DNA sequence data to estimate demographic parameters.  相似文献   

17.
18.
Hess J  Goldman N 《PloS one》2011,6(8):e22783
Phylogenomic approaches to the resolution of inter-species relationships have become well established in recent years. Often these involve concatenation of many orthologous genes found in the respective genomes followed by analysis using standard phylogenetic models. Genome-scale data promise increased resolution by minimising sampling error, yet are associated with well-known but often inappropriately addressed caveats arising through data heterogeneity and model violation. These can lead to the reconstruction of highly-supported but incorrect topologies. With the aim of obtaining a species tree for 18 species within the ascomycetous yeasts, we have investigated the use of appropriate evolutionary models to address inter-gene heterogeneities and the scalability and validity of supermatrix analysis as the phylogenetic problem becomes more difficult and the number of genes analysed approaches truly phylogenomic dimensions. We have extended a widely-known early phylogenomic study of yeasts by adding additional species to increase diversity and augmenting the number of genes under analysis. We have investigated sophisticated maximum likelihood analyses, considering not only a concatenated version of the data but also partitioned models where each gene constitutes a partition and parameters are free to vary between the different partitions (thereby accounting for variation in the evolutionary processes at different loci). We find considerable increases in likelihood using these complex models, arguing for the need for appropriate models when analyzing phylogenomic data. Using these methods, we were able to reconstruct a well-supported tree for 18 ascomycetous yeasts spanning about 250 million years of evolution.  相似文献   

19.
Microarray experiments are being increasingly used in molecular biology. A common task is to detect genes with differential expression across two experimental conditions, such as two different tissues or the same tissue at two time points of biological development. To take proper account of statistical variability, some statistical approaches based on the t-statistic have been proposed. In constructing the t-statistic, one needs to estimate the variance of gene expression levels. With a small number of replicated array experiments, the variance estimation can be challenging. For instance, although the sample variance is unbiased, it may have large variability, leading to a large mean squared error. For duplicated array experiments, a new approach based on simple averaging has recently been proposed in the literature. Here we consider two more general approaches based on nonparametric smoothing. Our goal is to assess the performance of each method empirically. The three methods are applied to a colon cancer data set containing 2,000 genes. Using two arrays, we compare the variance estimates obtained from the three methods. We also consider their impact on the t-statistics. Our results indicate that the three methods give variance estimates close to each other. Due to its simplicity and generality, we recommend the use of the smoothed sample variance for data with a small number of replicates. Electronic Publication  相似文献   

20.
Welch JJ 《Genetics》2006,173(2):821-837
When polymorphism and divergence data are available for multiple loci, extended forms of the McDonald-Kreitman test can be used to estimate the average proportion of the amino acid divergence due to adaptive evolution--a statistic denoted alpha. But such tests are subject to many biases. Most serious is the possibility that high estimates of alpha reflect demographic changes rather than adaptive substitution. Testing for between-locus variation in alpha is one possible way of distinguishing between demography and selection. However, such tests have yielded contradictory results, and their efficacy is unclear. Estimates of alpha from the same model organisms have also varied widely. This study clarifies the reasons for these discrepancies, identifying several method-specific biases in widely used estimators and assessing the power of the methods. As part of this process, a new maximum-likelihood estimator is introduced. This estimator is applied to a newly compiled data set of 115 genes from Drosophila simulans, each with each orthologs from D. melanogaster and D. yakuba. In this way, it is estimated that alpha approximately 0.4+/-0.1, a value that does not vary substantially between different loci or over different periods of divergence. The implications of these results are discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号