首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Technologies that have emerged from the genome project have dramatically increased our ability to generate data on the way in which organisms respond to their environments, how they execute their programmes of development and growth, and how these are altered in the development of disease states. However, our ability to analyse these large datasets has not kept pace with our ability to generate them and consequently new strategies must be developed to address the issues associated with their analysis. One approach that we have employed quite successfully is to look at data from microarrays (or proteomics or metabolomics experiments) not as independent datasets, but rather as elements of a much larger body of biological information across various scales that must be integrated with, and interpreted within, the context of such ancillary data. Here we outline the general approach and provide three examples from published studies of the way in which we have applied this strategy.  相似文献   

2.

Missing values in mass spectrometry metabolomic datasets occur widely and can originate from a number of sources, including for both technical and biological reasons. Currently, little is known about these data, i.e. about their distributions across datasets, the need (or not) to consider them in the data processing pipeline, and most importantly, the optimal way of assigning them values prior to univariate or multivariate data analysis. Here, we address all of these issues using direct infusion Fourier transform ion cyclotron resonance mass spectrometry data. We have shown that missing data are widespread, accounting for ca. 20% of data and affecting up to 80% of all variables, and that they do not occur randomly but rather as a function of signal intensity and mass-to-charge ratio. We have demonstrated that missing data estimation algorithms have a major effect on the outcome of data analysis when comparing the differences between biological sample groups, including by t test, ANOVA and principal component analysis. Furthermore, results varied significantly across the eight algorithms that we assessed for their ability to impute known, but labelled as missing, entries. Based on all of our findings we identified the k-nearest neighbour imputation method (KNN) as the optimal missing value estimation approach for our direct infusion mass spectrometry datasets. However, we believe the wider significance of this study is that it highlights the importance of missing metabolite levels in the data processing pipeline and offers an approach to identify optimal ways of treating missing data in metabolomics experiments.

  相似文献   

3.
Missing values in mass spectrometry metabolomic datasets occur widely and can originate from a number of sources, including for both technical and biological reasons. Currently, little is known about these data, i.e. about their distributions across datasets, the need (or not) to consider them in the data processing pipeline, and most importantly, the optimal way of assigning them values prior to univariate or multivariate data analysis. Here, we address all of these issues using direct infusion Fourier transform ion cyclotron resonance mass spectrometry data. We have shown that missing data are widespread, accounting for ca. 20% of data and affecting up to 80% of all variables, and that they do not occur randomly but rather as a function of signal intensity and mass-to-charge ratio. We have demonstrated that missing data estimation algorithms have a major effect on the outcome of data analysis when comparing the differences between biological sample groups, including by t test, ANOVA and principal component analysis. Furthermore, results varied significantly across the eight algorithms that we assessed for their ability to impute known, but labelled as missing, entries. Based on all of our findings we identified the k-nearest neighbour imputation method (KNN) as the optimal missing value estimation approach for our direct infusion mass spectrometry datasets. However, we believe the wider significance of this study is that it highlights the importance of missing metabolite levels in the data processing pipeline and offers an approach to identify optimal ways of treating missing data in metabolomics experiments.  相似文献   

4.
5.

Background

Large-scale gene expression studies have not yielded the expected insight into genetic networks that control complex processes. These anticipated discoveries have been limited not by technology, but by a lack of effective strategies to investigate the data in a manageable and meaningful way. Previous work suggests that using a pre-determined seed-network of gene relationships to query large-scale expression datasets is an effective way to generate candidate genes for further study and network expansion or enrichment. Based on the evolutionary conservation of gene relationships, we test the hypothesis that a seed network derived from studies of retinal cell determination in the fly, Drosophila melanogaster, will be an effective way to identify novel candidate genes for their role in mouse retinal development.

Methodology/Principal Findings

Our results demonstrate that a number of gene relationships regulating retinal cell differentiation in the fly are identifiable as pairwise correlations between genes from developing mouse retina. In addition, we demonstrate that our extracted seed-network of correlated mouse genes is an effective tool for querying datasets and provides a context to generate hypotheses. Our query identified 46 genes correlated with our extracted seed-network members. Approximately 54% of these candidates had been previously linked to the developing brain and 33% had been previously linked to the developing retina. Five of six candidate genes investigated further were validated by experiments examining spatial and temporal protein expression in the developing retina.

Conclusions/Significance

We present an effective strategy for pursuing a systems biology approach that utilizes an evolutionary comparative framework between two model organisms, fly and mouse. Future implementation of this strategy will be useful to determine the extent of network conservation, not just gene conservation, between species and will facilitate the use of prior biological knowledge to develop rational systems-based hypotheses.  相似文献   

6.
7.
The ability to generate large molecular datasets for phylogenetic studies benefits biologists, but such data expansion introduces numerous analytical problems. A typical molecular phylogenetic study implicitly assumes that sequences evolve under stationary, reversible and homogeneous conditions, but this assumption is often violated in real datasets. When an analysis of large molecular datasets results in unexpected relationships, it often reflects violation of phylogenetic assumptions, rather than a correct phylogeny. Molecular evolutionary phenomena such as base compositional heterogeneity and among‐site rate variation are known to affect phylogenetic inference, resulting in incorrect phylogenetic relationships. The ability of methods to overcome such bias has not been measured on real and complex datasets. We investigated how base compositional heterogeneity and among‐site rate variation affect phylogenetic inference in the context of a mitochondrial genome phylogeny of the insect order Coleoptera. We show statistically that our dataset is affected by base compositional heterogeneity regardless of how the data are partitioned or recoded. Among‐site rate variation is shown by comparing topologies generated using models of evolution with and without a rate variation parameter in a Bayesian framework. When compared for their effectiveness in dealing with systematic bias, standard phylogenetic methods tend to perform poorly, and parsimony without any data transformation performs worst. Two methods designed specifically to overcome systematic bias, LogDet and a Bayesian method implementing variable composition vectors, can overcome some level of base compositional heterogeneity, but are still affected by among‐site rate variation. A large degree of variation in both noise and phylogenetic signal among all three codon positions is observed. We caution and argue that more data exploration is imperative, especially when many genes are included in an analysis.  相似文献   

8.
Malaria parasites must undergo a round of sexual reproduction in the blood meal of a mosquito vector to be transmitted between hosts. Developing a transmission-blocking intervention to prevent parasites from mating is a major goal of biomedicine, but its effectiveness could be compromised if parasites can compensate by simply adjusting their sex allocation strategies. Recently, the application of evolutionary theory for sex allocation has been supported by experiments demonstrating that malaria parasites adjust their sex ratios in response to infection genetic diversity, precisely as predicted. Theory also predicts that parasites should adjust sex allocation in response to host immunity. Whilst data are supportive, the assumptions underlying this prediction - that host immune responses have differential effects on the mating ability of males and females - have not yet been tested. Here, we combine experimental work with theoretical models in order to investigate whether the development and fertility of male and female parasites is affected by innate immune factors and develop new theory to predict how parasites' sex allocation strategies should evolve in response to the observed effects. Specifically, we demonstrate that reactive nitrogen species impair gametogenesis of males only, but reduce the fertility of both male and female gametes. In contrast, tumour necrosis factor-α does not influence gametogenesis in either sex but impairs zygote development. Therefore, our experiments demonstrate that immune factors have complex effects on each sex, ranging from reducing the ability of gametocytes to develop into gametes, to affecting the viability of offspring. We incorporate these results into theory to predict how the evolutionary trajectories of parasite sex ratio strategies are shaped by sex differences in gamete production, fertility and offspring development. We show that medical interventions targeting offspring development are more likely to be 'evolution-proof' than interventions directed at killing males or females. Given the drive to develop medical interventions that interfere with parasite mating, our data and theoretical models have important implications.  相似文献   

9.
MOTIVATION: Protein-protein interactions have proved to be a valuable starting point for understanding the inner workings of the cell. Computational methodologies have been built which both predict interactions and use interaction datasets in order to predict other protein features. Such methods require gold standard positive (GSP) and negative (GSN) interaction sets. Here we examine and demonstrate the usefulness of homologous interactions in predicting good quality positive and negative interaction datasets. RESULTS: We generate GSP interaction sets as subsets from experimental data using only interaction and sequence information. We can therefore produce sets for several species (many of which at present have no identified GSPs). Comprehensive error rate testing demonstrates the power of the method. We also show how the use of our datasets significantly improves the predictive power of algorithms for interaction prediction and function prediction. Furthermore, we generate GSN interaction sets for yeast and examine the use of homology along with other protein properties such as localization, expression and function. Using a novel method to assess the accuracy of a negative interaction set, we find that the best single selector for negative interactions is a lack of co-function. However, an integrated method using all the characteristics shows significant improvement over any current method for identifying GSN interactions. The nature of homologous interactions is also examined and we demonstrate that interologs are found more commonly within species than across species. CONCLUSION: GSP sets built using our homologous verification method are demonstrably better than standard sets in terms of predictive ability. We can build such GSP sets for several species. When generating GSNs we show a combination of protein features and lack of homologous interactions gives the highest quality interaction sets. AVAILABILITY: GSP and GSN datasets for all the studied species can be downloaded from http://www.stats.ox.ac.uk/~deane/HPIV.  相似文献   

10.
道德规范教育如今已经提升到了专业的水平。因此在专业领域里 (例如工程学和医学 ) ,道德规范教育应作为必修课程。但至今很多理科课程仍没有把它列为必修课。这就给我们提出了一个疑问 :理科是专业课程吗 ?如果是的话 ,那么科学家例如动物学家需不需要熟悉他们职责范围内的道德准则和尺度呢 ?动物学家对医学上暴露的一些问题很敏感———包括我们怎样对待动物以及我们怎样或者是否开展基因工程。但是从道德观念上来看 ,道德规范教育的实行是比这两件事更实际的。这篇论文就以上观点进行了进一步的论述 ,并且对把道德规范教育加入理科课程的需求和可能性做了评估。在现实社会里 ,动物科学家是被敬重的专业人士。他们每天面对着许多极可能影响我们生活环境的决策。有鉴于此 ,动物科学家必需掌握道德规范的标准 ,并且有能力做出与此相符合的决策。这才能使我们在动物学的教学过程中确保动物学家持续稳定的专业发展方向  相似文献   

11.
Relaxed phylogenetics and dating with confidence   总被引:3,自引:1,他引:2       下载免费PDF全文
In phylogenetics, the unrooted model of phylogeny and the strict molecular clock model are two extremes of a continuum. Despite their dominance in phylogenetic inference, it is evident that both are biologically unrealistic and that the real evolutionary process lies between these two extremes. Fortunately, intermediate models employing relaxed molecular clocks have been described. These models open the gate to a new field of “relaxed phylogenetics.” Here we introduce a new approach to performing relaxed phylogenetic analysis. We describe how it can be used to estimate phylogenies and divergence times in the face of uncertainty in evolutionary rates and calibration times. Our approach also provides a means for measuring the clocklikeness of datasets and comparing this measure between different genes and phylogenies. We find no significant rate autocorrelation among branches in three large datasets, suggesting that autocorrelated models are not necessarily suitable for these data. In addition, we place these datasets on the continuum of clocklikeness between a strict molecular clock and the alternative unrooted extreme. Finally, we present analyses of 102 bacterial, 106 yeast, 61 plant, 99 metazoan, and 500 primate alignments. From these we conclude that our method is phylogenetically more accurate and precise than the traditional unrooted model while adding the ability to infer a timescale to evolution.  相似文献   

12.
Two-body inter-residue contact potentials for proteins have often been extracted and extensively used for threading. Here, we have developed a new scheme to derive four-body contact potentials as a way to consider protein interactions in a more cooperative model. We use several datasets of protein native structures to demonstrate that around 500 chains are sufficient to provide a good estimate of these four-body contact potentials by obtaining convergent threading results. We also have deliberately chosen two sets of protein native structures differing in resolution, one with all chains' resolution better than 1.5 A and the other with 94.2% of the structures having a resolution worse than 1.5 A to investigate whether potentials from well-refined protein datasets perform better in threading. However, potentials from well-refined proteins did not generate statistically significant better threading results. Our four-body contact potentials can discriminate well between native structures and partially unfolded or deliberately misfolded structures. Compared with another set of four-body contact potentials derived by using a Delaunay tessellation algorithm, our four-body contact potentials appear to offer a better characterization of the interactions between backbones and side chains and provide better threading results, somewhat complementary to those found using other potentials.  相似文献   

13.
Microbiome studies are often limited by a lack of statistical power due to small sample sizes and a large number of features. This problem is exacerbated in correlative studies of multi-omic datasets. Statistical power can be increased by finding and summarizing modules of correlated observations, which is one dimensionality reduction method. Additionally, modules provide biological insight as correlated groups of microbes can have relationships among themselves. To address these challenges, we developed SCNIC: Sparse Cooccurrence Network Investigation for compositional data. SCNIC is open-source software that can generate correlation networks and detect and summarize modules of highly correlated features. Modules can be formed using either the Louvain Modularity Maximization (LMM) algorithm or a Shared Minimum Distance algorithm (SMD) that we newly describe here and relate to LMM using simulated data. We applied SCNIC to two published datasets and we achieved increased statistical power and identified microbes that not only differed across groups, but also correlated strongly with each other, suggesting shared environmental drivers or cooperative relationships among them. SCNIC provides an easy way to generate correlation networks, identify modules of correlated features and summarize them for downstream statistical analysis. Although SCNIC was designed considering properties of microbiome data, such as compositionality and sparsity, it can be applied to a variety of data types including metabolomics data and used to integrate multiple data types. SCNIC allows for the identification of functional microbial relationships at scale while increasing statistical power through feature reduction.  相似文献   

14.
With growing computational capabilities of parallel machines, scientific simulations are being performed at finer spatial and temporal scales, leading to a data explosion. The growing sizes are making it extremely hard to store, manage, disseminate, analyze, and visualize these datasets, especially as neither the memory capacity of parallel machines, memory access speeds, nor disk bandwidths are increasing at the same rate as the computing power. Sampling can be an effective technique to address the above challenges, but it is extremely important to ensure that dataset characteristics are preserved, and the loss of accuracy is within acceptable levels. In this paper, we address the data explosion problems by developing a novel sampling approach, and implementing it in a flexible system that supports server-side sampling and data subsetting. We observe that to allow subsetting over scientific datasets, data repositories are likely to use an indexing technique. Among these techniques, we see that bitmap indexing can not only effectively support subsetting over scientific datasets, but can also help create samples that preserve both value and spatial distributions over scientific datasets. We have developed algorithms for using bitmap indices to sample datasets. We have also shown how only a small amount of additional metadata stored with bitvectors can help assess loss of accuracy with a particular subsampling level. Some of the other properties of this novel approach include: (1) sampling can be flexibly applied to a subset of the original dataset, which may be specified using a value-based and/or a dimension-based subsetting predicate, and (2) no data reorganization is needed, once bitmap indices have been generated. We have extensively evaluated our method with different types of datasets and applications, and demonstrated the effectiveness of our approach.  相似文献   

15.
The increasing ability to generate large-scale, quantitative proteomic data has brought with it the challenge of analyzing such data to discover the sequence elements that underlie systems-level protein behavior. Here we show that short, linear protein motifs can be efficiently recovered from proteome-scale datasets such as sub-cellular localization, molecular function, half-life, and protein abundance data using an information theoretic approach. Using this approach, we have identified many known protein motifs, such as phosphorylation sites and localization signals, and discovered a large number of candidate elements. We estimate that ~80% of these are novel predictions in that they do not match a known motif in both sequence and biological context, suggesting that post-translational regulation of protein behavior is still largely unexplored. These predicted motifs, many of which display preferential association with specific biological pathways and non-random positioning in the linear protein sequence, provide focused hypotheses for experimental validation.  相似文献   

16.
Three metrics of species diversity – species richness, the Shannon index and the Simpson index – are still widely used in ecology, despite decades of valid critiques leveled against them. Developing a robust diversity metric has been challenging because, unlike many variables ecologists measure, the diversity of a community often cannot be estimated in an unbiased way based on a random sample from that community. Over the past decade, ecologists have begun to incorporate two important tools for estimating diversity: coverage and Hill diversity. Coverage is a method for equalizing samples that is, on theoretical grounds, preferable to other commonly used methods such as equal-effort sampling, or rarefying datasets to equal sample size. Hill diversity comprises a spectrum of diversity metrics and is based on three key insights. First, species richness and variants of the Shannon and Simpson indices are all special cases of one general equation. Second, richness, Shannon and Simpson can be expressed on the same scale and in units of species. Third, there is no way to eliminate the effect of relative abundance from estimates of any of these diversity metrics, including species richness. Rather, a researcher must choose the relative sensitivity of the metric towards rare and common species, a concept which we describe as ‘leverage.' In this paper we explain coverage and Hill diversity, provide guidelines for how to use them together to measure species diversity, and demonstrate their use with examples from our own data. We show why researchers will obtain more robust results when they estimate the Hill diversity of equal-coverage samples, rather than using other methods such as equal-effort sampling or traditional sample rarefaction.  相似文献   

17.
18.
Reliable analyses can help wildlife managers make good decisions, which are particularly critical for controversial decisions such as wolf (Canis lupus) harvest. Creel and Rotella (2010) recently predicted substantial population declines in Montana wolf populations due to harvest, in contrast to predictions made by Montana Fish, Wildlife and Parks (MFWP). We replicated their analyses considering only those years in which field monitoring was consistent, and we considered the effect of annual variation in recruitment on wolf population growth. Rather than assuming constant rates, we used model selection methods to evaluate and incorporate models of factors driving recruitment and human-caused mortality rates in wolf populations in the Northern Rocky Mountains. Using data from 27 area-years of intensive wolf monitoring, we show that variation in both recruitment and human-caused mortality affect annual wolf population growth rates and that human-caused mortality rates have increased with the sizes of wolf populations. We document that recruitment rates have decreased over time, and we speculate that rates have decreased with increasing population sizes and/or that the ability of current field resources to document recruitment rates has recently become less successful as the number of wolves in the region has increased. Estimates of positive wolf population growth in Montana from our top models are consistent with field observations and estimates previously made by MFWP for 2008–2010, whereas the predictions for declining wolf populations of Creel and Rotella (2010) are not. Familiarity with limitations of raw data, obtained first-hand or through consultation with scientists who collected the data, helps generate more reliable inferences and conclusions in analyses of publicly available datasets. Additionally, development of efficient monitoring methods for wolves is a pressing need, so that analyses such as ours will be possible in future years when fewer resources will be available for monitoring. © 2011 The Wildlife Society.  相似文献   

19.
Community ecology is tasked with the considerable challenge of predicting the structure, and properties, of emerging ecosystems. It requires the ability to understand how and why species interact, as this will allow the development of mechanism‐based predictive models, and as such to better characterize how ecological mechanisms act locally on the existence of inter‐specific interactions. Here we argue that the current conceptualization of species interaction networks is ill‐suited for this task. Instead, we propose that future research must start to account for the intrinsic variability of species interactions, then scale up from here onto complex networks. This can be accomplished simply by recognizing that there exists intra‐specific variability, in traits or properties related to the establishment of species interactions. By shifting the scale towards population‐based processes, we show that this new approach will improve our predictive ability and mechanistic understanding of how species interact over large spatial or temporal scales. Synthesis Although species interactions are the backbone of ecological communities, we have little insights on how (and why) they vary through space and time. In this article, we build on existing empirical literature to show that the same species may happen to interact in different ways when their local abundances vary, their trait distribution changes, or when the environment affects either of these factors. We discuss how these findings can be integrated in existing frameworks for the analysis and simulation of species interactions.  相似文献   

20.
Shadforth I  Crowther D  Bessant C 《Proteomics》2005,5(16):4082-4095
Current proteomics experiments can generate vast quantities of data very quickly, but this has not been matched by data analysis capabilities. Although there have been a number of recent reviews covering various aspects of peptide and protein identification methods using MS, comparisons of which methods are either the most appropriate for, or the most effective at, their proposed tasks are not readily available. As the need for high-throughput, automated peptide and protein identification systems increases, the creators of such pipelines need to be able to choose algorithms that are going to perform well both in terms of accuracy and computational efficiency. This article therefore provides a review of the currently available core algorithms for PMF, database searching using MS/MS, sequence tag searches and de novo sequencing. We also assess the relative performances of a number of these algorithms. As there is limited reporting of such information in the literature, we conclude that there is a need for the adoption of a system of standardised reporting on the performance of new peptide and protein identification algorithms, based upon freely available datasets. We go on to present our initial suggestions for the format and content of these datasets.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号