首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Background  

Uncertainty often affects molecular biology experiments and data for different reasons. Heterogeneity of gene or protein expression within the same tumor tissue is an example of biological uncertainty which should be taken into account when molecular markers are used in decision making. Tissue Microarray (TMA) experiments allow for large scale profiling of tissue biopsies, investigating protein patterns characterizing specific disease states. TMA studies deal with multiple sampling of the same patient, and therefore with multiple measurements of same protein target, to account for possible biological heterogeneity. The aim of this paper is to provide and validate a classification model taking into consideration the uncertainty associated with measuring replicate samples.  相似文献   

2.
MOTIVATION: Biologists often employ clustering techniques in the explorative phase of microarray data analysis to discover relevant biological groupings. Given the availability of numerous clustering algorithms in the machine-learning literature, an user might want to select one that performs the best for his/her data set or application. While various validation measures have been proposed over the years to judge the quality of clusters produced by a given clustering algorithm including their biological relevance, unfortunately, a given clustering algorithm can perform poorly under one validation measure while outperforming many other algorithms under another validation measure. A manual synthesis of results from multiple validation measures is nearly impossible in practice, especially, when a large number of clustering algorithms are to be compared using several measures. An automated and objective way of reconciling the rankings is needed. RESULTS: Using a Monte Carlo cross-entropy algorithm, we successfully combine the ranks of a set of clustering algorithms under consideration via a weighted aggregation that optimizes a distance criterion. The proposed weighted rank aggregation allows for a far more objective and automated assessment of clustering results than a simple visual inspection. We illustrate our procedure using one simulated as well as three real gene expression data sets from various platforms where we rank a total of eleven clustering algorithms using a combined examination of 10 different validation measures. The aggregate rankings were found for a given number of clusters k and also for an entire range of k. AVAILABILITY: R code for all validation measures and rank aggregation is available from the authors upon request. SUPPLEMENTARY INFORMATION: Supplementary information are available at http://www.somnathdatta.org/Supp/RankCluster/supp.htm.  相似文献   

3.
4.

Background  

A cluster analysis is the most commonly performed procedure (often regarded as a first step) on a set of gene expression profiles. In most cases, a post hoc analysis is done to see if the genes in the same clusters can be functionally correlated. While past successes of such analyses have often been reported in a number of microarray studies (most of which used the standard hierarchical clustering, UPGMA, with one minus the Pearson's correlation coefficient as a measure of dissimilarity), often times such groupings could be misleading. More importantly, a systematic evaluation of the entire set of clusters produced by such unsupervised procedures is necessary since they also contain genes that are seemingly unrelated or may have more than one common function. Here we quantify the performance of a given unsupervised clustering algorithm applied to a given microarray study in terms of its ability to produce biologically meaningful clusters using a reference set of functional classes. Such a reference set may come from prior biological knowledge specific to a microarray study or may be formed using the growing databases of gene ontologies (GO) for the annotated genes of the relevant species.  相似文献   

5.
A central challenge in computational modeling of biological systems is the determination of the model parameters. Typically, only a fraction of the parameters (such as kinetic rate constants) are experimentally measured, while the rest are often fitted. The fitting process is usually based on experimental time course measurements of observables, which are used to assign parameter values that minimize some measure of the error between these measurements and the corresponding model prediction. The measurements, which can come from immunoblotting assays, fluorescent markers, etc., tend to be very noisy and taken at a limited number of time points. In this work we present a new approach to the problem of parameter selection of biological models. We show how one can use a dynamic recursive estimator, known as extended Kalman filter, to arrive at estimates of the model parameters. The proposed method follows. First, we use a variation of the Kalman filter that is particularly well suited to biological applications to obtain a first guess for the unknown parameters. Secondly, we employ an a posteriori identifiability test to check the reliability of the estimates. Finally, we solve an optimization problem to refine the first guess in case it should not be accurate enough. The final estimates are guaranteed to be statistically consistent with the measurements. Furthermore, we show how the same tools can be used to discriminate among alternate models of the same biological process. We demonstrate these ideas by applying our methods to two examples, namely a model of the heat shock response in E. coli, and a model of a synthetic gene regulation system. The methods presented are quite general and may be applied to a wide class of biological systems where noisy measurements are used for parameter estimation or model selection.  相似文献   

6.
Comparing a protein's concentrations across two or more treatments is the focus of many proteomics studies. A frequent source of measurements for these comparisons is a mass spectrometry (MS) analysis of a protein's peptide ions separated by liquid chromatography (LC) following its enzymatic digestion. Alas, LC-MS identification and quantification of equimolar peptides can vary significantly due to their unequal digestion, separation, and ionization. This unequal measurability of peptides, the largest source of LC-MS nuisance variation, stymies confident comparison of a protein's concentration across treatments. Our objective is to introduce a mixed-effects statistical model for comparative LC-MS proteomics studies. We describe LC-MS peptide abundance with a linear model featuring pivotal terms that account for unequal peptide LC-MS measurability. We advance fitting this model to an often incomplete LC-MS data set with REstricted Maximum Likelihood (REML) estimation, producing estimates of model goodness-of-fit, treatment effects, standard errors, confidence intervals, and protein relative concentrations. We illustrate the model with an experiment featuring a known dilution series of a filamentous ascomycete fungus Trichoderma reesei protein mixture. For 781 of the 1546 T. reesei proteins with sufficient data coverage, the fitted mixed-effects models capably described the LC-MS measurements. The LC-MS measurability terms effectively accounted for this major source of uncertainty. Ninety percent of the relative concentration estimates were within 0.5-fold of the true relative concentrations. Akin to the common ratio method, this model also produced biased estimates, albeit less biased. Bias decreased significantly, both absolutely and relative to the ratio method, as the number of observed peptides per protein increased. Mixed-effects statistical modeling offers a flexible, well-established methodology for comparative proteomics studies integrating common experimental designs with LC-MS sample processing plans. It favorably accounts for the unequal LC-MS measurability of peptides and produces informative quantitative comparisons of a protein's concentration across treatments with objective measures of uncertainties.  相似文献   

7.
MOTIVATION: Genome-wide gene expression measurements, as currently determined by the microarray technology, can be represented mathematically as points in a high-dimensional gene expression space. Genes interact with each other in regulatory networks, restricting the cellular gene expression profiles to a certain manifold, or surface, in gene expression space. To obtain knowledge about this manifold, various dimensionality reduction methods and distance metrics are used. For data points distributed on curved manifolds, a sensible distance measure would be the geodesic distance along the manifold. In this work, we examine whether an approximate geodesic distance measure captures biological similarities better than the traditionally used Euclidean distance. RESULTS: We computed approximate geodesic distances, determined by the Isomap algorithm, for one set of lymphoma and one set of lung cancer microarray samples. Compared with the ordinary Euclidean distance metric, this distance measure produced more instructive, biologically relevant, visualizations when applying multidimensional scaling. This suggests the Isomap algorithm as a promising tool for the interpretation of microarray data. Furthermore, the results demonstrate the benefit and importance of taking nonlinearities in gene expression data into account.  相似文献   

8.
Genomic rearrangement operations can be very useful to infer the phylogenetic relationship of gene orders representing species. We study the problem of finding potential ancestral gene orders for the gene orders of given taxa, such that the corresponding rearrangement scenario has a minimal number of reversals, and where each of the reversals has to preserve the common intervals of the given input gene orders. Common intervals identify sets of genes that occur consecutively in all input gene orders. The problem of finding such an ancestral gene order is called the preserving reversal median problem (pRMP). A tree-based data structure for the representation of the common intervals of all input gene orders is used in our exact algorithm TCIP for solving the pRMP. It is known that the minimum number of reversals to transform one gene order into another can be computed in polynomial time, whereas the corresponding problem with the restriction that common intervals should not be destroyed is already NP-hard. It is shown theoretically that TCIP can solve a large class of pRMP instances in polynomial time. Empirically we show the good performance of TCIP on biological and artificial data.  相似文献   

9.
If perturbing two genes together has a stronger or weaker effect than expected, they are said to genetically interact. Genetic interactions are important because they help map gene function, and functionally related genes have similar genetic interaction patterns. Mapping quantitative (positive and negative) genetic interactions on a global scale has recently become possible. This data clearly shows groups of genes connected by predominantly positive or negative interactions, termed monochromatic groups. These groups often correspond to functional modules, like biological processes or complexes, or connections between modules. However it is not yet known how these patterns globally relate to known functional modules. Here we systematically study the monochromatic nature of known biological processes using the largest quantitative genetic interaction data set available, which includes fitness measurements for ~5.4 million gene pairs in the yeast Saccharomyces cerevisiae. We find that only 10% of biological processes, as defined by Gene Ontology annotations, and less than 1% of inter-process connections are monochromatic. Further, we show that protein complexes are responsible for a surprisingly large fraction of these patterns. This suggests that complexes play a central role in shaping the monochromatic landscape of biological processes. Altogether this work shows that both positive and negative monochromatic patterns are found in known biological processes and in their connections and that protein complexes play an important role in these patterns. The monochromatic processes, complexes and connections we find chart a hierarchical and modular map of sensitive and redundant biological systems in the yeast cell that will be useful for gene function prediction and comparison across phenotypes and organisms. Furthermore the analysis methods we develop are applicable to other species for which genetic interactions will progressively become more available.  相似文献   

10.
If biological questions are to be answered using quantitative proteomics, it is essential to design experiments which have sufficient power to be able to detect changes in expression. Sample subpooling is a strategy that can be used to reduce the variance but still allow studies to encompass biological variation. Underlying sample pooling strategies is the biological averaging assumption that the measurements taken on the pool are equal to the average of the measurements taken on the individuals. This study finds no evidence of a systematic bias triggered by sample pooling for DIGE and that pooling can be useful in reducing biological variation. For the first time in quantitative proteomics, the two sources of variance were decoupled and it was found that technical variance predominates for mouse brain, while biological variance predominates for human brain. A power analysis found that as the number of individuals pooled increased, then the number of replicates needed declined but the number of biological samples increased. Repeat measures of biological samples decreased the numbers of samples required but increased the number of gels needed. An example cost benefit analysis demonstrates how researchers can optimise their experiments while taking into account the available resources.  相似文献   

11.

Background  

The functions of human cells are carried out by biomolecular networks, which include proteins, genes, and regulatory sites within DNA that encode and control protein expression. Models of biomolecular network structure and dynamics can be inferred from high-throughput measurements of gene and protein expression. We build on our previously developed fuzzy logic method for bridging quantitative and qualitative biological data to address the challenges of noisy, low resolution high-throughput measurements, i.e., from gene expression microarrays. We employ an evolutionary search algorithm to accelerate the search for hypothetical fuzzy biomolecular network models consistent with a biological data set. We also develop a method to estimate the probability of a potential network model fitting a set of data by chance. The resulting metric provides an estimate of both model quality and dataset quality, identifying data that are too noisy to identify meaningful correlations between the measured variables.  相似文献   

12.
荧光共振能量转移效率的实时定量测量   总被引:2,自引:0,他引:2  
荧光共振能量转移(FRET)广泛用于研究分子间的距离及其相互作用,与荧光显微镜结合,可定量获取有关生物活体内蛋白质、脂类、DNA和RNA的时空信息。随着绿色荧光蛋白(GFP)的发展,FRET荧光显微镜有可能实时测量活体细胞内分子的动态性质。提出了一种定量测量FRET效率以及供体与受体间距离的简单方法,仅需使用一组滤光片和测量一个比值,利用供体和受体的发射谱肖除光谱间的串扰。该方法简单快速,可实时定量测量FRET的效率和供体与受体间的距离,尤其适用于基于GFP的供体-受体对。  相似文献   

13.
Two-dimensional SDS-PAGE gel electrophoresis using post-run staining is widely used to measure the abundances of thousands of protein spots simultaneously. Usually, the protein abundances of two or more biological groups are compared using biological and technical replicates. After gel separation and staining, the spots are detected, spot volumes are quantified, and spots are matched across gels. There are almost always many missing values in the resulting data set. The missing values arise either because the corresponding proteins have very low abundances (or are absent) or because of experimental errors such as incomplete/over focusing in the first dimension or varying run times in the second dimension as well as faulty spot detection and matching. In this study, we show that the probability for a spot to be missing can be modeled by a logistic regression function of the logarithm of the volume. Furthermore, we present an algorithm that takes a set of gels with technical and biological replicates as input and estimates the average protein abundances in the biological groups from the number of missing spots and measured volumes of the present spots using a maximum likelihood approach. Confidence intervals for abundances and p-values for differential expression between two groups are calculated using bootstrap sampling. The algorithm is compared to two standard approaches, one that discards missing values and one that sets all missing values to zero. We have evaluated this approach in two different gel data sets of different biological origin. An R-program, implementing the algorithm, is freely available at http://bioinfo.thep .lu.se/MissingValues2Dgels.html.  相似文献   

14.
Asymmetry of Early Paleozoic trilobites   总被引:1,自引:1,他引:0  
Asymmetry in fossils can arise through a variety of biological and geological mechanisms. If geological sources of asymmetry can be minimized or factored out, it might be possible to assess biological sources of asymmetry. Fluctuating asymmetry (FA), a general measure of developmental precision, is documented for nine species of lower Paleozoic trilobites. Taphonomic analyses suggest that the populations studied for each taxon span relatively short time intervals that are approximately equal in duration. Tectonic deformation may have affected the specimens studied, since deviations from normal distributions are common. Several measures of FA were applied to 3–5 homologous measures in each taxon. Measurement error was assessed by the analysis of variance (ANOVA) for repeated measurements of individual specimens and by analysis of the statistical moments of the distributions of asymmetry measures. Measurement error was significantly smaller than the difference between measures taken on each side of a specimen. However, the distribution of differences between sides often deviated from a mean of zero, or was skewed or kurtosic. Regression of levels of FA against geologic age revealed no statistically significant changes in levels of asymmetry through time. Geological and taphonomic effects make it difficult to identify asymmetry due to biological factors. Although fluctuating asymmetry is a function of both intrinsic and extrinsic factors, the results suggest that early Cambrian trilobites possessed genetic or developmental mechanisms used to maintain developmental stability comparable to those of younger trilobites. Although the measures are biased by time averaging and deviations from the normal distribution, these data do not lend strong support to 'genomic' hypotheses that have been suggested to control the tempo of the Cambrian radiation.  相似文献   

15.
Ko H  Hogan JW  Mayer KH 《Biometrics》2003,59(1):152-162
Several recently completed and ongoing studies of the natural history of HIV infection have generated a wealth of information about its clinical progression and how this progression is altered by therepeutic interventions and environmental factors. Natural history studies typically follow prospective cohort designs, and enroll large numbers of participants for long-term prospective follow-up (up to several years). Using data from the HIV Epidemiology Research Study (HERS), a six-year natural history study that enrolled 871 HIV-infected women starting in 1993, we investigate the therapeutic effect of highly active antiretroviral therapy regimens (HAART) on CD4 cell count using the marginal structural modeling framework and associated estimation procedures based on inverse-probability weighting (developed by Robins and colleagues). To evaluate treatment effects from a natural history study, specialized methods are needed because treatments are not randomly prescribed and, in particular, the treatment-response relationship can be confounded by variables that are time-varying. Our analysis uses CD4 data on all follow-up visits over a two-year period, and includes sensitivity analyses to investigate potential biases attributable to unmeasured confounding. Strategies for selecting ranges of a sensitivity parameter are given, as are intervals for treatment effect that reflect uncertainty attributable both to sampling and to lack of knowledge about the nature and existence of unmeasured confounding. To our knowledge, this is the first use in "real data" of Robins's sensitivity analysis for unmeasured confounding (Robins, 1999a, Synthese 121, 151-179). The findings from our analysis are consistent with recent treatment guidelines set by the U.S. Panel of the International AIDS Society (Carpenter et al., 2000, Journal of the American Medical Association 280, 381-391).  相似文献   

16.
We develop a new regression algorithm, cMIKANA, for inference of gene regulatory networks from combinations of steady-state and time-series gene expression data. Using simulated gene expression datasets to assess the accuracy of reconstructing gene regulatory networks, we show that steady-state and time-series data sets can successfully be combined to identify gene regulatory interactions using the new algorithm. Inferring gene networks from combined data sets was found to be advantageous when using noisy measurements collected with either lower sampling rates or a limited number of experimental replicates. We illustrate our method by applying it to a microarray gene expression dataset from human umbilical vein endothelial cells (HUVECs) which combines time series data from treatment with growth factor TNF and steady state data from siRNA knockdown treatments. Our results suggest that the combination of steady-state and time-series datasets may provide better prediction of RNA-to-RNA interactions, and may also reveal biological features that cannot be identified from dynamic or steady state information alone. Finally, we consider the experimental design of genomics experiments for gene regulatory network inference and show that network inference can be improved by incorporating steady-state measurements with time-series data.  相似文献   

17.
Studies of biological variables such as those based on blood chemistry often have measurements taken over time at closely spaced intervals for groups of individuals. Natural scientific questions may then relate to the first time that the underlying population curve crosses a threshold (onset) and to how long it stays above the threshold (duration). In this paper we give general confidence regions for these population quantities. The regions are based on the intersection-union principle and may be applied to totally nonparametric, semiparametric, or fully parametric models where level-α tests exist pointwise at each time point. A key advantage of the approach is that no modeling of the correlation over time is required.  相似文献   

18.
A large number of biological pathways have been elucidated recently, and there is a need for methods to analyze these pathways. One class of methods compares pathways semantically in order to discover parts that are evolutionarily conserved between species or to discover intraspecies similarities. Such methods usually require that the topologies of the pathways being compared are known, i.e. that a query pathway is being aligned to a model pathway. However, sometimes the query only consists of an unordered set of gene products. Previous methods for mapping sets of gene products onto known pathways have not been based on semantic comparison of gene products using ontologies or other abstraction hierarchies. Therefore, we here propose an approach that uses a similarity function defined in Gene Ontology (GO) terms to find semantic alignments when comparing paths in biological pathways where the nodes are gene products. A known pathway graph is used as a model, and an evolutionary algorithm (EA) is used to evolve putative paths from a set of experimentally determined gene products. The method uses a measure of GO term similarity to calculate a match score between gene products, and the fitness value of each candidate path alignment is derived from these match scores. A statistical test is used to assess the significance of evolved alignments. The performance of the method has been tested using regulatory pathways for S. cerevisiae and M. musculus.  相似文献   

19.
Gene selection and classification of microarray data using random forest   总被引:9,自引:0,他引:9  

Background  

Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection.  相似文献   

20.
A lengthening in meal duration can be used to measure an increase in orofacial mechanical hyperalgesia having similarities to the guarding behavior of humans with orofacial pain. To measure meal duration unrestrained rats are continuously kept in sound attenuated, computerized feeding modules for days to weeks to record feeding behavior. These sound-attenuated chambers are equipped with chow pellet dispensers. The dispenser has a pellet trough with a photobeam placed at the bottom of the trough and when a rodent removes a pellet from the feeder trough this beam is no longer blocked, signaling the computer to drop another pellet. The computer records the date and time when the pellets were taken from the trough and from this data the experimenter can calculate the meal parameters. When calculating meal parameters a meal was defined based on previous work and was set at 10 min (in other words when the animal does not eat for 10 min that would be the end of the animal''s meal) also the minimum meal size was set at 3 pellets. The meal duration, meal number, food intake, meal size and inter-meal interval can then be calculated by the software for any time period that the operator desires. Of the feeding parameters that can be calculated meal duration has been shown to be a continuous noninvasive biological marker of orofacial nociception in male rats and mice and female rats. Meal duration measurements are quantitative, require no training or animal manipulation, require cortical participation, and do not compete with other experimentally induced behaviors. These factors distinguish this assay from other operant or reflex methods for recording orofacial nociception.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号