首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 421 毫秒
1.
Pooling biospecimens and limits of detection: effects on ROC curve analysis   总被引:2,自引:0,他引:2  
Frequently, epidemiological studies deal with two restrictions in the evaluation of biomarkers: cost and instrument sensitivity. Costs can hamper the evaluation of the effectiveness of new biomarkers. In addition, many assays are affected by a limit of detection (LOD), depending on the instrument sensitivity. Two common strategies used to cut costs include taking a random sample of the available samples and pooling biospecimens. We compare the two sampling strategies when an LOD effect exists. These strategies are compared by examining the efficiency of receiver operating characteristic (ROC) curve analysis, specifically the estimation of the area under the ROC curve (AUC) for normally distributed markers. We propose and examine a method to estimate AUC when dealing with data from pooled and unpooled samples where an LOD is in effect. In conclusion, pooling is the most efficient cost-cutting strategy when the LOD affects less than 50% of the data. However, when much more than 50% of the data are affected, utilization of the pooling design is not recommended.  相似文献   

2.
MOTIVATION: Many biomedical experiments are carried out by pooling individual biological samples. However, pooling samples can potentially hide biological variance and give false confidence concerning the data significance. In the context of microarray experiments for detecting differentially expressed genes, recent publications have addressed the problem of the efficiency of sample pooling, and some approximate formulas were provided for the power and sample size calculations. It is desirable to have exact formulas for these calculations and have the approximate results checked against the exact ones. We show that the difference between the approximate and the exact results can be large. RESULTS: In this study, we have characterized quantitatively the effect of pooling samples on the efficiency of microarray experiments for the detection of differential gene expression between two classes. We present exact formulas for calculating the power of microarray experimental designs involving sample pooling and technical replications. The formulas can be used to determine the total number of arrays and biological subjects required in an experiment to achieve the desired power at a given significance level. The conditions under which pooled design becomes preferable to non-pooled design can then be derived given the unit cost associated with a microarray and that with a biological subject. This paper thus serves to provide guidance on sample pooling and cost-effectiveness. The formulation in this paper is outlined in the context of performing microarray comparative studies, but its applicability is not limited to microarray experiments. It is also applicable to a wide range of biomedical comparative studies where sample pooling may be involved.  相似文献   

3.
Pooling biospecimens is a well accepted sampling strategy in biomedical research to reduce study cost of measuring biomarkers, and has been shown in the case of normally distributed data to yield more efficient estimation. In this paper we examine the efficiency of pooling, in the context of information matrix related to estimators of unknown parameters, when the biospecimens being pooled yield incomplete observations due to the instruments' limit of detection. Our investigation of three sampling strategies shows that, for a range of values of the detection limit, pooling is the most efficient sampling procedure. For certain other values of the detection limit, pooling can perform poorly.  相似文献   

4.
Summary It has become increasingly common in epidemiological studies to pool specimens across subjects to achieve accurate quantitation of biomarkers and certain environmental chemicals. In this article, we consider the problem of fitting a binary regression model when an important exposure is subject to pooling. We take a regression calibration approach and derive several methods, including plug‐in methods that use a pooled measurement and other covariate information to predict the exposure level of an individual subject, and normality‐based methods that make further adjustments by assuming normality of calibration errors. Within each class we propose two ways to perform the calibration (covariate augmentation and imputation). These methods are shown in simulation experiments to effectively reduce the bias associated with the naive method that simply substitutes a pooled measurement for all individual measurements in the pool. In particular, the normality‐based imputation method performs reasonably well in a variety of settings, even under skewed distributions of calibration errors. The methods are illustrated using data from the Collaborative Perinatal Project.  相似文献   

5.
As medical research and technology advance, there are always new biomarkers found and predictive models proposed for improving the diagnostic performance of diseases. Therefore, in addition to the existing biomarkers and predictive models, how to assess new biomarkers becomes an important research problem. Many classification performance measures, which are usually based on the performance on the whole cut‐off values, were applied directly to this type of problems. However, in a medical diagnosis, some cut‐off points are more important, such as those points within the range of high specificity. Thus, as the partial area under the ROC curve to the area under ROC curve, we study the partial integrated discriminant improvement (pIDI) for evaluating the predictive ability of a newly added marker at a prespecified range of cut‐offs. Theoretical property of estimate of the proposed measure is reported. The performance of this new measure is then compared with that of the partial area under an ROC curve. The numerical results use synthesized are presented, and a liver cancer dataset is used for demonstration purposes.  相似文献   

6.
Weinberg CR  Umbach DM 《Biometrics》1999,55(3):718-726
Assays can be so expensive that interesting hypotheses become impractical to study epidemiologically. One need not, however, perform an assay for everyone providing a biological specimen. We propose pooling equal-volume aliquots from randomly grouped sets of cases and randomly grouped sets of controls, and then assaying the smaller number of pooled samples. If an effect modifier is of concern, the pooling can be done within strata defined by that variable. For covariates assessed on individuals (e.g., questionnaire data), set-based counterparts are calculated by adding the values for the individuals in each set. The pooling set then becomes the unit of statistical analysis. We show that, with appropriate specification of a set-based logistic model, standard software yields a valid estimated exposure odds ratio, provided the multiplicative formulation is correct. Pooling minimizes the depletion of irreplaceable biological specimens and can enable additional exposures to be studied economically. Statistical power suffers very little compared with the usual, individual-based analysis. In settings where high assay costs constrain the number of people an investigator can afford to study, specimen pooling can make it possible to study more people and hence improve the study's statistical power with no increase in cost.  相似文献   

7.
Vieland VJ  Wang K  Huang J 《Human heredity》2001,51(4):199-208
The development of rigorous methods for evaluating the overall strength of evidence for genetic linkage based on multiple sets of data is becoming increasingly important in connection with genomic screens for complex disorders. We consider here what happens when we attempt to increase power to detect linkage by pooling multiple independently collected sets of families under conditions of variable levels of locus heterogeneity across samples. We show that power can be substantially reduced in pooled samples when compared to the most informative constituent subsamples considered alone, in spite of the increased sample size afforded by pooling. We demonstrate that for affected sib pair data, a simple adaptation of the lod score (which we call the compound lod), which allows for intersample admixture differences can afford appreciably higher power than the ordinary heterogeneity lod; and also, that a statistic we have proposed elsewhere, the posterior probability of linkage, performs at least as well as the compound lod while having considerable computational advantages. The companion paper (this issue, pp 217-225) shows further that in application to multiple data sets, familiar model-free methods are in some sense equivalent to ordinary lod scores based on data pooling, and that they therefore will also suffer dramatic losses in power for pooled data in the presence of locus heterogeneity and other complicating factors.  相似文献   

8.
Two-stage designs in case-control association analysis   总被引:1,自引:0,他引:1       下载免费PDF全文
Zuo Y  Zou G  Zhao H 《Genetics》2006,173(3):1747-1760
DNA pooling is a cost-effective approach for collecting information on marker allele frequency in genetic studies. It is often suggested as a screening tool to identify a subset of candidate markers from a very large number of markers to be followed up by more accurate and informative individual genotyping. In this article, we investigate several statistical properties and design issues related to this two-stage design, including the selection of the candidate markers for second-stage analysis, statistical power of this design, and the probability that truly disease-associated markers are ranked among the top after second-stage analysis. We have derived analytical results on the proportion of markers to be selected for second-stage analysis. For example, to detect disease-associated markers with an allele frequency difference of 0.05 between the cases and controls through an initial sample of 1000 cases and 1000 controls, our results suggest that when the measurement errors are small (0.005), approximately 3% of the markers should be selected. For the statistical power to identify disease-associated markers, we find that the measurement errors associated with DNA pooling have little effect on its power. This is in contrast to the one-stage pooling scheme where measurement errors may have large effect on statistical power. As for the probability that the disease-associated markers are ranked among the top in the second stage, we show that there is a high probability that at least one disease-associated marker is ranked among the top when the allele frequency differences between the cases and controls are not <0.05 for reasonably large sample sizes, even though the errors associated with DNA pooling in the first stage are not small. Therefore, the two-stage design with DNA pooling as a screening tool offers an efficient strategy in genomewide association studies, even when the measurement errors associated with DNA pooling are nonnegligible. For any disease model, we find that all the statistical results essentially depend on the population allele frequency and the allele frequency differences between the cases and controls at the disease-associated markers. The general conclusions hold whether the second stage uses an entirely independent sample or includes both the samples used in the first stage and an independent set of samples.  相似文献   

9.
A construction of pooling designs with some happy surprises.   总被引:9,自引:0,他引:9  
The screening of data sets for "positive data objects" is essential to modern technology. A (group) test that indicates whether a positive data object is in a specific subset or pool of the dataset can greatly facilitate the identification of all the positive data objects. A collection of tested pools is called a pooling design. Pooling designs are standard experimental tools in many biotechnical applications. In this paper, we use the (linear) subspace relation coupled with the general concept of a "containment matrix" to construct pooling designs with surprisingly high degrees of error correction (detection.) Error-correcting pooling designs are important to biotechnical applications where error rates often are as high as 15%. What is also surprising is that the rank of the pooling design containment matrix is independent of the number of positive data objects in the dataset.  相似文献   

10.
Population genomic approaches,which take advantages of high-throughput genotyping,are powerful yet costly methods to scan for selective sweeps.DNA-pooling strategies have been widely used for association studies because it is a cost-effective alternative to large-scale individual genotyping.Here,we performed an SNP-MaP(single nucleotide polymorphism microarrays and pooling)analysis using samples from Eurasia to evaluate the efficiency of pooling strategy in genome-wide scans for selection.By conducting simulations of allelotype data,we first demonstrated that the boxplot with average heterozygosity(HET)is a promising method to detect strong selective sweeps with a moderate level of pooling error.Based on this,we used a sliding window analysis of HET to detect the large contiguous regions(LCRs)putatively under selective sweeps from Eurasia datasets.This survey identified 63 LCRs in a European population.These signals were further supported by the integrated haplotype score(iHS)test using HapMap Ⅱ data.We also confirrned the European-specific signatures of positive selection from several previously identified genes(KEL,TRPV5,TRPV6,EPHB6).In summary,our results not only revealed the high credibility of SNP-MaP strategy in scanning for selective sweeps,but also provided an insight into the population differentiation.  相似文献   

11.
Human serum glycomics is a promising method for finding cancer biomarkers but often lacks the tools for streamlined data analysis. The Glycolyzer software incorporates a suite of analytic tools capable of identifying informative glycan peaks out of raw mass spectrometry data. As a demonstration of its utility, the program was used to identify putative biomarkers for epithelial ovarian cancer from a human serum sample set. A randomized, blocked, and blinded experimental design was used on a discovery set consisting of 46 cases and 48 controls. Retrosynthetic glycan libraries were used for data analysis and several significant candidate glycan biomarkers were discovered via hypothesis testing. The significant glycans were attributed to a glycan family based on glycan composition relationships and incorporated into a linear classifier motif test. The motif test was then applied to the discovery set to evaluate the disease state discrimination performance. The test provided strongly predictive results based on receiver operator characteristic curve analysis. The area under the receiver operator characteristic curve was 0.93. Using the Glycolyzer software, we were able to identify a set of glycan biomarkers that highly discriminate between cases and controls, and are ready to be formally validated in subsequent studies.  相似文献   

12.
Assessment of arbovirus vector infection rates using variable size pooling   总被引:2,自引:0,他引:2  
Pool testing of vector samples for arboviruses is widely used in surveillance programmes. The proportion of infected mosquitoes (Diptera: Culicidae) is often estimated from the minimum infection rate (MIR), based on the assumption of only one infected mosquito per positive pool. This assumption becomes problematic when pool size is large and/or infection rate is high. By relaxing this constraint, maximum likelihood estimation (MLE) is more useful for a wide range of infection levels that may be encountered in the field. We demonstrate the difference between these two estimation approaches using West Nile virus (WNV) surveillance data from vectors collected by gravid traps in Chicago during 2002. MLE of infection rates of Culex mosquitoes was as high as 60 per 1000 at the peak of transmission in August, whereas MIR was less than 30 per 1000. More importantly, we demonstrate roles of various pooling strategies for better estimation of infection rates based on simulation studies with hypothetical mosquito samples of 18 pools. Variable size pooling (with a serial pool sizes of 5, 10, 20, 30, 40 and 50 individuals) performed consistently better than a constant size pooling of 50 individuals. We conclude that variable pool size coupled with MLE is critical for accurate estimates of mosquito infection rates in WNV epidemic seasons.  相似文献   

13.
Association tests that pool minor alleles into a measure of burden at a locus have been proposed for case-control studies using sequence data containing rare variants. However, such pooling tests are not robust to the inclusion of neutral and protective variants, which can mask the association signal from risk variants. Early studies proposing pooling tests dismissed methods for locus-wide inference using nonnegative single-variant test statistics based on unrealistic comparisons. However, such methods are robust to the inclusion of neutral and protective variants and therefore may be more useful than previously appreciated. In fact, some recently proposed methods derived within different frameworks are equivalent to performing inference on weighted sums of squared single-variant score statistics. In this study, we compared two existing methods for locus-wide inference using nonnegative single-variant test statistics to two widely cited pooling tests under more realistic conditions. We established analytic results for a simple model with one rare risk and one rare neutral variant, which demonstrated that pooling tests were less powerful than even Bonferroni-corrected single-variant tests in most realistic situations. We also performed simulations using variants with realistic minor allele frequency and linkage disequilibrium spectra, disease models with multiple rare risk variants and extensive neutral variation, and varying rates of missing genotypes. In all scenarios considered, existing methods using nonnegative single-variant test statistics had power comparable to or greater than two widely cited pooling tests. Moreover, in disease models with only rare risk variants, an existing method based on the maximum single-variant Cochran-Armitage trend chi-square statistic in the locus had power comparable to or greater than another existing method closely related to some recently proposed methods. We conclude that efficient locus-wide inference using single-variant test statistics should be reconsidered as a useful framework for devising powerful association tests in sequence data with rare variants.  相似文献   

14.
MOTIVATION: If there is insufficient RNA from the tissues under investigation from one organism, then it is common practice to pool RNA. An important question is to determine whether pooling introduces biases, which can lead to inaccurate results. In this article, we describe two biases related to pooling, from a theoretical as well as a practical point of view. RESULTS: We model and quantify the respective parts of the pooling bias due to the log transform as well as the bias due to biological averaging of the samples. We also evaluate the impact of the bias on the statistical differential analysis of Affymetrix data.  相似文献   

15.
Liquid chromatography-mass spectrometry (LC-MS)-based proteomics is becoming an increasingly important tool in characterizing the abundance of proteins in biological samples of various types and across conditions. Effects of disease or drug treatments on protein abundance are of particular interest for the characterization of biological processes and the identification of biomarkers. Although state-of-the-art instrumentation is available to make high-quality measurements and commercially available software is available to process the data, the complexity of the technology and data presents challenges for bioinformaticians and statisticians. Here, we describe a pipeline for the analysis of quantitative LC-MS data. Key components of this pipeline include experimental design (sample pooling, blocking, and randomization) as well as deconvolution and alignment of mass chromatograms to generate a matrix of molecular abundance profiles. An important challenge in LC-MS-based quantitation is to be able to accurately identify and assign abundance measurements to members of protein families. To address this issue, we implement a novel statistical method for inferring the relative abundance of related members of protein families from tryptic peptide intensities. This pipeline has been used to analyze quantitative LC-MS data from multiple biomarker discovery projects. We illustrate our pipeline here with examples from two of these studies, and show that the pipeline constitutes a complete workable framework for LC-MS-based differential quantitation. Supplementary material is available at http://iec01.mie.utoronto.ca/~thodoros/Bukhman/.  相似文献   

16.
Zhao Y  Wang S 《Human heredity》2009,67(1):46-56
Study cost remains the major limiting factor for genome-wide association studies due to the necessity of genotyping a large number of SNPs for a large number of subjects. Both DNA pooling strategies and two-stage designs have been proposed to reduce genotyping costs. In this study, we propose a cost-effective, two-stage approach with a DNA pooling strategy. During stage I, all markers are evaluated on a subset of individuals using DNA pooling. The most promising set of markers is then evaluated with individual genotyping for all individuals during stage II. The goal is to determine the optimal parameters (pi(p)(sample ), the proportion of samples used during stage I with DNA pooling; and pi(p)(marker ), the proportion of markers evaluated during stage II with individual genotyping) that minimize the cost of a two-stage DNA pooling design while maintaining a desired overall significance level and achieving a level of power similar to that of a one-stage individual genotyping design. We considered the effects of three factors on optimal two-stage DNA pooling designs. Our results suggest that, under most scenarios considered, the optimal two-stage DNA pooling design may be much more cost-effective than the optimal two-stage individual genotyping design, which use individual genotyping during both stages.  相似文献   

17.
Background: Identifying biomarkers for accurate diagnosis and prognosis of diseases is important for the prevention of disease development. The molecular networks that describe the functional relationships among molecules provide a global view of the complex biological systems. With the molecular networks, the molecular mechanisms underlying diseases can be unveiled, which helps identify biomarkers in a systematic way. Results: In this survey, we report the recent progress on identifying biomarkers based on the topology of molecular networks, and we categorize those biomarkers into three groups, including node biomarkers, edge biomarkers and network biomarkers. These distinct types of biomarkers can be detected under different conditions depending on the data available. Conclusions: The biomarkers identified based on molecular networks can provide more accurate diagnosis and prognosis. The pros and cons of different types of biomarkers as well as future directions to improve the methods for identifying biomarkers are also discussed.  相似文献   

18.
Traditionally, biomarkers of aging are classified as either pro‐longevity or antilongevity. Using longitudinal data sets from the large‐scale inbred mouse strain study at the Jackson Laboratory Nathan Shock Center, we describe a protocol to identify two kinds of biomarkers: those with prognostic implication for lifespan and those with longitudinal evidence. Our protocol also identifies biomarkers for which, at first sight, there is conflicting evidence. Conflict resolution is possible by postulating a role switch. In these cases, high biomarker values are, for example, antilongevity in early life and pro‐longevity in later life. Role‐switching biomarkers correspond to features that must, for example, be minimized early, but maximized later, for optimal longevity. The clear‐cut pro‐longevity biomarkers we found reflect anti‐inflammatory, anti‐immunosenescent or anti‐anaemic mechanisms, whereas clear‐cut antilongevity biomarkers reflect inflammatory mechanisms. Many highly significant blood biomarkers relate to immune system features, indicating a shift from adaptive to innate processes, whereas most role‐switching biomarkers relate to blood serum features and whole‐body phenotypes. Our biomarker classification approach is applicable to any combination of longitudinal studies with life expectancy data, and it provides insights beyond a simplified scheme of biomarkers for long or short lifespan.  相似文献   

19.
Designing microarray experiments, scientists are often confronted with the question of pooling due to financial constraints, but discussion of the validity of pooling tends toward a sub-pooling recommendation. Since complete pooling protocols can be considered part of sub-pooling designs, gene expression data from three complete pooling experiments were analyzed. Data from complete pooled versus individual mRNA samples of rat brain tissue were compared to answer the question whether the pooled sample represents individual samples in small-sized experiments. Our analytic approach provided clear results concerning the Affymetrix MAS 5.0 signal and detection call parameters. Despite a strong similarity of arrays within experimental groups, the individual signals were evidently not appropriately represented in the pooled sample, with slightly more than half of all the genes considered. Our analysis reveals problems in cases of small complete pooling designs with less than six subjects pooled.  相似文献   

20.
本文提出了一种基于卷积神经网络和循环神经网络的深度学习模型,通过分析基因组序列数据,识别人基因组中环形RNA剪接位点.首先,根据预处理后的核苷酸序列,设计了2种网络深度、8种卷积核大小和3种长短期记忆(long short term memory,LSTM)参数,共8组16个模型;其次,进一步针对池化层进行均值池化和最大池化的测试,并加入GC含量提高模型的预测能力;最后,对已经实验验证过的人类精浆中环形RNA进行了预测.结果表明,卷积核尺寸为32×4、深度为1、LSTM参数为32的模型识别率最高,在训练集上为0.9824,在测试数据集上准确率为0.95,并且在实验验证数据上的正确识别率为83%.该模型在人的环形RNA剪接位点识别方面具有较好的性能.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号