首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
If biological questions are to be answered using quantitative proteomics, it is essential to design experiments which have sufficient power to be able to detect changes in expression. Sample subpooling is a strategy that can be used to reduce the variance but still allow studies to encompass biological variation. Underlying sample pooling strategies is the biological averaging assumption that the measurements taken on the pool are equal to the average of the measurements taken on the individuals. This study finds no evidence of a systematic bias triggered by sample pooling for DIGE and that pooling can be useful in reducing biological variation. For the first time in quantitative proteomics, the two sources of variance were decoupled and it was found that technical variance predominates for mouse brain, while biological variance predominates for human brain. A power analysis found that as the number of individuals pooled increased, then the number of replicates needed declined but the number of biological samples increased. Repeat measures of biological samples decreased the numbers of samples required but increased the number of gels needed. An example cost benefit analysis demonstrates how researchers can optimise their experiments while taking into account the available resources.  相似文献   

2.
MOTIVATION: Gene expression profiling experiments in cell lines and animal models characterized by specific genetic or molecular perturbations have yielded sets of genes annotated by the perturbation. These gene sets can serve as a reference base for interrogating other expression datasets. For example, a new dataset in which a specific pathway gene set appears to be enriched, in terms of multiple genes in that set evidencing expression changes, can then be annotated by that reference pathway. We introduce in this paper a formal statistical method to measure the enrichment of each sample in an expression dataset. This allows us to assay the natural variation of pathway activity in observed gene expression data sets from clinical cancer and other studies. RESULTS: Validation of the method and illustrations of biological insights gleaned are demonstrated on cell line data, mouse models, and cancer-related datasets. Using oncogenic pathway signatures, we show that gene sets built from a model system are indeed enriched in the model system. We employ ASSESS for the use of molecular classification by pathways. This provides an accurate classifier that can be interpreted at the level of pathways instead of individual genes. Finally, ASSESS can be used for cross-platform expression models where data on the same type of cancer are integrated over different platforms into a space of enrichment scores. AVAILABILITY: Versions are available in Octave and Java (with a graphical user interface). Software can be downloaded at http://people.genome.duke.edu/assess.  相似文献   

3.
MOTIVATION: Microarrays can simultaneously measure the expression levels of many genes and are widely applied to study complex biological problems at the genetic level. To contain costs, instead of obtaining a microarray on each individual, mRNA from several subjects can be first pooled and then measured with a single array. mRNA pooling is also necessary when there is not enough mRNA from each subject. Several studies have investigated the impact of pooling mRNA on inferences about gene expression, but have typically modeled the process of pooling as if it occurred in some transformed scale. This assumption is unrealistic. RESULTS: We propose modeling the gene expression levels in a pool as a weighted average of mRNA expression of all individuals in the pool on the original measurement scale, where the weights correspond to individual sample contributions to the pool. Based on these improved statistical models, we develop the appropriate F statistics to test for differentially expressed genes. We present formulae to calculate the power of various statistical tests under different strategies for pooling mRNA and compare resulting power estimates to those that would be obtained by following the approach proposed by Kendziorski et al. (2003). We find that the Kendziorski estimate tends to exceed true power and that the estimate we propose, while somewhat conservative, is less biased. We argue that it is possible to design a study that includes mRNA pooling at a significantly reduced cost but with little loss of information.  相似文献   

4.
ABSTRACT: BACKGROUND: Statistical analyses of whole genome expression data require functional information about genes in order to yield meaningful biological conclusions. The Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) are common sources of functionally grouped gene sets. For bacteria, the SEED and MicrobesOnline provide alternative, complementary sources of gene sets. To date, no comprehensive evaluation of the data obtained from these resources has been performed. RESULTS: We define a series of gene set consistency metrics directly related to the most common classes of statistical analyses for gene expression data, and then perform a comprehensive analysis of 3581 Affymetrix gene expression arrays across 17 diverse bacteria. We find that gene sets obtained from GO and KEGG demonstrate lower consistency than those obtained from the SEED and MicrobesOnline, regardless of gene set size. CONCLUSIONS: Despite the widespread use of GO and KEGG gene sets in bacterial gene expression data analysis, the SEED and MicrobesOnline provide more consistent sets for a wide variety of statistical analyses such data. Increased use of the SEED and MicrobesOnline gene sets in the analysis of bacterial gene expression data may improve statistical power and utility of expression data.  相似文献   

5.
MOTIVATION: Many biomedical experiments are carried out by pooling individual biological samples. However, pooling samples can potentially hide biological variance and give false confidence concerning the data significance. In the context of microarray experiments for detecting differentially expressed genes, recent publications have addressed the problem of the efficiency of sample pooling, and some approximate formulas were provided for the power and sample size calculations. It is desirable to have exact formulas for these calculations and have the approximate results checked against the exact ones. We show that the difference between the approximate and the exact results can be large. RESULTS: In this study, we have characterized quantitatively the effect of pooling samples on the efficiency of microarray experiments for the detection of differential gene expression between two classes. We present exact formulas for calculating the power of microarray experimental designs involving sample pooling and technical replications. The formulas can be used to determine the total number of arrays and biological subjects required in an experiment to achieve the desired power at a given significance level. The conditions under which pooled design becomes preferable to non-pooled design can then be derived given the unit cost associated with a microarray and that with a biological subject. This paper thus serves to provide guidance on sample pooling and cost-effectiveness. The formulation in this paper is outlined in the context of performing microarray comparative studies, but its applicability is not limited to microarray experiments. It is also applicable to a wide range of biomedical comparative studies where sample pooling may be involved.  相似文献   

6.
Zhao Y  Wang S 《Human heredity》2009,67(1):46-56
Study cost remains the major limiting factor for genome-wide association studies due to the necessity of genotyping a large number of SNPs for a large number of subjects. Both DNA pooling strategies and two-stage designs have been proposed to reduce genotyping costs. In this study, we propose a cost-effective, two-stage approach with a DNA pooling strategy. During stage I, all markers are evaluated on a subset of individuals using DNA pooling. The most promising set of markers is then evaluated with individual genotyping for all individuals during stage II. The goal is to determine the optimal parameters (pi(p)(sample ), the proportion of samples used during stage I with DNA pooling; and pi(p)(marker ), the proportion of markers evaluated during stage II with individual genotyping) that minimize the cost of a two-stage DNA pooling design while maintaining a desired overall significance level and achieving a level of power similar to that of a one-stage individual genotyping design. We considered the effects of three factors on optimal two-stage DNA pooling designs. Our results suggest that, under most scenarios considered, the optimal two-stage DNA pooling design may be much more cost-effective than the optimal two-stage individual genotyping design, which use individual genotyping during both stages.  相似文献   

7.
Dense maps of short-tandem-repeat polymorphisms (STRPs) have allowed genome-wide searches for genes involved in a great variety of diseases with genetic influences, including common complex diseases. Generally for this purpose, marker sets with a 10 cM spacing are genotyped in hundreds of individuals. We have performed power simulations to estimate the maximum possible intermarker distance that still allows for sufficient power. In this paper we further report on modifications of previously published protocols, resulting in a powerful screening set containing 229 STRPs with an average spacing of 18.3 cM. A complete genome scan using our protocol requires only 80 multiplex PCR reactions which are all carried out using one set of conditions and which do not contain overlapping marker allele sizes. The multiplex PCR reactions are grouped by sets of chromosomes, which enables on-line statistical analysis of a set of chromosomes, as sets of chromosomes are being genotyped. A genome scan following this modified protocol can be performed using a maximum amount of 2.5 micrograms of genomic DNA per individual, isolated from either blood or from mouth swabs.  相似文献   

8.
Nam S  Park T 《PloS one》2012,7(4):e31685
Colorectal cancer (CRC) has one of the highest incidences among all cancers. The majority of CRCs are sporadic cancers that occur in individuals without family histories of CRC or inherited mutations. Unfortunately, whole-genome expression studies of sporadic CRCs are limited. A recent study used microarray techniques to identify a predictor gene set indicative of susceptibility to early-onset CRC. However, the molecular mechanisms of the predictor gene set were not fully investigated in the previous study. To understand the functional roles of the predictor gene set, in the present study we applied a subpathway-based statistical model to the microarray data from the previous study and identified mechanisms that are reasonably associated with the predictor gene set. Interestingly, significant subpathways belonging to 2 KEGG pathways (focal adhesion; natural killer cell-mediated cytotoxicity) were found to be involved in the early-onset CRC patients. We also showed that the 2 pathways were functionally involved in the predictor gene set using a text-mining technique. Entry of a single member of the predictor gene set triggered a focal adhesion pathway, which confers anti-apoptosis in the early-onset CRC patients. Furthermore, intensive inspection of the predictor gene set in terms of the 2 pathways suggested that some entries of the predictor gene set were implicated in immunosuppression along with epithelial-mesenchymal transition (EMT) in the early-onset CRC patients. In addition, we compared our subpathway-based statistical model with a gene set-based statistical model, MIT Gene Set Enrichment Analysis (GSEA). Our method showed better performance than GSEA in the sense that our method was more consistent with a well-known cancer-related pathway set. Thus, the biological suggestion generated by our subpathway-based approach seems quite reasonable and warrants a further experimental study on early-onset CRC in terms of dedifferentiation or differentiation, which is underscored in EMT and immunosuppression.  相似文献   

9.
MOTIVATION: If there is insufficient RNA from the tissues under investigation from one organism, then it is common practice to pool RNA. An important question is to determine whether pooling introduces biases, which can lead to inaccurate results. In this article, we describe two biases related to pooling, from a theoretical as well as a practical point of view. RESULTS: We model and quantify the respective parts of the pooling bias due to the log transform as well as the bias due to biological averaging of the samples. We also evaluate the impact of the bias on the statistical differential analysis of Affymetrix data.  相似文献   

10.
Genotyping errors are present in almost all genetic data and can affect biological conclusions of a study, particularly for studies based on individual identification and parentage. Many statistical approaches can incorporate genotyping errors, but usually need accurate estimates of error rates. Here, we used a new microsatellite data set developed for brown rockfish (Sebastes auriculatus) to estimate genotyping error using three approaches: (i) repeat genotyping 5% of samples, (ii) comparing unintentionally recaptured individuals and (iii) Mendelian inheritance error checking for known parent–offspring pairs. In each data set, we quantified genotyping error rate per allele due to allele drop‐out and false alleles. Genotyping error rate per locus revealed an average overall genotyping error rate by direct count of 0.3%, 1.5% and 1.7% (0.002, 0.007 and 0.008 per allele error rate) from replicate genotypes, known parent–offspring pairs and unintentionally recaptured individuals, respectively. By direct‐count error estimates, the recapture and known parent–offspring data sets revealed an error rate four times greater than estimated using repeat genotypes. There was no evidence of correlation between error rates and locus variability for all three data sets, and errors appeared to occur randomly over loci in the repeat genotypes, but not in recaptures and parent–offspring comparisons. Furthermore, there was no correlation in locus‐specific error rates between any two of the three data sets. Our data suggest that repeat genotyping may underestimate true error rates and may not estimate locus‐specific error rates accurately. We therefore suggest using methods for error estimation that correspond to the overall aim of the study (e.g. known parent–offspring comparisons in parentage studies).  相似文献   

11.
Pedigree reconstruction using genotypic markers has become an important tool for the study of natural populations. The nonstandard nature of the underlying statistical problems has led to the necessity of developing specialized statistical and computational methods. In this article, a new version of pedigree reconstruction tools (PRT 2.0) is presented. The software implements algorithms proposed in Almudevar & Field (Journal of Agricultural Biological and Environmental Statistics, 4, 1999, 136) and Almudevar (Biometrics, 57, 2001a, 757) for the reconstruction of single generation sibling groups (SG). A wider range of enumeration algorithms is included, permitting improved computational performance. In particular, an iterative version of the algorithm designed for larger samples is included in a fully automated form. The new version also includes expanded simulation utilities, as well as extensive reporting, including half-sibling compatibility, parental genotype estimates and flagging of potential genotype errors. A number of alternative algorithms are described and demonstrated. A comparative discussion of the underlying methodologies is presented. Although important aspects of this problem remain open, we argue that a number of methodologies including maximum likelihood estimation (COLONY 1.2 and 2.0) and the set cover formulation (KINALYZER) exhibit undesirable properties in the sibling reconstruction problem. There is considerable evidence that large sets of individuals not genetically excluded as siblings can be inferred to be a true sibling group, but it is also true that unrelated individuals may be genetically compatible with a true sibling group by chance. Such individuals may be identified on a statistical basis. PRT 2.0, based on these sound statistical principles, is able to efficiently match or exceed the highest reported accuracy rates, particularly for larger SG. The new version is available at http://www.urmc.rochester.edu/biostat/people/faculty/almudevar.cfm.  相似文献   

12.
The rapid acceleration of genetic data collection in biomedical settings has recently resulted in the rise of genetic compendiums filled with rich longitudinal disease data. One common feature of these data sets is their plethora of interval-censored outcomes. However, very few tools are available for the analysis of genetic data sets with interval-censored outcomes, and in particular, there is a lack of methodology available for set-based inference. Set-based inference is used to associate a gene, biological pathway, or other genetic construct with outcomes and is one of the most popular strategies in genetics research. This work develops three such tests for interval-censored settings beginning with a variance components test for interval-censored outcomes, the interval-censored sequence kernel association test (ICSKAT). We also provide the interval-censored version of the Burden test, and then we integrate ICSKAT and Burden to construct the interval censored sequence kernel association test—optimal (ICSKATO) combination. These tests unlock set-based analysis of interval-censored data sets with analogs of three highly popular set-based tools commonly applied to continuous and binary outcomes. Simulation studies illustrate the advantages of the developed methods over ad hoc alternatives, including protection of the type I error rate at very low levels and increased power. The proposed approaches are applied to the investigation that motivated this study, an examination of the genes associated with bone mineral density deficiency and fracture risk.  相似文献   

13.
14.
Whole-genome association studies present many new statistical and computational challenges due to the large quantity of data obtained. One of these challenges is haplotype inference; methods for haplotype inference designed for small data sets from candidate-gene studies do not scale well to the large number of individuals genotyped in whole-genome association studies. We present a new method and software for inference of haplotype phase and missing data that can accurately phase data from whole-genome association studies, and we present the first comparison of haplotype-inference methods for real and simulated data sets with thousands of genotyped individuals. We find that our method outperforms existing methods in terms of both speed and accuracy for large data sets with thousands of individuals and densely spaced genetic markers, and we use our method to phase a real data set of 3,002 individuals genotyped for 490,032 markers in 3.1 days of computing time, with 99% of masked alleles imputed correctly. Our method is implemented in the Beagle software package, which is freely available.  相似文献   

15.
The information contents in previously published peptide sets was compared with smaller sets of peptides selected according to statistical designs. It was found that minimum analogue peptide sets (MAPS) constructed by factorial or fractional factorial designs in physiochemical properties contained substantial structure-activity information. Although five to six times smaller than the originally published peptide sets the MAPS resulted in QSAR models able to predict biological activity. The QSARs derived from a MAPS of nine dipeptides, and from a set of 58 dipeptides inhibiting angiotensin converting enzyme were compared and found to be of equal strength. Furthermore, for a set of bitter tasting dipeptides it was found that an incomplete MAPS of 10 dipeptides gave just as good a model as the model based on a set of 48 dipeptides. By comparison other non-designed sets of peptides gave QSARs with poor predictive power. It was also demonstrated how MAPS centered on a lead peptide can be constructed as to specifically explore the physiochemical and biological properties in the vicinity of the lead. It was concluded that small information-rich peptide sets MAPS can be constructed on the basis of statistical designs with principal properties of amino acids as design variables.  相似文献   

16.
In cluster randomized trials, intact social units such as schools, worksites or medical practices - rather than individuals themselves - are randomly allocated to intervention and control conditions, while the outcomes of interest are then observed on individuals within each cluster. Such trials are becoming increasingly common in the fields of health promotion and health services research. Attrition is a common occurrence in randomized trials, and a standard approach for dealing with the resulting missing values is imputation. We consider imputation strategies for missing continuous outcomes, focusing on trials with a completely randomized design in which fixed cohorts from each cluster are enrolled prior to random assignment. We compare five different imputation strategies with respect to Type I and Type II error rates of the adjusted two-sample t -test for the intervention effect. Cluster mean imputation is compared with multiple imputation, using either within-cluster data or data pooled across clusters in each intervention group. In the case of pooling across clusters, we distinguish between standard multiple imputation procedures which do not account for intracluster correlation and a specialized procedure which does account for intracluster correlation but is not yet available in standard statistical software packages. A simulation study is used to evaluate the influence of cluster size, number of clusters, degree of intracluster correlation, and variability among cluster follow-up rates. We show that cluster mean imputation yields valid inferences and given its simplicity, may be an attractive option in some large community intervention trials which are subject to individual-level attrition only; however, it may yield less powerful inferences than alternative procedures which pool across clusters especially when the cluster sizes are small and cluster follow-up rates are highly variable. When pooling across clusters, the imputation procedure should generally take intracluster correlation into account to obtain valid inferences; however, as long as the intracluster correlation coefficient is small, we show that standard multiple imputation procedures may yield acceptable type I error rates; moreover, these procedures may yield more powerful inferences than a specialized procedure, especially when the number of available clusters is small. Within-cluster multiple imputation is shown to be the least powerful among the procedures considered.  相似文献   

17.
Genetic similarities within and between human populations   总被引:2,自引:0,他引:2       下载免费PDF全文
The proportion of human genetic variation due to differences between populations is modest, and individuals from different populations can be genetically more similar than individuals from the same population. Yet sufficient genetic data can permit accurate classification of individuals into populations. Both findings can be obtained from the same data set, using the same number of polymorphic loci. This article explains why. Our analysis focuses on the frequency, omega, with which a pair of random individuals from two different populations is genetically more similar than a pair of individuals randomly selected from any single population. We compare omega to the error rates of several classification methods, using data sets that vary in number of loci, average allele frequency, populations sampled, and polymorphism ascertainment strategy. We demonstrate that classification methods achieve higher discriminatory power than omega because of their use of aggregate properties of populations. The number of loci analyzed is the most critical variable: with 100 polymorphisms, accurate classification is possible, but omega remains sizable, even when using populations as distinct as sub-Saharan Africans and Europeans. Phenotypes controlled by a dozen or fewer loci can therefore be expected to show substantial overlap between human populations. This provides empirical justification for caution when using population labels in biomedical settings, with broad implications for personalized medicine, pharmacogenetics, and the meaning of race.  相似文献   

18.
Recent studies have shown that the human genome has a haplotype block structure, such that it can be divided into discrete blocks of limited haplotype diversity. In each block, a small fraction of single-nucleotide polymorphisms (SNPs), referred to as "tag SNPs," can be used to distinguish a large fraction of the haplotypes. These tag SNPs can potentially be extremely useful for association studies, in that it may not be necessary to genotype all SNPs; however, this depends on how much power is lost. Here we develop a simulation study to quantitatively assess the power loss for a variety of study designs, including case-control designs and case-parental control designs. First, a number of data sets containing case-parental or case-control samples are generated on the basis of a disease model. Second, a small fraction of case and control individuals in each data set are genotyped at all the loci, and a dynamic programming algorithm is used to determine the haplotype blocks and the tag SNPs based on the genotypes of the sampled individuals. Third, the statistical power of tests was evaluated on the basis of three kinds of data: (1) all of the SNPs and the corresponding haplotypes, (2) the tag SNPs and the corresponding haplotypes, and (3) the same number of randomly chosen SNPs as the number of tag SNPs and the corresponding haplotypes. We study the power of different association tests with a variety of disease models and block-partitioning criteria. Our study indicates that the genotyping efforts can be significantly reduced by the tag SNPs, without much loss of power. Depending on the specific haplotype block-partitioning algorithm and the disease model, when the identified tag SNPs are only 25% of all the SNPs, the power is reduced by only 4%, on average, compared with a power loss of approximately 12% when the same number of randomly chosen SNPs is used in a two-locus haplotype analysis. When the identified tag SNPs are approximately 14% of all the SNPs, the power is reduced by approximately 9%, compared with a power loss of approximately 21% when the same number of randomly chosen SNPs is used in a two-locus haplotype analysis. Our study also indicates that haplotype-based analysis can be much more powerful than marker-by-marker analysis.  相似文献   

19.
Tao H  Berno AJ  Cox DR  Frazer KA 《PloS one》2007,2(8):e697
Efforts to develop effective therapeutic treatments for promoting fast wound healing after injury to the epidermis are hindered by a lack of understanding of the factors involved. Re-epithelialization is an essential step of wound healing involving the migration of epidermal keratinocytes over the wound site. Here, we examine genetic variants in the keratin-1 (KRT1) locus for association with migration rates of human epidermal keratinocytes (HEK) isolated from different individuals. Although the role of intermediate filament genes, including KRT1, in wound activated keratinocytes is well established, this is the first study to examine if genetic variants in humans contribute to differences in the migration rates of these cells. Using an in vitro scratch wound assay we observe quantifiable variation in HEK migration rates in two independent sets of samples; 24 samples in the first set and 17 samples in the second set. We analyze genetic variants in the KRT1 interval and identify SNPs significantly associated with HEK migration rates in both samples sets. Additionally, we show in the first set of samples that the average migration rate of HEK cells homozygous for one common haplotype pattern in the KRT1 interval is significantly faster than that of HEK cells homozygous for a second common haplotype pattern. Our study demonstrates that genetic variants in the KRT1 interval contribute to quantifiable differences in the migration rates of keratinocytes isolated from different individuals. Furthermore we show that in vitro cell assays can successfully be used to deconstruct complex traits into simple biological model systems for genetic association studies.  相似文献   

20.
Selective DNA pooling is an advanced methodology for linkage mapping of quantitative trait loci (QTL) in farm animals. The principle is based on densitometric estimates of marker allele frequency in pooled DNA samples of phenotypically extreme individuals from half-sib, backcross and F(2) experimental designs in farm animals. This methodology provides a rapid and efficient analysis of a large number of individuals with short tandem repeat markers that are essential to detect QTL through the genome - wide searching approach. Several strategies involving whole genome scanning with a high statistical power have been developed for systematic search to detect the quantitative traits loci and linked loci of complex traits. In recent studies, greater success has been achieved in mapping several QTLs in Israel-Holstein cattle using selective DNA pooling. This paper outlines the currently emerged novel strategies of linkage mapping to identify QTL based on selective DNA pooling with more emphasis on its theoretical pre-requisite to detect linked QTLs, applications, a general theory for experimental half-sib designs, the power of statistics and its feasibility to identify genetic markers linked QTL in dairy cattle. The study reveals that the application of selective DNA pooling in dairy cattle can be best exploited in the genome-wide detection of linked loci with small and large QTL effects and applied to a moderately sized half-sib family of about 500 animals.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号