首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Wang J 《Genetics》2012,191(1):183-194
Quite a few methods have been proposed to infer sibship and parentage among individuals from their multilocus marker genotypes. They are all based on Mendelian laws either qualitatively (exclusion methods) or quantitatively (likelihood methods), have different optimization criteria, and use different algorithms in searching for the optimal solution. The full-likelihood method assigns sibship and parentage relationships among all sampled individuals jointly. It is by far the most accurate method, but is computationally prohibitive for large data sets with many individuals and many loci. In this article I propose a new likelihood-based method that is computationally efficient enough to handle large data sets. The method uses the sum of the log likelihoods of pairwise relationships in a configuration as the score to measure its plausibility, where log likelihoods of pairwise relationships are calculated only once and stored for repeated use. By analyzing several empirical and many simulated data sets, I show that the new method is more accurate than pairwise likelihood and exclusion-based methods, but is slightly less accurate than the full-likelihood method. However, the new method is computationally much more efficient than the full-likelihood method, and for the cases of both sexes polygamous and markers with genotyping errors, it can be several orders faster. The new method can handle a large sample with thousands of individuals and the number of markers limited only by the computer memory.  相似文献   

2.
Meta-analysis is a statistical methodology for combining information from diverse sources so that a more reliable and efficient conclusion can be reached. It can be conducted by either synthesizing study-level summary statistics or drawing inference from an overarching model for individual participant data (IPD) if available. The latter is often viewed as the “gold standard.” For random-effects models, however, it remains not fully understood whether the use of IPD indeed gains efficiency over summary statistics. In this paper, we examine the relative efficiency of the two methods under a general likelihood inference setting. We show theoretically and numerically that summary-statistics-based analysis is at most as efficient as IPD analysis, provided that the random effects follow the Gaussian distribution, and maximum likelihood estimation is used to obtain summary statistics. More specifically, (i) the two methods are equivalent in an asymptotic sense; and (ii) summary-statistics-based inference can incur an appreciable loss of efficiency if the sample sizes are not sufficiently large. Our results are established under the assumption that the between-study heterogeneity parameter remains constant regardless of the sample sizes, which is different from a previous study. Our findings are confirmed by the analyses of simulated data sets and a real-world study of alcohol interventions.  相似文献   

3.
Basik M  Mousses S  Trent J 《BioTechniques》2003,35(3):580-2, 584, 586 passim
New technologies have greatly increased the scientist's ability to investigate complex molecular interactions that occur in cancer development and to identify genetic alterations and drug targets. However, these new capabilities have not accelerated drug development efforts; rather, they may be contributing to increased research and development costs because the large number of new drug targets discovered through genomics need to be investigated in great detail to characterize their putative functional involvement in the disease process. One solution to this bottleneck in functional analysis is the use of high-throughput technologies to produce efficient processes that can rapidly handle the large flood of information at every stage of disease. This review examines the use of new and emerging DNA, tissue, and live-cell transfection microarray technologies that can be used to discover, validate, and translate information resulting from the completion of the Human Genome Project.  相似文献   

4.
Group testing is frequently used to reduce the costs of screening a large number of individuals for infectious diseases or other binary characteristics in small prevalence situations. In many applications, the goals include both identifying individuals as positive or negative and estimating the probability of positivity. The identification aspect leads to additional tests being performed, known as “retests”, beyond those performed for initial groups of individuals. In this paper, we investigate how regression models can be fit to estimate the probability of positivity while also incorporating the extra information from these retests. We present simulation evidence showing that significant gains in efficiency occur by incorporating retesting information, and we further examine which testing protocols are the most efficient to use. Our investigations also demonstrate that some group testing protocols can actually lead to more efficient estimates than individual testing when diagnostic tests are imperfect. The proposed methods are applied retrospectively to chlamydia screening data from the Infertility Prevention Project. We demonstrate that significant cost savings could occur through the use of particular group testing protocols.  相似文献   

5.
A common goal in ecology and wildlife management is to determine the causes of variation in population dynamics over long periods of time and across large spatial scales. Many assumptions must nevertheless be overcome to make appropriate inference about spatio-temporal variation in population dynamics, such as autocorrelation among data points, excess zeros, and observation error in count data. To address these issues, many scientists and statisticians have recommended the use of Bayesian hierarchical models. Unfortunately, hierarchical statistical models remain somewhat difficult to use because of the necessary quantitative background needed to implement them, or because of the computational demands of using Markov Chain Monte Carlo algorithms to estimate parameters. Fortunately, new tools have recently been developed that make it more feasible for wildlife biologists to fit sophisticated hierarchical Bayesian models (i.e., Integrated Nested Laplace Approximation, ‘INLA’). We present a case study using two important game species in North America, the lesser and greater scaup, to demonstrate how INLA can be used to estimate the parameters in a hierarchical model that decouples observation error from process variation, and accounts for unknown sources of excess zeros as well as spatial and temporal dependence in the data. Ultimately, our goal was to make unbiased inference about spatial variation in population trends over time.  相似文献   

6.
7.
To address the global extinction crisis, both efficient use of existing conservation funding and new sources of funding are vital. Private sponsorship of charismatic ‘flagship’ species conservation represents an important source of new funding, but has been criticized as being inefficient. However, the ancillary benefits of privately sponsored flagship species conservation via actions benefiting other species have not been quantified, nor have the benefits of incorporating such sponsorship into objective prioritization protocols. Here, we use a comprehensive dataset of conservation actions for the 700 most threatened species in New Zealand to examine the potential biodiversity gains from national private flagship species sponsorship programmes. We find that private funding for flagship species can clearly result in additional species and phylogenetic diversity conserved, via conservation actions shared with other species. When private flagship species funding is incorporated into a prioritization protocol to preferentially sponsor shared actions, expected gains can be more than doubled. However, these gains are consistently smaller than expected gains in a hypothetical scenario where private funding could be optimally allocated among all threatened species. We recommend integrating private sponsorship of flagship species into objective prioritization protocols to sponsor efficient actions that maximize biodiversity gains, or wherever possible, encouraging private donations for broader biodiversity goals.  相似文献   

8.
The genetic basis of complex diseases is expected to be highly heterogeneous, with complex interactions among multiple disease loci and environment factors. Due to the multi-dimensional property of interactions among large number of genetic loci, efficient statistical approach has not been well developed to handle the high-order epistatic complexity. In this article, we introduce a new approach for testing genetic epistasis in multiple loci using an entropy-based statistic for a case-only design. The entropy-based statistic asymptotically follows a χ2 distribution. Computer simulations show that the entropy-based approach has better control of type I error and higher power compared to the standard χ2 test. Motivated by a schizophrenia data set, we propose a method for measuring and testing the relative entropy of a clinical phenotype, through which one can test the contribution or interaction of multiple disease loci to a clinical phenotype. A sequential forward selection procedure is proposed to construct a genetic interaction network which is illustrated through a tree-based diagram. The network information clearly shows the relative importance of a set of genetic loci on a clinical phenotype. To show the utility of the new entropy-based approach, it is applied to analyze two real data sets, a schizophrenia data set and a published malaria data set. Our approach provides a fast and testable framework for genetic epistasis study in a case-only design.  相似文献   

9.
More accurate and precise phenotyping strategies are necessary to empower high-resolution linkage mapping and genome-wide association studies and for training genomic selection models in plant improvement. Within this framework, the objective of modern phenotyping is to increase the accuracy, precision and throughput of phenotypic estimation at all levels of biological organization while reducing costs and minimizing labor through automation, remote sensing, improved data integration and experimental design. Much like the efforts to optimize genotyping during the 1980s and 1990s, designing effective phenotyping initiatives today requires multi-faceted collaborations between biologists, computer scientists, statisticians and engineers. Robust phenotyping systems are needed to characterize the full suite of genetic factors that contribute to quantitative phenotypic variation across cells, organs and tissues, developmental stages, years, environments, species and research programs. Next-generation phenotyping generates significantly more data than previously and requires novel data management, access and storage systems, increased use of ontologies to facilitate data integration, and new statistical tools for enhancing experimental design and extracting biologically meaningful signal from environmental and experimental noise. To ensure relevance, the implementation of efficient and informative phenotyping experiments also requires familiarity with diverse germplasm resources, population structures, and target populations of environments. Today, phenotyping is quickly emerging as the major operational bottleneck limiting the power of genetic analysis and genomic prediction. The challenge for the next generation of quantitative geneticists and plant breeders is not only to understand the genetic basis of complex trait variation, but also to use that knowledge to efficiently synthesize twenty-first century crop varieties.  相似文献   

10.
Over recent years many statisticians and researchers have highlighted that statistical inference would benefit from a better use and understanding of hypothesis testing, p-values, and statistical significance. We highlight three recommendations in the context of biochemical sciences. First recommendation: to improve the biological interpretation of biochemical data, do not use p-values (or similar test statistics) as thresholded values to select biomolecules. Second recommendation: to improve comparison among studies and to achieve robust knowledge, perform complete reporting of data. Third recommendation: statistical analyses should be reported completely with exact numbers (not as asterisks or inequalities). Owing to the high number of variables, a better use of statistics is of special importance in omic studies.  相似文献   

11.
The establishment of cause and effect relationships is a fundamental objective of scientific research. Many lines of evidence can be used to make cause–effect inferences. When statistical data are involved, alternative explanations for the statistical relationship need to be ruled out. These include chance (apparent patterns due to random factors), confounding effects (a relationship between two variables because they are each associated with an unmeasured third variable), and sampling bias (effects due to preexisting properties of compared groups). The gold standard for managing these issues is a controlled randomized experiment. In disciplines such as biological anthropology, where controlled experiments are not possible for many research questions, causal inferences are made from observational data. Methods that statisticians recommend for this difficult objective have not been widely adopted in the biological anthropology literature. Issues involved in using statistics to make valid causal inferences from observational data are discussed.  相似文献   

12.
Empirical evidence supporting the commonality of gene x gene interactions, coupled with frequent failure to replicate results from previous association studies, has prompted statisticians to develop methods to handle this important subject. Nonparametric methods have generated intense interest because of their capacity to handle high-dimensional data. Genome-wide association analysis of large-scale SNP data is challenging mathematically and computationally. In this paper, we describe major issues and questions arising from this challenge, along with methodological implications. Data reduction and pattern recognition methods seem to be the new frontiers in efforts to detect gene x gene interactions comprehensively. Currently, there is no single method that is recognized as the 'best' for detecting, characterizing, and interpreting gene x gene interactions. Instead, a combination of approaches with the aim of balancing their specific strengths may be the optimal approach to investigate gene x gene interactions in human data.  相似文献   

13.
Whole-genome association studies (WGAS) bring new computational, as well as analytic, challenges to researchers. Many existing genetic-analysis tools are not designed to handle such large data sets in a convenient manner and do not necessarily exploit the new opportunities that whole-genome data bring. To address these issues, we developed PLINK, an open-source C/C++ WGAS tool set. With PLINK, large data sets comprising hundreds of thousands of markers genotyped for thousands of individuals can be rapidly manipulated and analyzed in their entirety. As well as providing tools to make the basic analytic steps computationally efficient, PLINK also supports some novel approaches to whole-genome data that take advantage of whole-genome coverage. We introduce PLINK and describe the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation. In particular, we focus on the estimation and use of identity-by-state and identity-by-descent information in the context of population-based whole-genome studies. This information can be used to detect and correct for population stratification and to identify extended chromosomal segments that are shared identical by descent between very distantly related individuals. Analysis of the patterns of segmental sharing has the potential to map disease loci that contain multiple rare variants in a population-based linkage analysis.  相似文献   

14.
通常来讲,生态学者对于解释生态关系、描述格局和过程、进行空间或时间预测比较感兴趣。这些工作可以通过模拟输出值(响应)与一些特征值(即解释变量)的关系来实现。然而,生态数据模拟遇到了挑战,这是因为响应变量和预测变量可能是连续变量或离散变量。需要解释的生态关系通常是非线性的,并且解释变量之间具有复杂的相互作用关系。响应变量和解释变量存在缺失值并不是不常有的现象,奇异值也经常出现在生态数据中。此外,生态学者通常希望生态模型即要易于建立又易要于解释。通常是利用多种统计方法来分析处理各种各样情景中出现的独特的生态问题,这些模型包括(多元)逻辑回归、线性模型、生存模型、方差分析等等。随机森林是一个可以处理所有这些问题的有效方法。随机森林可以用来做分类、聚类、回归和生存分析、评估变量的重要性、检测数据中的奇异值、对缺失数据进行插补等。鉴于随机森林本身在算法上的优势,将就随机森林在生态学中的应用进行总结,对建模过程进行概述,并以云南松分布模拟研究为例,对其主要功能特点进行案例展示。通过对随机森林的一般术语、概念和建模思想进行介绍,有利于读者掌握本方法的应用本质,可以预见随机森林在生态学研究中将得到更多的应用和发展。  相似文献   

15.

Background  

The biomedical community is developing new methods of data analysis to more efficiently process the massive data sets produced by microarray experiments. Systematic and global mathematical approaches that can be readily applied to a large number of experimental designs become fundamental to correctly handle the otherwise overwhelming data sets.  相似文献   

16.
Shortreed and Ertefaie introduced a clever propensity score variable selection approach for estimating average causal effects, namely, the outcome adaptive lasso (OAL). OAL aims to select desirable covariates, confounders, and predictors of outcome, to build an unbiased and statistically efficient propensity score estimator. Due to its design, a potential limitation of OAL is how it handles the collinearity problem, which is often encountered in high-dimensional data. As seen in Shortreed and Ertefaie, OAL's performance degraded with increased correlation between covariates. In this note, we propose the generalized OAL (GOAL) that combines the strengths of the adaptively weighted L1 penalty and the elastic net to better handle the selection of correlated covariates. Two different versions of GOAL, which differ in their procedure (algorithm), are proposed. We compared OAL and GOAL in simulation scenarios that mimic those examined by Shortreed and Ertefaie. Although all approaches performed equivalently with independent covariates, we found that both GOAL versions were more performant than OAL in low and high dimensions with correlated covariates.  相似文献   

17.
We present methods for imputing data for ungenotyped markers and for inferring haplotype phase in large data sets of unrelated individuals and parent-offspring trios. Our methods make use of known haplotype phase when it is available, and our methods are computationally efficient so that the full information in large reference panels with thousands of individuals is utilized. We demonstrate that substantial gains in imputation accuracy accrue with increasingly large reference panel sizes, particularly when imputing low-frequency variants, and that unphased reference panels can provide highly accurate genotype imputation. We place our methodology in a unified framework that enables the simultaneous use of unphased and phased data from trios and unrelated individuals in a single analysis. For unrelated individuals, our imputation methods produce well-calibrated posterior genotype probabilities and highly accurate allele-frequency estimates. For trios, our haplotype-inference method is four orders of magnitude faster than the gold-standard PHASE program and has excellent accuracy. Our methods enable genotype imputation to be performed with unphased trio or unrelated reference panels, thus accounting for haplotype-phase uncertainty in the reference panel. We present a useful measure of imputation accuracy, allelic R2, and show that this measure can be estimated accurately from posterior genotype probabilities. Our methods are implemented in version 3.0 of the BEAGLE software package.  相似文献   

18.

Background  

The use of current high-throughput genetic, genomic and post-genomic data leads to the simultaneous evaluation of a large number of statistical hypothesis and, at the same time, to the multiple-testing problem. As an alternative to the too conservative Family-Wise Error-Rate (FWER), the False Discovery Rate (FDR) has appeared for the last ten years as more appropriate to handle this problem. However one drawback of FDR is related to a given rejection region for the considered statistics, attributing the same value to those that are close to the boundary and those that are not. As a result, the local FDR has been recently proposed to quantify the specific probability for a given null hypothesis to be true.  相似文献   

19.
A stepwise algorithm for finding minimum evolution trees   总被引:7,自引:6,他引:1  
A stepwise algorithm for reconstructing minimum evolution (ME) trees from evolutionary distance data is proposed. In each step, a taxon that potentially has a neighbor (another taxon connected to it with a single interior node) is first chosen and then its true neighbor searched iteratively. For m taxa, at most (m-1)!/2 trees are examined and the tree with the minimum sum of branch lengths (S) is chosen as the final tree. This algorithm provides simple strategies for restricting the tree space searched and allows us to implement efficient ways of dynamically computing the ordinary least squares estimates of S for the topologies examined. Using computer simulation, we found that the efficiency of the ME method in recovering the correct tree is similar to that of the neighbor-joining method (Saitou and Nei 1987). A more exhaustive search is unlikely to improve the efficiency of the ME method in finding the correct tree because the correct tree is almost always included in the tree space searched with this stepwise algorithm. The new algorithm finds trees for which S values may not be significantly different from that of the ME tree if the correct tree contains very small interior branches or if the pairwise distance estimates have large sampling errors. These topologies form a set of plausible alternatives to the ME tree and can be compared with each other using statistical tests based on the minimum evolution principle. The new algorithm makes it possible to use the ME method for large data sets.   相似文献   

20.
Commonly used methods for inferring phylogenies were designed before the emergence of high-throughput sequencing and can generally not accommodate the challenges associated with noisy, diploid sequencing data. In many applications, diploid genomes are still treated as haploid through the use of ambiguity characters; while the uncertainty in genotype calling—arising as a consequence of the sequencing technology—is ignored. In order to address this problem, we describe two new probabilistic approaches for estimating genetic distances: distAngsd-geno and distAngsd-nuc, both implemented in a software suite named distAngsd. These methods are specifically designed for next-generation sequencing data, utilize the full information from the data, and take uncertainty in genotype calling into account. Through extensive simulations, we show that these new methods are markedly more accurate and have more stable statistical behaviors than other currently available methods for estimating genetic distances—even for very low depth data with high error rates.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号