首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Qihuang Zhang  Grace Y. Yi 《Biometrics》2023,79(2):1089-1102
Zero-inflated count data arise frequently from genomics studies. Analysis of such data is often based on a mixture model which facilitates excess zeros in combination with a Poisson distribution, and various inference methods have been proposed under such a model. Those analysis procedures, however, are challenged by the presence of measurement error in count responses. In this article, we propose a new measurement error model to describe error-contaminated count data. We show that ignoring the measurement error effects in the analysis may generally lead to invalid inference results, and meanwhile, we identify situations where ignoring measurement error can still yield consistent estimators. Furthermore, we propose a Bayesian method to address the effects of measurement error under the zero-inflated Poisson model and discuss the identifiability issues. We develop a data-augmentation algorithm that is easy to implement. Simulation studies are conducted to evaluate the performance of the proposed method. We apply our method to analyze the data arising from a prostate adenocarcinoma genomic study.  相似文献   

2.
Count data often exhibit more zeros than predicted by common count distributions like the Poisson or negative binomial. In recent years, there has been considerable interest in methods for analyzing zero-inflated count data in longitudinal or other correlated data settings. A common approach has been to extend zero-inflated Poisson models to include random effects that account for correlation among observations. However, these models have been shown to have a few drawbacks, including interpretability of regression coefficients and numerical instability of fitting algorithms even when the data arise from the assumed model. To address these issues, we propose a model that parameterizes the marginal associations between the count outcome and the covariates as easily interpretable log relative rates, while including random effects to account for correlation among observations. One of the main advantages of this marginal model is that it allows a basis upon which we can directly compare the performance of standard methods that ignore zero inflation with that of a method that explicitly takes zero inflation into account. We present simulations of these various model formulations in terms of bias and variance estimation. Finally, we apply the proposed approach to analyze toxicological data of the effect of emissions on cardiac arrhythmias.  相似文献   

3.
Species-occurrence data sets tend to contain a large proportion of zero values, i.e., absence values (zero-inflated). Statistical inference using such data sets is likely to be inefficient or lead to incorrect conclusions unless the data are treated carefully. In this study, we propose a new modeling method to overcome the problems caused by zero-inflated data sets that involves a regression model and a machine-learning technique. We combined a generalized liner model (GLM), which is widely used in ecology, and bootstrap aggregation (bagging), a machine-learning technique. We established distribution models of Vincetoxicum pycnostelma (a vascular plant) and Ninox scutulata (an owl), both of which are endangered and have zero-inflated distribution patterns, using our new method and traditional GLM and compared model performances. At the same time we modeled four theoretical data sets that contained different ratios of presence/absence values using new and traditional methods and also compared model performances. For distribution models, our new method showed good performance compared to traditional GLMs. After bagging, area under the curve (AUC) values were almost the same as with traditional methods, but sensitivity values were higher. Additionally, our new method showed high sensitivity values compared to the traditional GLM when modeling a theoretical data set containing a large proportion of zero values. These results indicate that our new method has high predictive ability with presence data when analyzing zero-inflated data sets. Generally, predicting presence data is more difficult than predicting absence data. Our new modeling method has potential for advancing species distribution modeling.  相似文献   

4.
Phenotypes measured in counts are commonly observed in nature. Statistical methods for mapping quantitative trait loci (QTL) underlying count traits are documented in the literature. The majority of them assume that the count phenotype follows a Poisson distribution with appropriate techniques being applied to handle data dispersion. When a count trait has a genetic basis, “naturally occurring” zero status also reflects the underlying gene effects. Simply ignoring or miss-handling the zero data may lead to wrong QTL inference. In this article, we propose an interval mapping approach for mapping QTL underlying count phenotypes containing many zeros. The effects of QTLs on the zero-inflated count trait are modelled through the zero-inflated generalized Poisson regression mixture model, which can handle the zero inflation and Poisson dispersion in the same distribution. We implement the approach using the EM algorithm with the Newton-Raphson algorithm embedded in the M-step, and provide a genome-wide scan for testing and estimating the QTL effects. The performance of the proposed method is evaluated through extensive simulation studies. Extensions to composite and multiple interval mapping are discussed. The utility of the developed approach is illustrated through a mouse F2 intercross data set. Significant QTLs are detected to control mouse cholesterol gallstone formation.  相似文献   

5.
Count phenotypes with excessive zeros are often observed in the biological world. Researchers have studied many statistical methods for mapping the quantitative trait loci (QTLs) of zero-inflated count phenotypes. However, most of the existing methods consist of finding the approximate positions of the QTLs on the chromosome by genome-wide scanning. Additionally, most of the existing methods use the EM algorithm for parameter estimation. In this paper, we propose a Bayesian interval mapping scheme of QTLs for zero-inflated count data. The method takes advantage of a zero-inflated generalized Poisson (ZIGP) regression model to study the influence of QTLs on the zero-inflated count phenotype. The MCMC algorithm is used to estimate the effects and position parameters of QTLs. We use the Haldane map function to realize the conversion between recombination rate and map distance. Monte Carlo simulations are conducted to test the applicability and advantage of the proposed method. The effects of QTLs on the formation of mouse cholesterol gallstones were demonstrated by analyzing an mouse data set.  相似文献   

6.
Background: The recently emerged technology of methylated RNA immunoprecipitation sequencing (MeRIP-seq) sheds light on the study of RNA epigenetics. This new bioinformatics question calls for effective and robust peaking calling algorithms to detect mRNA methylation sites from MeRIP-seq data. Methods: We propose a Bayesian hierarchical model to detect methylation sites from MeRIP-seq data. Our modeling approach includes several important characteristics. First, it models the zero-inflated and over-dispersed counts by deploying a zero-inflated negative binomial model. Second, it incorporates a hidden Markov model (HMM) to account for the spatial dependency of neighboring read enrichment. Third, our Bayesian inference allows the proposed model to borrow strength in parameter estimation, which greatly improves the model stability when dealing with MeRIP-seq data with a small number of replicates. We use Markov chain Monte Carlo (MCMC) algorithms to simultaneously infer the model parameters in a de novo fashion. The R Shiny demo is available at the authors' website and the R/C++ code is available at https://github.com/liqiwei2000/BaySeqPeak. Results: In simulation studies, the proposed method outperformed the competing methods exomePeak and MeTPeak, especially when an excess of zeros were present in the data. In real MeRIP-seq data analysis, the proposed method identified methylation sites that were more consistent with biological knowledge, and had better spatial resolution compared to the other methods. Conclusions: In this study, we develop a Bayesian hierarchical model to identify methylation peaks in MeRIP-seq data. The proposed method has a competitive edge over existing methods in terms of accuracy, robustness and spatial resolution.  相似文献   

7.
8.
In some occupational health studies, observations occur in both exposed and unexposed individuals. If the levels of all exposed individuals have been detected, a two-part zero-inflated log-normal model is usually recommended, which assumes that the data has a probability mass at zero for unexposed individuals and a continuous response for values greater than zero for exposed individuals. However, many quantitative exposure measurements are subject to left censoring due to values falling below assay detection limits. A zero-inflated log-normal mixture model is suggested in this situation since unexposed zeros are not distinguishable from those exposed with values below detection limits. In the context of this mixture distribution, the information contributed by values falling below a fixed detection limit is used only to estimate the probability of unexposed. We consider sample size and statistical power calculation when comparing the median of exposed measurements to a regulatory limit. We calculate the required sample size for the data presented in a recent paper comparing the benzene TWA exposure data to a regulatory occupational exposure limit. A simulation study is conducted to investigate the performance of the proposed sample size calculation methods.  相似文献   

9.
Binomial regression models are commonly applied to proportion data such as those relating to the mortality and infection rates of diseases. However, it is often the case that the responses may exhibit excessive zeros; in such cases a zero‐inflated binomial (ZIB) regression model can be applied instead. In practice, it is essential to test if there are excessive zeros in the outcome to help choose an appropriate model. The binomial models can yield biased inference if there are excessive zeros, while ZIB models may be unnecessarily complex and hard to interpret, and even face convergence issues, if there are no excessive zeros. In this paper, we develop a new test for testing zero inflation in binomial regression models by directly comparing the amount of observed zeros with what would be expected under the binomial regression model. A closed form of the test statistic, as well as the asymptotic properties of the test, is derived based on estimating equations. Our systematic simulation studies show that the new test performs very well in most cases, and outperforms the classical Wald, likelihood ratio, and score tests, especially in controlling type I errors. Two real data examples are also included for illustrative purpose.  相似文献   

10.
Hall DB 《Biometrics》2000,56(4):1030-1039
In a 1992 Technometrics paper, Lambert (1992, 34, 1-14) described zero-inflated Poisson (ZIP) regression, a class of models for count data with excess zeros. In a ZIP model, a count response variable is assumed to be distributed as a mixture of a Poisson(lambda) distribution and a distribution with point mass of one at zero, with mixing probability p. Both p and lambda are allowed to depend on covariates through canonical link generalized linear models. In this paper, we adapt Lambert's methodology to an upper bounded count situation, thereby obtaining a zero-inflated binomial (ZIB) model. In addition, we add to the flexibility of these fixed effects models by incorporating random effects so that, e.g., the within-subject correlation and between-subject heterogeneity typical of repeated measures data can be accommodated. We motivate, develop, and illustrate the methods described here with an example from horticulture, where both upper bounded count (binomial-type) and unbounded count (Poisson-type) data with excess zeros were collected in a repeated measures designed experiment.  相似文献   

11.
The advent of high-throughput metagenomic sequencing has prompted the development of efficient taxonomic profiling methods allowing to measure the presence, abundance and phylogeny of organisms in a wide range of environmental samples. Multivariate sequence-derived abundance data further has the potential to enable inference of ecological associations between microbial populations, but several technical issues need to be accounted for, like the compositional nature of the data, its extreme sparsity and overdispersion, as well as the frequent need to operate in under-determined regimes.The ecological network reconstruction problem is frequently cast into the paradigm of Gaussian Graphical Models (GGMs) for which efficient structure inference algorithms are available, like the graphical lasso and neighborhood selection. Unfortunately, GGMs or variants thereof can not properly account for the extremely sparse patterns occurring in real-world metagenomic taxonomic profiles. In particular, structural zeros (as opposed to sampling zeros) corresponding to true absences of biological signals fail to be properly handled by most statistical methods.We present here a zero-inflated log-normal graphical model (available at https://github.com/vincentprost/Zi-LN) specifically aimed at handling such “biological” zeros, and demonstrate significant performance gains over state-of-the-art statistical methods for the inference of microbial association networks, with most notable gains obtained when analyzing taxonomic profiles displaying sparsity levels on par with real-world metagenomic datasets.  相似文献   

12.
Disease mapping models have been popularly used to model disease incidence with spatial correlation. In disease mapping models, zero inflation is an important issue, which often occurs in disease incidence datasets with high proportions of zero disease count. It is originated from limited survey coverage or unadvanced testing equipment, which makes some regions have no observed patients. Then excessive zeros recorded in the disease incidence dataset would mess up the true distributions of disease incidence and lead to inaccurate estimates. To address this issue, a zero-inflated disease mapping model is developed in this work. In this model, a zero-inflated process using Bernoulli indicators is assumed to characterize whether the zero inflation occurs for each region. For regions without zero inflation, a coherent and generative disease mapping model is applied for mapping the spatially correlated disease incidence. Independent spatial random effects are incorporated in both processes to account for the spatial patterns of zero inflation and disease incidence. External covariates are also considered in both processes to better explain the disease count data. To estimate the model, a Markov chain Monte Carlo algorithm is proposed. We evaluate model performance via a variety of simulation experiments. Finally, a Lyme disease dataset of Virginia is analyzed to illustrate the application of the proposed model.  相似文献   

13.
The assessment of population trends is a key point in wildlife conservation. Survey data collected over long period may not be comparable due to the presence of environmental biases (i.e. inadequate representation of the variability of environmental covariates in the study area). Moreover, count data may be affected by both overdispersion (i.e. the variance is larger than the mean) and excess of zero counts (potentially leading to zero inflation). The aim of this study was to define a modelling procedure to assess long-term population trends that addressed these three issues and to shed light on the effects of environmental bias, overdispersion, and zero inflation on trend estimates. To test our procedure, we used six bird species whose data were collected in northern Italy from 1992 to 2019. We designed a multi-step approach. First, using generalised additive models (GAMs), we implemented a full factorial design of models (eight models per species) taking or not into account the environmental bias (including or not including environmental covariates, respectively), overdispersion (using a negative binomial distribution or a Poisson distribution, respectively), and zero inflation (using or not using zero-inflated models, respectively). Models were ranked according to the Akaike Information Criterion. Second, annual population indices (median and 95% confidence interval of the number of breeding pairs per point count) were predicted through a parametric bootstrap procedure. Third, long-term population trends were assessed and tested for significance fitting weighted least square linear regression models to the predicted annual indices. To evaluate the effect of environmental bias, overdispersion, and zero inflation on trend estimates, an average discrepancy index was calculated for each model group. The results showed that environmental bias was the most important driver in determining different trend estimates, although overlooking overdispersion and zero inflation could lead to misleading results. For five species, zero-inflated GAMs resulted the best models to predict annual population indices. Our findings suggested a mutual interaction between zero inflation and overdispersion, with overdispersion arising in non-zero-inflated models. Moreover, for species having flocking foraging and/or colonial breeding behaviours, overdispersed and zero-inflated models may be more adequate. In conclusion, properly handling environmental bias, which may affect several data sets coming from long-term monitoring programs, is crucial to obtain reliable estimates of population trends. Furthermore, the extent to which overdispersion and zero inflation may affect trend estimates should be assessed by comparing different models, rather than presumed using statistical assumption.  相似文献   

14.
雷击火的发生与气象因子之间存在着密切的关系。该文选用符合大兴安岭地区林火发生数据结构的负二项(negative binomial,NB)和零膨胀负二项(zero-inflated negative binomial,ZINB)两种模型对大兴安岭林区1980–2005年间雷击火的发生与气象因素间的关系进行建模分析,并与以往研究中所使用的最小二乘(OLS)回归方法相对比。使用SAS和R-Project统计软件进行模型拟合运算,计算得出模型各参数。结果表明,NB和ZINB模型对数据拟合较好,模型内各气象因子显著性水平较高,对雷击火发生次数均具有较好的预测能力。运用AIC和Vuong等检验方法,进一步比较了NB和ZINB模型对数据的拟合水平以及模型预测水平,结果表明ZINB模型无论在数据拟合还是模型预测上都要优于NB模型。提出了大兴安岭地区林火发生与气象因子关系的最优模型。  相似文献   

15.
Bivariate time series of counts with excess zeros relative to the Poisson process are common in many bioscience applications. Failure to account for the extra zeros in the analysis may result in biased parameter estimates and misleading inferences. A class of bivariate zero-inflated Poisson autoregression models is presented to accommodate the zero-inflation and the inherent serial dependency between successive observations. An autoregressive correlation structure is assumed in the random component of the compound regression model. Parameter estimation is achieved via an EM algorithm, by maximizing an appropriate log-likelihood function to obtain residual maximum likelihood estimates. The proposed method is applied to analyze a bivariate series from an occupational health study, in which the zero-inflated injury count events are classified as either musculoskeletal or non-musculoskeletal in nature. The approach enables the evaluation of the effectiveness of a participatory ergonomics intervention at the population level, in terms of reducing the overall incidence of lost-time injury and a simultaneous decline in the two mean injury rates.  相似文献   

16.
Dark spots in the fleece area are often associated with dark fibres in wool, which limits its competitiveness with other textile fibres. Field data from a sheep experiment in Uruguay revealed an excess number of zeros for dark spots. We compared the performance of four Poisson and zero-inflated Poisson (ZIP) models under four simulation scenarios. All models performed reasonably well under the same scenario for which the data were simulated. The deviance information criterion favoured a Poisson model with residual, while the ZIP model with a residual gave estimates closer to their true values under all simulation scenarios. Both Poisson and ZIP models with an error term at the regression level performed better than their counterparts without such an error. Field data from Corriedale sheep were analysed with Poisson and ZIP models with residuals. Parameter estimates were similar for both models. Although the posterior distribution of the sire variance was skewed due to a small number of rams in the dataset, the median of this variance suggested a scope for genetic selection. The main environmental factor was the age of the sheep at shearing. In summary, age related processes seem to drive the number of dark spots in this breed of sheep.  相似文献   

17.
18.
Ecological diffusion is a theory that can be used to understand and forecast spatio‐temporal processes such as dispersal, invasion, and the spread of disease. Hierarchical Bayesian modelling provides a framework to make statistical inference and probabilistic forecasts, using mechanistic ecological models. To illustrate, we show how hierarchical Bayesian models of ecological diffusion can be implemented for large data sets that are distributed densely across space and time. The hierarchical Bayesian approach is used to understand and forecast the growth and geographic spread in the prevalence of chronic wasting disease in white‐tailed deer (Odocoileus virginianus). We compare statistical inference and forecasts from our hierarchical Bayesian model to phenomenological regression‐based methods that are commonly used to analyse spatial occurrence data. The mechanistic statistical model based on ecological diffusion led to important ecological insights, obviated a commonly ignored type of collinearity, and was the most accurate method for forecasting.  相似文献   

19.
We prove that the generalized Poisson distribution GP(theta, eta) (eta > or = 0) is a mixture of Poisson distributions; this is a new property for a distribution which is the topic of the book by Consul (1989). Because we find that the fits to count data of the generalized Poisson and negative binomial distributions are often similar, to understand their differences, we compare the probability mass functions and skewnesses of the generalized Poisson and negative binomial distributions with the first two moments fixed. They have slight differences in many situations, but their zero-inflated distributions, with masses at zero, means and variances fixed, can differ more. These probabilistic comparisons are helpful in selecting a better fitting distribution for modelling count data with long right tails. Through a real example of count data with large zero fraction, we illustrate how the generalized Poisson and negative binomial distributions as well as their zero-inflated distributions can be discriminated.  相似文献   

20.
The Yangtze River estuary (YRE) is an important migration channel and foraging habitat for Coilia nasus. Due to its ecological significance and a prioritization of this species’ protection, the need to investigate and analyze environmental relationships of the abundance of Coilia nasus in the YRE as well as develop an understanding of their temporal and spatial distributions is becoming exceedingly important. Using fishery data and environmental survey data from 2009 to 2016, three models including generalized additive mixed models (GAMM), generalized additive models with zero-inflated Poisson distribution (ZIP-GAM) and two-step GAM were used to analyze relationships between environmental factors and the distribution of Coilia nasus in the YRE. The results showed that model fitting of GAMM was more consistent with observations and revealed influences of water temperature, salinity, chlorophyll, and pH on distribution. GAMM demonstrated that higher Coilia nasus abundances were located in waters with water temperature values at 15°C and 30°C, and lower Coilia nasus abundances were located in areas with water temperature values at 10°C and 20°C. All models indicated that the effect of salinity on abundance of Coilia nasus present a multimodal pattern including three peaks at 5, 15, and 25 ppt respectively. Additionally, abundance of Coilia nasus increased with the increase of chlorophyll A in its range of 0–4 mg/L. In a range of 8.0–9.5, higher PH value was more suitable for the aggregation of Coilia nasus. Cross validation was used to evaluate the predictive performance of models and GAMM was found to be the best. The predicted abundance distribution of Coilia nasus in the summer and autumn of 2016 was relatively higher overall than that in winter and spring. The predicted zero abundance distribution pattern was consistent with the sampling presence distribution which was obtained using fishery independent survey data of the year 2009–2015. Facing the urgency protection of Coilia nasus in YRE, results of this study could be used for Coilia nasus conservation and reserve planning.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号