共查询到20条相似文献,搜索用时 0 毫秒
1.
Metagenomics provides a powerful new tool set for investigating evolutionary interactions with the environment. However, an absence of model-based statistical methods means that researchers are often not able to make full use of this complex information. We present a Bayesian method for inferring the phylogenetic relationship among related organisms found within metagenomic samples. Our approach exploits variation in the frequency of taxa among samples to simultaneously infer each lineage haplotype, the phylogenetic tree connecting them, and their frequency within each sample. Applications of the algorithm to simulated data show that our method can recover a substantial fraction of the phylogenetic structure even in the presence of high rates of migration among sample sites. We provide examples of the method applied to data from green sulfur bacteria recovered from an Antarctic lake, plastids from mixed Plasmodium falciparum infections, and virulent Neisseria meningitidis samples. 相似文献
2.
When a dataset is imbalanced, the prediction of the scarcely-sampled subpopulation can be over-influenced by the population contributing to the majority of the data. The aim of this study was to develop a Bayesian modelling approach with balancing informative prior so that the influence of imbalance to the overall prediction could be minimised. The new approach was developed in order to weigh the data in favour of the smaller subset(s). The method was assessed in terms of bias and precision in predicting model parameter estimates of simulated datasets. Moreover, the method was evaluated in predicting optimal dose levels of tobramycin for various age groups in a motivating example. The bias estimates using the balancing informative prior approach were smaller than those generated using the conventional approach which was without the consideration for the imbalance in the datasets. The precision estimates were also superior. The method was further evaluated in a motivating example of optimal dosage prediction of tobramycin. The resulting predictions also agreed well with what had been reported in the literature. The proposed Bayesian balancing informative prior approach has shown a real potential to adequately weigh the data in favour of smaller subset(s) of data to generate robust prediction models. 相似文献
3.
Inferring the ancestral dynamics of effective population size is a long-standing question in population genetics, which can now be tackled much more accurately thanks to the massive genomic data available in many species. Several promising methods that take advantage of whole-genome sequences have been recently developed in this context. However, they can only be applied to rather small samples, which limits their ability to estimate recent population size history. Besides, they can be very sensitive to sequencing or phasing errors. Here we introduce a new approximate Bayesian computation approach named PopSizeABC that allows estimating the evolution of the effective population size through time, using a large sample of complete genomes. This sample is summarized using the folded allele frequency spectrum and the average zygotic linkage disequilibrium at different bins of physical distance, two classes of statistics that are widely used in population genetics and can be easily computed from unphased and unpolarized SNP data. Our approach provides accurate estimations of past population sizes, from the very first generations before present back to the expected time to the most recent common ancestor of the sample, as shown by simulations under a wide range of demographic scenarios. When applied to samples of 15 or 25 complete genomes in four cattle breeds (Angus, Fleckvieh, Holstein and Jersey), PopSizeABC revealed a series of population declines, related to historical events such as domestication or modern breed creation. We further highlight that our approach is robust to sequencing errors, provided summary statistics are computed from SNPs with common alleles. 相似文献
4.
Understanding how an animal utilises its surroundings requires its movements through space to be described accurately. Satellite telemetry is the only means of acquiring movement data for many species however data are prone to varying amounts of spatial error; the recent application of state-space models (SSMs) to the location estimation problem have provided a means to incorporate spatial errors when characterising animal movements. The predominant platform for collecting satellite telemetry data on free-ranging animals, Service Argos, recently provided an alternative Doppler location estimation algorithm that is purported to be more accurate and generate a greater number of locations that its predecessor. We provide a comprehensive assessment of this new estimation process performance on data from free-ranging animals relative to concurrently collected Fastloc GPS data. Additionally, we test the efficacy of three readily-available SSM in predicting the movement of two focal animals. Raw Argos location estimates generated by the new algorithm were greatly improved compared to the old system. Approximately twice as many Argos locations were derived compared to GPS on the devices used. Root Mean Square Errors (RMSE) for each optimal SSM were less than 4.25km with some producing RMSE of less than 2.50km. Differences in the biological plausibility of the tracks between the two focal animals used to investigate the utility of SSM highlights the importance of considering animal behaviour in movement studies. The ability to reprocess Argos data collected since 2008 with the new algorithm should permit questions of animal movement to be revisited at a finer resolution. 相似文献
5.
The neural patterns recorded during a neuroscientific experiment reflect complex interactions between many brain regions, each comprising millions of neurons. However, the measurements themselves are typically abstracted from that underlying structure. For example, functional magnetic resonance imaging (fMRI) datasets comprise a time series of three-dimensional images, where each voxel in an image (roughly) reflects the activity of the brain structure(s)–located at the corresponding point in space–at the time the image was collected. FMRI data often exhibit strong spatial correlations, whereby nearby voxels behave similarly over time as the underlying brain structure modulates its activity. Here we develop topographic factor analysis (TFA), a technique that exploits spatial correlations in fMRI data to recover the underlying structure that the images reflect. Specifically, TFA casts each brain image as a weighted sum of spatial functions. The parameters of those spatial functions, which may be learned by applying TFA to an fMRI dataset, reveal the locations and sizes of the brain structures activated while the data were collected, as well as the interactions between those structures. 相似文献
7.
We modified the stable isotope mixing model MixSIR to infer primary producer contributions to consumer diets based on their fatty acid composition. To parameterize the algorithm, we generated a ‘consumer-resource library’ of FA signatures of Daphnia fed different algal diets, using 34 feeding trials representing diverse phytoplankton lineages. This library corresponds to the resource or producer file in classic Bayesian mixing models such as MixSIR or SIAR. Because this library is based on the FA profiles of zooplankton consuming known diets, and not the FA profiles of algae directly, trophic modification of consumer lipids is directly accounted for. To test the model, we simulated hypothetical Daphnia comprised of 80% diatoms, 10% green algae, and 10% cryptophytes and compared the FA signatures of these known pseudo-mixtures to outputs generated by the mixing model. The algorithm inferred these simulated consumers were comprised of 82% (63-92%) [median (2.5th to 97.5th percentile credible interval)] diatoms, 11% (4-22%) green algae, and 6% (0-25%) cryptophytes. We used the same model with published phytoplankton stable isotope (SI) data for δ 13C and δ 15N to examine how a SI based approach resolved a similar scenario. With SI, the algorithm inferred that the simulated consumer assimilated 52% (4-91%) diatoms, 23% (1-78%) green algae, and 18% (1-73%) cyanobacteria. The accuracy and precision of SI based estimates was extremely sensitive to both resource and consumer uncertainty, as well as the trophic fractionation assumption. These results indicate that when using only two tracers with substantial uncertainty for the putative resources, as is often the case in this class of analyses, the underdetermined constraint in consumer-resource SI analyses may be intractable. The FA based approach alleviated the underdetermined constraint because many more FA biomarkers were utilized (n < 20), different primary producers (e.g., diatoms, green algae, and cryptophytes) have very characteristic FA compositions, and the FA profiles of many aquatic primary consumers are strongly influenced by their diets. 相似文献
8.
We investigate in this paper reverse engineering of gene regulatory networks from time-series microarray data. We apply dynamic Bayesian networks (DBNs) for modeling cell cycle regulations. In developing a network inference algorithm, we focus on soft solutions that can provide a posteriori probability (APP) of network topology. In particular, we propose a variational Bayesian structural expectation maximization algorithm that can learn the posterior distribution of the network model parameters and topology jointly. We also show how the obtained APPs of the network topology can be used in a Bayesian data integration strategy to integrate two different microarray data sets. The proposed VBSEM algorithm has been tested on yeast cell cycle data sets. To evaluate the confidence of the inferred networks, we apply a moving block bootstrap method. The inferred network is validated by comparing it to the KEGG pathway map. 相似文献
10.
We propose and develop a general approach based on reaction-diffusion equations for modelling a species dynamics in a realistic two-dimensional (2D) landscape crossed by linear one-dimensional (1D) corridors, such as roads, hedgerows or rivers. Our approach is based on a hybrid “2D/1D model”, i.e, a system of 2D and 1D reaction-diffusion equations with homogeneous coefficients, in which each equation describes the population dynamics in a given 2D or 1D element of the landscape. Using the example of the range expansion of the tiger mosquito Aedes albopictus in France and its main highways as 1D corridors, we show that the model can be fitted to realistic observation data. We develop a mechanistic-statistical approach, based on the coupling between a model of population dynamics and a probabilistic model of the observation process. This allows us to bridge the gap between the data (3 levels of infestation, at the scale of a French department) and the output of the model (population densities at each point of the landscape), and to estimate the model parameter values using a maximum-likelihood approach. Using classical model comparison criteria, we obtain a better fit and a better predictive power with the 2D/1D model than with a standard homogeneous reaction-diffusion model. This shows the potential importance of taking into account the effect of the corridors (highways in the present case) on species dynamics. With regard to the particular case of A. albopictus, the conclusion that highways played an important role in species range expansion in mainland France is consistent with recent findings from the literature. 相似文献
11.
Throughout history, the population size of modern humans has varied considerably due to changes in environment, culture, and technology. More accurate estimates of population size changes, and when they occurred, should provide a clearer picture of human colonization history and help remove confounding effects from natural selection inference. Demography influences the pattern of genetic variation in a population, and thus genomic data of multiple individuals sampled from one or more present-day populations contain valuable information about the past demographic history. Recently, Li and Durbin developed a coalescent-based hidden Markov model, called the pairwise sequentially Markovian coalescent (PSMC), for a pair of chromosomes (or one diploid individual) to estimate past population sizes. This is an efficient, useful approach, but its accuracy in the very recent past is hampered by the fact that, because of the small sample size, only few coalescence events occur in that period. Multiple genomes from the same population contain more information about the recent past, but are also more computationally challenging to study jointly in a coalescent framework. Here, we present a new coalescent-based method that can efficiently infer population size changes from multiple genomes, providing access to a new store of information about the recent past. Our work generalizes the recently developed sequentially Markov conditional sampling distribution framework, which provides an accurate approximation of the probability of observing a newly sampled haplotype given a set of previously sampled haplotypes. Simulation results demonstrate that we can accurately reconstruct the true population histories, with a significant improvement over the PSMC in the recent past. We apply our method, called diCal, to the genomes of multiple human individuals of European and African ancestry to obtain a detailed population size change history during recent times. 相似文献
12.
The availability of large-scale datasets has led to more effort being made to understand characteristics of metabolic reaction networks. However, because the large-scale data are semi-quantitative, and may contain biological variations and/or analytical errors, it remains a challenge to construct a mathematical model with precise parameters using only these data. The present work proposes a simple method, referred to as PENDISC ( arameter stimation in a on- mensionalized -system with onstraints), to assist the complex process of parameter estimation in the construction of a mathematical model for a given metabolic reaction system. The PENDISC method was evaluated using two simple mathematical models: a linear metabolic pathway model with inhibition and a branched metabolic pathway model with inhibition and activation. The results indicate that a smaller number of data points and rate constant parameters enhances the agreement between calculated values and time-series data of metabolite concentrations, and leads to faster convergence when the same initial estimates are used for the fitting. This method is also shown to be applicable to noisy time-series data and to unmeasurable metabolite concentrations in a network, and to have a potential to handle metabolome data of a relatively large-scale metabolic reaction system. Furthermore, it was applied to aspartate-derived amino acid biosynthesis in Arabidopsis thaliana plant. The result provides confirmation that the mathematical model constructed satisfactorily agrees with the time-series datasets of seven metabolite concentrations. 相似文献
13.
The multilocus conditional sampling distribution (CSD) describes the probability that an additionally sampled DNA sequence is of a certain type, given that a collection of sequences has already been observed. The CSD has a wide range of applications in both computational biology and population genomics analysis, including phasing genotype data into haplotype data, imputing missing data, estimating recombination rates, inferring local ancestry in admixed populations, and importance sampling of coalescent genealogies. Unfortunately, the true CSD under the coalescent with recombination is not known, so approximations, formulated as hidden Markov models, have been proposed in the past. These approximations have led to a number of useful statistical tools, but it is important to recognize that they were not derived from, though were certainly motivated by, principles underlying the coalescent process. The goal of this article is to develop a principled approach to derive improved CSDs directly from the underlying population genetics model. Our approach is based on the diffusion process approximation and the resulting mathematical expressions admit intuitive genealogical interpretations, which we utilize to introduce further approximations and make our method scalable in the number of loci. The general algorithm presented here applies to an arbitrary number of loci and an arbitrary finite-alleles recurrent mutation model. Empirical results are provided to demonstrate that our new CSDs are in general substantially more accurate than previously proposed approximations.THE probability of observing a sample of DNA sequences under a given population genetics model—which is referred to as the sampling probability or likelihood—plays an important role in a wide range of problems in a genetic variation study. When recombination is involved, however, obtaining an analytic formula for the sampling probability has hitherto remained a challenging open problem (see Jenkins and Song 2009, 2010 for recent progress on this problem). As such, much research ( Griffiths and Marjoram 1996; Kuhner et al. 2000; Nielsen 2000; Stephens and Donnelly 2000; Fearnhead and Donnelly 2001; De Iorio and Griffiths 2004a, b; Fearnhead and Smith 2005; Griffiths et al. 2008; Wang and Rannala 2008) has focused on developing Monte Carlo methods on the basis of the coalescent with recombination ( Griffiths 1981; Kingman 1982a, b; Hudson 1983), a well-established mathematical framework that models the genealogical history of sample chromosomes. These Monte Carlo-based full-likelihood methods mark an important development in population genetics analysis, but a well-known obstacle to their utility is that they tend to be computationally intensive. For a whole-genome variation study, approximations are often unavoidable, and it is therefore important to think of ways to minimize the trade-off between scalability and accuracy.A popular likelihood-based approximation method that has had a significant impact on population genetics analysis is the following approach introduced by Li and Stephens (2003): Given a set Φ of model parameters ( e.g., mutation rate, recombination rate, etc.), the joint probability p( h1, … , hn | Φ) of observing a set { h1, … , hn} of haplotypes sampled from a population can be decomposed as a product of conditional sampling distributions (CSDs), denoted by π,(1)where π( hk+1| h1, …, hk, Φ) is the probability of an additionally sampled haplotype being of type hk+1, given a set of already observed haplotypes h1, …, hk. In the presence of recombination, the true CSD π is unknown, so Li and Stephens proposed using an approximate CSD in place of π, thus obtaining the following approximation of the joint probability:(2)Li and Stephens referred to this approximation as the product of approximate conditionals (PAC) model. In general, the closer is to the true CSD π, the more accurate the PAC model becomes. Notable applications and extensions of this framework include estimating crossover rates ( Li and Stephens 2003; Crawford et al. 2004) and gene conversion parameters ( Gay et al. 2007; Yin et al. 2009), phasing genotype data into haplotype data ( Stephens and Scheet 2005; Scheet and Stephens 2006), imputing missing data to improve power in association mapping ( Stephens and Scheet 2005; Li and Abecasis 2006; Marchini et al. 2007; Howie et al. 2009), inferring local ancestry in admixed populations ( Price et al. 2009), inferring human colonization history ( Hellenthal et al. 2008), inferring demography ( Davison et al. 2009), and so on.Another problem in which the CSD plays a fundamental role is importance sampling of genealogies under the coalescent process ( Stephens and Donnelly 2000; Fearnhead and Donnelly 2001; De Iorio and Griffiths 2004a, b; Fearnhead and Smith 2005; Griffiths et al. 2008). In this context, the optimal proposal distribution can be written in terms of the CSD π ( Stephens and Donnelly 2000), and as in the PAC model, an approximate CSD may be used in place of π. The performance of an importance sampling scheme depends critically on the proposal distribution and therefore on the accuracy of the approximation . Often in conjunction with composite-likelihood frameworks ( Hudson 2001; Fearnhead and Donnelly 2002), importance sampling has been used in estimating fine-scale recombination rates ( McVean et al. 2004; Fearnhead and Smith 2005; Johnson and Slatkin 2009).So far, a significant scope of intuition has gone into choosing the approximate CSDs used in these problems ( Marjoram and Tavaré 2006). In the case of completely linked loci, Stephens and Donnelly (2000) suggested constructing an approximation by assuming that the additional haplotype hk+1 is an imperfect copy of one of the first k haplotypes, with copying errors corresponding to mutation. Fearnhead and Donnelly (2001) generalized this construction to include crossover recombination, assuming that the haplotype hk+1 is an imperfect mosaic of the first k haplotypes ( i.e., hk+1 is obtained by copying segments from h1, …, hk, where crossover recombination can change the haplotype from which copying is performed). The associated CSD, which we denote by , can be interpreted as a hidden Markov model and so admits an efficient dynamic programming solution. Finally, Li and Stephens (2003) proposed a modification to Fearnhead and Donnelly''s model that limits the hidden state space, thereby providing a computational simplification; we denote the corresponding approximate CSD by .Although these approaches are computationally appealing, it is important to note that they are not derived from, though are certainly motivated by, principles underlying typical population genetics models, in particular the coalescent process ( Griffiths 1981; Kingman 1982a, b; Hudson 1983). The main objective of this article is to develop a principled technique to derive an improved CSD directly from the underlying population genetics model. Rather than relying on intuition, we base our work on mathematical foundation. The theoretical framework we employ is the diffusion process. De Iorio and Griffiths (2004a, b) first introduced the diffusion-generator approximation technique to obtain an approximate CSD in the case of a single locus ( i.e., no recombination). Griffiths et al. (2008) later extended the approach to two loci to include crossover recombination, assuming a parent-independent mutation model at each locus. In this article, we extend the framework to develop a general algorithm that applies to an arbitrary number of loci and an arbitrary finite-alleles recurrent mutation model.Our work can be summarized as follows. Using the diffusion-generator approximation technique, we derive a recursion relation satisfied by an approximate CSD. This recursion can be used to construct a closed system of coupled linear equations, in which the conditional sampling probability of interest appears as one of the unknown variables. The system of equations can be solved using standard numerical analysis techniques. However, the size of the system grows superexponentially with the number of loci and, consequently, so does the running time. To remedy this drawback, we introduce additional approximations to make our approach scalable in the number of loci. Specifically, the recursion admits an intuitive genealogical interpretation, and, on the basis of this interpretation, we propose modifications to the recursion, which then can be easily solved using dynamic programming. The computational complexity of the modified algorithm is polynomial in the number of loci, and, importantly, the resulting CSD has little loss of accuracy compared to that following from the full recursion.The accuracy of approximate CSDs has not been discussed much in the literature, except in the application-specific context for which they are being employed. In this article, we carry out an empirical study to explicitly test the accuracy of various CSDs and demonstrate that our new CSDs are in general substantially more accurate than previously proposed approximations. We also consider the PAC framework and show that our approximations also produce more accurate PAC-likelihood estimates. We note that for the maximum-likelihood estimation of recombination rates, the actual value of the likelihood may not be so important, as long as it is maximized near the true recombination rate. However, in many other applications— e.g., phasing genotype data into haplotype data, imputing missing data, importance sampling, and so on—the accuracy of the CSD and PAC-likelihood function over a wide range of parameter values may be important. Thus, we believe that the theoretical work presented here will have several practical implications; our method can be applied in a wide range of statistical tools that use CSDs, improving their accuracy.The remainder of this article is organized as follows. To provide intuition for the ensuing mathematics, we first describe a genealogical process that gives rise to our CSD. Using our genealogical interpretation, we consider two additional approximations and relate these to previously proposed CSDs. Then, in the following section, we derive our CSD using the diffusion-generator approach and provide mathematical statements for the additional approximations; some interesting limiting behavior is also described there. This section is self-contained and may be skipped by the reader uninterested in mathematical details. Finally, in the subsequent section, we carry out a simulation study to compare the accuracy of various approximate CSDs and demonstrate that ours are generally the most accurate. 相似文献
14.
One of the first things one learns in a basic psychology or statistics course is that you cannot prove the null hypothesis that there is no difference between two conditions such as a patient group and a normal control group. This remains true. However now, thanks to ongoing progress by a special group of devoted methodologists, even when the result of an inferential test is p?>?.05, it is now possible to rigorously and quantitatively conclude that (a) the null hypothesis is actually unlikely, and (b) that the alternative hypothesis of an actual difference between treatment and control is more probable than the null. Alternatively, it is also possible to conclude quantitatively that the null hypothesis is much more likely than the alternative. Without Bayesian statistics, we couldn’t say anything if a simple inferential analysis like a t-test yielded p?>?.05. The present, mostly non-quantitative article describes free resources and illustrative procedures for doing Bayesian analysis, with t-test and ANOVA examples. 相似文献
15.
The detection of epistatic interactive effects of multiple genetic variants on the susceptibility of human complex diseases is a great challenge in genome-wide association studies (GWAS). Although methods have been proposed to identify such interactions, the lack of an explicit definition of epistatic effects, together with computational difficulties, makes the development of new methods indispensable. In this paper, we introduce epistatic modules to describe epistatic interactive effects of multiple loci on diseases. On the basis of this notion, we put forward a Bayesian marker partition model to explain observed case-control data, and we develop a Gibbs sampling strategy to facilitate the detection of epistatic modules. Comparisons of the proposed approach with three existing methods on seven simulated disease models demonstrate the superior performance of our approach. When applied to a genome-wide case-control data set for Age-related Macular Degeneration (AMD), the proposed approach successfully identifies two known susceptible loci and suggests that a combination of two other loci—one in the gene SGCD and the other in SCAPER—is associated with the disease. Further functional analysis supports the speculation that the interaction of these two genetic variants may be responsible for the susceptibility of AMD. When applied to a genome-wide case-control data set for Parkinson's disease, the proposed method identifies seven suspicious loci that may contribute independently to the disease. 相似文献
16.
Phenotypes, DNA, and measures of ecological differences are widely used in species delimitation. Although rarely defined in such studies, ecological divergence is almost always approximated using multivariate climatic data associated with sets of specimens (i.e., the “climatic niche”); the justification for this approach is that species-specific climatic envelopes act as surrogates for physiological tolerances. Using identical statistical procedures, we evaluated the usefulness and validity of the climate-as-proxy assumption by comparing performance of genetic (nDNA SNPs and mitochondrial DNA), phenotypic, and climatic data for objective species delimitation in the speckled rattlesnake ( Crotalus mitchellii) complex. Ordination and clustering patterns were largely congruent among intrinsic (heritable) traits (nDNA, mtDNA, phenotype), and discordance is explained by biological processes (e.g., ontogeny, hybridization). In contrast, climatic data did not produce biologically meaningful clusters that were congruent with any intrinsic dataset, but rather corresponded to regional differences in atmospheric circulation and climate, indicating an absence of inherent taxonomic signal in these data. Surrogating climate for physiological tolerances adds artificial weight to evidence of species boundaries, as these data are irrelevant for that purpose. Based on the evidence from congruent clustering of intrinsic datasets, we recommend that three subspecies of C. mitchellii be recognized as species: C. angelensis, C. mitchellii, and C. Pyrrhus. 相似文献
17.
Based on capture-mark-recapture sampling methods the problem of estimating unknown population size was considered. The sampling started with the assumption that at the beginning of the experiment all the individuals were unmarked, and the unmarked individuals caught in each sample will be marked and returned to the original population before the next sample is drawn. It is also assumed that the population is closed by birth, death, emigration and immigration. Using a general inverse sampling approach, the unknown population size N is estimated by a maximum likelihood estimator (MLE), and a simple form for approximate MLE is obtained. The probability function for S (the minimum number of samples required to be drawn to have L (L ≥ 1) samples, each of which contains at least one marked individual) and the form for E[S] are also obtained. In addition, corrections and improvements of some previous works in this field are given. 相似文献
18.
Time series data provided by single-molecule Förster resonance energy transfer (smFRET) experiments offer the opportunity to infer not only model parameters describing molecular complexes, e.g., rate constants, but also information about the model itself, e.g., the number of conformational states. Resolving whether such states exist or how many of them exist requires a careful approach to the problem of model selection, here meaning discrimination among models with differing numbers of states. The most straightforward approach to model selection generalizes the common idea of maximum likelihood—selecting the most likely parameter values—to maximum evidence: selecting the most likely model. In either case, such an inference presents a tremendous computational challenge, which we here address by exploiting an approximation technique termed variational Bayesian expectation maximization. We demonstrate how this technique can be applied to temporal data such as smFRET time series; show superior statistical consistency relative to the maximum likelihood approach; compare its performance on smFRET data generated from experiments on the ribosome; and illustrate how model selection in such probabilistic or generative modeling can facilitate analysis of closely related temporal data currently prevalent in biophysics. Source code used in this analysis, including a graphical user interface, is available open source via http://vbFRET.sourceforge.net. 相似文献
19.
BackgroundEvaluating environmental health risks in communities requires models characterizing geographic and demographic patterns of exposure to multiple stressors. These exposure models can be constructed from multivariable regression analyses using individual-level predictors (microdata), but these microdata are not typically available with sufficient geographic resolution for community risk analyses given privacy concerns. MethodsWe developed synthetic geographically-resolved microdata for a low-income community (New Bedford, Massachusetts) facing multiple environmental stressors. We first applied probabilistic reweighting using simulated annealing to data from the 2006–2010 American Community Survey, combining 9,135 microdata samples from the New Bedford area with census tract-level constraints for individual and household characteristics. We then evaluated the synthetic microdata using goodness-of-fit tests and by examining spatial patterns of microdata fields not used as constraints. As a demonstration, we developed a multivariable regression model predicting smoking behavior as a function of individual-level microdata fields using New Bedford-specific data from the 2006–2010 Behavioral Risk Factor Surveillance System, linking this model with the synthetic microdata to predict demographic and geographic smoking patterns in New Bedford. ResultsOur simulation produced microdata representing all 94,944 individuals living in New Bedford in 2006–2010. Variables in the synthetic population matched the constraints well at the census tract level (e.g., ancestry, gender, age, education, household income) and reproduced the census-derived spatial patterns of non-constraint microdata. Smoking in New Bedford was significantly associated with numerous demographic variables found in the microdata, with estimated tract-level smoking rates varying from 20% (95% CI: 17%, 22%) to 37% (95% CI: 30%, 45%). ConclusionsWe used simulation methods to create geographically-resolved individual-level microdata that can be used in community-wide exposure and risk assessment studies. This approach provides insights regarding community-scale exposure and vulnerability patterns, valuable in settings where policy can be informed by characterization of multi-stressor exposures and health risks at high resolution. 相似文献
|