首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Advances in GPS tracking technologies have allowed for rapid assessment of important oceanographic regions for seabirds. This allows us to understand seabird distributions, and the characteristics which determine the success of populations. In many cases, quality GPS tracking data may not be available; however, long term population monitoring data may exist. In this study, a method to infer important oceanographic regions for seabirds will be presented using breeding sooty shearwaters as a case study. This method combines a popular machine learning algorithm (generalized boosted regression modeling), geographic information systems, long-term ecological data and open access oceanographic datasets. Time series of chick size and harvest index data derived from a long term dataset of Maori ‘muttonbirder’ diaries were obtained and used as response variables in a gridded spatial model. It was found that areas of the sub-Antarctic water region best capture the variation in the chick size data. Oceanographic features including wind speed and charnock (a derived variable representing ocean surface roughness) came out as top predictor variables in these models. Previously collected GPS data demonstrates that these regions are used as “flyways” by sooty shearwaters during the breeding season. It is therefore likely that wind speeds in these flyways affect the ability of sooty shearwaters to provision for their chicks due to changes in flight dynamics. This approach was designed to utilize machine learning methodology but can also be implemented with other statistical algorithms. Furthermore, these methods can be applied to any long term time series of population data to identify important regions for a species of interest.  相似文献   

2.
High Throughput Biological Data (HTBD) requires detailed analysis methods and from a life science perspective, these analysis results make most sense when interpreted within the context of biological pathways. Bayesian Networks (BNs) capture both linear and nonlinear interactions and handle stochastic events in a probabilistic framework accounting for noise making them viable candidates for HTBD analysis. We have recently proposed an approach, called Bayesian Pathway Analysis (BPA), for analyzing HTBD using BNs in which known biological pathways are modeled as BNs and pathways that best explain the given HTBD are found. BPA uses the fold change information to obtain an input matrix to score each pathway modeled as a BN. Scoring is achieved using the Bayesian-Dirichlet Equivalent method and significance is assessed by randomization via bootstrapping of the columns of the input matrix. In this study, we improve on the BPA system by optimizing the steps involved in “Data Preprocessing and Discretization”, “Scoring”, “Significance Assessment”, and “Software and Web Application”. We tested the improved system on synthetic data sets and achieved over 98% accuracy in identifying the active pathways. The overall approach was applied on real cancer microarray data sets in order to investigate the pathways that are commonly active in different cancer types. We compared our findings on the real data sets with a relevant approach called the Signaling Pathway Impact Analysis (SPIA).  相似文献   

3.
We introduce a method for spatiotemporal data fusion and demonstrate its performance on three constructed data sets: one entirely simulated, one with temporal speech signals and simulated spatial images, and another with recorded music time series and astronomical images defining the spatial patterns. Each case study is constructed to present specific challenges to test the method and demonstrate its capabilities. Our algorithm, BICAR (Bidirectional Independent Component Averaged Representation), is based on independent component analysis (ICA) and extracts pairs of temporal and spatial sources from two data matrices with arbitrarily different spatiotemporal resolution. We pair the temporal and spatial sources using a physical transfer function that connects the dynamics of the two. BICAR produces a hierarchy of sources ranked according to reproducibility; we show that sources which are more reproducible are more similar to true (known) sources. BICAR is robust to added noise, even in a “worst case” scenario where all physical sources are equally noisy. BICAR is also relatively robust to misspecification of the transfer function. BICAR holds promise as a useful data-driven assimilation method in neuroscience, earth science, astronomy, and other signal processing domains.  相似文献   

4.
Models for prediction of allogeneic hematopoietic stem transplantation (HSCT) related mortality partially account for transplant risk. Improving predictive accuracy requires understating of prediction limiting factors, such as the statistical methodology used, number and quality of features collected, or simply the population size. Using an in-silico approach (i.e., iterative computerized simulations), based on machine learning (ML) algorithms, we set out to analyze these factors. A cohort of 25,923 adult acute leukemia patients from the European Society for Blood and Marrow Transplantation (EBMT) registry was analyzed. Predictive objective was non-relapse mortality (NRM) 100 days following HSCT. Thousands of prediction models were developed under varying conditions: increasing sample size, specific subpopulations and an increasing number of variables, which were selected and ranked by separate feature selection algorithms. Depending on the algorithm, predictive performance plateaued on a population size of 6,611–8,814 patients, reaching a maximal area under the receiver operator characteristic curve (AUC) of 0.67. AUCs’ of models developed on specific subpopulation ranged from 0.59 to 0.67 for patients in second complete remission and receiving reduced intensity conditioning, respectively. Only 3–5 variables were necessary to achieve near maximal AUCs. The top 3 ranking variables, shared by all algorithms were disease stage, donor type, and conditioning regimen. Our findings empirically demonstrate that with regards to NRM prediction, few variables “carry the weight” and that traditional HSCT data has been “worn out”. “Breaking through” the predictive boundaries will likely require additional types of inputs.  相似文献   

5.
Considering the two-class classification problem in brain imaging data analysis, we propose a sparse representation-based multi-variate pattern analysis (MVPA) algorithm to localize brain activation patterns corresponding to different stimulus classes/brain states respectively. Feature selection can be modeled as a sparse representation (or sparse regression) problem. Such technique has been successfully applied to voxel selection in fMRI data analysis. However, single selection based on sparse representation or other methods is prone to obtain a subset of the most informative features rather than all. Herein, our proposed algorithm recursively eliminates informative features selected by a sparse regression method until the decoding accuracy based on the remaining features drops to a threshold close to chance level. In this way, the resultant feature set including all the identified features is expected to involve all the informative features for discrimination. According to the signs of the sparse regression weights, these selected features are separated into two sets corresponding to two stimulus classes/brain states. Next, in order to remove irrelevant/noisy features in the two selected feature sets, we perform a nonparametric permutation test at the individual subject level or the group level. In data analysis, we verified our algorithm with a toy data set and an intrinsic signal optical imaging data set. The results show that our algorithm has accurately localized two class-related patterns. As an application example, we used our algorithm on a functional magnetic resonance imaging (fMRI) data set. Two sets of informative voxels, corresponding to two semantic categories (i.e., “old people” and “young people”), respectively, are obtained in the human brain.  相似文献   

6.

Background

DNA barcode differences within animal species are usually much less than differences among species, making it generally straightforward to match unknowns to a reference library. Here we aim to better understand the evolutionary mechanisms underlying this usual “barcode gap” pattern. We employ avian barcode libraries to test a central prediction of neutral theory, namely, intraspecific variation equals 2 Nµ, where N is population size and µ is mutations per site per generation. Birds are uniquely suited for this task: they have the best-known species limits, are well represented in barcode libraries, and, most critically, are the only large group with documented census population sizes. In addition, we ask if mitochondrial molecular clock measurements conform to neutral theory prediction of clock rate equals µ.

Results

Intraspecific COI barcode variation was uniformly low regardless of census population size (n = 142 species in 15 families). Apparent outliers reflected lumping of reproductively isolated populations or hybrid lineages. Re-analysis of a published survey of cytochrome b variation in diverse birds (n = 93 species in 39 families) further confirmed uniformly low intraspecific variation. Hybridization/gene flow among species/populations was the main limitation to DNA barcode identification.

Conclusions/Significance

To our knowledge, this is the first large study of animal mitochondrial diversity using actual census population sizes and the first to test outliers for population structure. Our finding of universally low intraspecific variation contradicts a central prediction of neutral theory and is not readily accounted for by commonly proposed ad hoc modifications. We argue that the weight of evidence–low intraspecific variation and the molecular clock–indicates neutral evolution plays a minor role in mitochondrial sequence evolution. As an alternate paradigm consistent with empirical data, we propose extreme purifying selection, including at synonymous sites, limits variation within species and continuous adaptive selection drives the molecular clock.  相似文献   

7.
Use of socially generated “big data” to access information about collective states of the minds in human societies has become a new paradigm in the emerging field of computational social science. A natural application of this would be the prediction of the society''s reaction to a new product in the sense of popularity and adoption rate. However, bridging the gap between “real time monitoring” and “early predicting” remains a big challenge. Here we report on an endeavor to build a minimalistic predictive model for the financial success of movies based on collective activity data of online users. We show that the popularity of a movie can be predicted much before its release by measuring and analyzing the activity level of editors and viewers of the corresponding entry to the movie in Wikipedia, the well-known online encyclopedia.  相似文献   

8.
Genome-wide association study (GWAS) data on a disease are increasingly available from multiple related populations. In this scenario, meta-analyses can improve power to detect homogeneous genetic associations, but if there exist ancestry-specific effects, via interactions on genetic background or with a causal effect that co-varies with genetic background, then these will typically be obscured. To address this issue, we have developed a robust statistical method for detecting susceptibility gene-ancestry interactions in multi-cohort GWAS based on closely-related populations. We use the leading principal components of the empirical genotype matrix to cluster individuals into “ancestry groups” and then look for evidence of heterogeneous genetic associations with disease or other trait across these clusters. Robustness is improved when there are multiple cohorts, as the signal from true gene-ancestry interactions can then be distinguished from gene-collection artefacts by comparing the observed interaction effect sizes in collection groups relative to ancestry groups. When applied to colorectal cancer, we identified a missense polymorphism in iron-absorption gene CYBRD1 that associated with disease in individuals of English, but not Scottish, ancestry. The association replicated in two additional, independently-collected data sets. Our method can be used to detect associations between genetic variants and disease that have been obscured by population genetic heterogeneity. It can be readily extended to the identification of genetic interactions on other covariates such as measured environmental exposures. We envisage our methodology being of particular interest to researchers with existing GWAS data, as ancestry groups can be easily defined and thus tested for interactions.  相似文献   

9.
Epigenetic alterations are a hallmark of aging and age‐related diseases. Computational models using DNA methylation data can create “epigenetic clocks” which are proposed to reflect “biological” aging. Thus, it is important to understand the relationship between predictive clock sites and aging biology. To do this, we examined over 450,000 methylation sites from 9,699 samples. We found ~20% of the measured genomic cytosines can be used to make many different epigenetic clocks whose age prediction performance surpasses that of telomere length. Of these predictive sites, the average methylation change over a lifetime was small (~1.5%) and these sites were under‐represented in canonical regions of epigenetic regulation. There was only a weak association between “accelerated” epigenetic aging and disease. We also compare tissue‐specific and pan‐tissue clock performance. This is critical to applying clocks both to new sample sets in basic research, as well as understanding if clinically available tissues will be feasible samples to evaluate “epigenetic aging” in unavailable tissues (e.g., brain). Despite the reproducible and accurate age predictions from DNA methylation data, these findings suggest they may have limited utility as currently designed in understanding the molecular biology of aging and may not be suitable as surrogate endpoints in studies of anti‐aging interventions. Purpose‐built clocks for specific tissues age ranges or phenotypes may perform better for their specific purpose. However, if purpose‐built clocks are necessary for meaningful predictions, then the utility of clocks and their application in the field needs to be considered in that context.  相似文献   

10.
Accurate and spatially-explicit maps of tropical forest carbon stocks are needed to implement carbon offset mechanisms such as REDD+ (Reduced Deforestation and Degradation Plus). The Random Forest machine learning algorithm may aid carbon mapping applications using remotely-sensed data. However, Random Forest has never been compared to traditional and potentially more reliable techniques such as regionally stratified sampling and upscaling, and it has rarely been employed with spatial data. Here, we evaluated the performance of Random Forest in upscaling airborne LiDAR (Light Detection and Ranging)-based carbon estimates compared to the stratification approach over a 16-million hectare focal area of the Western Amazon. We considered two runs of Random Forest, both with and without spatial contextual modeling by including—in the latter case—x, and y position directly in the model. In each case, we set aside 8 million hectares (i.e., half of the focal area) for validation; this rigorous test of Random Forest went above and beyond the internal validation normally compiled by the algorithm (i.e., called “out-of-bag”), which proved insufficient for this spatial application. In this heterogeneous region of Northern Peru, the model with spatial context was the best preforming run of Random Forest, and explained 59% of LiDAR-based carbon estimates within the validation area, compared to 37% for stratification or 43% by Random Forest without spatial context. With the 60% improvement in explained variation, RMSE against validation LiDAR samples improved from 33 to 26 Mg C ha−1 when using Random Forest with spatial context. Our results suggest that spatial context should be considered when using Random Forest, and that doing so may result in substantially improved carbon stock modeling for purposes of climate change mitigation.  相似文献   

11.

Background

Accurate inference of genetic discontinuities between populations is an essential component of intraspecific biodiversity and evolution studies, as well as associative genetics. The most widely-used methods to infer population structure are model-based, Bayesian MCMC procedures that minimize Hardy-Weinberg and linkage disequilibrium within subpopulations. These methods are useful, but suffer from large computational requirements and a dependence on modeling assumptions that may not be met in real data sets. Here we describe the development of a new approach, PCO-MC, which couples principal coordinate analysis to a clustering procedure for the inference of population structure from multilocus genotype data.

Methodology/Principal Findings

PCO-MC uses data from all principal coordinate axes simultaneously to calculate a multidimensional “density landscape”, from which the number of subpopulations, and the membership within subpopulations, is determined using a valley-seeking algorithm. Using extensive simulations, we show that this approach outperforms a Bayesian MCMC procedure when many loci (e.g. 100) are sampled, but that the Bayesian procedure is marginally superior with few loci (e.g. 10). When presented with sufficient data, PCO-MC accurately delineated subpopulations with population Fst values as low as 0.03 (G''st>0.2), whereas the limit of resolution of the Bayesian approach was Fst = 0.05 (G''st>0.35).

Conclusions/Significance

We draw a distinction between population structure inference for describing biodiversity as opposed to Type I error control in associative genetics. We suggest that discrete assignments, like those produced by PCO-MC, are appropriate for circumscribing units of biodiversity whereas expression of population structure as a continuous variable is more useful for case-control correction in structured association studies.  相似文献   

12.
The use of dense SNPs to predict the genetic value of an individual for a complex trait is often referred to as “genomic selection” in livestock and crops, but is also relevant to human genetics to predict, for example, complex genetic disease risk. The accuracy of prediction depends on the strength of linkage disequilibrium (LD) between SNPs and causal mutations. If sequence data were used instead of dense SNPs, accuracy should increase because causal mutations are present, but demographic history and long-term negative selection also influence accuracy. We therefore evaluated genomic prediction, using simulated sequence in two contrasting populations: one reducing from an ancestrally large effective population size (Ne) to a small one, with high LD common in domestic livestock, while the second had a large constant-sized Ne with low LD similar to that in some human or outbred plant populations. There were two scenarios in each population; causal variants were either neutral or under long-term negative selection. For large Ne, sequence data led to a 22% increase in accuracy relative to ∼600K SNP chip data with a Bayesian analysis and a more modest advantage with a BLUP analysis. This advantage increased when causal variants were influenced by negative selection, and accuracy persisted when 10 generations separated reference and validation populations. However, in the reducing Ne population, there was little advantage for sequence even with negative selection. This study demonstrates the joint influence of demography and selection on accuracy of prediction and improves our understanding of how best to exploit sequence for genomic prediction.  相似文献   

13.
Presence-only data, where information is available concerning species presence but not species absence, are subject to bias due to observers being more likely to visit and record sightings at some locations than others (hereafter “observer bias”). In this paper, we describe and evaluate a model-based approach to accounting for observer bias directly – by modelling presence locations as a function of known observer bias variables (such as accessibility variables) in addition to environmental variables, then conditioning on a common level of bias to make predictions of species occurrence free of such observer bias. We implement this idea using point process models with a LASSO penalty, a new presence-only method related to maximum entropy modelling, that implicitly addresses the “pseudo-absence problem” of where to locate pseudo-absences (and how many). The proposed method of bias-correction is evaluated using systematically collected presence/absence data for 62 plant species endemic to the Blue Mountains near Sydney, Australia. It is shown that modelling and controlling for observer bias significantly improves the accuracy of predictions made using presence-only data, and usually improves predictions as compared to pseudo-absence or “inventory” methods of bias correction based on absences from non-target species. Future research will consider the potential for improving the proposed bias-correction approach by estimating the observer bias simultaneously across multiple species.  相似文献   

14.
In both prokaryotic and eukaryotic cells, gene expression is regulated across the cell cycle to ensure “just-in-time” assembly of select cellular structures and molecular machines. However, present in all time-series gene expression measurements is variability that arises from both systematic error in the cell synchrony process and variance in the timing of cell division at the level of the single cell. Thus, gene or protein expression data collected from a population of synchronized cells is an inaccurate measure of what occurs in the average single-cell across a cell cycle. Here, we present a general computational method to extract “single-cell”-like information from population-level time-series expression data. This method removes the effects of 1) variance in growth rate and 2) variance in the physiological and developmental state of the cell. Moreover, this method represents an advance in the deconvolution of molecular expression data in its flexibility, minimal assumptions, and the use of a cross-validation analysis to determine the appropriate level of regularization. Applying our deconvolution algorithm to cell cycle gene expression data from the dimorphic bacterium Caulobacter crescentus, we recovered critical features of cell cycle regulation in essential genes, including ctrA and ftsZ, that were obscured in population-based measurements. In doing so, we highlight the problem with using population data alone to decipher cellular regulatory mechanisms and demonstrate how our deconvolution algorithm can be applied to produce a more realistic picture of temporal regulation in a cell.  相似文献   

15.
The ability to evaluate the validity of data is essential to any investigation, and manual “eyes on” assessments of data quality have dominated in the past. Yet, as the size of collected data continues to increase, so does the effort required to assess their quality. This challenge is of particular concern for networks that automate their data collection, and has resulted in the automation of many quality assurance and quality control analyses. Unfortunately, the interpretation of the resulting data quality flags can become quite challenging with large data sets. We have developed a framework to summarize data quality information and facilitate interpretation by the user. Our framework consists of first compiling data quality information and then presenting it through 2 separate mechanisms; a quality report and a quality summary. The quality report presents the results of specific quality analyses as they relate to individual observations, while the quality summary takes a spatial or temporal aggregate of each quality analysis and provides a summary of the results. Included in the quality summary is a final quality flag, which further condenses data quality information to assess whether a data product is valid or not. This framework has the added flexibility to allow “eyes on” information on data quality to be incorporated for many data types. Furthermore, this framework can aid problem tracking and resolution, should sensor or system malfunctions arise.  相似文献   

16.
Improvements to particle tracking algorithms are required to effectively analyze the motility of biological molecules in complex or noisy systems. A typical single particle tracking (SPT) algorithm detects particle coordinates for trajectory assembly. However, particle detection filters fail for data sets with low signal-to-noise levels. When tracking molecular motors in complex systems, standard techniques often fail to separate the fluorescent signatures of moving particles from background signal. We developed an approach to analyze the motility of kinesin motor proteins moving along the microtubule cytoskeleton of extracted neurons using the Kullback-Leibler divergence to identify regions where there are significant differences between models of moving particles and background signal. We tested our software on both simulated and experimental data and found a noticeable improvement in SPT capability and a higher identification rate of motors as compared with current methods. This algorithm, called Cega, for “find the object,” produces data amenable to conventional blob detection techniques that can then be used to obtain coordinates for downstream SPT processing. We anticipate that this algorithm will be useful for those interested in tracking moving particles in complex in vitro or in vivo environments.  相似文献   

17.
Regulatory networks play a central role in cellular behavior and decision making. Learning these regulatory networks is a major task in biology, and devising computational methods and mathematical models for this task is a major endeavor in bioinformatics. Boolean networks have been used extensively for modeling regulatory networks. In this model, the state of each gene can be either ‘on’ or ‘off’ and that next-state of a gene is updated, synchronously or asynchronously, according to a Boolean rule that is applied to the current-state of the entire system. Inferring a Boolean network from a set of experimental data entails two main steps: first, the experimental time-series data are discretized into Boolean trajectories, and then, a Boolean network is learned from these Boolean trajectories. In this paper, we consider three methods for data discretization, including a new one we propose, and three methods for learning Boolean networks, and study the performance of all possible nine combinations on four regulatory systems of varying dynamics complexities. We find that employing the right combination of methods for data discretization and network learning results in Boolean networks that capture the dynamics well and provide predictive power. Our findings are in contrast to a recent survey that placed Boolean networks on the low end of the “faithfulness to biological reality” and “ability to model dynamics” spectra. Further, contrary to the common argument in favor of Boolean networks, we find that a relatively large number of time points in the time-series data is required to learn good Boolean networks for certain data sets. Last but not least, while methods have been proposed for inferring Boolean networks, as discussed above, missing still are publicly available implementations thereof. Here, we make our implementation of the methods available publicly in open source at http://bioinfo.cs.rice.edu/.  相似文献   

18.
Oat (Avena sativa L.) seed is a rich resource of beneficial lipids, soluble fiber, protein, and antioxidants, and is considered a healthful food for humans. Little is known regarding the genetic controllers of variation for these compounds in oat seed. We characterized natural variation in the mature seed metabolome using untargeted metabolomics on 367 diverse lines and leveraged this information to improve prediction for seed quality traits. We used a latent factor approach to define unobserved variables that may drive covariance among metabolites. One hundred latent factors were identified, of which 21% were enriched for compounds associated with lipid metabolism. Through a combination of whole-genome regression and association mapping, we show that latent factors that generate covariance for many metabolites tend to have a complex genetic architecture. Nonetheless, we recovered significant associations for 23% of the latent factors. These associations were used to inform a multi-kernel genomic prediction model, which was used to predict seed lipid and protein traits in two independent studies. Predictions for 8 of the 12 traits were significantly improved compared to genomic best linear unbiased prediction when this prediction model was informed using associations from lipid-enriched factors. This study provides new insights into variation in the oat seed metabolome and provides genomic resources for breeders to improve selection for health-promoting seed quality traits. More broadly, we outline an approach to distill high-dimensional “omics” data to a set of biologically meaningful variables and translate inferences on these data into improved breeding decisions.  相似文献   

19.
Quantitative microscopy is a valuable tool for inferring molecular mechanisms of cellular processes such as clathrin-mediated endocytosis, but, for quantitative microscopy to reach its potential, both data collection and analysis needed improvement. We introduce new tools to track and count endocytic patches in fission yeast to increase the quality of the data extracted from quantitative microscopy movies. We present a universal method to achieve “temporal superresolution” by aligning temporal data sets with higher temporal resolution than the measurement intervals. These methods allowed us to extract new information about endocytic actin patches in wild-type cells from measurements of the fluorescence of fimbrin-mEGFP. We show that the time course of actin assembly and disassembly varies <600 ms between patches. Actin polymerizes during vesicle formation, but we show that polymerization does not participate in vesicle movement other than to limit the complex diffusive motions of newly formed endocytic vesicles, which move faster as the surrounding actin meshwork decreases in size over time. Our methods also show that the number of patches in fission yeast is proportional to cell length and that the variability in the repartition of patches between the tips of interphase cells has been underestimated.  相似文献   

20.

Background

Evaluating environmental health risks in communities requires models characterizing geographic and demographic patterns of exposure to multiple stressors. These exposure models can be constructed from multivariable regression analyses using individual-level predictors (microdata), but these microdata are not typically available with sufficient geographic resolution for community risk analyses given privacy concerns.

Methods

We developed synthetic geographically-resolved microdata for a low-income community (New Bedford, Massachusetts) facing multiple environmental stressors. We first applied probabilistic reweighting using simulated annealing to data from the 2006–2010 American Community Survey, combining 9,135 microdata samples from the New Bedford area with census tract-level constraints for individual and household characteristics. We then evaluated the synthetic microdata using goodness-of-fit tests and by examining spatial patterns of microdata fields not used as constraints. As a demonstration, we developed a multivariable regression model predicting smoking behavior as a function of individual-level microdata fields using New Bedford-specific data from the 2006–2010 Behavioral Risk Factor Surveillance System, linking this model with the synthetic microdata to predict demographic and geographic smoking patterns in New Bedford.

Results

Our simulation produced microdata representing all 94,944 individuals living in New Bedford in 2006–2010. Variables in the synthetic population matched the constraints well at the census tract level (e.g., ancestry, gender, age, education, household income) and reproduced the census-derived spatial patterns of non-constraint microdata. Smoking in New Bedford was significantly associated with numerous demographic variables found in the microdata, with estimated tract-level smoking rates varying from 20% (95% CI: 17%, 22%) to 37% (95% CI: 30%, 45%).

Conclusions

We used simulation methods to create geographically-resolved individual-level microdata that can be used in community-wide exposure and risk assessment studies. This approach provides insights regarding community-scale exposure and vulnerability patterns, valuable in settings where policy can be informed by characterization of multi-stressor exposures and health risks at high resolution.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号