首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
Tremendous progress has been made at the level of sequential computation in phylogenetics. However, little attention has been paid to parallel computation. Parallel computing is particularly suited to phylogenetics because of the many ways large computational problems can be broken into parts that can be analyzed concurrently. In this paper, we investigate the scaling factors and efficiency of random addition and tree refinement strategies using the direct optimization software, POY, on a small (10 slave processors) and a large (256 slave processors) cluster of networked PCs running LINUX. These algorithms were tested on several data sets composed of DNA and morphology ranging from 40 to 500 taxa. Various algorithms in POY show fundamentally different properties within and between clusters. All algorithms are efficient on the small cluster for the 40-taxon data set. On the large cluster, multibuilding exhibits excellent parallel efficiency, whereas parallel building is inefficient. These results are independent of data set size. Branch swapping in parallel shows excellent speed-up for 16 slave processors on the large cluster. However, there is no appreciable speed-up for branch swapping with the further addition of slave processors (>16). This result is independent of data set size. Ratcheting in parallel is efficient with the addition of up to 32 processors in the large cluster. This result is independent of data set size.  相似文献   

3.
Prediction of group patterns in social mammals based on a coalescent model   总被引:1,自引:0,他引:1  
This study describes a statistical model which assumes that mammal group patterns match with groups of genetic relatives. Given a fixed sample size, recursive algorithms for the exact computation of the probability distribution of the number of groups are provided. The recursive algorithms are then incorporated into a statistical likelihood framework which can be used to detect and quantify departure from the null-model by estimating a clustering parameter. The test is then applied to ecological data from social herbivores and carnivores. Our findings support the hypothesis that genetic relatedness is likely to predict group patterns when large mammals have few or no predators.  相似文献   

4.
Parameter inference and model selection are very important for mathematical modeling in systems biology. Bayesian statistics can be used to conduct both parameter inference and model selection. Especially, the framework named approximate Bayesian computation is often used for parameter inference and model selection in systems biology. However, Monte Carlo methods needs to be used to compute Bayesian posterior distributions. In addition, the posterior distributions of parameters are sometimes almost uniform or very similar to their prior distributions. In such cases, it is difficult to choose one specific value of parameter with high credibility as the representative value of the distribution. To overcome the problems, we introduced one of the population Monte Carlo algorithms, population annealing. Although population annealing is usually used in statistical mechanics, we showed that population annealing can be used to compute Bayesian posterior distributions in the approximate Bayesian computation framework. To deal with un-identifiability of the representative values of parameters, we proposed to run the simulations with the parameter ensemble sampled from the posterior distribution, named “posterior parameter ensemble”. We showed that population annealing is an efficient and convenient algorithm to generate posterior parameter ensemble. We also showed that the simulations with the posterior parameter ensemble can, not only reproduce the data used for parameter inference, but also capture and predict the data which was not used for parameter inference. Lastly, we introduced the marginal likelihood in the approximate Bayesian computation framework for Bayesian model selection. We showed that population annealing enables us to compute the marginal likelihood in the approximate Bayesian computation framework and conduct model selection depending on the Bayes factor.  相似文献   

5.
We analyse optimal and heuristic place prioritization algorithms for biodiversity conservation area network design which can use probabilistic data on the distribution of surrogates for biodiversity. We show how an Expected Surrogate Set Covering Problem (ESSCP) and a Maximal Expected Surrogate Covering Problem (MESCP) can be linearized for computationally efficient solution. For the ESSCP, we study the performance of two optimization software packages (XPRESS and CPLEX) and five heuristic algorithms based on traditional measures of complementarity and rarity as well as the Shannon and Simpson indices of α‐diversity which are being used in this context for the first time. On small artificial data sets the optimal place prioritization algorithms often produced more economical solutions than the heuristic algorithms, though not always ones guaranteed to be optimal. However, with large data sets, the optimal algorithms often required long computation times and produced no better results than heuristic ones. Thus there is generally little reason to prefer optimal to heuristic algorithms with probabilistic data sets.  相似文献   

6.
Information transfer, measured by transfer entropy, is a key component of distributed computation. It is therefore important to understand the pattern of information transfer in order to unravel the distributed computational algorithms of a system. Since in many natural systems distributed computation is thought to rely on rhythmic processes a frequency resolved measure of information transfer is highly desirable. Here, we present a novel algorithm, and its efficient implementation, to identify separately frequencies sending and receiving information in a network. Our approach relies on the invertible maximum overlap discrete wavelet transform (MODWT) for the creation of surrogate data in the computation of transfer entropy and entirely avoids filtering of the original signals. The approach thereby avoids well-known problems due to phase shifts or the ineffectiveness of filtering in the information theoretic setting. We also show that measuring frequency-resolved information transfer is a partial information decomposition problem that cannot be fully resolved to date and discuss the implications of this issue. Last, we evaluate the performance of our algorithm on simulated data and apply it to human magnetoencephalography (MEG) recordings and to local field potential recordings in the ferret. In human MEG we demonstrate top-down information flow in temporal cortex from very high frequencies (above 100Hz) to both similarly high frequencies and to frequencies around 20Hz, i.e. a complex spectral configuration of cortical information transmission that has not been described before. In the ferret we show that the prefrontal cortex sends information at low frequencies (4-8 Hz) to early visual cortex (V1), while V1 receives the information at high frequencies (> 125 Hz).  相似文献   

7.
In discrete tomography, a scanned object is assumed to consist of only a few different materials. This prior knowledge can be effectively exploited by a specialized discrete reconstruction algorithm such as the Discrete Algebraic Reconstruction Technique (DART), which is capable of providing more accurate reconstructions from limited data compared to conventional reconstruction algorithms. However, like most iterative reconstruction algorithms, DART suffers from long computation times. To increase the computational efficiency as well as the reconstruction quality of DART, a multiresolution version of DART (MDART) is proposed, in which the reconstruction starts on a coarse grid with big pixel (voxel) size. The resulting reconstruction is then resampled on a finer grid and used as an initial point for a subsequent DART reconstruction. This process continues until the target pixel size is reached. Experiments show that MDART can provide a significant speed-up, reduce missing wedge artefacts and improve feature reconstruction in the object compared with DART within the same time, making its use with large datasets more feasible.  相似文献   

8.
The development of high-throughput technology has generated a massive amount of high-dimensional data, and many of them are of discrete type. Robust and efficient learning algorithms such as LASSO [1] are required for feature selection and overfitting control. However, most feature selection algorithms are only applicable to the continuous data type. In this paper, we propose a novel method for sparse support vector machines (SVMs) with L_{p} (p ≪ 1) regularization. Efficient algorithms (LpSVM) are developed for learning the classifier that is applicable to high-dimensional data sets with both discrete and continuous data types. The regularization parameters are estimated through maximizing the area under the ROC curve (AUC) of the cross-validation data. Experimental results on protein sequence and SNP data attest to the accuracy, sparsity, and efficiency of the proposed algorithm. Biomarkers identified with our methods are compared with those from other methods in the literature. The software package in Matlab is available upon request.  相似文献   

9.
Previous work has shown that it is often essential to account for the variation in rates at different sites in phylogenetic models in order to avoid phylogenetic artifacts such as long branch attraction. In most current models, the gamma distribution is used for the rates-across-sites distributions and is implemented as an equal-probability discrete gamma. In this article, we introduce discrete distribution estimates with large numbers of equally spaced rate categories allowing us to investigate the appropriateness of the gamma model. With large numbers of rate categories, these discrete estimates are flexible enough to approximate the shape of almost any distribution. Likelihood ratio statistical tests and a nonparametric bootstrap confidence-bound estimation procedure based on the discrete estimates are presented that can be used to test the fit of a parametric family. We applied the methodology to several different protein data sets, and found that although the gamma model often provides a good parametric model for this type of data, rate estimates from an equal-probability discrete gamma model with a small number of categories will tend to underestimate the largest rates. In cases when the gamma model assumption is in doubt, rate estimates coming from the discrete rate distribution estimate with a large number of rate categories provide a robust alternative to gamma estimates. An alternative implementation of the gamma distribution is proposed that, for equal numbers of rate categories, is computationally more efficient during optimization than the standard gamma implementation and can provide more accurate estimates of site rates.  相似文献   

10.

Background  

Stochastic effects can be important for the behavior of processes involving small population numbers, so the study of stochastic models has become an important topic in the burgeoning field of computational systems biology. However analysis techniques for stochastic models have tended to lag behind their deterministic cousins due to the heavier computational demands of the statistical approaches for fitting the models to experimental data. There is a continuing need for more effective and efficient algorithms. In this article we focus on the parameter inference problem for stochastic kinetic models of biochemical reactions given discrete time-course observations of either some or all of the molecular species.  相似文献   

11.
The discrete data structure and large sequencing depth of RNA sequencing (RNA-seq) experiments can often generate outlier read counts in one or more RNA samples within a homogeneous group. Thus, how to identify and manage outlier observations in RNA-seq data is an emerging topic of interest. One of the main objectives in these research efforts is to develop statistical methodology that effectively balances the impact of outlier observations and achieves maximal power for statistical testing. To reach that goal, strengthening the accuracy of outlier detection is an important precursor. Current outlier detection algorithms for RNA-seq data are executed within a testing framework and may be sensitive to sparse data and heavy-tailed distributions. Therefore, we propose a univariate algorithm that utilizes a probabilistic approach to measure the deviation between an observation and the distribution generating the remaining data and implement it within in an iterative leave-one-out design strategy. Analyses of real and simulated RNA-seq data show that the proposed methodology has higher outlier detection rates for both non-normalized and normalized negative binomial distributed data.  相似文献   

12.
Aim Trait‐based risk assessment for invasive species is becoming an important tool for identifying non‐indigenous species that are likely to cause harm. Despite this, concerns remain that the invasion process is too complex for accurate predictions to be made. Our goal was to test risk assessment performance across a range of taxonomic and geographical scales, at different points in the invasion process, with a range of statistical and machine learning algorithms. Location Regional to global data sets. Methods We selected six data sets differing in size, geography and taxonomic scope. For each data set, we created seven risk assessment tools using a range of statistical and machine learning algorithms. Performance of tools was compared to determine the effects of data set size and scale, the algorithm used, and to determine overall performance of the trait‐based risk assessment approach. Results Risk assessment tools with good performance were generated for all data sets. Random forests (RF) and logistic regression (LR) consistently produced tools with high performance. Other algorithms had varied performance. Despite their greater power and flexibility, machine learning algorithms did not systematically outperform statistical algorithms. Geographic scope of the data set, and size of the data set, did not systematically affect risk assessment performance. Main conclusions Across six representative data sets, we were able to create risk assessment tools with high performance. Additional data sets could be generated for other taxonomic groups and regions, and these could support efforts to prevent the arrival of new invaders. Random forests and LR approaches performed well for all data sets and could be used as a standard approach to risk assessment development.  相似文献   

13.
Microarray and beadchip are two most efficient techniques for measuring gene expression and methylation data in bioinformatics. Biclustering deals with the simultaneous clustering of genes and samples. In this article, we propose a computational rule mining framework, StatBicRM (i.e., statistical biclustering-based rule mining) to identify special type of rules and potential biomarkers using integrated approaches of statistical and binary inclusion-maximal biclustering techniques from the biological datasets. At first, a novel statistical strategy has been utilized to eliminate the insignificant/low-significant/redundant genes in such way that significance level must satisfy the data distribution property (viz., either normal distribution or non-normal distribution). The data is then discretized and post-discretized, consecutively. Thereafter, the biclustering technique is applied to identify maximal frequent closed homogeneous itemsets. Corresponding special type of rules are then extracted from the selected itemsets. Our proposed rule mining method performs better than the other rule mining algorithms as it generates maximal frequent closed homogeneous itemsets instead of frequent itemsets. Thus, it saves elapsed time, and can work on big dataset. Pathway and Gene Ontology analyses are conducted on the genes of the evolved rules using David database. Frequency analysis of the genes appearing in the evolved rules is performed to determine potential biomarkers. Furthermore, we also classify the data to know how much the evolved rules are able to describe accurately the remaining test (unknown) data. Subsequently, we also compare the average classification accuracy, and other related factors with other rule-based classifiers. Statistical significance tests are also performed for verifying the statistical relevance of the comparative results. Here, each of the other rule mining methods or rule-based classifiers is also starting with the same post-discretized data-matrix. Finally, we have also included the integrated analysis of gene expression and methylation for determining epigenetic effect (viz., effect of methylation) on gene expression level.  相似文献   

14.
Gene expression data usually contain a large number of genes but a small number of samples. Feature selection for gene expression data aims at finding a set of genes that best discriminate biological samples of different types. Using machine learning techniques, traditional gene selection based on empirical mutual information suffers the data sparseness issue due to the small number of samples. To overcome the sparseness issue, we propose a model-based approach to estimate the entropy of class variables on the model, instead of on the data themselves. Here, we use multivariate normal distributions to fit the data, because multivariate normal distributions have maximum entropy among all real-valued distributions with a specified mean and standard deviation and are widely used to approximate various distributions. Given that the data follow a multivariate normal distribution, since the conditional distribution of class variables given the selected features is a normal distribution, its entropy can be computed with the log-determinant of its covariance matrix. Because of the large number of genes, the computation of all possible log-determinants is not efficient. We propose several algorithms to largely reduce the computational cost. The experiments on seven gene data sets and the comparison with other five approaches show the accuracy of the multivariate Gaussian generative model for feature selection, and the efficiency of our algorithms.  相似文献   

15.
Along with the development of high-throughput sequencing technologies, both sample size and SNP number are increasing rapidly in genome-wide association studies (GWAS), and the associated computation is more challenging than ever. Here, we present a memory-efficient, visualization-enhanced, and parallel-accelerated R package called “rMVP” to address the need for improved GWAS computation. rMVP can 1) effectively process large GWAS data, 2) rapidly evaluate population structure, 3) efficiently estimate variance components by Efficient Mixed-Model Association eXpedited (EMMAX), Factored Spectrally Transformed Linear Mixed Models (FaST-LMM), and Haseman-Elston (HE) regression algorithms, 4) implement parallel-accelerated association tests of markers using general linear model (GLM), mixed linear model (MLM), and fixed and random model circulating probability unification (FarmCPU) methods, 5) compute fast with a globally efficient design in the GWAS processes, and 6) generate various visualizations of GWAS-related information. Accelerated by block matrix multiplication strategy and multiple threads, the association test methods embedded in rMVP are significantly faster than PLINK, GEMMA, and FarmCPU_pkg. rMVP is freely available at https://github.com/xiaolei-lab/rMVP.  相似文献   

16.
In this study, we compared the effects of 2,6-dideoxy-2,6-imino-7-O-(beta-D-glucopyranosyl)-D-glycero-L-gulohep titol (MDL) to those of the glucosidase I inhibitor, castanospermine, on the purified processing enzymes glucosidases I and II. WE also compared the effects of these two inhibitors on glycoprotein processing in cell culture using influenza virus-infected Madin-Darby canine kidney cells as a model system. With the purified processing enzymes, castanospermine was a better inhibitor of glucosidase I than of glucosidase II, whereas MDL is more effective against glucosidase II than glucosidase I. In cell culture at the appropriate dose, MDL also preferentially affected glucosidase II. Thus, at 250 micrograms/ml MDL, the major [3H]glucose-labeled (or [3H]mannose-labeled) glycopeptide from the viral hemagglutinin was susceptible to endoglucosaminidase H, and the oligosaccharide liberated by this treatment was characterized as a Glc2Man7-9GlcNAc on the basis of size, resistance to digestion by glucosidase I (but sensitivity to glucosidase II), methylation analysis, and Smith degradation studies. These data indicate that at appropriate concentrations of MDL (250 micrograms/ml), one can selectively inhibit glucosidase II in Madin-Darby canine kidney cells. However, at higher concentrations of inhibitor (500 micrograms/ml), both enzymes are apparently affected. Since MDL did not greatly inhibit the synthesis of lipid-linked saccharides or the synthesis of protein or RNA, it should be a useful tool for studies on the biosynthesis and role of N-linked oligosaccharides in glycoprotein function.  相似文献   

17.
MDL 72527 was considered a selective inhibitor of FAD-dependent polyamine oxidases. In the present communication, we demonstrate that MDL 72527 inactivates bovine serum amine oxidase, a copper-containing, TPQ-enzyme, time-dependently at 25 degrees C. In striking contrast, the enzyme remained active after incubation with excessive MDL 72527 at 37 degrees C, even after 70 h of incubation. Inactivation of BSAO with MDL 72527 at 25 degrees C did not involve the cofactor, as was shown by spectroscopy and by reaction with phenylhydrazine. Docking of MDL 72527 is difficult, owing to its size and two lipophilic moieties, and it has been shown that minor changes in reaction rate of substrates cause major changes in K(m) and k(cat)/K(m). We hypothesise that subtle conformational changes between 25 and 37 degrees C impair MDL 72527 from productive binding and prevent the nucleophilic group from reacting with the double bond system.  相似文献   

18.
19.
Many biologists believe that data analysis expertise lags behind the capacity for producing high-throughput data. One view within the bioinformatics community is that biological scientists need to develop algorithmic skills to meet the demands of the new technologies. In this article, we argue that the broader concept of inferential literacy, which includes understanding of data characteristics, experimental design and statistical analysis, in addition to computation, more adequately encompasses what is needed for efficient progress in high-throughput biology.  相似文献   

20.
In vivo measurement of local tissue characteristics by modern bioimaging techniques such as positron emission tomography (PET) provides the opportunity to analyze quantitatively the role that tissue heterogeneity may play in understanding biological function. This paper develops a statistical measure of the heterogeneity of a tissue characteristic that is based on the deviation of the distribution of the tissue characteristic from a unimodal elliptically contoured spatial pattern. An efficient algorithm is developed for computation of the measure based on volumetric region of interest data. The technique is illustrated by application to data from PET imaging studies of fluorodeoxyglucose utilization in human sarcomas. A set of 74 sarcoma patients (with five-year follow-up survival information) were evaluated for heterogeneity as well as a number of other potential prognostic indicators of survival. A Cox proportional hazards analysis of these data shows that the degree of heterogeneity of the sarcoma is the major risk factor associated with patient death. Some theory is developed to analyze the asymptotic statistical behavior of the heterogeneity estimator. In the context of data arising from Poisson deconvolution (PET being the prime example), the heterogeneity estimator, which is a non-linear functional of the PET image data, is consistent and converges at a rate that is parametric in the injected dose.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号