首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Aim To develop a systematic and generic framework for biogeographical regionalizations that can assist in reconciling different approaches and advance their application as a research tool. Location The Australian continent is used as a case study. Methods A review of approaches to biogeographical regionalization revealed two basic methodologies: the integrated survey method and the parametric approach. To help reconcile these different approaches, we propose a simple, four‐step, flexible and generic framework. (1) Identification of the thematic foci from the three main themes (composition and evolutionary legacy; ecosystem drivers; ecosystem responses). (2) Proposal of a theory defining the purpose. (3) Application of a numeric agglomerative classification procedure that requires the user to make explicit assumptions about attributes, the number of classification groups, the spatial unit of analysis, and the metric for measuring the similarity of these units based on their attribute values. (4) Acquisition of spatial estimates of the required input attribute data. For this case study, an agglomerative classification strategy was applied using the functions within patn 3.03, a software package facilitating large‐scale, multivariate pattern analysis. The input data to the classifications were continental coverages of 11 environmental variables and three indices of gross primary productivity stored at a grid cell resolution of c. 250 m. The spatial units of analysis were surface hydrological units (SHU), which were derived from a continental digital elevation model based on the contributing areas to stream segments or the area draining into a local sink where there is no organized drainage. The Minkowski series (Euclidean distance) was selected as the association measure to allow weightings to be applied to the variables. Results Two new biogeographical regionalizations of the Australian continent were generated. The first was an environmental domain classification, based on 11 climatic, terrain and soil attributes. This regionalization can be used to address hypotheses about the relationship between environmental distance and evolutionary processes. The classification produced 151 environmental groups. The second was a classification of primary productivity regimes based on estimates of the gross primary productivity of the vegetation cover calculated from moderate resolution imaging spectroradiometer (MODIS) normalized difference vegetation index (NDVI) data and estimates of radiation. This classification produced 50 groups, and can be used to examine hypotheses concerning productivity regimes and animal life‐history strategies. The productivity classification does not capture all the properties related to biological carrying capacity, process rates and differences in the characteristic biodiversity of ecosystems. Some of these ecologically significant properties are captured by the environmental domain classification. Main conclusions Our framework can be applied to all terrestrial regions, and the necessary data for the analyses presented here are now available at global scales. As the spatial predictions generated by the classifications can be tested by comparison with independent data, the approach facilitates exploratory analysis and further hypothesis generation. Integration of the three themes in our framework will contribute to a more comprehensive approach to biogeography.  相似文献   

2.
3.
Yi G  Shi JQ  Choi T 《Biometrics》2011,67(4):1285-1294
The model based on Gaussian process (GP) prior and a kernel covariance function can be used to fit nonlinear data with multidimensional covariates. It has been used as a flexible nonparametric approach for curve fitting, classification, clustering, and other statistical problems, and has been widely applied to deal with complex nonlinear systems in many different areas particularly in machine learning. However, it is a challenging problem when the model is used for the large-scale data sets and high-dimensional data, for example, for the meat data discussed in this article that have 100 highly correlated covariates. For such data, it suffers from large variance of parameter estimation and high predictive errors, and numerically, it suffers from unstable computation. In this article, penalized likelihood framework will be applied to the model based on GPs. Different penalties will be investigated, and their ability in application given to suit the characteristics of GP models will be discussed. The asymptotic properties will also be discussed with the relevant proofs. Several applications to real biomechanical and bioinformatics data sets will be reported.  相似文献   

4.
Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by hashing the features into a low-dimensional space, using a hash function, i.e., by mapping features into hash keys, where multiple features can be mapped (at random) to the same hash key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" approach. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks.  相似文献   

5.
Conceptually, protein crystallization can be divided into two phases search and optimization. Robotic protein crystallization screening can speed up the search phase, and has a potential to increase process quality. Automated image classification helps to increase throughput and consistently generate objective results. Although the classification accuracy can always be improved, our image analysis system can classify images from 1536-well plates with high classification accuracy (85%) and ROC score (0.87), as evaluated on 127 human-classified protein screens containing 5600 crystal images and 189472 non-crystal images. Data mining can integrate results from high-throughput screens with information about crystallizing conditions, intrinsic protein properties, and results from crystallization optimization. We apply association mining, a data mining approach that identifies frequently occurring patterns among variables and their values. This approach segregates proteins into groups based on how they react in a broad range of conditions, and clusters cocktails to reflect their potential to achieve crystallization. These results may lead to crystallization screen optimization, and reveal associations between protein properties and crystallization conditions. We also postulate that past experience may lead us to the identification of initial conditions favorable to crystallization for novel proteins.  相似文献   

6.
The existence of uncertainties and variations in data represents a remaining challenge for life cycle assessment (LCA). Moreover, a full analysis may be complex, time‐consuming, and implemented mainly when a product design is already defined. Structured under‐specification, a method developed to streamline LCA, is here proposed to support the residential building design process, by quantifying environmental impact when specific information on the system under analysis cannot be available. By means of structured classifications of materials and building assemblies, it is possible to use surrogate data during the life cycle inventory phase and thus to obtain environmental impact and associated uncertainty. The bill of materials of a building assembly can be specified using minimal detail during the design process. The low‐fidelity characterization of a building assembly and the uncertainty associated with these low levels of fidelity are systematically quantified through structured under‐specification using a structured classification of materials. The analyst is able to use this classification to quantify uncertainty in results at each level of specificity. Concerning building assemblies, an average decrease of uncertainty of 25% is observed at each additional level of specificity within the data structure. This approach was used to compare different exterior wall options during the early design process. Almost 50% of the comparisons can be statistically differentiated at even the lowest level of specificity. This data structure is the foundation of a streamlined approach that can be applied not only when a complete bill of materials is available, but also when fewer details are known.  相似文献   

7.
We propose and validate a multivariate classification algorithm for characterizing changes in human intracranial electroencephalographic data (iEEG) after learning motor sequences. The algorithm is based on a Hidden Markov Model (HMM) that captures spatio-temporal properties of the iEEG at the level of single trials. Continuous intracranial iEEG was acquired during two sessions (one before and one after a night of sleep) in two patients with depth electrodes implanted in several brain areas. They performed a visuomotor sequence (serial reaction time task, SRTT) using the fingers of their non-dominant hand. Our results show that the decoding algorithm correctly classified single iEEG trials from the trained sequence as belonging to either the initial training phase (day 1, before sleep) or a later consolidated phase (day 2, after sleep), whereas it failed to do so for trials belonging to a control condition (pseudo-random sequence). Accurate single-trial classification was achieved by taking advantage of the distributed pattern of neural activity. However, across all the contacts the hippocampus contributed most significantly to the classification accuracy for both patients, and one fronto-striatal contact for one patient. Together, these human intracranial findings demonstrate that a multivariate decoding approach can detect learning-related changes at the level of single-trial iEEG. Because it allows an unbiased identification of brain sites contributing to a behavioral effect (or experimental condition) at the level of single subject, this approach could be usefully applied to assess the neural correlates of other complex cognitive functions in patients implanted with multiple electrodes.  相似文献   

8.
The recent increase in data accuracy from high resolution accelerometers offers substantial potential for improved understanding and prediction of animal movements. However, current approaches used for analysing these multivariable datasets typically require existing knowledge of the behaviors of the animals to inform the behavioral classification process. These methods are thus not well‐suited for the many cases where limited knowledge of the different behaviors performed exist. Here, we introduce the use of an unsupervised learning algorithm. To illustrate the method's capability we analyse data collected using a combination of GPS and Accelerometers on two seabird species: razorbills (Alca torda) and common guillemots (Uria aalge). We applied the unsupervised learning algorithm Expectation Maximization to characterize latent behavioral states both above and below water at both individual and group level. The application of this flexible approach yielded significant new insights into the foraging strategies of the two study species, both above and below the surface of the water. In addition to general behavioral modes such as flying, floating, as well as descending and ascending phases within the water column, this approach allowed an exploration of previously unstudied and important behaviors such as searching and prey chasing/capture events. We propose that this unsupervised learning approach provides an ideal tool for the systematic analysis of such complex multivariable movement data that are increasingly being obtained with accelerometer tags across species. In particular, we recommend its application in cases where we have limited current knowledge of the behaviors performed and existing supervised learning approaches may have limited utility.  相似文献   

9.
Successful learning of a motor skill requires repetitive training. Once the skill is mastered, it can be remembered for a long period of time. The durable memory makes motor skill learning an interesting paradigm for the study of learning and memory mechanisms. To gain better understanding, one scientific approach is to dissect the process into stages and to study these as well as their interactions. This article covers the growing evidence that motor skill learning advances through stages, in which different storage mechanisms predominate. The acquisition phase is characterized by fast (within session) and slow learning (between sessions). For a short period following the initial training sessions, the skill is labile to interference by other skills and by protein synthesis inhibition, indicating that consolidation processes occur during rest periods between training sessions. During training as well as rest periods, activation in different brain regions changes dynamically. Evidence for stages in motor skill learning is provided by experiments using behavioral, electrophysiological, functional imaging, and cellular/molecular methods.  相似文献   

10.
The present study introduces an approach to automatic classification of extracellularly recorded action potentials of neurons. The classification of spike waveform is considered a pattern recognition problem of special segments of signal that correspond to the appearance of spikes. The spikes generated by one neuron should be recognized as members of the same class. The spike waveforms are described by the nonlinear oscillating model as an ordinary differential equation with perturbation, thus characterizing the signal distortions in both amplitude and phase. It is shown that the use of local variables reduces the problem of spike recognition to the separation of a mixture of normal distributions in the transformed feature space. We have developed an unsupervised iteration-learning algorithm that estimates the number of classes and their centers according to the distance between spike trajectories in phase space. This algorithm scans the learning set to evaluate spike trajectories with maximal probability density in their neighborhood. Following the learning, the procedure of minimal distance is used to perform spike recognition. Estimation of trajectories in phase space requires calculation of the first- and second-order derivatives, and integral operators with piecewise polynomial kernels were used. This provided the computational efficiency of the developed approach for real-time application as required by recordings in behaving animals and in human neurosurgical operations. The new method of spike sorting was tested on simulated and real data and performed better than other approaches currently used in neurophysiology.  相似文献   

11.
Process life cycle assessment (PLCA) is widely used to quantify environmental flows associated with the manufacturing of products and other processes. As PLCA always depends on defining a system boundary, its application involves truncation errors. Different methods of estimating truncation errors are proposed in the literature; most of these are based on artificially constructed system complete counterfactuals. In this article, we review the literature on truncation errors and their estimates and systematically explore factors that influence truncation error estimates. We classify estimation approaches, together with underlying factors influencing estimation results according to where in the estimation procedure they occur. By contrasting different PLCA truncation/error modeling frameworks using the same underlying input‐output (I‐O) data set and varying cut‐off criteria, we show that modeling choices can significantly influence estimates for PLCA truncation errors. In addition, we find that differences in I‐O and process inventory databases, such as missing service sector activities, can significantly affect estimates of PLCA truncation errors. Our results expose the challenges related to explicit statements on the magnitude of PLCA truncation errors. They also indicate that increasing the strictness of cut‐off criteria in PLCA has only limited influence on the resulting truncation errors. We conclude that applying an additional I‐O life cycle assessment or a path exchange hybrid life cycle assessment to identify where significant contributions are located in upstream layers could significantly reduce PLCA truncation errors.  相似文献   

12.
Jeremy W. Fox 《Oikos》2010,119(11):1823-1833
The temporal variability of ecological communities may depend on species richness and composition due to a variety of statistical and ecological mechanisms. However, ecologists currently lack a general, unified theoretical framework within which to compare the effects of these mechanisms. Developing such a framework is difficult because community variability depends not just on how species vary, but also how they covary, making it unclear how to isolate the contributions of individual species to community variability. Here I develop such a theoretical framework using the multi‐level Price equation, originally developed in evolutionary biology to partition the effects of group selection and individual selection. I show how the variability of a community can be related to the properties of the individual species comprising it, just as the properties of an evolving group can be related to the properties of the individual organisms comprising it. I show that effects of species loss on community variability can be partitioned into effects of species richness (random loss of species), effects of species composition (non‐random loss of species with respect to their variances and covariances), and effects of context dependence (post‐loss changes in species’ variances and covariances). I illustrate the application of this framework using data from the Biodiversity II experiment, and show that it leads to new conceptual and empirical insights. For instance, effects of species richness on community variability necessarily occur, but often are swamped by other effects, particularly context dependence.  相似文献   

13.
High dimensional data increase the dimension of space and consequently the computational complexity and result in lower generalization. From these types of classification problems microarray data classification can be mentioned. Microarrays contain genetic and biological data which can be used to diagnose diseases including various types of cancers and tumors. Having intractable dimensions, dimension reduction process is necessary on these data. The main goal of this paper is to provide a method for dimension reduction and classification of genetic data sets. The proposed approach includes different stages. In the first stage, several feature ranking methods are fused for enhancing the robustness and stability of feature selection process. Wrapper method is combined with the proposed hybrid ranking method to embed the interaction between genes. Afterwards, the classification process is applied using support vector machine. Before feeding the data to the SVM classifier the problem of imbalance classes of data in the training phase should be overcame. The experimental results of the proposed approach on five microarray databases show that the robustness metric of the feature selection process is in the interval of [0.70, 0.88]. Also the classification accuracy is in the range of [91%, 96%].  相似文献   

14.
This article presents an approach to estimate missing elements in hybrid life cycle inventories. Its development is motivated by a desire to rationalize inventory compilation while maintaining the quality of the data. The approach builds on a hybrid framework, that is, a combination of process‐ and input–output‐based life cycle assessment (LCA) methodology. The application of Leontief's price model is central in the proposed procedure. Through the application of this approach, an inventory with no cutoff with respect to costs can be obtained. The formal framework is presented and discussed. A numerical example is provided in Supplementary Appendix S1 on the Web.  相似文献   

15.
环境微生物研究中机器学习算法及应用   总被引:1,自引:0,他引:1  
陈鹤  陶晔  毛振镀  邢鹏 《微生物学报》2022,62(12):4646-4662
微生物在环境中无处不在,它们不仅是生物地球化学循环和环境演化的关键参与者,也在环境监测、生态治理和保护中发挥着重要作用。随着高通量技术的发展,大量微生物数据产生,运用机器学习对环境微生物大数据进行建模和分析,在微生物标志物识别、污染物预测和环境质量预测等领域的科学研究和社会应用方面均具有重要意义。机器学习可分为监督学习和无监督学习2大类。在微生物组学研究当中,无监督学习通过聚类、降维等方法高效地学习输入数据的特征,进而对微生物数据进行整合和归类。监督学习运用有特征和标记的微生物数据集训练模型,在面对只有特征没有标记的数据时可以判断出标记,从而实现对新数据的分类、识别和预测。然而,复杂的机器学习算法通常以牺牲可解释性为代价来重点关注模型预测的准确性。机器学习模型通常可以看作预测特定结果的“黑匣子”,即对模型如何得出预测所知甚少。为了将机器学习更多地运用于微生物组学研究、提高我们提取有价值的微生物信息的能力,深入了解机器学习算法、提高模型的可解释性尤为重要。本文主要介绍在环境微生物领域常用的机器学习算法和基于微生物组数据的机器学习模型的构建步骤,包括特征选择、算法选择、模型构建和评估等,并对各种机器学习模型在环境微生物领域的应用进行综述,深入探究微生物组与周围环境之间的关联,探讨提高模型可解释性的方法,并为未来环境监测、环境健康预测提供科学参考。  相似文献   

16.
Data quality     
A methodology is presented that enables incorporating expert judgment regarding the variability of input data for environmental life cycle assessment (LCA) modeling. The quality of input data in the life-cycle inventory (LCI) phase is evaluated by LCA practitioners using data quality indicators developed for this application. These indicators are incorporated into the traditional LCA inventory models that produce non-varying point estimate results (i.e., deterministic models) to develop LCA inventory models that produce results in the form of random variables that can be characterized by probability distributions (i.e., stochastic models). The outputs of these probabilistic LCA models are analyzed using classical statistical methods for better decision and policy making information. This methodology is applied to real-world beverage delivery system LCA inventory models. The inventory study results for five beverage delivery system alternatives are compared using statistical methods that account for the variance in the model output values for each alternative. Sensitivity analyses are also performed that indicate model output value variance increases as input data uncertainty increases (i.e., input data quality degrades). Concluding remarks point out the strengths of this approach as an alternative to providing the traditional qualitative assessment of LCA inventory study input data with no efficient means of examining the combined effects on the model results. Data quality assessments can now be captured quantitatively within the LCA inventory model structure. The approach produces inventory study results that are variables reflecting the uncertainty associated with the input data. These results can be analyzed using statistical methods that make efficient quantitative comparisons of inventory study alternatives possible. Recommendations for future research are also provided that include the screening of LCA inventory model inputs for significance and the application of selection and ranking techniques to the model outputs.  相似文献   

17.
Process understanding and characterization forms the foundation, ensuring consistent and robust biologics manufacturing process. Using appropriate modeling tools and machine learning approaches, the process data can be monitored in real time to avoid manufacturing risks. In this article, we have outlined an approach toward implementation of chemometrics and machine learning tools (neural network analysis) to model and predict the behavior of a mixed-mode chromatography step for a biosimilar (Teriparatide) as a case study. The process development data and process knowledge was assimilated into a prior process knowledge assessment using chemometrics tools to derive important parameters critical to performance indicators (i.e., potential quality and process attributes) and to establish the severity ranking for the FMEA analysis. The characterization data of the chromatographic operation are presented alongwith the determination of the critical, key and non- key process parameters, set points, operating, process acceptance and characterized ranges. The scale-down model establishment was assessed using traditional approaches and novel approaches like batch evolution model and neural network analysis. The batch evolution model was further used to demonstrate batch monitoring through direct chromatographic data, thus demonstrating its application for continuos process verification. Assimilation of process knowledge through a structured data acquisition approach, built-in from process development to continuous process verification was demonstrated to result in a data analytics driven model that can be coupled with machine learning tools for real time process monitoring. We recommend application of these approaches with the FDA guidance on stage wise process development and validation to reduce manufacturing risks.  相似文献   

18.
The classification of microorganisms by high‐dimensional phenotyping methods such as FTIR spectroscopy is often a complicated process due to the complexity of microbial phylogenetic taxonomy. A hierarchical structure developed for such data can often facilitate the classification analysis. The hierarchical tree structure can either be imposed to a given set of phenotypic data by integrating the phylogenetic taxonomic structure or set up by revealing the inherent clusters in the phenotypic data. In this study, we wanted to compare different approaches to hierarchical classification of microorganisms based on high‐dimensional phenotypic data. A set of 19 different species of molds (filamentous fungi) obtained from the mycological strain collection of the Norwegian Veterinary Institute (Oslo, Norway) is used for the study. Hierarchical cluster analysis is performed for setting up the classification trees. Classification algorithms such as artificial neural networks (ANN), partial least‐squared discriminant analysis and random forest (RF) are used and compared. The 2 methods ANN and RF outperformed all the other approaches even though they did not utilize predefined hierarchical structure. To our knowledge, the RF approach is used here for the first time to classify microorganisms by FTIR spectroscopy.   相似文献   

19.
The development of a biopharmaceutical production process usually occurs sequentially, and tedious optimization of each individual unit operation is very time-consuming. Here, the conditions established as optimal for one-step serve as input for the following step. Yet, this strategy does not consider potential interactions between a priori distant process steps and therefore cannot guarantee for optimal overall process performance. To overcome these limitations, we established a smart approach to develop and utilize integrated process models using machine learning techniques and genetic algorithms. We evaluated the application of the data-driven models to explore potential efficiency increases and compared them to a conventional development approach for one of our development products. First, we developed a data-driven integrated process model using gradient boosting machines and Gaussian processes as machine learning techniques and a genetic algorithm as recommendation engine for two downstream unit operations, namely solubilization and refolding. Through projection of the results into our large-scale facility, we predicted a twofold increase in productivity. Second, we extended the model to a three-step model by including the capture chromatography. Here, depending on the selected baseline-process chosen for comparison, we obtained between 50% and 100% increase in productivity. These data show the successful application of machine learning techniques and optimization algorithms for downstream process development. Finally, our results highlight the importance of considering integrated process models for the whole process chain, including all unit operations.  相似文献   

20.
MOTIVATION: Small non-coding RNA (ncRNA) genes play important regulatory roles in a variety of cellular processes. However, detection of ncRNA genes is a great challenge to both experimental and computational approaches. In this study, we describe a new approach called positive sample only learning (PSoL) to predict ncRNA genes in the Escherichia coli genome. Although PSoL is a machine learning method for classification, it requires no negative training data, which, in general, is hard to define properly and affects the performance of machine learning dramatically. In addition, using the support vector machine (SVM) as the core learning algorithm, PSoL can integrate many different kinds of information to improve the accuracy of prediction. Besides the application of PSoL for predicting ncRNAs, PSoL is applicable to many other bioinformatics problems as well. RESULTS: The PSoL method is assessed by 5-fold cross-validation experiments which show that PSoL can achieve about 80% accuracy in recovery of known ncRNAs. We compared PSoL predictions with five previously published results. The PSoL method has the highest percentage of predictions overlapping with those from other methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号