首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Using a fermentation database for Escherichia coli producing green fluorescent protein (GFP), we have implemented a novel three-step optimization method to identify the process input variables most important in modeling the fermentation, as well as the values of those critical input variables that result in an increase in the desired output. In the first step of this algorithm, we use either decision-tree analysis (DTA) or information theoretic subset selection (ITSS) as a database mining technique to identify which process input variables best classify each of the process outputs (maximum cell concentration, maximum product concentration, and productivity) monitored in the experimental fermentations. The second step of the optimization method is to train an artificial neural network (ANN) model of the process input-output data, using the critical inputs identified in the first step. Finally, a hybrid genetic algorithm (hybrid GA), which includes both gradient and stochastic search methods, is used to identify the maximum output modeled by the ANN and the values of the input conditions that result in that maximum. The results of the database mining techniques are compared, both in terms of the inputs selected and the subsequent ANN performance. For the E. coli process used in this study, we identified 6 inputs from the original 13 that resulted in an ANN that best modeled the GFP fluorescence outputs of an independent test set. Values of the six inputs that resulted in a modeled maximum fluorescence were identified by applying a hybrid GA to the ANN model developed. When these conditions were tested in laboratory fermentors, an actual maximum fluorescence of 2.16E6 AU was obtained. The previous high value of fluorescence that was observed was 1.51E6 AU. Thus, this input condition set that was suggested by implementing the proposed optimization scheme on the available historical database increased the maximum fluorescence by 55%.  相似文献   

2.
The concept of "design space" has been proposed in the ICH Q8 guideline and is gaining momentum in its application in the biotech industry. It has been defined as "the multidimensional combination and interaction of input variables (e.g., material attributes) and process parameters that have been demonstrated to provide assurance of quality." This paper presents a stepwise approach for defining process design space for a biologic product. A case study, involving P. pastoris fermentation, is presented to facilitate this. First, risk analysis via Failure Modes and Effects Analysis (FMEA) is performed to identify parameters for process characterization. Second, small-scale models are created and qualified prior to their use in these experimental studies. Third, studies are designed using Design of Experiments (DOE) in order for the data to be amenable for use in defining the process design space. Fourth, the studies are executed and the results analyzed for decisions on the criticality of the parameters as well as on establishing process design space. For the application under consideration, it is shown that the fermentation unit operation is very robust with a wide design space and no critical operating parameters. The approach presented here is not specific to the illustrated case study. It can be extended to other biotech unit operations and processes that can be scaled down and characterized at small scale.  相似文献   

3.
临床决策支持系统(Clinical decision support system,CDSS)是利用决策支持系统相关理论和技术,为临床医师在诊疗过程中提供诊疗决策的支持系统。本文根据临床神经外科疾病特点,提出基于KD、DB、和MD的框架系统结构,并构建临床疾病知识库、病人信息数据库和决策模型库,为下一步实现NCDSS功能提供理论基础和方法指导。  相似文献   

4.
We have previously shown the usefulness of historical data for fermentation process optimization. The methodology developed includes identification of important process inputs, training of an artificial neural network (ANN) process model, and ultimately use of the ANN model with a genetic algorithm to find the optimal values of each critical process input. However, this approach ignores the time-dependent nature of the system, and therefore, does not fully utilize the available information within a database. In this work, we propose a method for incorporating time-dependent optimization into our previously developed three-step optimization routine. This is achieved by an additional step that uses a fermentation model (consisting of coupled ordinary differential equations (ODE)) to interpret important time-course features of the collected data through adjustments in model parameters. Important process variables not explicitly included in the model were then identified for each model parameter using automatic relevance determination (ARD) with Gaussian process (GP) models. The developed GP models were then combined with the fermentation model to form a hybrid neural network model that predicted the time-course activity of the cell and protein concentrations of novel fermentation conditions. A hybrid-genetic algorithm was then used in conjunction with the hybrid model to suggest optimal time-dependent control strategies. The presented method was implemented upon an E. coli fermentation database generated in our laboratory. Optimization of two different criteria (final protein yield and a simplified economic criteria) was attempted. While the overall protein yield was not increased using this methodology, we were successful in increasing a simplified economic criterion by 15% compared to what had been previously observed. These process conditions included using 35% less arabinose (the inducer) and 33% less typtone in the media and reducing the time required to reach the maximum protein concentration by 10% while producing approximately the same level of protein as the previous optimum.  相似文献   

5.
This paper proposes the use of multilayer perceptron for brain dysfunction diagnosis. The performance of MLP was better than that of Discriminant Analysis and Decision Tree classifiers, with an 85% accuracy rate in an experimental test involving 332 subjects. In addition, the neural network employing Bayesian learning was able to identify the most important input variable. These two results demonstrate that the neural network can be effectively used in the diagnosis of children with brain dysfunction.  相似文献   

6.
This work presents a novel multivariate statistical algorithm, Decision Tree-PLS (DT-PLS), to improve the prediction and understanding of dynamic processes based on local partial least square regression (PLSR) models for characteristic process groups defined based on Decision Tree (DT) analysis. The DT-PLS algorithm is successfully applied to two different cell culture data sets, one obtained from bioreactors of 3.5 L lab scale and the other obtained from the 15 ml ambr microbioreactor system. Substantial improvement in the predictive capabilities of the model can be achieved based on the localization compared to the classical PLSR approach, which is implemented in the commercially available packages. Additionally, the differences in the model parameters of the local models suggest that the governing process variables vary for the different process regimes indicating the different states of the cell under different process conditions.  相似文献   

7.
The database analysis allows the return of experience needed to support decision-making processes in risk management. A Colombian toxicological database (TED) developed and maintained by the Center of Safety Information and Chemical Products (CISPROQUIM by its Spanish abbreviation in) is analyzed here using a demographic clustering technique. Data-quality processes were performed on the raw data (more than 170 variables) and as a result the database was reduced to 20 meaningful variables. The variables characterized by values with categories were selected for clustering analysis: gender, age, type of emergency, emergency location, means of poisoning, product use, and physical state of the toxic substance. Clustering analysis showed that there are three profiles that are prevalent in the TED database: Young Adult Suicidal Woman, Unsupervised Child, and Man at Work. These profiles could not be identified using traditional statistical analyses performed on the data collected by CISPROQUIM or defined a priori from the categorical variables. The identification of vulnerable populations and the cause of the toxicological events are critical in order to develop national prevention programs and policies. The analysis described provided a methodology for a critical analysis of toxicological databases that can be applied to other databases such as security databases.  相似文献   

8.
We consider studies of cohorts of individuals after a critical event, such as an injury, with the following characteristics. First, the studies are designed to measure "input" variables, which describe the period before the critical event, and to characterize the distribution of the input variables in the cohort. Second, the studies are designed to measure "output" variables, primarily mortality after the critical event, and to characterize the predictive (conditional) distribution of mortality given the input variables in the cohort. Such studies often possess the complication that the input data are missing for those who die shortly after the critical event because the data collection takes place after the event. Standard methods of dealing with the missing inputs, such as imputation or weighting methods based on an assumption of ignorable missingness, are known to be generally invalid when the missingness of inputs is nonignorable, that is, when the distribution of the inputs is different between those who die and those who live. To address this issue, we propose a novel design that obtains and uses information on an additional key variable-a treatment or externally controlled variable, which if set at its "effective" level, could have prevented the death of those who died. We show that the new design can be used to draw valid inferences for the marginal distribution of inputs in the entire cohort, and for the conditional distribution of mortality given the inputs, also in the entire cohort, even under nonignorable missingness. The crucial framework that we use is principal stratification based on the potential outcomes, here mortality under both levels of treatment. We also show using illustrative preliminary injury data that our approach can reveal results that are more reasonable than the results of standard methods, in relatively dramatic ways. Thus, our approach suggests that the routine collection of data on variables that could be used as possible treatments in such studies of inputs and mortality should become common.  相似文献   

9.
In this work we propose a model that simultaneously optimizes the process variables and the structure of a multiproduct batch plant for the production of recombinant proteins. The complete model includes process performance models for the unit stages and a posynomial representation for the multiproduct batch plant. Although the constant time and size factor models are the most commonly used to model multiproduct batch processes, process performance models describe these time and size factors as functions of the process variables selected for optimization. These process performance models are expressed as algebraic equations obtained from the analytical integration of simplified mass balances and kinetic expressions that describe each unit operation. They are kept as simple as possible while retaining the influence of the process variables selected to optimize the plant. The resulting mixed-integer nonlinear program simultaneously calculates the plant structure (parallel units in or out of phase, and allocation of intermediate storage tanks), the batch plant decision variables (equipment sizes, batch sizes, and operating times of semicontinuous items), and the process decision variables (e.g., final concentration at selected stages, volumetric ratio of phases in the liquid-liquid extraction). A noteworthy feature of the proposed approach is that the mathematical model for the plant is the same as that used in the constant factor model. The process performance models are handled as extra constraints. A plant consisting of eight stages operating in the single product campaign mode (one fermentation, two microfiltrations, two ultrafiltrations, one homogenization, one liquid-liquid extraction, and one chromatography) for producing four different recombinant proteins by the genetically engineered yeast Saccharomyces cerevisiae was modeled and optimized. Using this example, it is shown that the presence of additional degrees of freedom introduced by the process performance models, with respect to a fixed size and time factor model, represents an important development in improving plant design.  相似文献   

10.
Multicriteria-Spatial Decision Support Systems (MC-SDSS) are increasingly popular tools in decision-making processes and in policy making, thanks to their significant new capabilities in the use of spatial or geospatial information.Many spatial problems are complex and require the use of integrated analysis and models. The present paper illustrates the development of a MC-SDSS approach for studying the ecological connectivity of the Piedmont Region in Italy. The MC-SDSS model considers ecological and environmental spatial indicators which are combined by integrating the Multicriteria Decision Aiding (MCDA) technique named Analytic Network Process (ANP) and the Ordered Weighted Average (OWA) approach. The ANP is used for the elicitation of attribute weights while the OWA operator function is used to generate a wide range of decision alternatives for addressing uncertainty associated with interaction between multiple criteria. The usefulness of the approach is illustrated by different OWA scenarios that report the ecological connectivity index on a scale between 0 and 1. The OWA scenarios are intended to quantify the level of risk taking (i.e., optimistic, pessimistic, and neutral) and to facilitate a better understanding of patterns that emerge from decision alternatives involved in the decision-making process.The purpose of the research is to generate a final map representing the ecological connectivity index of each area in the region under analysis, to be used as a decision variable in spatial planning. In particular, by using the resulting index map as a means of analysis, it is possible to identify, for the sake of nature conservation, some critical areas needing mitigation measures. In addition, areas with high ecological connectivity values can be identified and monitoring procedures can therefore be planned. The study concludes highlighting that the applied methodology is an effective tool in providing decision support for spatial planning and sustainability assessments.  相似文献   

11.
Rifamycin B is an important polyketide antibiotic used in the treatment of tuberculosis and leprosy. We present results on medium optimization for Rifamycin B production via a barbital insensitive mutant strain of Amycolatopsis mediterranei S699. Machine-learning approaches such as Genetic algorithm (GA), Neighborhood analysis (NA) and Decision Tree technique (DT) were explored for optimizing the medium composition. Genetic algorithm was applied as a global search algorithm while NA was used for a guided local search and to develop medium predictors. The fermentation medium for Rifamycin B consisted of nine components. A large number of distinct medium compositions are possible by variation of concentration of each component. This presents a large combinatorial search space. Optimization was achieved within five generations via GA as well as NA. These five generations consisted of 178 shake-flask experiments, which is a small fraction of the search space. We detected multiple optima in the form of 11 distinct medium combinations. These medium combinations provided over 600% improvement in Rifamycin B productivity. Genetic algorithm performed better in optimizing fermentation medium as compared to NA. The Decision Tree technique revealed the media-media interactions qualitatively in the form of sets of rules for medium composition that give high as well as low productivity.  相似文献   

12.

Background

Signatures are short sequences that are unique and not similar to any other sequence in a database that can be used as the basis to identify different species. Even though several signature discovery algorithms have been proposed in the past, these algorithms require the entirety of databases to be loaded in the memory, thus restricting the amount of data that they can process. It makes those algorithms unable to process databases with large amounts of data. Also, those algorithms use sequential models and have slower discovery speeds, meaning that the efficiency can be improved.

Results

In this research, we are debuting the utilization of a divide-and-conquer strategy in signature discovery and have proposed a parallel signature discovery algorithm on a computer cluster. The algorithm applies the divide-and-conquer strategy to solve the problem posed to the existing algorithms where they are unable to process large databases and uses a parallel computing mechanism to effectively improve the efficiency of signature discovery. Even when run with just the memory of regular personal computers, the algorithm can still process large databases such as the human whole-genome EST database which were previously unable to be processed by the existing algorithms.

Conclusions

The algorithm proposed in this research is not limited by the amount of usable memory and can rapidly find signatures in large databases, making it useful in applications such as Next Generation Sequencing and other large database analysis and processing. The implementation of the proposed algorithm is available athttp://www.cs.pu.edu.tw/~fang/DDCSDPrograms/DDCSD.htm.  相似文献   

13.
14.
Our lack of knowledge about the biological mechanisms of 50 Hz magnetic fields makes it hard to improve exposure assessment. To provide better information about these exposure measures, we use multidimensional analysis techniques to examine the relations between different exposure metrics for a group of subjects. We used a combination of a two stage Principal Component Analysis (PCA) followed by an ascending hierarchical classification (AHC) to identify a set of measures that would capture the characteristics of the total exposure. This analysis gives an indication of the aspects of the exposure that are important to capture to get a complete picture of the magnetic field environment. We calculated 44 metrics of exposure measures from 16 exposed EDF employees and 15 control subjects, containing approximately 20,000 recordings of magnetic field measurements, taken every 30 s for 7 days with an EMDEX II dosimeter. These metrics included parameters used routinely or occasionally and some that were new. To eliminate those that expressed the least variability and that were most highly correlated to one another, we began with an initial Principal Component Analysis (PCA). A second PCA of the remaining 12 metrics enabled us to identify from the foreground 82.7% of the variance: the first component (62.0%) was characterized by central tendency metrics, and the second (20.7%) by dispersion characteristics. We were able to use AHC to divide the entire sample (of individuals) into four groups according to the axes that emerged from the PCA. Finally, discriminant analysis tested the discriminant power of the variables in the exposed/control classification as well as those from the AHC classification. The first showed that two subjects had been incorrectly classified, while no classification error was observed in the second. This exploratory study underscores the need to improve exposure measures by using at least two dimensions: intensity and dispersion. It also indicates the usefulness of constructing a typology of magnetic field exposures.  相似文献   

15.
Tandem mass spectrometry (MS/MS) combined with database searching is currently the most widely used method for high-throughput peptide and protein identification. Many different algorithms, scoring criteria, and statistical models have been used to identify peptides and proteins in complex biological samples, and many studies, including our own, describe the accuracy of these identifications, using at best generic terms such as "high confidence." False positive identification rates for these criteria can vary substantially with changing organisms under study, growth conditions, sequence databases, experimental protocols, and instrumentation; therefore, study-specific methods are needed to estimate the accuracy (false positive rates) of these peptide and protein identifications. We present and evaluate methods for estimating false positive identification rates based on searches of randomized databases (reversed and reshuffled). We examine the use of separate searches of a forward then a randomized database and combined searches of a randomized database appended to a forward sequence database. Estimated error rates from randomized database searches are first compared against actual error rates from MS/MS runs of known protein standards. These methods are then applied to biological samples of the model microorganism Shewanella oneidensis strain MR-1. Based on the results obtained in this study, we recommend the use of use of combined searches of a reshuffled database appended to a forward sequence database as a means providing quantitative estimates of false positive identification rates of peptides and proteins. This will allow researchers to set criteria and thresholds to achieve a desired error rate and provide the scientific community with direct and quantifiable measures of peptide and protein identification accuracy as opposed to vague assessments such as "high confidence."  相似文献   

16.
Fermentation database mining by pattern recognition   总被引:1,自引:0,他引:1  
A large volume of data is routinely collected during the course of typical fermentation and other processes. Such data provide the required basis for process documentation and occasionally are also used for process analysis and improvement. The information density of these data is often low, and automatic condensing, analysis, and interpretation ("database mining") are highly desirable. In this article we present a methodology whereby process variables are processed to create a database of derivative process quantities representative of the global patterns, intermediate trends, and local characteristics of the process. A powerful search algorithm subsequently attempts to extract the specific process variables and their particular attributes that uniquely characterize a class of process outcomes such as high- or low-yield fermentations.The basic components of our pattern recognition methodology are described along with applications to the analysis of two sets of data from industrial fermentations. Results indicate that truly discriminating variables do exist in typical fermentation data and they can be useful in identifying the causes or symptoms of different process outcomes. The methodology has been implemented in a user-friendly software, named db-miner, which facilitates the application of the methodology for efficient and speedy analysis of fermentation process data. (c) 1997 John Wiley & Sons, Inc. Biotechnol Bioeng 53: 443-452, 1997.  相似文献   

17.
ObjectivesThe main objective of the present work is to evaluate the feasibility of harmonising the available information from different independent databases, in order to build an integrated database to study frailty.Material and methodsThis work is based on the European project, Integral Approach to the Transition between Frailty and Dependence on older adults: Patterns of occurrence, identification tools and model of care (INTAFRADE), developed by 4 groups, 3 in Spain and one in France. Each partner provided their databases related to the study of frailty. As a previous step to the creation of an integrated database the characteristics and variables included in each study were mapped, specifying whether their harmonisation was possible or not.ResultsA total of 30 different variables that corresponded to 8 dimensions were identified: Sociodemographic and social characteristics, health status, lifestyle habits, anthropometric measures, other physical measurements, use of health services, and adverse health results. Of them all, 28 (93%) variables were harmonisable, although only 20% were present in all databases, with 47% in 3 of them. In relation to the frailty instruments, all of them were lacking at least 50% of the items. The harmonisation process will allow us to jointly analyse information available on 2,361 people.ConclusionsThe European INTAFRADE study will allow a deeper understanding of the frailty process in older people by harmonising information from heterogeneous databases.  相似文献   

18.
Gene families compose a large proportion of eukaryotic genomes. The rapidly expanding genomic sequence database provides a good opportunity to study gene family evolution and function. However, most gene family identification programs are restricted to searching protein databases where data are often lagging behind the genomic sequence data. Here, we report a user-friendly web-based pipeline, named TARGeT (Tree Analysis of Related Genes and Transposons), which uses either a DNA or amino acid ‘seed’ query to: (i) automatically identify and retrieve gene family homologs from a genomic database, (ii) characterize gene structure and (iii) perform phylogenetic analysis. Due to its high speed, TARGeT is also able to characterize very large gene families, including transposable elements (TEs). We evaluated TARGeT using well-annotated datasets, including the ascorbate peroxidase gene family of rice, maize and sorghum and several TE families in rice. In all cases, TARGeT rapidly recapitulated the known homologs and predicted new ones. We also demonstrated that TARGeT outperforms similar pipelines and has functionality that is not offered elsewhere.  相似文献   

19.
Reduction in energy sector greenhouse gas GHG emissions is a key aim of European Commission plans to expand cultivation of bioenergy crops. Since agriculture makes up 10–12% of anthropogenic GHG emissions, impacts of land‐use change must be considered, which requires detailed understanding of specific changes to agroecosystems. The greenhouse gas (GHG) balance of perennials may differ significantly from the previous ecosystem. Net change in GHG emissions with land‐use change for bioenergy may exceed avoided fossil fuel emissions, meaning that actual GHG mitigation benefits are variable. Carbon (C) and nitrogen (N) cycling are complex interlinked systems, and a change in land management may affect both differently at different sites, depending on other variables. Change in evapotranspiration with land‐use change may also have significant environmental or water resource impacts at some locations. This article derives a multi‐criteria based decision analysis approach to objectively identify the most appropriate assessment method of the environmental impacts of land‐use change for perennial energy crops. Based on a literature review and conceptual model in support of this approach, the potential impacts of land‐use change for perennial energy crops on GHG emissions and evapotranspiration were identified, as well as likely controlling variables. These findings were used to structure the decision problem and to outline model requirements. A process‐based model representing the complete agroecosystem was identified as the best predictive tool, where adequate data are available. Nineteen models were assessed according to suitability criteria, to identify current model capability, based on the conceptual model, and explicit representation of processes at appropriate resolution. FASSET, ECOSSE, ANIMO, DNDC, DayCent, Expert‐N, Ecosys, WNMM and CERES‐NOE were identified as appropriate models, with factors such as crop, location and data availability dictating the final decision for a given project. A database to inform such decisions is included.  相似文献   

20.
Summary .  Pharmacovigilance systems aim at early detection of adverse effects of marketed drugs. They maintain large spontaneous reporting databases for which several automatic signaling methods have been developed. One limit of those methods is that the decision rules for the signal generation are based on arbitrary thresholds. In this article, we propose a new signal-generation procedure. The decision criterion is formulated in terms of a critical region for the P-values resulting from the reporting odds ratio method as well as from the Fisher's exact test. For the latter, we also study the use of mid-P-values. The critical region is defined by the false discovery rate, which can be estimated by adapting the P-values mixture model based procedures to one-sided tests. The methodology is mainly illustrated with the location-based estimator procedure. It is studied through a large simulation study and applied to the French pharmacovigilance database.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号