首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
2.
Phylogenetic approaches to biological nomenclature are becoming increasingly common. Here I compare the behaviour of two such approaches, the phylogenetic system of definition and the phylogenetic system of reference, when there is a shift in the preference of phylogenetic hypotheses. The comparison is based on a case study from nemertean systematics and is the first to compare two different phylogenetic approaches throughout three stages of change, including two stages of phylogenetic nomenclature. It is concluded that a phylogenetic system of reference in combination with uninomials is superior in conveying phylogenetic information.  相似文献   

3.
In the present contribution we propose two recently developed classification algorithms for the analysis of mass-spectrometric data-the supervised neural gas and the fuzzy-labeled self-organizing map. The algorithms are inherently regularizing, which is recommended, for these spectral data because of its high dimensionality and the sparseness for specific problems. The algorithms are both prototype-based such that the principle of characteristic representants is realized. This leads to an easy interpretation of the generated classifcation model. Further, the fuzzy-labeled self-organizing map is able to process uncertainty in data, and classification results can be obtained as fuzzy decisions. Moreover, this fuzzy classification together with the property of topographic mapping offers the possibility of class similarity detection, which can be used for class visualization. We demonstrate the power of both methods for two exemplary examples: the classification of bacteria (listeria types) and neoplastic and non-neoplastic cell populations in breast cancer tissue sections.  相似文献   

4.
The data-mining challenge presented is composed of two fundamental problems. Problem one is the separation of forty-one subjects into two classifications based on the data produced by the mass spectrometry of protein samples from each subject. Problem two is to find the specific differences between protein expression data of two sets of subjects. In each problem, one group of subjects has a disease, while the other group is nondiseased. Each problem was approached with the intent to introduce a new and potentially useful tool to analyze protein expression from mass spectrometry data. A variety of methodologies, both conventional and nonconventional were used in the analysis of these problems. The results presented show both overlap and discrepancies. What is important is the breadth of the techniques and the future direction this analysis will create.  相似文献   

5.
The complexity of ecosystems is staggering, with hundreds or thousands of species interacting in a number of ways from competition and predation to facilitation and mutualism. Understanding the networks that form the systems is of growing importance, e.g. to understand how species will respond to climate change, or to predict potential knock-on effects of a biological control agent. In recent years, a variety of summary statistics for characterising the global and local properties of such networks have been derived, which provide a measure for gauging the accuracy of a mathematical model for network formation processes. However, the critical underlying assumption is that the true network is known. This is not a straightforward task to accomplish, and typically requires minute observations and detailed field work. More importantly, knowledge about species interactions is restricted to specific kinds of interactions. For instance, while the interactions between pollinators and their host plants are amenable to direct observation, other types of species interactions, like those mentioned above, are not, and might not even be clearly defined from the outset. To discover information about complex ecological systems efficiently, new tools for inferring the structure of networks from field data are needed. In the present study, we investigate the viability of various statistical and machine learning methods recently applied in molecular systems biology: graphical Gaussian models, L1-regularised regression with least absolute shrinkage and selection operator (LASSO), sparse Bayesian regression and Bayesian networks. We have assessed the performance of these methods on data simulated from food webs of known structure, where we combined a niche model with a stochastic population model in a 2-dimensional lattice. We assessed the network reconstruction accuracy in terms of the area under the receiver operating characteristic (ROC) curve, which was typically in the range between 0.75 and 0.9, corresponding to the recovery of about 60% of the true species interactions at a false prediction rate of 5%. We also applied the models to presence/absence data for 39 European warblers, and found that the inferred species interactions showed a weak yet significant correlation with phylogenetic similarity scores, which tended to weakly increase when including bio-climate covariates and allowing for spatial autocorrelation. Our findings demonstrate that relevant patterns in ecological networks can be identified from large-scale spatial data sets with machine learning methods, and that these methods have the potential to contribute novel important tools for gaining deeper insight into the structure and stability of ecosystems.  相似文献   

6.
Mass spectrometric based methods for absolute quantification of proteins, such as QconCAT, rely on internal standards of stable-isotope labeled reference peptides, or "Q-peptides," to act as surrogates. Key to the success of this and related methods for absolute protein quantification (such as AQUA) is selection of the Q-peptide. Here we describe a novel method, CONSeQuence (consensus predictor for Q-peptide sequence), based on four different machine learning approaches for Q-peptide selection. CONSeQuence demonstrates improved performance over existing methods for optimal Q-peptide selection in the absence of prior experimental information, as validated using two independent test sets derived from yeast. Furthermore, we examine the physicochemical parameters associated with good peptide surrogates, and demonstrate that in addition to charge and hydrophobicity, peptide secondary structure plays a significant role in determining peptide "detectability" in liquid chromatography-electrospray ionization experiments. We relate peptide properties to protein tertiary structure, demonstrating a counterintuitive preference for buried status for frequently detected peptides. Finally, we demonstrate the improved efficacy of the general approach by applying a predictor trained on yeast data to sets of proteotypic peptides from two additional species taken from an existing peptide identification repository.  相似文献   

7.
8.

Background

Currently, a surgical approach is the best curative treatment for those with hepatocellular carcinoma (HCC). However, this requires HCC detection and removal of the lesion at an early stage. Unfortunately, most cases of HCC are detected at an advanced stage because of the lack of accurate biomarkers that can be used in the surveillance of those at risk. It is believed that biomarkers that could detect HCC early will play an important role in the successful treatment of HCC.

Methods

In this study, we analyzed serum levels of alpha fetoprotein, Golgi protein, fucosylated alpha-1-anti-trypsin, and fucosylated kininogen from 113 patients with cirrhosis and 164 serum samples from patients with cirrhosis plus HCC. We utilized two different methods, namely, stepwise penalized logistic regression (stepPLR) and model-based classification and regression trees (mob), along with the inclusion of clinical and demographic factors such as age and gender, to determine if these improved algorithms could be used to increase the detection of cancer.

Results and discussion

The performance of multiple biomarkers was found to be better than that of individual biomarkers. Using several statistical methods, we were able to detect HCC in the background of cirrhosis with an area under the receiver operating characteristic curve of at least 0.95. stepPLR and mob demonstrated better predictive performance relative to logistic regression (LR), penalized LR and classification and regression trees (CART) used in our prior study based on three-fold cross-validation and leave one out cross-validation. In addition, mob provided unparalleled intuitive interpretation of results and potential cut-points for biomarker levels. The inclusion of age and gender improved the overall performance of both methods among all models considered, while the stratified male-only subset provided the best overall performance among all methods and models considered.

Conclusions

In addition to multiple biomarkers, the incorporation of age and gender into statistical models significantly improved their predictive performance in the detection of HCC.
  相似文献   

9.
A pseudo-random generator is an algorithm to generate a sequence of objects determined by a truly random seed which is not truly random. It has been widely used in many applications, such as cryptography and simulations. In this article, we examine current popular machine learning algorithms with various on-line algorithms for pseudo-random generated data in order to find out which machine learning approach is more suitable for this kind of data for prediction based on on-line algorithms. To further improve the prediction performance, we propose a novel sample weighted algorithm that takes generalization errors in each iteration into account. We perform intensive evaluation on real Baccarat data generated by Casino machines and random number generated by a popular Java program, which are two typical examples of pseudo-random generated data. The experimental results show that support vector machine and k-nearest neighbors have better performance than others with and without sample weighted algorithm in the evaluation data set.  相似文献   

10.
11.
RNA-Seq technologies are quickly revolutionizing genomic studies, and statistical methods for RNA-seq data are under continuous development. Timely review and comparison of the most recently proposed statistical methods will provide a useful guide for choosing among them for data analysis. Particular interest surrounds the ability to detect differential expression (DE) in genes. Here we compare four recently proposed statistical methods, edgeR, DESeq, baySeq, and a method with a two-stage Poisson model (TSPM), through a variety of simulations that were based on different distribution models or real data. We compared the ability of these methods to detect DE genes in terms of the significance ranking of genes and false discovery rate control. All methods compared are implemented in freely available software. We also discuss the availability and functions of the currently available versions of these software.  相似文献   

12.
草地地上生物量(Aboveground Biomass,AGB)是指导畜牧业生产管理的重要指标,是草畜平衡综合分析的基础。目前,有关祁连山草地AGB反演的研究较少,且多源数据间的尺度差异问题并未得到很好的解决。为了解祁连山草地AGB的空间分布状况,利用Sentinel-2多光谱数据、无人机(Unmanned Aerial Vehicle,UAV)数据以及2021年植被生长期实测草地AGB数据实现了空天地一体化监测,通过决策树回归(Decision Tree Regression,DTR)、随机森林回归(Random Forest Regression,RFR)、梯度提升决策回归树(Gradient Boosting Regression Tree,GBRT)以及极致梯度提升(eXtreme Gradient Boosting,XGBoost)共4种算法反演草地AGB的适用性分析,利用最优模型反演了祁连山草地的AGB空间分布状况。结果表明:研究区内多种植被指数所表现出的特性有所差异。祁连山地区AGB在空间分布上呈现出由西北向东南递增的趋势,平均AGB为925.43kg/hm2。6种植被指数与实测AGB之间均表现为显著正相关,适合作为祁连山草地AGB遥感反演的指标;XGBoost模型较其它模型具有最高的R2值(0.78)和精度(74.75%)、最低的均方根误差(RMSE,99.74 kg/hm2)和平均绝对误差(MAE,71.60 kg/hm2),模型反演效果最好;UAV数据能够提供更加详细的空间细节特征,减小Sentinel-2数据和实地采样数据间的尺度差异;因此,基于6种植被指数与祁连山草地AGB间的相关性,构建XGBoost模型反演研究区草地AGB空间分布状况是具有实践意义的。研究结果将为指导祁连山草地畜牧业的发展和维护草地生态系统的平衡提供一定的参考价值与数据支撑。  相似文献   

13.

Background  

Bioactivity profiling using high-throughput in vitro assays can reduce the cost and time required for toxicological screening of environmental chemicals and can also reduce the need for animal testing. Several public efforts are aimed at discovering patterns or classifiers in high-dimensional bioactivity space that predict tissue, organ or whole animal toxicological endpoints. Supervised machine learning is a powerful approach to discover combinatorial relationships in complex in vitro/in vivo datasets. We present a novel model to simulate complex chemical-toxicology data sets and use this model to evaluate the relative performance of different machine learning (ML) methods.  相似文献   

14.
We compared the ability of three machine learning algorithms (linear discriminant analysis, decision tree, and support vector machines) to automate the classification of calls of nine frogs and three bird species. In addition, we tested two ways of characterizing each call to train/test the system. Calls were characterized with four standard call variables (minimum and maximum frequencies, call duration and maximum power) or eleven variables that included three standard call variables (minimum and maximum frequencies, call duration) and a coarse representation of call structure (frequency of maximum power in eight segments of the call). A total of 10,061 isolated calls were used to train/test the system. The average true positive rates for the three methods were: 94.95% for support vector machine (0.94% average false positive rate), 89.20% for decision tree (1.25% average false positive rate) and 71.45% for linear discriminant analysis (1.98% average false positive rate). There was no statistical difference in classification accuracy based on 4 or 11 call variables, but this efficient data reduction technique in conjunction with the high classification accuracy of the SVM is a promising combination for automated species identification by sound. By combining automated digital recording systems with our automated classification technique, we can greatly increase the temporal and spatial coverage of biodiversity data collection.  相似文献   

15.
16.
SUMMARY 1. The prediction of species distributions is of primary importance in ecology and conservation biology. Statistical models play an important role in this regard; however, researchers have little guidance when choosing between competing methodologies because few comparative studies have been conducted. 2. We provide a comprehensive comparison of traditional and alternative techniques for predicting species distributions using logistic regression analysis, linear discriminant analysis, classification trees and artificial neural networks to model: (1) the presence/absence of 27 fish species as a function of habitat conditions in 286 temperate lakes located in south‐central Ontario, Canada and (2) simulated data sets exhibiting deterministic, linear and non‐linear species response curves. 3. Detailed evaluation of model predictive power showed that approaches produced species models that differed in overall correct classification, specificity (i.e. ability to correctly predict species absence) and sensitivity (i.e. ability to correctly predict speciespresence) and in terms of which of the study lakes they correctly classified. Onaverage, neural networks outperformed the other modelling approaches, although all approaches predicted species presence/absence with moderate to excellent success. 4. Based on simulated non‐linear data, classification trees and neural networks greatly outperformed traditional approaches, whereas all approaches exhibited similar correct classification rates when modelling simulated linear data. 5. Detailed evaluation of model explanatory insight showed that the relative importance of the habitat variables in the species models varied among the approaches, where habitat variable importance was similar among approaches for some species and very different for others. 6. In general, differences in predictive power (both correct classification rate and identity of the lakes correctly classified) among the approaches corresponded with differences in habitat variable importance, suggesting that non‐linear modelling approaches (i.e. classification trees and neural networks) are better able to capture and model complex, non‐linear patterns found in ecological data. The results from the comparisons using simulated data further support this notion. 7. By employing parallel modelling approaches with the same set of data and focusing on comparing multiple metrics of predictive performance, researchers can begin to choose predictive models that not only provide the greatest predictive power, but also best fit the proposed application.  相似文献   

17.

Background  

Mass spectrometry is a key technique in proteomics and can be used to analyze complex samples quickly. One key problem with the mass spectrometric analysis of peptides and proteins, however, is the fact that absolute quantification is severely hampered by the unclear relationship between the observed peak intensity and the peptide concentration in the sample. While there are numerous approaches to circumvent this problem experimentally (e.g. labeling techniques), reliable prediction of the peak intensities from peptide sequences could provide a peptide-specific correction factor. Thus, it would be a valuable tool towards label-free absolute quantification.  相似文献   

18.

Background

The recent pandemic of obesity and the metabolic syndrome (MetS) has led to the realisation that new drug targets are needed to either reduce obesity or the subsequent pathophysiological consequences associated with excess weight gain. Certain nuclear hormone receptors (NRs) play a pivotal role in lipid and carbohydrate metabolism and have been highlighted as potential treatments for obesity. This realisation started a search for NR agonists in order to understand and successfully treat MetS and associated conditions such as insulin resistance, dyslipidaemia, hypertension, hypertriglyceridemia, obesity and cardiovascular disease. The most studied NRs for treating metabolic diseases are the peroxisome proliferator-activated receptors (PPARs), PPAR-α, PPAR-γ, and PPAR-δ. However, prolonged PPAR treatment in animal models has led to adverse side effects including increased risk of a number of cancers, but how these receptors change metabolism long term in terms of pathology, despite many beneficial effects shorter term, is not fully understood. In the current study, changes in male Sprague Dawley rat liver caused by dietary treatment with a PPAR-pan (PPAR-α, ?γ, and –δ) agonist were profiled by classical toxicology (clinical chemistry) and high throughput metabolomics and lipidomics approaches using mass spectrometry.

Results

In order to integrate an extensive set of nine different multivariate metabolic and lipidomics datasets with classical toxicological parameters we developed a hypotheses free, data driven machine learning approach. From the data analysis, we examined how the nine datasets were able to model dose and clinical chemistry results, with the different datasets having very different information content.

Conclusions

We found lipidomics (Direct Infusion-Mass Spectrometry) data the most predictive for different dose responses. In addition, associations with the metabolic and lipidomic data with aspartate amino transaminase (AST), a hepatic leakage enzyme to assess organ damage, and albumin, indicative of altered liver synthetic function, were established. Furthermore, by establishing correlations and network connections between eicosanoids, phospholipids and triacylglycerols, we provide evidence that these lipids function as a key link between inflammatory processes and intermediary metabolism.
  相似文献   

19.
In this Letter, we present a novel methodology of searching for biologically active compounds, which is based on the combination of docking experiments and analysis of the results by machine learning methods. The study was performed for 5 different protein kinases, and several sets of compounds (active, inactive and assumed inactives) were docked into their targets. The resulting ligand–protein complexes were represented by the means of structural interaction fingerprints profiles (SIFts profiles) that constituted an input for ML methods. The developed protocol was found to be superior to the combination of classification algorithms with the standard fingerprint MACCSFP.  相似文献   

20.
The current investigations were carried out in the context of a nutritional case study aiming at assessing the postnatal impact of maternal dietary protein restriction during pregnancy and lactation on rat offspring plasma metabolome and hypothalamic proteome. Although data generated by different ??Omics?? technologies are usually considered and analyzed separately, their interrelation may offer a valuable opportunity for assessing the emerging ??integrated biology?? concept. The overall strategy of analysis first investigated data pretreatment and variable selection for each dataset. Then, three multivariate analyses were applied to investigate the links between the abundance of metabolites and the expression of proteins collected on the same samples. Unfold principal component analysis and regularized canonical correlation analysis did not take into account the presence of groups of individuals related to the intervention study. On the contrary, the predictive MultiBlock Partial Least Squares method used this information. Regularized canonical correlation analysis appeared as a relevant approach to investigate of the relationships between the two datasets. However, in order to highlight the molecular compounds, proteins and metabolites, associated in interacting or common metabolic pathways for the experimental groups, MultiBlock partial least squares was the most appropriate method in the present nutritional case study.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号