首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
Multigene and genomic data sets have become commonplace in the field of phylogenetics, but many existing tools are not designed for such data sets, which often makes the analysis time‐consuming and tedious. Here, we present PhyloSuite , a (cross‐platform, open‐source, stand‐alone Python graphical user interface) user‐friendly workflow desktop platform dedicated to streamlining molecular sequence data management and evolutionary phylogenetics studies. It uses a plugin‐based system that integrates several phylogenetic and bioinformatic tools, thereby streamlining the entire procedure, from data acquisition to phylogenetic tree annotation (in combination with iTOL). It has the following features: (a) point‐and‐click and drag‐and‐drop graphical user interface; (b) a workplace to manage and organize molecular sequence data and results of analyses; (c) GenBank entry extraction and comparative statistics; and (d) a phylogenetic workflow with batch processing capability, comprising sequence alignment (mafft and macse ), alignment optimization (trimAl, HmmCleaner and Gblocks), data set concatenation, best partitioning scheme and best evolutionary model selection (PartitionFinder and modelfinder ), and phylogenetic inference (MrBayes and iq‐tree ). PhyloSuite is designed for both beginners and experienced researchers, allowing the former to quick‐start their way into phylogenetic analysis, and the latter to conduct, store and manage their work in a streamlined way, and spend more time investigating scientific questions instead of wasting it on transferring files from one software program to another.  相似文献   

2.
This review discusses data analysis strategies for the discovery of biomarkers in clinical proteomics. Proteomics studies produce large amounts of data, characterized by few samples of which many variables are measured. A wealth of classification methods exists for extracting information from the data. Feature selection plays an important role in reducing the dimensionality of the data prior to classification and in discovering biomarker leads. The question which classification strategy works best is yet unanswered. Validation is a crucial step for biomarker leads towards clinical use. Here we only discuss statistical validation, recognizing that biological and clinical validation is of utmost importance. First, there is the need for validated model selection to develop a generalized classifier that predicts new samples correctly. A cross-validation loop that is wrapped around the model development procedure assesses the performance using unseen data. The significance of the model should be tested; we use permutations of the data for comparison with uninformative data. This procedure also tests the correctness of the performance validation. Preferably, a new set of samples is measured to test the classifier and rule out results specific for a machine, analyst, laboratory or the first set of samples. This is not yet standard practice. We present a modular framework that combines feature selection, classification, biomarker discovery and statistical validation; these data analysis aspects are all discussed in this review. The feature selection, classification and biomarker discovery modules can be incorporated or omitted to the preference of the researcher. The validation modules, however, should not be optional. In each module, the researcher can select from a wide range of methods, since there is not one unique way that leads to the correct model and proper validation. We discuss many possibilities for feature selection, classification and biomarker discovery. For validation we advice a combination of cross-validation and permutation testing, a validation strategy supported in the literature.  相似文献   

3.
Because of high dimensionality, machine learning algorithms typically rely on feature selection techniques in order to perform effective classification in microarray gene expression data sets. However, the large number of features compared to the number of samples makes the task of feature selection computationally hard and prone to errors. This paper interprets feature selection as a task of stochastic optimization, where the goal is to select among an exponential number of alternative gene subsets the one expected to return the highest generalization in classification. Blocking is an experimental design strategy which produces similar experimental conditions to compare alternative stochastic configurations in order to be confident that observed differences in accuracy are due to actual differences rather than to fluctuations and noise effects. We propose an original blocking strategy for improving feature selection which aggregates in a paired way the validation outcomes of several learning algorithms to assess a gene subset and compare it to others. This is a novelty with respect to conventional wrappers, which commonly adopt a sole learning algorithm to evaluate the relevance of a given set of variables. The rationale of the approach is that, by increasing the amount of experimental conditions under which we validate a feature subset, we can lessen the problems related to the scarcity of samples and consequently come up with a better selection. The paper shows that the blocking strategy significantly improves the performance of a conventional forward selection for a set of 16 publicly available cancer expression data sets. The experiments involve six different classifiers and show that improvements take place independent of the classification algorithm used after the selection step. Two further validations based on available biological annotation support the claim that blocking strategies in feature selection may improve the accuracy and the quality of the solution. The first validation is based on retrieving PubMEd abstracts associated to the selected genes and matching them to regular expressions describing the biological phenomenon underlying the expression data sets. The biological validation that follows is based on the use of the Bioconductor package GoStats in order to perform Gene Ontology statistical analysis.  相似文献   

4.
Our goal in this paper is to show an analytical workflow for selecting protein biomarker candidates from SELDI-MS data. The clinical question at issue is to enable prediction of the complete remission (CR) duration for acute myeloid leukemia (AML) patients. This would facilitate disease prognosis and make individual therapy possible. SELDI-mass spectrometry proteomics analyses were performed on blast cell samples collected from AML patients pre-chemotherapy. Although the biobank available included approximately 200 samples, only 58 were available for analysis. The presented workflow includes sample selection, experimental optimization, repeatability estimation, data preprocessing, data fusion, and feature selection. Specific difficulties have been the small number of samples and the skew distribution of the CR duration among the patients. Further, we had to deal with both noisy SELDI-MS data and a diverse patient cohort. This has been handled by sample selection and several methods for data preprocessing and feature detection in the analysis workflow. Four conceptually different methods for peak detection and alignment were considered, as well as two diverse methods for feature selection. The peak detection and alignment methods included the recently developed annotated regions of significance (ARS) method, the SELDI-MS software Ciphergen Express which was regarded as the standard method, segment-wise spectral alignment by a genetic algorithm (PAGA) followed by binning, and, finally, binning of raw data. In the feature selection, the "standard" Mann-Whitney t test was compared with a hierarchical orthogonal partial least-squares (O-PLS) analysis approach. The combined information from all these analyses gave a collection of 21 protein peaks. These were regarded as the most potential and robust biomarker candidates since they were picked out as significant features in several of the models. The chosen peaks will now be our first choice for the continuing work on protein identification and biological validation. The identification will be performed by chromatographic purification and MALDI MS/MS. Thus, we have shown that the use of several data handling methods can improve a protein profiling workflow from experimental optimization to a predictive model. The framework of this methodology should be seen as general and could be used with other one-dimensional spectral omics data than SELDI MS including an adequate number of samples.  相似文献   

5.
Protein posttranslational modifications (PTMs) are of increasing interest in biomedical research, yet studies rarely examine more than one PTM. One barrier to multi‐PTM studies is the time cost for both sample preparation and data acquisition, which scale linearly with the number of modifications. The most prohibitive requirement is often the need for large amounts of sample, which must be increased proportionally with the number of PTM enrichment steps. Here, a streamlined, quantitative label‐free proteomic workflow—“one‐pot” PTM enrichment—that enables comprehensive identification and quantification of peptides containing acetylated and succinylated lysine residues from a single sample containing as little as 1 mg mitochondria protein is described. Coupled with a label‐free, data‐independent acquisition (DIA), 2235 acetylated and 2173 succinylated peptides with the one‐pot method are identified and quantified and peak areas are shown to be highly correlated between the one‐pot and traditional single‐PTM enrichments. The ‘one‐pot’ method makes possible detection of multiple PTMs occurring on the same peptide, and it is shown that it can be used to make unique biological insights into PTM crosstalk. Compared to single‐PTM enrichments, the one‐pot workflow has equivalent reproducibility and enables direct assessment of PTM crosstalk from biological samples in less time from less tissue.  相似文献   

6.
Identifying reproducible yet relevant protein features in proteomics data is a major challenge. Analysis at the level of protein complexes can resolve this issue and we have developed a suite of feature‐selection methods collectively referred to as Rank‐Based Network Analysis (RBNA). RBNAs differ in their individual statistical test setup but are similar in the sense that they deploy rank‐defined weights among proteins per sample. This procedure is known as gene fuzzy scoring. Currently, no RBNA exists for paired‐sample scenarios where both control and test tissues originate from the same source (e.g. same patient). It is expected that paired tests, when used appropriately, are more powerful than approaches intended for unpaired samples. We report that the class‐paired RBNA, PPFSNET, dominates in both simulated and real data scenarios. Moreover, for the first time, we explicitly incorporate batch‐effect resistance as an additional evaluation criterion for feature‐selection approaches. Batch effects are class irrelevant variations arising from different handlers or processing times, and can obfuscate analysis. We demonstrate that PPFSNET and an earlier RBNA, PFSNET, are particularly resistant against batch effects, and only select features strongly correlated with class but not batch.  相似文献   

7.
Protein quantification using data‐independent acquisition methods such as SWATH‐MS most commonly relies on spectral matching to a reference MS/MS assay library. To enable deep proteome coverage and efficient use of existing data, in silico approaches have been described to use archived or publicly available large reference spectral libraries for spectral matching. Since implicit in the use of larger libraries is the increasing likelihood of false‐discoveries, new workflows are needed to ensure high confidence in protein matching under these conditions. We present a workflow which introduces a range of filters and thresholds aimed at increasing confidence that the resulting proteins are reliably detected and their quantitation is consistent and reproducible. We demonstrated the workflow using extended libraries with SWATH data from human plasma samples and yeast‐spiked human K562 cell lysate digest.  相似文献   

8.
High‐throughput ‘‐omics’ data can be combined with large‐scale molecular interaction networks, for example, protein–protein interaction networks, to provide a unique framework for the investigation of human molecular biology. Interest in these integrative ‘‐omics’ methods is growing rapidly because of their potential to understand complexity and association with disease; such approaches have a focus on associations between phenotype and “network‐type.” The potential of this research is enticing, yet there remain a series of important considerations. Here, we discuss interaction data selection, data quality, the relative merits of using data from large high‐throughput studies versus a meta‐database of smaller literature‐curated studies, and possible issues of sociological or inspection bias in interaction data. Other work underway, especially international consortia to establish data formats, quality standards and address data redundancy, and the improvements these efforts are making to the field, is also evaluated. We present options for researchers intending to use large‐scale molecular interaction networks as a functional context for protein or gene expression data, including microRNAs, especially in the context of human disease.  相似文献   

9.
Increasingly, animal behavior studies are enhanced through the use of accelerometry. To allow translation of raw accelerometer data to animal behaviors requires the development of classifiers. Here, we present the “rabc” (r for animal behavior classification) package to assist researchers with the interactive development of such animal behavior classifiers in a supervised classification approach. The package uses datasets consisting of accelerometer data with their corresponding animal behaviors (e.g., for triaxial accelerometer data along the x, y and z axes arranged as “x, y, z, x, y, z,…, behavior”). Using an example dataset collected on white stork (Ciconia ciconia), we illustrate the workflow of this package, including accelerometer data visualization, feature calculation, feature selection, feature visualization, extreme gradient boost model training, validation, and, finally, a demonstration of the behavior classification results.  相似文献   

10.
A major challenge in biomedical studies in recent years has been the classification of gene expression profiles into categories, such as cases and controls. This is done by first training a classifier by using a labeled training set containing labeled samples from the two populations, and then using that classifier to predict the labels of new samples. Such predictions have recently been shown to improve the diagnosis and treatment selection practices for several diseases. This procedure is complicated, however, by the high dimensionality if the data. While microarrays can measure the levels of thousands of genes per sample, case-control microarray studies usually involve no more than several dozen samples. Standard classifiers do not work well in these situations where the number of features (gene expression levels measured in these microarrays) far exceeds the number of samples. Selecting only the features that are most relevant for discriminating between the two categories can help construct better classifiers, in terms of both accuracy and efficiency. In this work we developed a novel method for multivariate feature selection based on the Partial Least Squares algorithm. We compared the method''s variants with common feature selection techniques across a large number of real case-control datasets, using several classifiers. We demonstrate the advantages of the method and the preferable combinations of classifier and feature selection technique.  相似文献   

11.
High‐throughput sequencing is a powerful tool, but suffers biases and errors that must be accounted for to prevent false biological conclusions. Such errors include batch effects; technical errors only present in subsets of data due to procedural changes within a study. If overlooked and multiple batches of data are combined, spurious biological signals can arise, particularly if batches of data are correlated with biological variables. Batch effects can be minimized through randomization of sample groups across batches. However, in long‐term or multiyear studies where data are added incrementally, full randomization is impossible, and batch effects may be a common feature. Here, we present a case study where false signals of selection were detected due to a batch effect in a multiyear study of Alpine ibex (Capra ibex). The batch effect arose because sequencing read length changed over the course of the project and populations were added incrementally to the study, resulting in nonrandom distributions of populations across read lengths. The differences in read length caused small misalignments in a subset of the data, leading to false variant alleles and thus false SNPs. Pronounced allele frequency differences between populations arose at these SNPs because of the correlation between read length and population. This created highly statistically significant, but biologically spurious, signals of selection and false associations between allele frequencies and the environment. We highlight the risk of batch effects and discuss strategies to reduce the impacts of batch effects in multiyear high‐throughput sequencing studies.  相似文献   

12.
Photo-identification is a commonly used non-invasive technique that has been profitably employed in biological studies throughout the years. It starts from the assumption that a single individual can be recognized in multiple photos captured at different times by exploiting its unique representative and visible physical qualities such as marks, notches or any other definite feature. Hence, photo-identification is performed to infer knowledge about wild species' spatial and temporal distributions as well as population dynamics, thus providing valuable information especially when the species being investigated is ranked as data deficient. Furthermore, the technological improvements of the last decades and the large availability of devices with powerful computing capabilities are driving the research towards a common goal of enriching bio-ecological studies with innovative computer science approaches. In this scenario, computer vision plays a fundamental role, as it can successfully assist researchers in the analysis of large amounts of data. The aim of this paper is, in fact, to effectively provide a computer vision approach for the photo-identification of the Risso's dolphin, exploiting specific visual cues with a feature-based approach relying on SIFT and SURF feature detectors. The experiments have been conducted on image data acquired in the Gulf of Taranto from 2013 to 2017, conducting a comparative analysis of the performance of both SIFT and SURF, as well as a comparison with the state-of-the-art software DARWIN, and they proved the effectiveness of the proposed approach and suggested its application would be suitable to large scale studies. In conclusion, this paper shows an innovative computer vision application for the identification of unknown Risso's dolphin individuals that relies on a feature-based automated approach. The results suggest that the proposed approach can efficiently assist researchers during the photo-identification task of large amounts of data collected in such a challenging domain.  相似文献   

13.
This article investigates an ensemble‐based technique called Bayesian Model Averaging (BMA) to improve the performance of protein amino acid pKa predictions. Structure‐based pKa calculations play an important role in the mechanistic interpretation of protein structure and are also used to determine a wide range of protein properties. A diverse set of methods currently exist for pKa prediction, ranging from empirical statistical models to ab initio quantum mechanical approaches. However, each of these methods are based on a set of conceptual assumptions that can effect a model's accuracy and generalizability for pKa prediction in complicated biomolecular systems. We use BMA to combine eleven diverse prediction methods that each estimate pKa values of amino acids in staphylococcal nuclease. These methods are based on work conducted for the pKa Cooperative and the pKa measurements are based on experimental work conducted by the García‐Moreno lab. Our cross‐validation study demonstrates that the aggregated estimate obtained from BMA outperforms all individual prediction methods with improvements ranging from 45 to 73% over other method classes. This study also compares BMA's predictive performance to other ensemble‐based techniques and demonstrates that BMA can outperform these approaches with improvements ranging from 27 to 60%. This work illustrates a new possible mechanism for improving the accuracy of pKa prediction and lays the foundation for future work on aggregate models that balance computational cost with prediction accuracy. Proteins 2014; 82:354–363. © 2013 Wiley Periodicals, Inc.  相似文献   

14.
数据非依赖采集(data-independent acquisition,DIA)是一种高通量、无偏性的质谱数据采集方法,具有定量结果重现性好,对低丰度蛋白质友好的特点,是近年来进行大队列蛋白质组研究的首选方法之一。由于DIA产生的二级谱是混合谱,包含了多个肽段的碎片离子信息,使得蛋白质鉴定和定量更加困难。目前,DIA数据分析方法分为两大类,即以肽为中心和以谱图为中心。其中,以肽为中心的分析方法鉴定更灵敏,定量更准确,已成为DIA数据解析的主流方法。其分析流程包括构建谱图库、提取色谱峰群、特征打分和结果质控4个关键步骤。本文综述了以肽为中心的DIA数据分析流程,介绍了基于此流程的数据分析软件及相关比较评估工作,进一步总结了已有的算法改进工作,最后对未来发展方向进行了展望。  相似文献   

15.
Perhaps the most important recent advance in species delimitation has been the development of model‐based approaches to objectively diagnose species diversity from genetic data. Additionally, the growing accessibility of next‐generation sequence data sets provides powerful insights into genome‐wide patterns of divergence during speciation. However, applying complex models to large data sets is time‐consuming and computationally costly, requiring careful consideration of the influence of both individual and population sampling, as well as the number and informativeness of loci on species delimitation conclusions. Here, we investigated how locus number and information content affect species delimitation results for an endangered Mexican salamander species, Ambystoma ordinarium. We compared results for an eight‐locus, 137‐individual data set and an 89‐locus, seven‐individual data set. For both data sets, we used species discovery methods to define delimitation models and species validation methods to rigorously test these hypotheses. We also used integrated demographic model selection tools to choose among delimitation models, while accounting for gene flow. Our results indicate that while cryptic lineages may be delimited with relatively few loci, sampling larger numbers of loci may be required to ensure that enough informative loci are available to accurately identify and validate shallow‐scale divergences. These analyses highlight the importance of striking a balance between dense sampling of loci and individuals, particularly in shallowly diverged lineages. They also suggest the presence of a currently unrecognized, endangered species in the western part of A. ordinarium's range.  相似文献   

16.
17.
Maximum entropy (MaxEnt) modelling, as implemented in the Maxent software, has rapidly become one of the most popular methods for distribution modelling. Originally, MaxEnt was described as a machine‐learning method. More recently, it has been explained from principles of Bayesian estimation. MaxEnt offers numerous options (variants of the method) and settings (tuning of parameters) to the users. A widespread practice of accepting the Maxent software's default options and settings has been established, most likely because of ecologists’ lack of familiarity with machine‐learning and Bayesian statistical concepts and the ease by which the default models are obtained in Maxent. However, these defaults have been shown, in many cases, to be suboptimal and exploration of alternatives has repeatedly been called for. In this paper, we derive MaxEnt from strict maximum likelihood principles, and point out parallels between MaxEnt and standard modelling tools like generalised linear models (GLM). Furthermore, we describe several new options opened by this new derivation of MaxEnt, which may improve MaxEnt practice. The most important of these is the option for selecting variables by subset selection methods instead of the ?1‐regularisation method, which currently is the Maxent software default. Other new options include: incorporation of new transformations of explanatory variables and user control of the transformation process; improved variable contribution measures and options for variation partitioning; and improved output prediction formats. The new options are exemplified for a data set for the plant species Scorzonera humilis in SE Norway, which was analysed by the standard MaxEnt procedure in a previously published paper. We recommend that thorough comparisons between the proposed alternative options and default procedures and variants thereof be carried out.  相似文献   

18.
Pilot studies are often used to help design ecological studies. Ideally the pilot data are incorporated into the full-scale study data, but if the pilot study's results indicate a need for major changes to experimental design, then pooling pilot and full-scale study data is difficult. The default position is to disregard the preliminary data. But ignoring pilot study data after a more comprehensive study has been completed forgoes statistical power or costs more by sampling additional data equivalent to the pilot study's sample size. With Bayesian methods, pilot study data can be used as an informative prior for a model built from the full-scale study dataset. We demonstrate a Bayesian method for recovering information from otherwise unusable pilot study data with a case study on eucalypt seedling mortality. A pilot study of eucalypt tree seedling mortality was conducted in southeastern Australia in 2005. A larger study with a modified design was conducted the following year. The two datasets differed substantially, so they could not easily be combined. Posterior estimates from pilot dataset model parameters were used to inform a model for the second larger dataset. Model checking indicated that incorporating prior information maintained the predictive capacity of the model with respect to the training data. Importantly, adding prior information improved model accuracy in predicting a validation dataset. Adding prior information increased the precision and the effective sample size for estimating the average mortality rate. We recommend that practitioners move away from the default position of discarding pilot study data when they are incompatible with the form of their full-scale studies. More generally, we recommend that ecologists should use informative priors more frequently to reap the benefits of the additional data.  相似文献   

19.
The MSE (where MSE is low energy (MS) and elevated energy (E) mode of acquisition) acquisition method commercialized by Waters on its Q‐TOF instruments is regarded as a unique data‐independent fragmentation approach that improves the accuracy and dynamic range of label‐free proteomic quantitation. Due to its special format, MSE acquisition files cannot be independently analyzed with most widely used open‐source proteomic software specialized for processing data‐dependent acquisition files. In this study, we established a workflow integrating Skyline, a popular and versatile peptide‐centric quantitation program, and a statistical tool DiffProt to fulfill MSE‐based proteomic quantitation. Comparison with the vendor software package for analyzing targeted phosphopeptides and global proteomic datasets reveals distinct advantages of Skyline in MSE data mining, including sensitive peak detection, flexible peptide filtering, and transparent step‐by‐step workflow. Moreover, we developed a new procedure such that Skyline MS1 filtering was extended to small molecule quantitation for the first time. This new utility of Skyline was examined in a protein–ligand interaction experiment to identify multiple chemical compounds specifically bound to NDM‐1 (where NDM is New Delhi metallo‐β‐lactamase 1), an antibiotics‐resistance target. Further improvement of the current weaknesses in Skyline MS1 filtering is expected to enhance the reliability of this powerful program in full scan‐based quantitation of both peptides and small molecules.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号