首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Boosting for tumor classification with gene expression data   总被引:7,自引:0,他引:7  
MOTIVATION: Microarray experiments generate large datasets with expression values for thousands of genes but not more than a few dozens of samples. Accurate supervised classification of tissue samples in such high-dimensional problems is difficult but often crucial for successful diagnosis and treatment. A promising way to meet this challenge is by using boosting in conjunction with decision trees. RESULTS: We demonstrate that the generic boosting algorithm needs some modification to become an accurate classifier in the context of gene expression data. In particular, we present a feature preselection method, a more robust boosting procedure and a new approach for multi-categorical problems. This allows for slight to drastic increase in performance and yields competitive results on several publicly available datasets. AVAILABILITY: Software for the modified boosting algorithms as well as for decision trees is available for free in R at http://stat.ethz.ch/~dettling/boosting.html.  相似文献   

2.
MOTIVATION: Advances in techniques to sparsely label neurons unlock the potential to reconstruct connectivity from 3D image stacks acquired by light microscopy. We present an application for semi-automated tracing of neurons to quickly annotate noisy datasets and construct complex neuronal topologies, which we call the Simple Neurite Tracer. AVAILABILITY: Simple Neurite Tracer is open source software, licensed under the GNU General Public Licence (GPL) and based on the public domain image processing software ImageJ. The software and further documentation are available via http://fiji.sc/Simple_Neurite_Tracer as part of the package Fiji, and can be used on Windows, Mac OS and Linux. Documentation and introductory screencasts are available at the same URL. CONTACT: longair@ini.phys.ethz.ch; longair@ini.phys.ethz.ch.  相似文献   

3.
This paper examines Charles Darwin's idea that language-use and humanity's unique cognitive abilities reinforced each other's evolutionary emergence-an idea Darwin sketched in his early notebooks, set forth in his Descent of man (1871), and qualified in Descent's second (1874) edition. Darwin understood this coevolution process in essentially Lockean terms, based on John Locke's hints about the way language shapes thinking itself. Ironically, the linguist Friedrich Max Müller attacked Darwin's human descent theory by invoking a similar thesis, the German romantic notion of an identity between language and thought. Although Darwin avoided outright contradiction, when he came to defend himself against Müller's attacks, he undercut some of his own argumentation in favor of the coevolution idea. That is, he found it difficult to counter Müller's argument while also making a case for coevolution. Darwin's efforts in this area were further complicated by British and American writers who held a naturalistic view of speech origins yet still taught that language had been invented by fully evolved homo sapiens, thus denying coevolution.  相似文献   

4.
In the second chapter of The descent of man (1871), Charles Darwin interrupted his discussion of the evolutionary origins of language to describe ten ways in which the formation of languages and of biological species were 'curiously' similar. I argue that these comparisons served mainly as analogies in which linguistic processes stood for aspects of biological evolution. Darwin used these analogies to recapitulate themes from On the origin of species (1859), including common descent, genealogical classification, the struggle for existence, and natural selection, among others. Skeptical of this interpretation, Gregory Radick sees the naturalistic account of language formation in the Descent comparisons as reinforcing Darwin's idea that languages and the races of mankind have both undergone progressive development. (The opposite view was that modern-day primitive peoples had degenerated from an originally civilized condition.) Yet the details of Darwin's language-species comparisons, as well as the polemical context in which they appear, show that they were not aimed at so limited a function. Rather, they addressed issues related to species transmutation in general.  相似文献   

5.
Proteins can be identified using a set of peptide fragment weights produced by a specific digestion to search a protein database in which sequences have been replaced by fragment weights calculated for various cleavage methods. We present a method using multidimensional searches that greatly increases the confidence level for identification, allowing DNA sequence databases to be examined. This method provides a link between 2-dimensional gel electrophoresis protein databases and genome sequencing projects. Moreover, the increased confidence level allows unknown proteins to be matched to expressed sequence tags, potentially eliminating the need to obtain sequence information for cloning. Database searching from a mass profile is offered as a free service by an automatic server at the ETH, Zürich. For information, send an electronic message to the address cbrg/inf.ethz.ch with the line: help mass search, or help all.  相似文献   

6.
SUMMARY: Besides classical clustering methods such as hierarchical clustering, in recent years biclustering has become a popular approach to analyze biological data sets, e.g. gene expression data. The Biclustering Analysis Toolbox (BicAT) is a software platform for clustering-based data analysis that integrates various biclustering and clustering techniques in terms of a common graphical user interface. Furthermore, BicAT provides different facilities for data preparation, inspection and postprocessing such as discretization, filtering of biclusters according to specific criteria or gene pair analysis for constructing gene interconnection graphs. The possibility to use different biclustering algorithms inside a single graphical tool allows the user to compare clustering results and choose the algorithm that best fits a specific biological scenario. The toolbox is described in the context of gene expression analysis, but is also applicable to other types of data, e.g. data from proteomics or synthetic lethal experiments. AVAILABILITY: The BicAT toolbox is freely available at http://www.tik.ee.ethz.ch/sop/bicat and runs on all operating systems. The Java source code of the program and a developer's guide is provided on the website as well. Therefore, users may modify the program and add further algorithms or extensions.  相似文献   

7.
Different plant plastid types contain a distinct protein complement for specialized functions and metabolic activities. plprot was established as a plastid proteome database to provide information about the proteomes of chloroplasts, etioplasts and undifferentiated plastids. The current version of plprot features 2,043 protein entries and consists of two modules. Module one contains a BLAST search option and provides comparative information on the proteomes of different plastid types. The second module contains four searchable databases, three for each individual plastid type and one comprehensive composite database that provides the results of plastid proteome analyses from different laboratories. plprot is accessible at http://www.plprot.ethz.ch.  相似文献   

8.
The 67th Discussion Forum on Life Cycle Assessment (LCA), organised by partners of the European project RELIEF (RELIability of product Environmental Footprints), focused on methods for better understanding the impacts of land use linked to agricultural value chains. The first session of the forum was dedicated to methods that help in retrospective tracking of land use within complex supply chains. Novel approaches were presented for the integration of increasingly available spatially located land use data into LCA. The second session focused on forward-looking projections of land use change and included emerging, predictive methods for the modelling of land change. The third session considered impact assessment methods related to the use of land and their application together with land change modelling approaches. Discussions throughout the day centred on opportunities and challenges arising from integrating spatially located land use information into Life Cycle Assessment. Increasing amounts of spatially located land use data are becoming available and this could potentially increase the robustness and specificity of Life Cycle Assessment. However, the use of such data can be computationally expensive and requires the development of skills (i.e. use of geographical information systems (GIS) and model coding) within the LCA community. Land change modelling and ecosystem service modelling are associated with considerable uncertainty which must be communicated appropriately to stakeholders and decision-makers when interpreting results from an LCA. The new approaches were found to challenge aspects of the traditional LCA approach—particularly the division between the life cycle inventory and impact assessment and the assumption of linearity between scale and impacts when deriving characterisation factors. The presentations from the DF-67 are available for download (www.lcaforum.ch), and video recordings can be accessed online (http://www.video.ethz.ch/events/lca/2017/autumn/67th.html).  相似文献   

9.
Designed peptides that bind to major histocompatibility protein I (MHC-I) allomorphs bear the promise of representing epitopes that stimulate a desired immune response. A rigorous bioinformatical exploration of sequence patterns hidden in peptides that bind to the mouse MHC-I allomorph H-2Kb is presented. We exemplify and validate these motif findings by systematically dissecting the epitope SIINFEKL and analyzing the resulting fragments for their binding potential to H-2Kb in a thermal denaturation assay. The results demonstrate that only fragments exclusively retaining the carboxy- or amino-terminus of the reference peptide exhibit significant binding potential, with the N-terminal pentapeptide SIINF as shortest ligand. This study demonstrates that sophisticated machine-learning algorithms excel at extracting fine-grained patterns from peptide sequence data and predicting MHC-I binding peptides, thereby considerably extending existing linear prediction models and providing a fresh view on the computer-based molecular design of future synthetic vaccines. The server for prediction is available at http://modlab-cadd.ethz.ch (SLiDER tool, MHC-I version 2012).  相似文献   

10.
In computational evolutionary biology, verification and benchmarking is a challenging task because the evolutionary history of studied biological entities is usually not known. Computer programs for simulating sequence evolution in silico have shown to be viable test beds for the verification of newly developed methods and to compare different algorithms. However, current simulation packages tend to focus either on gene-level aspects of genome evolution such as character substitutions and insertions and deletions (indels) or on genome-level aspects such as genome rearrangement and speciation events. Here, we introduce Artificial Life Framework (ALF), which aims at simulating the entire range of evolutionary forces that act on genomes: nucleotide, codon, or amino acid substitution (under simple or mixture models), indels, GC-content amelioration, gene duplication, gene loss, gene fusion, gene fission, genome rearrangement, lateral gene transfer (LGT), or speciation. The other distinctive feature of ALF is its user-friendly yet powerful web interface. We illustrate the utility of ALF with two possible applications: 1) we reanalyze data from a study of selection after globin gene duplication and test the statistical significance of the original conclusions and 2) we demonstrate that LGT can dramatically decrease the accuracy of two well-established orthology inference methods. ALF is available as a stand-alone application or via a web interface at http://www.cbrg.ethz.ch/alf.  相似文献   

11.

Background

With next-generation sequencing technologies, experiments that were considered prohibitive only a few years ago are now possible. However, while these technologies have the ability to produce enormous volumes of data, the sequence reads are prone to error. This poses fundamental hurdles when genetic diversity is investigated.

Results

We developed ShoRAH, a computational method for quantifying genetic diversity in a mixed sample and for identifying the individual clones in the population, while accounting for sequencing errors. The software was run on simulated data and on real data obtained in wet lab experiments to assess its reliability.

Conclusions

ShoRAH is implemented in C++, Python, and Perl and has been tested under Linux and Mac OS X. Source code is available under the GNU General Public License at http://www.cbg.ethz.ch/software/shorah.  相似文献   

12.
MOTIVATION: Microarray experiments are expected to contribute significantly to the progress in cancer treatment by enabling a precise and early diagnosis. They create a need for class prediction tools, which can deal with a large number of highly correlated input variables, perform feature selection and provide class probability estimates that serve as a quantification of the predictive uncertainty. A very promising solution is to combine the two ensemble schemes bagging and boosting to a novel algorithm called BagBoosting. RESULTS: When bagging is used as a module in boosting, the resulting classifier consistently improves the predictive performance and the probability estimates of both bagging and boosting on real and simulated gene expression data. This quasi-guaranteed improvement can be obtained by simply making a bigger computing effort. The advantageous predictive potential is also confirmed by comparing BagBoosting to several established class prediction tools for microarray data. AVAILABILITY: Software for the modified boosting algorithms, for benchmark studies and for the simulation of microarray data are available as an R package under GNU public license at http://stat.ethz.ch/~dettling/bagboost.html.  相似文献   

13.
We present AUDENS, a new platform-independent open source tool for automated de novo sequencing of peptides from MS/MS data. We implemented a dynamic programming algorithm and combined it with a flexible preprocessing module which is designed to distinguish between signal and other peaks. By applying a user-defined set of heuristics, AUDENS screens through the spectrum and assigns high relevance values to putative signal peaks. The algorithm constructs a sequence path through the MS/MS spectrum using the peak relevances to score each suggested sequence path, i.e., the corresponding amino acid sequence. At present, we consider AUDENS a prototype that unfolds its biggest potential if used in parallel with other de novo sequencing tools. AUDENS is available open source and can be downloaded with further documentation at http://www.ti.inf.ethz.ch/pw/software/audens/ .  相似文献   

14.
15.
Cell surface proteins are major targets of biomedical research due to their utility as cellular markers and their extracellular accessibility for pharmacological intervention. However, information about the cell surface protein repertoire (the surfaceome) of individual cells is only sparsely available. Here, we applied the Cell Surface Capture (CSC) technology to 41 human and 31 mouse cell types to generate a mass-spectrometry derived Cell Surface Protein Atlas (CSPA) providing cellular surfaceome snapshots at high resolution. The CSPA is presented in form of an easy-to-navigate interactive database, a downloadable data matrix and with tools for targeted surfaceome rediscovery (http://wlab.ethz.ch/cspa). The cellular surfaceome snapshots of different cell types, including cancer cells, resulted in a combined dataset of 1492 human and 1296 mouse cell surface glycoproteins, providing experimental evidence for their cell surface expression on different cell types, including 136 G-protein coupled receptors and 75 membrane receptor tyrosine-protein kinases. Integrated analysis of the CSPA reveals that the concerted biological function of individual cell types is mainly guided by quantitative rather than qualitative surfaceome differences. The CSPA will be useful for the evaluation of drug targets, for the improved classification of cell types and for a better understanding of the surfaceome and its concerted biological functions in complex signaling microenvironments.  相似文献   

16.
In his book The descent of man (1871), Charles Darwin paid tribute to a trio of writers (Hensleigh Wedgwood, F. W. Farrar, and August Schleicher) who offered naturalistic explanations of the origin of language. Darwin's concurrence with these figures was limited, however, because each of them denied some aspect of his thesis that the evolution of language had been coeval with and essential to the emergence of humanity's characteristic mental traits. Darwin first sketched out this thesis in his theoretical notebooks of the 1830s and then clarified his position in Descent, where he argued that mind-language coevolution had occurred prior to the rise of distinct racial groups. He thus opposed the view of August Schleicher and Ernst Haeckel, who (along with Alfred Russel Wallace) taught that speech had originated subsequent to the geographical and racial dispersion of humanity's ancestors. As Darwin argued in Descent, this quasi-polygenetic version of coevolution was unable to explain primeval man's initial dominance over rival ape-like populations. Drawing inspiration from British anthropologists, Darwin made the early development of language, hence mental monogenesis, central to his account of human evolution.  相似文献   

17.
The potential effectiveness of statistical haplotype inference makes it an area of active exploration over the last decade. There are several complications of statistical inference, including: the same algorithm can produce different solutions for the same data set, which reflects the internal algorithm variability; different algorithms can give different solutions for the same data set, reflecting the discordance among algorithms; and the algorithms per se are unable to evaluate the reliability of the solutions even if they are unique, this being a general limitation of all inference methods. With the aim of increasing the confidence of statistical inference results, consensus strategy appears to be an effective means to deal with these problems. Several authors have explored this with different emphases. Here we discuss two recent studies examining the internal algorithm variability and among-algorithm discordance, respectively, and evaluate the different outcomes of these analyses, in light of Orzack (2009) comment. Until other, better methods are developed, a combination of these two approaches should provide a practical way to increase the confidence of statistical haplotyping results.  相似文献   

18.
We have analyzed proteome dynamics during light-induced development of rice (Oryza sativa) chloroplasts from etioplasts using quantitative two-dimensional gel electrophoresis and tandem mass spectrometry protein identification. In the dark, the etioplast allocates the main proportion of total protein mass to carbohydrate and amino acid metabolism and a surprisingly high number of proteins to the regulation and expression of plastid genes. Chaperones, proteins for photosynthetic energy metabolism, and enzymes of the tetrapyrrole pathway were identified among the most abundant etioplast proteins. The detection of 13 N-terminal acetylated peptides allowed us to map the exact localization of the transit peptide cleavage site, demonstrating good agreement with the prediction for most proteins. Based on the quantitative etioplast proteome map, we examined early light-induced changes during chloroplast development. The transition from heterotrophic metabolism to photosynthesis-supported autotrophic metabolism was already detectable 2 h after illumination and affected most essential metabolic modules. Enzymes in carbohydrate metabolism, photosynthesis, and gene expression were up-regulated, whereas enzymes in amino acid and fatty acid metabolism were significantly decreased in relative abundance. Enzymes involved in nucleotide metabolism, tetrapyrrole biosynthesis, and redox regulation remained unchanged. Phosphoprotein-specific staining at different time points during chloroplast development revealed light-induced phosphorylation of a nuclear-encoded plastid RNA-binding protein, consistent with changes in plastid RNA metabolism. Quantitative information about all identified proteins and their regulation by light is available in plprot, the plastid proteome database (http://www.plprot.ethz.ch).  相似文献   

19.
Nested effects models have been used successfully for learning subcellular networks from high-dimensional perturbation effects that result from RNA interference (RNAi) experiments. Here, we further develop the basic nested effects model using high-content single-cell imaging data from RNAi screens of cultured cells infected with human rhinovirus. RNAi screens with single-cell readouts are becoming increasingly common, and they often reveal high cell-to-cell variation. As a consequence of this cellular heterogeneity, knock-downs result in variable effects among cells and lead to weak average phenotypes on the cell population level. To address this confounding factor in network inference, we explicitly model the stimulation status of a signaling pathway in individual cells. We extend the framework of nested effects models to probabilistic combinatorial knock-downs and propose NEMix, a nested effects mixture model that accounts for unobserved pathway activation. We analyzed the identifiability of NEMix and developed a parameter inference scheme based on the Expectation Maximization algorithm. In an extensive simulation study, we show that NEMix improves learning of pathway structures over classical NEMs significantly in the presence of hidden pathway stimulation. We applied our model to single-cell imaging data from RNAi screens monitoring human rhinovirus infection, where limited infection efficiency of the assay results in uncertain pathway stimulation. Using a subset of genes with known interactions, we show that the inferred NEMix network has high accuracy and outperforms the classical nested effects model without hidden pathway activity. NEMix is implemented as part of the R/Bioconductor package ‘nem’ and available at www.cbg.ethz.ch/software/NEMix.  相似文献   

20.
MOTIVATION: Accurate time series for biological processes are difficult to estimate due to problems of synchronization, temporal sampling and rate heterogeneity. Methods are needed that can utilize multi-dimensional data, such as those resulting from DNA microarray experiments, in order to reconstruct time series from unordered or poorly ordered sets of observations. RESULTS: We present a set of algorithms for estimating temporal orderings from unordered sets of sample elements. The techniques we describe are based on modifications of a minimum-spanning tree calculated from a weighted, undirected graph. We demonstrate the efficacy of our approach by applying these techniques to an artificial data set as well as several gene expression data sets derived from DNA microarray experiments. In addition to estimating orderings, the techniques we describe also provide useful heuristics for assessing relevant properties of sample datasets such as noise and sampling intensity, and we show how a data structure called a PQ-tree can be used to represent uncertainty in a reconstructed ordering. AVAILABILITY: Academic implementations of the ordering algorithms are available as source code (in the programming language Python) on our web site, along with documentation on their use. The artificial 'jelly roll' data set upon which the algorithm was tested is also available from this web site. The publicly available gene expression data may be found at http://genome-www.stanford.edu/cellcycle/ and http://caulobacter.stanford.edu/CellCycle/.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号