首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
Anomaly detection is the process of identifying unexpected items or events in datasets, which differ from the norm. In contrast to standard classification tasks, anomaly detection is often applied on unlabeled data, taking only the internal structure of the dataset into account. This challenge is known as unsupervised anomaly detection and is addressed in many practical applications, for example in network intrusion detection, fraud detection as well as in the life science and medical domain. Dozens of algorithms have been proposed in this area, but unfortunately the research community still lacks a comparative universal evaluation as well as common publicly available datasets. These shortcomings are addressed in this study, where 19 different unsupervised anomaly detection algorithms are evaluated on 10 different datasets from multiple application domains. By publishing the source code and the datasets, this paper aims to be a new well-funded basis for unsupervised anomaly detection research. Additionally, this evaluation reveals the strengths and weaknesses of the different approaches for the first time. Besides the anomaly detection performance, computational effort, the impact of parameter settings as well as the global/local anomaly detection behavior is outlined. As a conclusion, we give an advise on algorithm selection for typical real-world tasks.  相似文献   

2.
3.
As an emerging field, MS-based proteomics still requires software tools for efficiently storing and accessing experimental data. In this work, we focus on the management of LC–MS data, which are typically made available in standard XML-based portable formats. The structures that are currently employed to manage these data can be highly inefficient, especially when dealing with high-throughput profile data. LC–MS datasets are usually accessed through 2D range queries. Optimizing this type of operation could dramatically reduce the complexity of data analysis. We propose a novel data structure for LC–MS datasets, called mzRTree, which embodies a scalable index based on the R-tree data structure. mzRTree can be efficiently created from the XML-based data formats and it is suitable for handling very large datasets. We experimentally show that, on all range queries, mzRTree outperforms other known structures used for LC–MS data, even on those queries these structures are optimized for. Besides, mzRTree is also more space efficient. As a result, mzRTree reduces data analysis computational costs for very large profile datasets.  相似文献   

4.
5.
MOTIVATION: The IntAct repository is one of the largest and most widely used databases for the curation and storage of molecular interaction data. These datasets need to be analyzed by computational methods. Software packages in the statistical environment R provide powerful tools for conducting such analyses. RESULTS: We introduce Rintact, a Bioconductor package that allows users to transform PSI-MI XML2.5 interaction data files from IntAct into R graph objects. On these, they can use methods from R and Bioconductor for a variety of tasks: determining cohesive subgraphs, computing summary statistics, fitting mathematical models to the data or rendering graphical layouts. Rintact provides a programmatic interface to the IntAct repository and allows the use of the analytic methods provided by R and Bioconductor. AVAILABILITY: Rintact is freely available at http://bioconductor.org  相似文献   

6.
ABSTRACT: BACKGROUND: The MapReduce framework enables a scalable processing and analyzing of large datasets by distributing the computational load on connected computer nodes, referred to as a cluster. In Bioinformatics, MapReduce has already been adopted to various case scenarios such as mapping next generation sequencing data to a reference genome, finding SNPs from short read data or matching strings in genotype files. Nevertheless, tasks like installing and maintaining MapReduce on a cluster system, importing data into its distributed file system or executing MapReduce programs require advanced knowledge in computer science and could thus prevent scientists from usage of currently available and useful software solutions. RESULTS: Here we present Cloudgene, a freely available platform to improve the usability of MapReduce programs in Bioinformatics by providing a graphical user interface for the execution, the import and export of data and the reproducibility of workflows on in-house (private clouds) and rented clusters (public clouds). The aim of Cloudgene is to build a standardized graphical execution environment for currently available and future MapReduce programs, which can all be integrated by using its plug-in interface. Since Cloudgene can be executed on private clusters, sensitive datasets can be kept in house at all time and data transfer times are therefore minimized. CONCLUSIONS: Our results show that MapReduce programs can be integrated into Cloudgene with little effort and without adding any computational overhead to existing programs. This platform gives developers the opportunity to focus on the actual implementation task and provides scientists a platform with the aim to hide the complexity of MapReduce. In addition to MapReduce programs, Cloudgene can also be used to launch predefined systems (e.g. Cloud BioLinux, RStudio) in public clouds. Currently, five different bioinformatic programs using MapReduce and two systems are integrated and have been successfully deployed. Cloudgene is freely available at http://cloudgene.uibk.ac.at.  相似文献   

7.
Ecologists are increasingly asking large‐scale and/or broad‐scope questions that require vast datasets. In response, various top‐down efforts and incentives have been implemented to encourage data sharing and integration. However, despite general consensus on the critical need for more open ecological data, several roadblocks still discourage compliance and participation in these projects; as a result, ecological data remain largely unavailable. Grassroots initiatives (i.e. efforts initiated and led by cohesive groups of scientists focused on specific goals) have thus far been overlooked as a powerful means to meet these challenges. These bottom‐up collaborative data integration projects can play a crucial role in making high quality datasets available because they tackle the heterogeneity of ecological data at a scale where it is still manageable, all the while offering the support and structure to do so. These initiatives foster best practices in data management and provide tangible rewards to researchers who choose to invest time in sound data stewardship. By maintaining proximity between data generators and data users, grassroots initiatives improve data interpretation and ensure high‐quality data integration while providing fair acknowledgement to data generators. We encourage researchers to formalize existing collaborations and to engage in local activities that improve the availability and distribution of ecological data. By fostering communication and interaction among scientists, we are convinced that grassroots initiatives can significantly support the development of global‐scale data repositories. In doing so, these projects help address important ecological questions and support policy decisions.  相似文献   

8.
Shadforth I  Crowther D  Bessant C 《Proteomics》2005,5(16):4082-4095
Current proteomics experiments can generate vast quantities of data very quickly, but this has not been matched by data analysis capabilities. Although there have been a number of recent reviews covering various aspects of peptide and protein identification methods using MS, comparisons of which methods are either the most appropriate for, or the most effective at, their proposed tasks are not readily available. As the need for high-throughput, automated peptide and protein identification systems increases, the creators of such pipelines need to be able to choose algorithms that are going to perform well both in terms of accuracy and computational efficiency. This article therefore provides a review of the currently available core algorithms for PMF, database searching using MS/MS, sequence tag searches and de novo sequencing. We also assess the relative performances of a number of these algorithms. As there is limited reporting of such information in the literature, we conclude that there is a need for the adoption of a system of standardised reporting on the performance of new peptide and protein identification algorithms, based upon freely available datasets. We go on to present our initial suggestions for the format and content of these datasets.  相似文献   

9.

Background  

DNA Microarrays have become the standard method for large scale analyses of gene expression and epigenomics. The increasing complexity and inherent noisiness of the generated data makes visual data exploration ever more important. Fast deployment of new methods as well as a combination of predefined, easy to apply methods with programmer's access to the data are important requirements for any analysis framework. Mayday is an open source platform with emphasis on visual data exploration and analysis. Many built-in methods for clustering, machine learning and classification are provided for dissecting complex datasets. Plugins can easily be written to extend Mayday's functionality in a large number of ways. As Java program, Mayday is platform-independent and can be used as Java WebStart application without any installation. Mayday can import data from several file formats, database connectivity is included for efficient data organization. Numerous interactive visualization tools, including box plots, profile plots, principal component plots and a heatmap are available, can be enhanced with metadata and exported as publication quality vector files.  相似文献   

10.
SIMLR (S ingle‐cell I nterpretation via M ulti‐kernel L eaR ning), an open‐source tool that implements a novel framework to learn a sample‐to‐sample similarity measure from expression data observed for heterogenous samples, is presented here. SIMLR can be effectively used to perform tasks such as dimension reduction, clustering, and visualization of heterogeneous populations of samples. SIMLR was benchmarked against state‐of‐the‐art methods for these three tasks on several public datasets, showing it to be scalable and capable of greatly improving clustering performance, as well as providing valuable insights by making the data more interpretable via better a visualization. SIMLR is available on https://github.com/BatzoglouLabSU/SIMLR GitHub in both R and MATLAB implementations. Furthermore, it is also available as an R package on http://bioconductor.org  相似文献   

11.
With many sophisticated methods available for estimating migration, ecologists face the difficult decision of choosing for their specific line of work. Here we test and compare several methods, performing sanity and robustness tests, applying to large‐scale data and discussing the results and interpretation. Five methods were selected to compare for their ability to estimate migration from spatially implicit and semi‐explicit simulations based on three large‐scale field datasets from South America (Guyana, Suriname, French Guiana and Ecuador). Space was incorporated semi‐explicitly by a discrete probability mass function for local recruitment, migration from adjacent plots or from a metacommunity. Most methods were able to accurately estimate migration from spatially implicit simulations. For spatially semi‐explicit simulations, estimation was shown to be the additive effect of migration from adjacent plots and the metacommunity. It was only accurate when migration from the metacommunity outweighed that of adjacent plots, discrimination, however, proved to be impossible. We show that migration should be considered more an approximation of the resemblance between communities and the summed regional species pool. Application of migration estimates to simulate field datasets did show reasonably good fits and indicated consistent differences between sets in comparison with earlier studies. We conclude that estimates of migration using these methods are more an approximation of the homogenization among local communities over time rather than a direct measurement of migration and hence have a direct relationship with beta diversity. As betadiversity is the result of many (non)‐neutral processes, we have to admit that migration as estimated in a spatial explicit world encompasses not only direct migration but is an ecological aggregate of these processes. The parameter m of neutral models then appears more as an emerging property revealed by neutral theory instead of being an effective mechanistic parameter and spatially implicit models should be rejected as an approximation of forest dynamics.  相似文献   

12.
MOTIVATION: Protein families evolve a multiplicity of functions through gene duplication, speciation and other processes. As a number of studies have shown, standard methods of protein function prediction produce systematic errors on these data. Phylogenomic analysis--combining phylogenetic tree construction, integration of experimental data and differentiation of orthologs and paralogs--has been proposed to address these errors and improve the accuracy of functional classification. The explicit integration of structure prediction and analysis in this framework, which we call structural phylogenomics, provides additional insights into protein superfamily evolution. RESULTS: Results of protein functional classification using phylogenomic analysis show fewer expected false positives overall than when pairwise methods of functional classification are employed. We present an overview of the motivations and fundamental principles of phylogenomic analysis, new methods developed for the key tasks, benchmark datasets for these tasks (when available) and suggest procedures to increase accuracy. We also discuss some of the methods used in the Celera Genomics high-throughput phylogenomic classification of the human genome. AVAILABILITY: Software tools from the Berkeley Phylogenomics Group are available at http://phylogenomics.berkeley.edu  相似文献   

13.
Point 1: The ecological models of Alfred J. Lotka and Vito Volterra have had an enormous impact on ecology over the past century. Some of the earliest—and clearest—experimental tests of these models were famously conducted by Georgy Gause in the 1930s. Although well known, the data from these experiments are not widely available and are often difficult to analyze using standard statistical and computational tools.Point 2: Here, we introduce the gauseR package, a collection of tools for fitting Lotka‐Volterra models to time series data of one or more species. The package includes several methods for parameter estimation and optimization, and includes 42 datasets from Gause''s species interaction experiments and related work. Additionally, we include with this paper a short blog post discussing the historical importance of these data and models, and an R vignette with a walk‐through introducing the package methods. The package is available for download at github.com/adamtclark/gauseR.Point 3: To demonstrate the package, we apply it to several classic experimental studies from Gause, as well as two other well‐known datasets on multi‐trophic dynamics on Isle Royale, and in spatially structured mite populations. In almost all cases, models fit observations closely and fitted parameter values make ecological sense.Point 4: Taken together, we hope that the methods, data, and analyses that we present here provide a simple and user‐friendly way to interact with complex ecological data. We are optimistic that these methods will be especially useful to students and educators who are studying ecological dynamics, as well as researchers who would like a fast tool for basic analyses.  相似文献   

14.
Fregel R  Delgado S 《Mitochondrion》2011,11(2):366-367
Comparison of available mitochondrial DNA data is some times hindered by the data presentation format. HaploSearch is a simple tool for transforming DNA sequences into haplotype data and vice versa, speeding up the manipulation of large datasets. Although designed for mitochondrial DNA, HaploSearch could be used with any kind of DNA type. HaploSearch program, detailed software instructions and example files are freely available on the web http://www.haplosite.com/haplosearch/.  相似文献   

15.
High-throughput genotyping chips have produced huge datasets for genome-wide association studies(GWAS)that have contributed greatly to discovering susceptibility genes for complex diseases.There are two strategies for performing data analysis for GWAS.One strategy is to use open-source or commercial packages that are designed for GWAS.The other is to take advantage of classic genetic programs with specific functions,such as linkage disequilibrium mapping,haplotype inference and transmission disequilibrium tests.However,most classic programs that are available are not suitable for analyzing chip data directly and require custom-made input,which results in the inconvenience of converting raw genotyping files into various data formats.We developed a powerful,user-friendly,lightweight program named SNPTransformer for GWAS that includes five major modules (Transformer,Operator,Previewer,Coder and Simulator).The toolkit not only works for transforming the genotyping files into ten input formats for use with classic genetics packages,but also carries out useful functions such as relational operations on IDs,previewing data files,recoding data formats and simulating marker files,among other functions.It bridges upstream raw genotyping data with downstream genetic programs,and can act as an in-hand toolkit for human geneticists,especially for non-programmers.SNPTransformer is freely available at http://snptransformer.sourceforge.net.  相似文献   

16.
The effect that climate change and variability will have on waterborne bacteria is a topic of increasing concern for coastal ecosystems, including the Chesapeake Bay. Surface water temperature trends in the Bay indicate a warming pattern of roughly 0.3–0.4°C per decade over the past 30 years. It is unclear what impact future warming will have on pathogens currently found in the Bay, including Vibrio spp. Using historical environmental data, combined with three different statistical models of Vibrio vulnificus probability, we explore the relationship between environmental change and predicted Vibrio vulnificus presence in the upper Chesapeake Bay. We find that the predicted response of V. vulnificus probability to high temperatures in the Bay differs systematically between models of differing structure. As existing publicly available datasets are inadequate to determine which model structure is most appropriate, the impact of climatic change on the probability of V. vulnificus presence in the Chesapeake Bay remains uncertain. This result points to the challenge of characterizing climate sensitivity of ecological systems in which data are sparse and only statistical models of ecological sensitivity exist.  相似文献   

17.
The structure and function of diverse microbial communities is underpinned by ecological interactions that remain uncharacterized. With rapid adoption of next-generation sequencing for studying microbiomes, data-driven inference of microbial interactions based on abundance correlations is widely used, but with the drawback that ecological interpretations may not be possible. Leveraging cross-sectional microbiome datasets for unravelling ecological structure in a scalable manner thus remains an open problem. We present an expectation-maximization algorithm (BEEM-Static) that can be applied to cross-sectional datasets to infer interaction networks based on an ecological model (generalized Lotka-Volterra). The method exhibits robustness to violations in model assumptions by using statistical filters to identify and remove corresponding samples. Benchmarking against 10 state-of-the-art correlation based methods showed that BEEM-Static can infer presence and directionality of ecological interactions even with relative abundance data (AUC-ROC>0.85), a task that other methods struggle with (AUC-ROC<0.63). In addition, BEEM-Static can tolerate a high fraction of samples (up to 40%) being not at steady state or coming from an alternate model. Applying BEEM-Static to a large public dataset of human gut microbiomes (n = 4,617) identified multiple stable equilibria that better reflect ecological enterotypes with distinct carrying capacities and interactions for key species.ConclusionBEEM-Static provides new opportunities for mining ecologically interpretable interactions and systems insights from the growing corpus of microbiome data.  相似文献   

18.

Background

Sequence alignment data is often ordered by coordinate (id of the reference sequence plus position on the sequence where the fragment was mapped) when stored in BAM files, as this simplifies the extraction of variants between the mapped data and the reference or of variants within the mapped data. In this order paired reads are usually separated in the file, which complicates some other applications like duplicate marking or conversion to the FastQ format which require to access the full information of the pairs.

Results

In this paper we introduce biobambam, a set of tools based on the efficient collation of alignments in BAM files by read name. The employed collation algorithm avoids time and space consuming sorting of alignments by read name where this is possible without using more than a specified amount of main memory. Using this algorithm tasks like duplicate marking in BAM files and conversion of BAM files to the FastQ format can be performed very efficiently with limited resources. We also make the collation algorithm available in the form of an API for other projects. This API is part of the libmaus package.

Conclusions

In comparison with previous approaches to problems involving the collation of alignments by read name like the BAM to FastQ or duplication marking utilities our approach can often perform an equivalent task more efficiently in terms of the required main memory and run-time. Our BAM to FastQ conversion is faster than all widely known alternatives including Picard and bamUtil. Our duplicate marking is about as fast as the closest competitor bamUtil for small data sets and faster than all known alternatives on large and complex data sets.
  相似文献   

19.
The Flora North America Program will create a computer data bank of taxonomic information about the vascular plants of North America north of Mexico. The system, which will serve a broad range of users, will provide for the acquisition, editing, and storage of data and the preparation of a variety of products. It will grow and evolve over a long period of time, continually incorporating changes to data and new categories of information as they become available. The system will utilize an existing computer program package, the IBM Generalized Information System (GIS), for creating, maintaining, querying, and generating reports from its files of data. In addition to a file of taxonomic, ecological, and geographic data about each taxon, the system will build and maintain a series of authority or dictionary files to control system content. The first major system product will be a published flora, generated by computer. Many other catalogs and indexes will be prepared and specific questions may be asked of the data bank to meet individual user needs for information.  相似文献   

20.
Human‐environmental relationships have long been of interest to a variety of scientists, including ecologists, biologists, anthropologists, and many others. 1 , 2 In anthropology, this interest was especially prevalent among cultural ecologists of the 1970s and earlier, who tended to explain culture as the result of techno‐environmental constraints. 3 More recently researchers have used historical ecology, an approach that focuses on the long‐term dialectical relationship between humans and their environments, as well as long‐term prehuman ecological datasets. 4 - 7 An important contribution of anthropology to historical ecology is that anthropological datasets dealing with ethnohistory, traditional ecological knowledge, and human skeletal analysis, as well as archeological datasets on faunal and floral remains, artifacts, geochemistry, and stratigraphic analysis, provide a deep time perspective (across decades, centuries, and millennia) on the evolution of ecosystems and the place of people in those larger systems. Historical ecological data also have an applied component that can provide important information on the relative abundances of flora and fauna, changes in biogeography, alternations in food webs, landscape evolution, and much more.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号