期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Fast Fitch-Parsimony Algorithms for Large Data Sets

Fredrik Ronquist 《Cladistics : the international journal of the Willi Hennig Society》1998,14(4):387-400

The speed of analytical algorithms becomes increasingly important as systematists accumulate larger data sets. In this paper I discuss several time-saving modifications to published Fitch-parsimony tree search algorithms, including shortcuts that allow rapid evaluation of tree lengths and fast reoptimization of trees after clipping or joining of subtrees, as well as search strategies that allows one to successively increase the exhaustiveness of branch swapping. I also describe how Fitch-parsimony algorithms can be restructured to take full advantage of the computing power of modern microprocessors by horizontal or vertical packing of characters, allowing simultaneous processing of many characters, and by avoidance of conditional branches that disturb instruction flow. These new multicharacter algorithms are particularly useful for large data sets of characters with a small number of states, such as nucleotide characters. As an example, the multicharacter algorithms are estimated to be 3.6–10 times faster than single-character equivalents on a PowerPC 604. The speed gain is even larger on processors using MMX, Altivec or similar technologies allowing single instructions to be performed on multiple data simultaneously. 相似文献

2.

Large rDNA Data Sets Question Molecular Phylogenies

《Mycological Research》2000,104(9):1026

相似文献

3.

Parallel Clustering Algorithm for Large Data Sets with Applications in Bioinformatics

Olman Victor Mao Fenglou Wu Hongwei Xu Ying 《IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM》2009,6(2):344-352

Large sets of bioinformatical data provide a challenge in time consumption while solving the cluster identification problem, and that is why a parallel algorithm is so needed for identifying dense clusters in a noisy background. Our algorithm works on a graph representation of the data set to be analyzed. It identifies clusters through the identification of densely intraconnected subgraphs. We have employed a minimum spanning tree (MST) representation of the graph and solve the cluster identification problem using this representation. The computational bottleneck of our algorithm is the construction of an MST of a graph, for which a parallel algorithm is employed. Our high-level strategy for the parallel MST construction algorithm is to first partition the graph, then construct MSTs for the partitioned subgraphs and auxiliary bipartite graphs based on the subgraphs, and finally merge these MSTs to derive an MST of the original graph. The computational results indicate that when running on 150 CPUs, our algorithm can solve a cluster identification problem on a data set with 1,000,000 data points almost 100 times faster than on single CPU, indicating that this program is capable of handling very large data clustering problems in an efficient manner. We have implemented the clustering algorithm as the software CLUMP. 相似文献

4.

fastSTRUCTURE: Variational Inference of Population Structure in Large SNP Data Sets

Anil Raj Matthew Stephens Jonathan K. Pritchard 《Genetics》2014,197(2):573-589

Tools for estimating population structure from genetic data are now used in a wide variety of applications in population genetics. However, inferring population structure in large modern data sets imposes severe computational challenges. Here, we develop efficient algorithms for approximate inference of the model underlying the STRUCTURE program using a variational Bayesian framework. Variational methods pose the problem of computing relevant posterior distributions as an optimization problem, allowing us to build on recent advances in optimization theory to develop fast inference tools. In addition, we propose useful heuristic scores to identify the number of populations represented in a data set and a new hierarchical prior to detect weak population structure in the data. We test the variational algorithms on simulated data and illustrate using genotype data from the CEPH–Human Genome Diversity Panel. The variational algorithms are almost two orders of magnitude faster than STRUCTURE and achieve accuracies comparable to those of ADMIXTURE. Furthermore, our results show that the heuristic scores for choosing model complexity provide a reasonable range of values for the number of populations represented in the data, with minimal bias toward detecting structure when it is very weak. Our algorithm, fastSTRUCTURE, is freely available online at http://pritchardlab.stanford.edu/structure.html. 相似文献

5.

A Scalable Approach for Protein False Discovery Rate Estimation in Large Proteomic Data Sets

Mikhail M. Savitski Mathias Wilhelm Hannes Hahne Bernhard Kuster Marcus Bantscheff 《Molecular & cellular proteomics : MCP》2015,14(9):2394-2404

Calculating the number of confidently identified proteins and estimating false discovery rate (FDR) is a challenge when analyzing very large proteomic data sets such as entire human proteomes. Biological and technical heterogeneity in proteomic experiments further add to the challenge and there are strong differences in opinion regarding the conceptual validity of a protein FDR and no consensus regarding the methodology for protein FDR determination. There are also limitations inherent to the widely used classic target–decoy strategy that particularly show when analyzing very large data sets and that lead to a strong over-representation of decoy identifications. In this study, we investigated the merits of the classic, as well as a novel target–decoy-based protein FDR estimation approach, taking advantage of a heterogeneous data collection comprised of ∼19,000 LC-MS/MS runs deposited in ProteomicsDB (https://www.proteomicsdb.org). The “picked” protein FDR approach treats target and decoy sequences of the same protein as a pair rather than as individual entities and chooses either the target or the decoy sequence depending on which receives the highest score. We investigated the performance of this approach in combination with q-value based peptide scoring to normalize sample-, instrument-, and search engine-specific differences. The “picked” target–decoy strategy performed best when protein scoring was based on the best peptide q-value for each protein yielding a stable number of true positive protein identifications over a wide range of q-value thresholds. We show that this simple and unbiased strategy eliminates a conceptual issue in the commonly used “classic” protein FDR approach that causes overprediction of false-positive protein identification in large data sets. The approach scales from small to very large data sets without losing performance, consistently increases the number of true-positive protein identifications and is readily implemented in proteomics analysis software.Shotgun proteomics is the most popular approach for large-scale identification and quantification of proteins. The rapid evolution of high-end mass spectrometers in recent years (–) has made proteomic studies feasible that identify and quantify as many as 10,000 proteins in a sample (–) and enables many lines of new scientific research including, for example, the analysis of many human proteomes, and proteome-wide protein–drug interaction studies (–). One fundamental step in most proteomic experiments is the identification of proteins in the biological system under investigation. To achieve this, proteins are digested into peptides, analyzed by LC-MS/MS, and tandem mass spectra are used to interrogate protein sequence databases using search engines that match experimental data to data generated in silico (, ). Peptide spectrum matches (PSMs)¹ are commonly assigned by a search engine using either a heuristic or a probabilistic scoring scheme (–). Proteins are then inferred from identified peptides and a protein score or a probability derived as a measure for the confidence in the identification (, ).Estimating the proportion of false matches (false discovery rate; FDR) in an experiment is important to assess and maintain the quality of protein identifications. Owing to its conceptual and practical simplicity, the most widely used strategy to estimate FDR in proteomics is the target–decoy database search strategy (target–decoy strategy; TDS) (). The main assumption underlying this idea is that random matches (false positives) should occur with similar likelihood in the target database and the decoy (reversed, shuffled, or otherwise randomized) version of the same database (, ). The number of matches to the decoy database, therefore, provides an estimate of the number of random matches one should expect to obtain in the target database. The number of target and decoy hits can then be used to calculate either a local or a global FDR for a given data set (–). This general idea can be applied to control the FDR at the level of PSMs, peptides, and proteins, typically by counting the number of target and decoy observations above a specified score.Despite the significant practical impact of the TDS, it has been observed that a peptide FDR that results in an acceptable protein FDR (of say 1%) for a small or medium sized data set, turns into an unacceptably high protein FDR when the data set grows larger (, ). This is because the basic assumption of the classical TDS is compromised when a large proportion of the true positive proteins have already been identified. In small data sets, containing say only a few hundred to a few thousand proteins, random peptide matches will be distributed roughly equally over all decoy and “leftover” target proteins, allowing for a reasonably accurate estimation of false positive target identifications by using the number of decoy identifications. However, in large experiments comprising hundreds to thousands of LC-MS/MS runs, 10,000 or more target proteins may be genuinely and repeatedly identified, leaving an ever smaller number of (target) proteins to be hit by new false positive peptide matches. In contrast, decoy proteins are only hit by the occasional random peptide match but fully count toward the number of false positive protein identifications estimated from the decoy hits. The higher the number of genuinely identified target proteins gets, the larger this imbalance becomes. If this is not corrected for in the decoy space, an overestimation of false positives will occur.This problem has been recognized and e.g. Reiter and colleagues suggested a way for correcting for the overestimation of false positive protein hits termed MAYU (). Following the main assumption that protein identifications containing false positive PSMs are uniformly distributed over the target database, MAYU models the number of false positive protein identifications using a hypergeometric distribution. Its parameters are estimated from the number of protein database entries and the total number of target and decoy protein identifications. The protein FDR is then estimated by dividing the number of expected false positive identifications (expectation value of the hypergeometric distribution) by the total number of target identifications. Although this approach was specifically designed for large data sets (tested on ∼1300 LC-MS/MS runs from digests of C. elegans proteins), it is not clear how far the approach actually scales. Another correction strategy for overestimation of false positive rates, the R factor, was suggested initially for peptides () and more recently for proteins (). A ratio, R, of forward and decoy hits in the low probability range is calculated, where the number of true peptide or protein identifications is expected to be close to zero, and hence, R should approximate one. The number of decoy hits is then multiplied (corrected) by the R factor when performing FDR calculations. The approach is conceptually simpler than the MAYU strategy and easy to implement, but is also based on the assumption that the inflation of the decoy hits intrinsic in the classic target–decoy strategy occurs to the same extent in all probability ranges.In the context of the above, it is interesting to note that there is currently no consensus in the community regarding if and how protein FDRs should be calculated for data of any size. One perhaps extreme view is that, owing to issues and assumptions related to the peptide to protein inference step and ways of constructing decoy protein sequences, protein level FDRs cannot be meaningfully estimated at all (30). This is somewhat unsatisfactory as an estimate of protein level error in proteomic experiments is highly desirable. Others have argued that target–decoy searches are not even needed when accurate p values of individual PSMs are available () whereas others choose to tighten the PSM or peptide FDRs obtained from TDS analysis to whatever threshold necessary to obtain a desired protein FDR (). This is likely too conservative.We have recently proposed an alternative protein FDR approach termed “picked” target–decoy strategy (picked TDS) that indicated improved performance over the classical TDS in a very large proteomic data set () but a systematic investigation of the idea had not been performed at the time. In this study, we further characterized the picked TDS for protein FDR estimation and investigated its scalability compared with that of the classic TDS FDR method in data sets of increasing size up to ∼19,000 LC-MS/MS runs. The results show that the picked TDS is effective in preventing decoy protein over-representation, identifies more true positive hits, and works equally well for small and large proteomic data sets. 相似文献

6.

Accurate and Scalable Construction of Polygenic Scores in Large Biobank Data Sets

《American journal of human genetics》2020,106(5):679-693

相似文献

7.

A Bayesian Outlier Criterion to Detect SNPs under Selection in Large Data Sets

Mathieu Gautier Toby Dylan Hocking Jean-Louis Foulley 《PloS one》2010,5(8)

Background

The recent advent of high-throughput SNP genotyping technologies has opened new avenues of research for population genetics. In particular, a growing interest in the identification of footprints of selection, based on genome scans for adaptive differentiation, has emerged.

Methodology/Principal Findings

The purpose of this study is to develop an efficient model-based approach to perform Bayesian exploratory analyses for adaptive differentiation in very large SNP data sets. The basic idea is to start with a very simple model for neutral loci that is easy to implement under a Bayesian framework and to identify selected loci as outliers via Posterior Predictive P-values (PPP-values). Applications of this strategy are considered using two different statistical models. The first one was initially interpreted in the context of populations evolving respectively under pure genetic drift from a common ancestral population while the second one relies on populations under migration-drift equilibrium. Robustness and power of the two resulting Bayesian model-based approaches to detect SNP under selection are further evaluated through extensive simulations. An application to a cattle data set is also provided.

Conclusions/Significance

The procedure described turns out to be much faster than former Bayesian approaches and also reasonably efficient especially to detect loci under positive selection. 相似文献

8.

CoreCruncher: Fast and Robust Construction of Core Genomes in Large Prokaryotic Data Sets

Connor D Harris Ellis L Torrance Kasie Raymann Louis-Marie Bobay 《Molecular biology and evolution》2021,38(2):727

The core genome represents the set of genes shared by all, or nearly all, strains of a given population or species of prokaryotes. Inferring the core genome is integral to many genomic analyses, however, most methods rely on the comparison of all the pairs of genomes; a step that is becoming increasingly difficult given the massive accumulation of genomic data. Here, we present CoreCruncher; a program that robustly and rapidly constructs core genomes across hundreds or thousands of genomes. CoreCruncher does not compute all pairwise genome comparisons and uses a heuristic based on the distributions of identity scores to classify sequences as orthologs or paralogs/xenologs. Although it is much faster than current methods, our results indicate that our approach is more conservative than other tools and less sensitive to the presence of paralogs and xenologs. CoreCruncher is freely available from: https://github.com/lbobay/CoreCruncher. CoreCruncher is written in Python 3.7 and can also run on Python 2.7 without modification. It requires the python library Numpy and either Usearch or Blast. Certain options require the programs muscle or mafft. 相似文献

9.

Protein Identification False Discovery Rates for Very Large Proteomics Data Sets Generated by Tandem Mass Spectrometry

Lukas Reiter Manfred Claassen Sabine P. Schrimpf Marko Jovanovic Alexander Schmidt Joachim M. Buhmann Michael O. Hengartner Ruedi Aebersold 《Molecular & cellular proteomics : MCP》2009,8(11):2405-2417

相似文献

10.

Analyzing Large Gene Expression and Methylation Data Profiles Using StatBicRM: Statistical Biclustering-Based Rule Mining

Ujjwal Maulik Saurav Mallik Anirban Mukhopadhyay Sanghamitra Bandyopadhyay 《PloS one》2015,10(4)

Microarray and beadchip are two most efficient techniques for measuring gene expression and methylation data in bioinformatics. Biclustering deals with the simultaneous clustering of genes and samples. In this article, we propose a computational rule mining framework, StatBicRM (i.e., statistical biclustering-based rule mining) to identify special type of rules and potential biomarkers using integrated approaches of statistical and binary inclusion-maximal biclustering techniques from the biological datasets. At first, a novel statistical strategy has been utilized to eliminate the insignificant/low-significant/redundant genes in such way that significance level must satisfy the data distribution property (viz., either normal distribution or non-normal distribution). The data is then discretized and post-discretized, consecutively. Thereafter, the biclustering technique is applied to identify maximal frequent closed homogeneous itemsets. Corresponding special type of rules are then extracted from the selected itemsets. Our proposed rule mining method performs better than the other rule mining algorithms as it generates maximal frequent closed homogeneous itemsets instead of frequent itemsets. Thus, it saves elapsed time, and can work on big dataset. Pathway and Gene Ontology analyses are conducted on the genes of the evolved rules using David database. Frequency analysis of the genes appearing in the evolved rules is performed to determine potential biomarkers. Furthermore, we also classify the data to know how much the evolved rules are able to describe accurately the remaining test (unknown) data. Subsequently, we also compare the average classification accuracy, and other related factors with other rule-based classifiers. Statistical significance tests are also performed for verifying the statistical relevance of the comparative results. Here, each of the other rule mining methods or rule-based classifiers is also starting with the same post-discretized data-matrix. Finally, we have also included the integrated analysis of gene expression and methylation for determining epigenetic effect (viz., effect of methylation) on gene expression level. 相似文献

11.

Knowledge and Theme Discovery across Very Large Biological Data Sets Using Distributed Queries: A Prototype Combining Unstructured and Structured Data

Uma S. Mudunuri Mohamad Khouja Stephen Repetski Girish Venkataraman Anney Che Brian T. Luke F. Pascal Girard Robert M. Stephens 《PloS one》2013,8(12)

As the discipline of biomedical science continues to apply new technologies capable of producing unprecedented volumes of noisy and complex biological data, it has become evident that available methods for deriving meaningful information from such data are simply not keeping pace. In order to achieve useful results, researchers require methods that consolidate, store and query combinations of structured and unstructured data sets efficiently and effectively. As we move towards personalized medicine, the need to combine unstructured data, such as medical literature, with large amounts of highly structured and high-throughput data such as human variation or expression data from very large cohorts, is especially urgent. For our study, we investigated a likely biomedical query using the Hadoop framework. We ran queries using native MapReduce tools we developed as well as other open source and proprietary tools. Our results suggest that the available technologies within the Big Data domain can reduce the time and effort needed to utilize and apply distributed queries over large datasets in practical clinical applications in the life sciences domain. The methodologies and technologies discussed in this paper set the stage for a more detailed evaluation that investigates how various data structures and data models are best mapped to the proper computational framework. 相似文献

12.

Mining Metagenomic Data Sets for Ancient DNA: Recommended Protocols for Authentication

Felix M. Key Cosimo Posth Johannes Krause Alexander Herbig Kirsten I. Bos 《Trends in genetics : TIG》2017,33(8):508-520

相似文献

13.

A Unified Approach to Genotype Imputation and Haplotype-Phase Inference for Large Data Sets of Trios and Unrelated Individuals 总被引：6，自引：0，他引：6

Brian L. Browning Sharon R. Browning 《American journal of human genetics》2009,84(2):210-223

We present methods for imputing data for ungenotyped markers and for inferring haplotype phase in large data sets of unrelated individuals and parent-offspring trios. Our methods make use of known haplotype phase when it is available, and our methods are computationally efficient so that the full information in large reference panels with thousands of individuals is utilized. We demonstrate that substantial gains in imputation accuracy accrue with increasingly large reference panel sizes, particularly when imputing low-frequency variants, and that unphased reference panels can provide highly accurate genotype imputation. We place our methodology in a unified framework that enables the simultaneous use of unphased and phased data from trios and unrelated individuals in a single analysis. For unrelated individuals, our imputation methods produce well-calibrated posterior genotype probabilities and highly accurate allele-frequency estimates. For trios, our haplotype-inference method is four orders of magnitude faster than the gold-standard PHASE program and has excellent accuracy. Our methods enable genotype imputation to be performed with unphased trio or unrelated reference panels, thus accounting for haplotype-phase uncertainty in the reference panel. We present a useful measure of imputation accuracy, allelic R², and show that this measure can be estimated accurately from posterior genotype probabilities. Our methods are implemented in version 3.0 of the BEAGLE software package. 相似文献

14.

Panorama Public: A Public Repository for Quantitative Data Sets Processed in Skyline

《Molecular & cellular proteomics : MCP》2018,17(6):1239-1244

相似文献

15.

Parallel Clustering Algorithm for Large-Scale Biological Data Sets

Minchao Wang Wu Zhang Wang Ding Dongbo Dai Huiran Zhang Hao Xie Luonan Chen Yike Guo Jiang Xie 《PloS one》2014,9(4)

Backgrounds

Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets. Moreover, the similarity matrix, whose constructing procedure takes long runtime, is required before running the affinity propagation algorithm, since the algorithm clusters data sets based on the similarities between data pairs.

Methods

Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm. The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity. An appropriate way of data partition and reduction is designed in our method, in order to minimize the global communication cost among processes.

Result

A speedup of 100 is gained with 128 cores. The runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively. The parallel affinity propagation also achieves a good performance when clustering large-scale gene data (microarray) and detecting families in large protein superfamilies. 相似文献

16.

mzDB: A File Format Using Multiple Indexing Strategies for the Efficient Analysis of Large LC-MS/MS and SWATH-MS Data Sets

David Bouyssié Marc Dubois Sara Nasso Anne Gonzalez de Peredo Odile Burlet-Schiltz Ruedi Aebersold Bernard Monsarrat 《Molecular & cellular proteomics : MCP》2015,14(3):771-781

The analysis and management of MS data, especially those generated by data independent MS acquisition, exemplified by SWATH-MS, pose significant challenges for proteomics bioinformatics. The large size and vast amount of information inherent to these data sets need to be properly structured to enable an efficient and straightforward extraction of the signals used to identify specific target peptides. Standard XML based formats are not well suited to large MS data files, for example, those generated by SWATH-MS, and compromise high-throughput data processing and storing.We developed mzDB, an efficient file format for large MS data sets. It relies on the SQLite software library and consists of a standardized and portable server-less single-file database. An optimized 3D indexing approach is adopted, where the LC-MS coordinates (retention time and m/z), along with the precursor m/z for SWATH-MS data, are used to query the database for data extraction.In comparison with XML formats, mzDB saves ∼25% of storage space and improves access times by a factor of twofold up to even 2000-fold, depending on the particular data access. Similarly, mzDB shows also slightly to significantly lower access times in comparison with other formats like mz5. Both C++ and Java implementations, converting raw or XML formats to mzDB and providing access methods, will be released under permissive license. mzDB can be easily accessed by the SQLite C library and its drivers for all major languages, and browsed with existing dedicated GUIs. The mzDB described here can boost existing mass spectrometry data analysis pipelines, offering unprecedented performance in terms of efficiency, portability, compactness, and flexibility.The continuous improvement of mass spectrometers (–) and HPLC systems (–) and the rapidly increasing volumes of data they produce pose a real challenge to software developers who constantly have to adapt their tools to deal with different types and increasing sizes of raw files. Indeed, the file size of a single MS analysis evolved from a few MB to several GB in less than 10 years. The introduction of high throughput, high mass accuracy MS analyses in data dependent acquisitions (DDA)¹ and the adoption of Data Independent Acquisition (DIA) approaches, for example, SWATH-MS (), were significant factors in this development. The management of these huge data files is a major issue for laboratories and raw file public repositories, which need to regularly upgrade their storage solutions and capacity.The availability of XML (eXtensible Markup Language) standard formats (, ) enhanced data exchange among laboratories. However, XMLs causes the inflation of raw file size by a factor of two to three times compared with their original size. Vendor files, although lighter, are proprietary formats, often not compatible with operating systems other than Microsoft Windows. They do not generally interface with many open source software tools, and do not offer a viable solution for data exchange. In addition to size inflation, other disadvantages associated with the use of XML for the representation of raw data have been previously described in the literature (–). These include the verbosity of language syntax, the lack of support for multidimensional chromatographic analyses, and the low performance showed during data processing. Although XML standards were originally conceived as a format for enabling data sharing in the community, they are commonly used as the input for MS data analysis. Latest software tools (, ) are usually only compatible with mzML files, limiting de facto the throughput of proteomic analyses.To tackle these issues, some independent laboratories developed open formats relying on binary specifications (, , , ), to optimize both file size and data processing performance. Similar efforts started already more than ten years ago, and, among the others, the NetCDF version 4, first described in 2004, added the support for a new data model called HDF5. Because it is particularly well suited to the representation of complex data, HDF5 was used in several scientific projects to store and efficiently access large volumes of bytes, as for the mz5 format (). Compared with XML based formats, mz5 is much more efficient in terms of file size, memory footprint, and access time. Thus, after replacing the JCAMP text format more than 10 years ago, netCDF is nowadays a suitable alternative to XML based formats. Nonetheless, solutions for storing and indexing large amounts of data in a binary file are not limited to netCDF. For instance, it has been demonstrated that a relational model can represent raw data, as in YAFMS format (), which is based on SQLite, a technology that allows implementing a portable, self-contained, single file database. Similarly to mz5, YAFMS is definitely more efficient in terms of file size and access times than XML.Despite their improvements, a limitation of these new binary formats relies on the lack of a multi-indexing model to represent the bi-dimensional structure of LC-MS data. The inherently 2D indexing of LC-MS data can indeed be very useful when working with LC-MS/MS acquisition files. At the state-of-the-art, three main raw data access strategies can be identified across DDA and DIA approaches:

(1) Sequential reading of whole m/z spectra, for a systematic processing of the entire raw file. Use cases: file format conversion, peak picking, analysis of MS/MS spectra, and MS/MS peak list generation.
(2) Systematic processing of the data contained in specific m/z windows, across the entire chromatographic gradient. Use cases: extraction of XICs on the whole chromatographic gradient and MS features detection.
(3) Random access to a small region of the LC-MS map (a few spectra or an m/z window of consecutive spectra). Use cases: data visualization, targeted extraction of XICs on a small time range, and targeted extraction of a subset of spectra.

The adoption of a certain data access strategy depends upon the particular data analysis algorithms, which can perform signal extraction mainly by unsupervised or supervised approaches. Unsupervised approaches (, –) recognize LC-MS features on the basis of patterns like the theoretical isotope distribution, the shape of the elution peaks, etc. Conversely, supervised approaches (–) implement the peak picking as driven data access, using the a priori knowledge on peptide coordinates (m/z, retention time, and m/z precursor for DIA), which are provided by appropriate extraction lists given by the identification search engine or the transition lists in targeted proteomics (34). Data access overhead can vary significantly, according to the specific algorithm, data size, and length of the extraction list. In the unsupervised approach, feature detection is based first on the analysis of the full set of MS spectra and then on the grouping of the peaks detected in adjacent MS scans; thus, optimized sequential spectra access is required. In the supervised approach, peptide XICs are extracted using their a priori coordinates and therefore sequential spectra access is not a suitable solution; for instance, MS spectra shared by different peptides would be loaded multiple times leading to highly redundant data reloading. Even though sophisticated caching mechanisms can reduce the impact of this issue, they would increase memory consumption. It is thus preferable to perform a targeted access to specific MS spectra by leveraging an index in the time dimension. However, it would still be a sub-optimal solution because of redundant loads of full MS spectra, whereas only a small spectral window centered on the peptide m/z is of interest. Thus the quantification of dozens of thousands of peptides (, ) requires appropriate data access methods to cope with the repetitive and high load of MS data.We therefore deem that an ideal file format should show comparable efficiency regardless of the particular use case. In order to achieve this important flexibility and efficiency on any data access, we developed a new solution featuring multiple indexing strategies: the mzDB format (i.e. m/z database). As the YAFMS format, mzDB is implemented using SQLite, which is commonly adopted in several computational projects and is compatible with most programming languages. In contrast to mz5 and YAFMS formats, where each spectrum is referred by a single index entry, mzDB has an internal data structure allowing a multidimensional data indexing, and thus results in efficient queries along both time and m/z dimensions. This makes mzDB specifically suited to the processing of large-scale LC-MS/MS data. In particular, the multidimensional data-indexing model was extended for SWATH-MS data, where a third index is given by the m/z of the precursor ion, in addition to the RT and m/z of the fragment ions.In order to show its efficiency for all described data access strategies, mzDB was compared with the mzML format, which is the official XML standard, and the latest mz5 binary format, which has already been compared with many existing file formats (). Results show that mzDB outperforms other formats on most comparisons, except in sequential reading benchmarks where mz5 and mzDB are comparable. mzDB access performance, portability, and compactness, as well as its compliance to the PSI controlled vocabulary make it complementary to existing solutions for both the storage and exchange of mass spectrometry data and will eventually address the issues related to data access overhead during their processing. mzDB can therefore enhance existing mass spectrometry data analysis pipelines, offering unprecedented performance and therefore possibilities. 相似文献

17.

Web-Queryable Large-Scale Data Sets for Hypothesis Generation in Plant Biology

Siobhan M. Brady Nicholas J. Provart 《The Plant cell》2009,21(4):1034-1051

相似文献

18.

Sub‐phenotyping Metabolic Disorders Using Body Composition: An Individualized,Nonparametric Approach Utilizing Large Data Sets

Jennifer Linge Brandon Whitcher Magnus Borga Olof Dahlqvist Leinhard 《Obesity (Silver Spring, Md.)》2019,27(7):1190-1199

相似文献

19.

Statistical Approaches for Analyzing Mutational Spectra: Some Recommendations for Categorical Data 总被引：1，自引：0，他引：1

下载免费PDF全文

W. W. Piegorsch A. J. Bailer 《Genetics》1994,136(1):403-416

相似文献

20.

ClusTrack: Feature Extraction and Similarity Measures for Clustering of Genome-Wide Data Sets

Halfdan Rydbeck Geir Kjetil Sandve Egil Ferkingstad Boris Simovski Morten Rye Eivind Hovig 《PloS one》2015,10(4)

Clustering is a popular technique for explorative analysis of data, as it can reveal subgroupings and similarities between data in an unsupervised manner. While clustering is routinely applied to gene expression data, there is a lack of appropriate general methodology for clustering of sequence-level genomic and epigenomic data, e.g. ChIP-based data. We here introduce a general methodology for clustering data sets of coordinates relative to a genome assembly, i.e. genomic tracks. By defining appropriate feature extraction approaches and similarity measures, we allow biologically meaningful clustering to be performed for genomic tracks using standard clustering algorithms. An implementation of the methodology is provided through a tool, ClusTrack, which allows fine-tuned clustering analyses to be specified through a web-based interface. We apply our methods to the clustering of occupancy of the H3K4me1 histone modification in samples from a range of different cell types. The majority of samples form meaningful subclusters, confirming that the definitions of features and similarity capture biological, rather than technical, variation between the genomic tracks. Input data and results are available, and can be reproduced, through a Galaxy Pages document at http://hyperbrowser.uio.no/hb/u/hb-superuser/p/clustrack. The clustering functionality is available as a Galaxy tool, under the menu option "Specialized analyzis of tracks", and the submenu option "Cluster tracks based on genome level similarity", at the Genomic HyperBrowser server: http://hyperbrowser.uio.no/hb/. 相似文献