首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Rapid improvements in mass spectrometry sensitivity and mass accuracy combined with improved liquid chromatography separation technologies allow acquisition of high throughput metabolomics data, providing an excellent opportunity to understand biological processes. While spectral deconvolution software can identify discrete masses and their associated isotopes and adducts, the utility of metabolomic approaches for many statistical analyses such as identifying differentially abundant ions depends heavily on data quality and robustness, especially, the accuracy of aligning features across multiple biological replicates. We have developed a novel algorithm for feature alignment using density maximization. Instead of a greedy iterative, hence local, merging strategy, which has been widely used in the literature and in commercial applications, we apply a global merging strategy to improve alignment quality. Using both simulated and real data, we demonstrate that our new algorithm provides high map (e.g. chromatogram) coverage, which is critically important for non-targeted comparative metabolite profiling of highly replicated biological datasets.  相似文献   

2.
MOTIVATION: Multiple sequence alignment is a fundamental task in bioinformatics. Current tools typically form an initial alignment by merging subalignments, and then polish this alignment by repeated splitting and merging of subalignments to obtain an improved final alignment. In general this form-and-polish strategy consists of several stages, and a profusion of methods have been tried at every stage. We carefully investigate: (1) how to utilize a new algorithm for aligning alignments that optimally solves the common subproblem of merging subalignments, and (2) what is the best choice of method for each stage to obtain the highest quality alignment. RESULTS: We study six stages in the form-and-polish strategy for multiple alignment: parameter choice, distance estimation, merge-tree construction, sequence-pair weighting, alignment merging, and polishing. For each stage, we consider novel approaches as well as standard ones. Interestingly, the greatest gains in alignment quality come from (i) estimating distances by a new approach using normalized alignment costs, and (ii) polishing by a new approach using 3-cuts. Experiments with a parameter-value oracle suggest large gains in quality may be possible through an input-dependent choice of alignment parameters, and we present a promising approach for building such an oracle. Combining the best approaches to each stage yields a new tool we call Opal that on benchmark alignments matches the quality of the top tools, without employing alignment consistency or hydrophobic gap penalties. AVAILABILITY: Opal, a multiple alignment tool that implements the best methods in our study, is freely available at http://opal.cs.arizona.edu.  相似文献   

3.
Yang ZR  Grant M 《PloS one》2012,7(6):e39158
Small molecules are central to all biological processes and metabolomics becoming an increasingly important discovery tool. Robust, accurate and efficient experimental approaches are critical to supporting and validating predictions from post-genomic studies. To accurately predict metabolic changes and dynamics, experimental design requires multiple biological replicates and usually multiple treatments. Mass spectra from each run are processed and metabolite features are extracted. Because of machine resolution and variation in replicates, one metabolite may have different implementations (values) of retention time and mass in different spectra. A major impediment to effectively utilizing untargeted metabolomics data is ensuring accurate spectral alignment, enabling precise recognition of features (metabolites) across spectra. Existing alignment algorithms use either a global merge strategy or a local merge strategy. The former delivers an accurate alignment, but lacks efficiency. The latter is fast, but often inaccurate. Here we document a new algorithm employing a technique known as quicksort. The results on both simulated data and real data show that this algorithm provides a dramatic increase in alignment speed and also improves alignment accuracy.  相似文献   

4.
An essential element of any strategy for non-targeted metabolomics analysis of complex biological extracts is the capacity to perform comparisons between large numbers of samples. As the most widely used technologies are all based on mass spectrometry (e.g. GCMS, LCMS), this entails that we must be able to compare reliably and (semi)automatically large series of chromatographic mass spectra from which compositional differences are to be extracted in a statistically justifiable manner. In this paper we describe a novel approach for the extraction of relevant information from multiple full-scan metabolic profiles derived from LC–MS analyses. Specifically-designed software has made it possible to combine all mass peaks on the basis of retention time and m/z values only, without prior identification, to produce a data matrix output which can then be used for multivariate statistical analysis. To demonstrate the capacity of this approach, aqueous methanol extracts from potato tuber tissues of eight contrasting genotypes, harvested at two developmental stages have been used. Our results showed that it is possible to discover reproducibly discriminatory mass peaks related both to the genetic origin of the material as well as the developmental stage at which it was harvested. In addition the limitations of the approach are explored by a careful evaluation of the alignment quality.  相似文献   

5.
Phosphorylation site assignment of high throughput tandem mass spectrometry (LC-MS/MS) data is one of the most common and critical aspects of phosphoproteomics. Correctly assigning phosphorylated residues helps us understand their biological significance. The design of common search algorithms (such as Sequest, Mascot etc.) do not incorporate site assignment; therefore additional algorithms are essential to assign phosphorylation sites for mass spectrometry data. The main contribution of this study is the design and implementation of a linear time and space dynamic programming strategy for phosphorylation site assignment referred to as PhosSA. The proposed algorithm uses summation of peak intensities associated with theoretical spectra as an objective function. Quality control of the assigned sites is achieved using a post-processing redundancy criteria that indicates the signal-to-noise ratio properties of the fragmented spectra. The quality assessment of the algorithm was determined using experimentally generated data sets using synthetic peptides for which phosphorylation sites were known. We report that PhosSA was able to achieve a high degree of accuracy and sensitivity with all the experimentally generated mass spectrometry data sets. The implemented algorithm is shown to be extremely fast and scalable with increasing number of spectra (we report up to 0.5 million spectra/hour on a moderate workstation). The algorithm is designed to accept results from both Sequest and Mascot search engines. An executable is freely available at http://helixweb.nih.gov/ESBL/PhosSA/ for academic research purposes.  相似文献   

6.
Algorithms that can robustly identify post-translational protein modifications from mass spectrometry data are needed for data-mining and furthering biological interpretations. In this study, we determined that a mass-based alignment algorithm (OpenSea) for de novo sequencing results could identify post-translationally modified peptides in a high-throughput environment. A complex digest of proteins from human cataractous lens, a tissue containing a high abundance of modified proteins, was analyzed using two-dimensional liquid chromatography, and data was collected on both high and low mass accuracy instruments. The data were analyzed using automated de novo sequencing followed by OpenSea mass-based sequence alignment. A total of 80 modifications were detected, 36 of which were previously unreported in the lens. This demonstrates the potential to identify large numbers of known and previously unknown protein modifications in a given tissue using automated data processing algorithms such as OpenSea.  相似文献   

7.

Background

Obtaining an accurate sequence alignment is fundamental for consistently analyzing biological data. Although this problem may be efficiently solved when only two sequences are considered, the exact inference of the optimal alignment easily gets computationally intractable for the multiple sequence alignment case. To cope with the high computational expenses, approximate heuristic methods have been proposed that address the problem indirectly by progressively aligning the sequences in pairs according to their relatedness. These methods however are not flexible to change the alignment of an already aligned group of sequences in the view of new data, resulting thus in compromises on the quality of the deriving alignment. In this paper we present ReformAlign, a novel meta-alignment approach that may significantly improve on the quality of the deriving alignments from popular aligners. We call ReformAlign a meta-aligner as it requires an initial alignment, for which a variety of alignment programs can be used. The main idea behind ReformAlign is quite straightforward: at first, an existing alignment is used to construct a standard profile which summarizes the initial alignment and then all sequences are individually re-aligned against the formed profile. From each sequence-profile comparison, the alignment of each sequence against the profile is recorded and the final alignment is indirectly inferred by merging all the individual sub-alignments into a unified set. The employment of ReformAlign may often result in alignments which are significantly more accurate than the starting alignments.

Results

We evaluated the effect of ReformAlign on the generated alignments from ten leading alignment methods using real data of variable size and sequence identity. The experimental results suggest that the proposed meta-aligner approach may often lead to statistically significant more accurate alignments. Furthermore, we show that ReformAlign results in more substantial improvement in cases where the starting alignment is of relatively inferior quality or when the input sequences are harder to align.

Conclusions

The proposed profile-based meta-alignment approach seems to be a promising and computationally efficient method that can be combined with practically all popular alignment methods and may lead to significant improvements in the generated alignments.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2105-15-265) contains supplementary material, which is available to authorized users.  相似文献   

8.
Species identification based on short sequences of DNA markers, that is, DNA barcoding, has emerged as an integral part of modern taxonomy. However, software for the analysis of large and multilocus barcoding data sets is scarce. The Basic Local Alignment Search Tool (BLAST) is currently the fastest tool capable of handling large databases (e.g. >5000 sequences), but its accuracy is a concern and has been criticized for its local optimization. However, current more accurate software requires sequence alignment or complex calculations, which are time‐consuming when dealing with large data sets during data preprocessing or during the search stage. Therefore, it is imperative to develop a practical program for both accurate and scalable species identification for DNA barcoding. In this context, we present VIP Barcoding: a user‐friendly software in graphical user interface for rapid DNA barcoding. It adopts a hybrid, two‐stage algorithm. First, an alignment‐free composition vector (CV) method is utilized to reduce searching space by screening a reference database. The alignment‐based K2P distance nearest‐neighbour method is then employed to analyse the smaller data set generated in the first stage. In comparison with other software, we demonstrate that VIP Barcoding has (i) higher accuracy than Blastn and several alignment‐free methods and (ii) higher scalability than alignment‐based distance methods and character‐based methods. These results suggest that this platform is able to deal with both large‐scale and multilocus barcoding data with accuracy and can contribute to DNA barcoding for modern taxonomy. VIP Barcoding is free and available at http://msl.sls.cuhk.edu.hk/vipbarcoding/ .  相似文献   

9.
Purpose: For our research on computer-optimised and automated cochlear implant surgery, we pursue a model-based approach to overcome the limitations of currently available clinical imaging modalities. A serial cross section preparation procedure has been developed and evaluated concerning accuracy to serve for modelling of a digital anatomic atlas to make delicate soft tissue structures available for pre-operative planning.

Methods: A special grinding tool was developed allowing the setting of a specific amount of abrasion as equidistant slice thickness was considered a crucial step. Additionally, each actual abrasion was accurately measured and used during three-dimensional reconstruction of the serial cross-sectional images obtained via digital photo documentation after each microgrinding step. A well-known reference object was prepared using this procedure and evaluated in terms of accuracy.

Results: Reconstruction of the whole sample was achieved with an error less than 0.4%, and the edge lengths in the direction of abrasion could be reconstructed with an average error of 0.6 ± 0.3 mm; both prove the realisation of equidistant abrasion. Using artificial registration fiducials and a custom-made algorithm for image alignment, parallelism and rectangularity could be preserved with average errors less than 0.4° ± 0.3°.

Conclusion: We present a systematic, practicable and reliable method for the geometrically accurate reconstruction of anatomical structures, which is especially suitable for the middle and inner ear anatomy including soft tissue structures. For the first time, the quality of such a reconstruction process has been quantified and successfully proven for its usability.  相似文献   

10.
The accuracy of global Smith-Waterman alignments and Pareto-optimal alignments depending on the degree of sequence similarity (percent of coincidence, %id, and the number of removed fragments NGap) has been examined. An algorithm for constructing a set of three to six alignments has been developed of which the best alignment on the average exceeds in accuracy the best alignment that can be constructed using the Smith-Waterman algorithm. For weakly homologous sequences (%id 15, NGap 20), the increase in accuracy is on the average about 8%, with the average accuracy of the global Smith-Waterman alignments being about 38% (the accuracy was estimated on model test sets).  相似文献   

11.

Introduction

Liquid chromatography-mass spectrometry (LC-MS) is a commonly used technique in untargeted metabolomics owing to broad coverage of metabolites, high sensitivity and simple sample preparation. However, data generated from multiple batches are affected by measurement errors inherent to alterations in signal intensity, drift in mass accuracy and retention times between samples both within and between batches. These measurement errors reduce repeatability and reproducibility and may thus decrease the power to detect biological responses and obscure interpretation.

Objective

Our aim was to develop procedures to address and correct for within- and between-batch variability in processing multiple-batch untargeted LC-MS metabolomics data to increase their quality.

Methods

Algorithms were developed for: (i) alignment and merging of features that are systematically misaligned between batches, through aggregating feature presence/missingness on batch level and combining similar features orthogonally present between batches; and (ii) within-batch drift correction using a cluster-based approach that allows multiple drift patterns within batch. Furthermore, a heuristic criterion was developed for the feature-wise choice of reference-based or population-based between-batch normalisation.

Results

In authentic data, between-batch alignment resulted in picking 15 % more features and deconvoluting 15 % of features previously erroneously aligned. Within-batch correction provided a decrease in median quality control feature coefficient of variation from 20.5 to 15.1 %. Algorithms are open source and available as an R package (‘batchCorr’).

Conclusions

The developed procedures provide unbiased measures of improved data quality, with implications for improved data analysis. Although developed for LC-MS based metabolomics, these methods are generic and can be applied to other data suffering from similar limitations.
  相似文献   

12.
Ren  Shanshan  Ahmed  Nauman  Bertels  Koen  Al-Ars  Zaid 《BMC genomics》2019,20(2):103-116
Background

Pairwise sequence alignment is widely used in many biological tools and applications. Existing GPU accelerated implementations mainly focus on calculating optimal alignment score and omit identifying the optimal alignment itself. In GATK HaplotypeCaller (HC), the semi-global pairwise sequence alignment with traceback has so far been difficult to accelerate effectively on GPUs.

Results

We first analyze the characteristics of the semi-global alignment with traceback in GATK HC and then propose a new algorithm that allows for retrieving the optimal alignment efficiently on GPUs. For the first stage, we choose intra-task parallelization model to calculate the position of the optimal alignment score and the backtracking matrix. Moreover, in the first stage, our GPU implementation also records the length of consecutive matches/mismatches in addition to lengths of consecutive insertions and deletions as in the CPU-based implementation. This helps efficiently retrieve the backtracking matrix to obtain the optimal alignment in the second stage.

Conclusions

Experimental results show that our alignment kernel with traceback is up to 80x and 14.14x faster than its CPU counterpart with synthetic datasets and real datasets, respectively. When integrated into GATK HC (alongside a GPU accelerated pair-HMMs forward kernel), the overall acceleration is 2.3x faster than the baseline GATK HC implementation, and 1.34x faster than the GATK HC implementation with the integrated GPU-based pair-HMMs forward algorithm. Although the methods proposed in this paper is to improve the performance of GATK HC, they can also be used in other pairwise alignments and applications.

  相似文献   

13.
Xie  Minzhu  Lei  Xiaowen  Zhong  Jianchen  Ouyang  Jianxing  Li  Guijing 《BMC bioinformatics》2022,23(8):1-13
Background

Essential proteins are indispensable to the development and survival of cells. The identification of essential proteins not only is helpful for the understanding of the minimal requirements for cell survival, but also has practical significance in disease diagnosis, drug design and medical treatment. With the rapidly amassing of protein–protein interaction (PPI) data, computationally identifying essential proteins from protein–protein interaction networks (PINs) becomes more and more popular. Up to now, a number of various approaches for essential protein identification based on PINs have been developed.

Results

In this paper, we propose a new and effective approach called iMEPP to identify essential proteins from PINs by fusing multiple types of biological data and applying the influence maximization mechanism to the PINs. Concretely, we first integrate PPI data, gene expression data and Gene Ontology to construct weighted PINs, to alleviate the impact of high false-positives in the raw PPI data. Then, we define the influence scores of nodes in PINs with both orthological data and PIN topological information. Finally, we develop an influence discount algorithm to identify essential proteins based on the influence maximization mechanism.

Conclusions

We applied our method to identifying essential proteins from saccharomyces cerevisiae PIN. Experiments show that our iMEPP method outperforms the existing methods, which validates its effectiveness and advantage.

  相似文献   

14.
15.
16.
One of the challenges of using mass spectrometry for metabolomic analyses of samples consisting of thousands of compounds is that of peak identification and alignment. This paper addresses the issue of aligning mass spectral data from different samples in order to determine average component m/z peak values. The alignment scheme developed takes the instrument m/z measurement error into consideration in order to heuristically align two or more samples using a technique comparable to automated visual inspection and alignment. The results obtained using mass spectral profiles of replicate human urine samples suggest that this heuristic alignment approach is more efficient than other approaches using hierarchical clustering algorithms. The output consists of an average m/z and intensity value for the spectral components together with the number of matches from the different samples. One of the major advantages of using this alignment strategy is that it eliminates the boundary problem that occurs when using predetermined fixed bins to identify and combine peaks for averaging and the efficient runtime allows large datasets to be processed quickly.  相似文献   

17.
Le  Vuong  Quinn  Thomas P.  Tran  Truyen  Venkatesh  Svetha 《BMC genomics》2020,21(4):1-15
Background

Technological advances in next-generation sequencing (NGS) and chromatographic assays [e.g., liquid chromatography mass spectrometry (LC-MS)] have made it possible to identify thousands of microbe and metabolite species, and to measure their relative abundance. In this paper, we propose a sparse neural encoder-decoder network to predict metabolite abundances from microbe abundances.

Results

Using paired data from a cohort of inflammatory bowel disease (IBD) patients, we show that our neural encoder-decoder model outperforms linear univariate and multivariate methods in terms of accuracy, sparsity, and stability. Importantly, we show that our neural encoder-decoder model is not simply a black box designed to maximize predictive accuracy. Rather, the network’s hidden layer (i.e., the latent space, comprised only of sparsely weighted microbe counts) actually captures key microbe-metabolite relationships that are themselves clinically meaningful. Although this hidden layer is learned without any knowledge of the patient’s diagnosis, we show that the learned latent features are structured in a way that predicts IBD and treatment status with high accuracy.

Conclusions

By imposing a non-negative weights constraint, the network becomes a directed graph where each downstream node is interpretable as the additive combination of the upstream nodes. Here, the middle layer comprises distinct microbe-metabolite axes that relate key microbial biomarkers with metabolite biomarkers. By pre-processing the microbiome and metabolome data using compositional data analysis methods, we ensure that our proposed multi-omics workflow will generalize to any pair of -omics data. To the best of our knowledge, this work is the first application of neural encoder-decoders for the interpretable integration of multi-omics biological data.

  相似文献   

18.

Background  

In a previous paper, we introduced MUSCLE, a new program for creating multiple alignments of protein sequences, giving a brief summary of the algorithm and showing MUSCLE to achieve the highest scores reported to date on four alignment accuracy benchmarks. Here we present a more complete discussion of the algorithm, describing several previously unpublished techniques that improve biological accuracy and / or computational complexity. We introduce a new option, MUSCLE-fast, designed for high-throughput applications. We also describe a new protocol for evaluating objective functions that align two profiles.  相似文献   

19.
Biswas  Bipasa  Lai  Yinglei 《BMC genomics》2019,20(2):35-47
Background

The next generation sequencing technology allows us to obtain a large amount of short DNA sequence (DNA-seq) reads at a genome-wide level. DNA-seq data have been increasingly collected during the recent years. Count-type data analysis is a widely used approach for DNA-seq data. However, the related data pre-processing is based on the moving window method, in which a window size need to be defined in order to obtain count-type data. Furthermore, useful information can be reduced after data pre-processing for count-type data.

Results

In this study, we propose to analyze DNA-seq data based on the related distance-type measure. Distances are measured in base pairs (bps) between two adjacent alignments of short reads mapped to a reference genome. Our experimental data based simulation study confirms the advantages of distance-type measure approach in both detection power and detection accuracy. Furthermore, we propose artificial censoring for the distance data so that distances larger than a given value are considered potential outliers. Our purpose is to simplify the pre-processing of DNA-seq data. Statistically, we consider a mixture of right censored geometric distributions to model the distance data. Additionally, to reduce the GC-content bias, we extend the mixture model to a mixture of generalized linear models (GLMs). The estimation of model can be achieved by the Newton-Raphson algorithm as well as the Expectation-Maximization (E-M) algorithm. We have conducted simulations to evaluate the performance of our approach. Based on the rank based inverse normal transformation of distance data, we can obtain the related z-values for a follow-up analysis. For an illustration, an application to the DNA-seq data from a pair of normal and tumor cell lines is presented with a change-point analysis of z-values to detect DNA copy number alterations.

Conclusion

Our distance-type measure approach is novel. It does not require either a fixed or a sliding window procedure for generating count-type data. Its advantages have been demonstrated by our simulation studies and its practical usefulness has been illustrated by an experimental data application.

  相似文献   

20.
An important and still unsolved problem in gene prediction is designing an algorithm that not only predicts genes but estimates the quality ofindividualpredictions as well. Since experimental biologists areinterested mainly in the reliability of individual predictions (rather than in the average reliability of an algorithm) we attempted to develop a gene recognition algorithm that guarantees a certain quality of predictions. We demonstrate here that the similarity level with a related protein is a reliable quality estimator for thespliced alignmentapproach to gene recognition. We also study the average performance of the spliced alignment algorithm for different targets on a complete set of human genomic sequences with known relatives and demonstrate that the average performance of the method remains high even for very distant targets. Using plant, fungal, and prokaryotic target proteins for recognition of human genes leads to accurate predictions with 95, 93, and 91% correlation coefficient, respectively. For target proteins with similarity score above 60%, not only the average correlation coefficient is very high (97% and up) but also the quality of individual predictions isguaranteedto be at least 82%. It indicates that for this level of similarity the worst case performance of the spliced alignment algorithm is better than the average case performance of many statistical gene recognition methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号