首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 359 毫秒
1.
Abe  Ko  Hirayama  Masaaki  Ohno  Kinji  Shimamura  Teppei 《BMC genomics》2019,20(2):63-75
Background

One of the major challenges in microbial studies is detecting associations between microbial communities and a specific disease. A specialized feature of microbiome count data is that intestinal bacterial communities form clusters called as “enterotype”, which are characterized by differences in specific bacterial taxa, making it difficult to analyze these data under health and disease conditions. Traditional probabilistic modeling cannot distinguish between the bacterial differences derived from enterotype and those related to a specific disease.

Results

We propose a new probabilistic model, named as ENIGMA (Enterotype-like uNIGram mixture model for Microbial Association analysis), which can be used to address these problems. ENIGMA enabled simultaneous estimation of enterotype-like clusters characterized by the abundances of signature bacterial genera and the parameters of environmental effects associated with the disease.

Conclusion

In the simulation study, we evaluated the accuracy of parameter estimation. Furthermore, by analyzing the real-world data, we detected the bacteria related to Parkinson’s disease. ENIGMA is implemented in R and is available from GitHub (https://github.com/abikoushi/enigma).

  相似文献   

2.
3.
4.
Li  Xin  Wu  Yufeng 《BMC bioinformatics》2023,23(8):1-16
Background

Structural variation (SV), which ranges from 50 bp to \(\sim\) 3 Mb in size, is an important type of genetic variations. Deletion is a type of SV in which a part of a chromosome or a sequence of DNA is lost during DNA replication. Three types of signals, including discordant read-pairs, reads depth and split reads, are commonly used for SV detection from high-throughput sequence data. Many tools have been developed for detecting SVs by using one or multiple of these signals.

Results

In this paper, we develop a new method called EigenDel for detecting the germline submicroscopic genomic deletions. EigenDel first takes advantage of discordant read-pairs and clipped reads to get initial deletion candidates, and then it clusters similar candidates by using unsupervised learning methods. After that, EigenDel uses a carefully designed approach for calling true deletions from each cluster. We conduct various experiments to evaluate the performance of EigenDel on low coverage sequence data.

Conclusions

Our results show that EigenDel outperforms other major methods in terms of improving capability of balancing accuracy and sensitivity as well as reducing bias. EigenDel can be downloaded from https://github.com/lxwgcool/EigenDel.

  相似文献   

5.
Huo  Zhiguang  Zhu  Li  Ma  Tianzhou  Liu  Hongcheng  Han  Song  Liao  Daiqing  Zhao  Jinying  Tseng  George 《Statistics in biosciences》2020,12(1):1-22

Disease subtype discovery is an essential step in delivering personalized medicine. Disease subtyping via omics data has become a common approach for this purpose. With the advancement of technology and the lower price for generating omics data, multi-level and multi-cohort omics data are prevalent in the public domain, providing unprecedented opportunities to decrypt disease mechanisms. How to fully utilize multi-level/multi-cohort omics data and incorporate established biological knowledge toward disease subtyping remains a challenging problem. In this paper, we propose a meta-analytic integrative sparse Kmeans (MISKmeans) algorithm for integrating multi-cohort/multi-level omics data and prior biological knowledge. Compared with previous methods, MISKmeans shows better clustering accuracy and feature selection relevancy. An efficient R package, “MIS-Kmeans”, calling C++ is freely available on GitHub (https://github.com/Caleb-Huo/MIS-Kmeans).

  相似文献   

6.
BackgroundRecord linkage integrates records across multiple related data sources identifying duplicates and accounting for possible errors. Real life applications require efficient algorithms to merge these voluminous data sources to find out all records belonging to same individuals. Our recently devised highly efficient record linkage algorithms provide best-known solutions to this challenging problem.MethodWe have developed RLT-S, a freely available web tool, which implements our single linkage clustering algorithm for record linkage. This tool requires input data sets and a small set of configuration settings about these files to work efficiently. RLT-S employs exact match clustering, blocking on a specified attribute and single linkage based hierarchical clustering among these blocks.ResultsRLT-S is an implementation package of our sequential record linkage algorithm. It outperforms previous best-known implementations by a large margin. The tool is at least two times faster for any dataset than the previous best-known tools.ConclusionsRLT-S tool implements our record linkage algorithm that outperforms previous best-known algorithms in this area. This website also contains necessary information such as instructions, submission history, feedback, publications and some other sections to facilitate the usage of the tool.AvailabilityRLT-S is integrated into http://www.rlatools.com, which is currently serving this tool only. The tool is freely available and can be used without login. All data files used in this paper have been stored in https://github.com/abdullah009/DataRLATools. For copies of the relevant programs please see https://github.com/abdullah009/RLATools.  相似文献   

7.
ABSTRACT

The Hierarchical Factor Segmentation (HFS) method is a non-parametric statistical method for detection of the phase of a biological rhythm shown in an actogram. The detection accuracy of this method was measured on actograms showing only circadian rhythms with a constant ratio of signal to noise (S/N). In the present study, we generated 84 types of artificial actograms including circadian or circatidal rhythms by using three parameters: α/ρ, S/N and period length τ, and evaluated the effectiveness of our devised adaptation of the HFS method, the cycle-by-cycle adaptation. The results showed the effectiveness of the cycle-by-cycle adaptation was high even though S/N or τ was fluctuating through a whole actogram. These suggested that the cycle-by-cycle adaptation could be effectively applied to various kinds of rhythmic activity data. The C++ source code of the cycle-by-cycle adaptation is available on the website at https://github.com/KazukiSakura/cHFS.git.  相似文献   

8.
9.
Du  Nan  Chen  Jiao  Sun  Yanni 《BMC genomics》2019,20(2):49-62
Background

Single-molecule, real-time sequencing (SMRT) developed by Pacific BioSciences produces longer reads than second-generation sequencing technologies such as Illumina. The increased read length enables PacBio sequencing to close gaps in genome assembly, reveal structural variations, and characterize the intra-species variations. It also holds the promise to decipher the community structure in complex microbial communities because long reads help metagenomic assembly. One key step in genome assembly using long reads is to quickly identify reads forming overlaps. Because PacBio data has higher sequencing error rate and lower coverage than popular short read sequencing technologies (such as Illumina), efficient detection of true overlaps requires specially designed algorithms. In particular, there is still a need to improve the sensitivity of detecting small overlaps or overlaps with high error rates in both reads. Addressing this need will enable better assembly for metagenomic data produced by third-generation sequencing technologies.

Results

In this work, we designed and implemented an overlap detection program named GroupK, for third-generation sequencing reads based on grouped k-mer hits. While using k-mer hits for detecting reads’ overlaps has been adopted by several existing programs, our method uses a group of short k-mer hits satisfying statistically derived distance constraints to increase the sensitivity of small overlap detection. Grouped k-mer hit was originally designed for homology search. We are the first to apply group hit for long read overlap detection. The experimental results of applying our pipeline to both simulated and real third-generation sequencing data showed that GroupK enables more sensitive overlap detection, especially for datasets of low sequencing coverage.

Conclusions

GroupK is best used for detecting small overlaps for third-generation sequencing data. It provides a useful supplementary tool to existing ones for more sensitive and accurate overlap detection. The source code is freely available at https://github.com/Strideradu/GroupK.

  相似文献   

10.
Recurrent neural networks with memory and attention mechanisms are widely used in natural language processing because they can capture short and long term sequential information for diverse tasks. We propose an integrated deep learning model for microbial DNA sequence data, which exploits convolutional neural networks, recurrent neural networks, and attention mechanisms to predict taxonomic classifications and sample-associated attributes, such as the relationship between the microbiome and host phenotype, on the read/sequence level. In this paper, we develop this novel deep learning approach and evaluate its application to amplicon sequences. We apply our approach to short DNA reads and full sequences of 16S ribosomal RNA (rRNA) marker genes, which identify the heterogeneity of a microbial community sample. We demonstrate that our implementation of a novel attention-based deep network architecture, Read2Pheno, achieves read-level phenotypic prediction. Training Read2Pheno models will encode sequences (reads) into dense, meaningful representations: learned embedded vectors output from the intermediate layer of the network model, which can provide biological insight when visualized. The attention layer of Read2Pheno models can also automatically identify nucleotide regions in reads/sequences which are particularly informative for classification. As such, this novel approach can avoid pre/post-processing and manual interpretation required with conventional approaches to microbiome sequence classification. We further show, as proof-of-concept, that aggregating read-level information can robustly predict microbial community properties, host phenotype, and taxonomic classification, with performance at least comparable to conventional approaches. An implementation of the attention-based deep learning network is available at https://github.com/EESI/sequence_attention (a python package) and https://github.com/EESI/seq2att (a command line tool).  相似文献   

11.
12.
The discovery of higher-order epistatic interactions is an important task in the field of genome wide association studies which allows for the identification of complex interaction patterns between multiple genetic markers. Some existing bruteforce approaches explore the whole space of k-interactions in an exhaustive manner resulting in almost intractable execution times. Computational cost can be reduced drastically by restricting the search space with suitable preprocessing filters which prune unpromising candidates. Other approaches mitigate the execution time by employing massively parallel accelerators in order to benefit from the vast computational resources of these architectures. In this paper, we combine a novel preprocessing filter, namely SingleMI, with massively parallel computation on modern GPUs to further accelerate epistasis discovery. Our implementation improves both the runtime and accuracy when compared to a previous GPU counterpart that employs mutual information clustering for prefiltering. SingleMI is open source software and publicly available at: https://github.com/sleeepyjack/singlemi/.  相似文献   

13.
14.
ChIP-seq is a powerful method for obtaining genome-wide maps of protein-DNA interactions and epigenetic modifications. CHANCE (CHip-seq ANalytics and Confidence Estimation) is a standalone package for ChIP-seq quality control and protocol optimization. Our user-friendly graphical software quickly estimates the strength and quality of immunoprecipitations, identifies biases, compares the user's data with ENCODE's large collection of published datasets, performs multi-sample normalization, checks against quantitative PCR-validated control regions, and produces informative graphical reports. CHANCE is available at https://github.com/songlab/chance.  相似文献   

15.
Metabolomics and proteomics, like other omics domains, usually face a data mining challenge in providing an understandable output to advance in biomarker discovery and precision medicine. Often, statistical analysis is one of the most difficult challenges and it is critical in the subsequent biological interpretation of the results. Because of this, combined with the computational programming skills needed for this type of analysis, several bioinformatic tools aimed at simplifying metabolomics and proteomics data analysis have emerged. However, sometimes the analysis is still limited to a few hidebound statistical methods and to data sets with limited flexibility. POMAShiny is a web-based tool that provides a structured, flexible and user-friendly workflow for the visualization, exploration and statistical analysis of metabolomics and proteomics data. This tool integrates several statistical methods, some of them widely used in other types of omics, and it is based on the POMA R/Bioconductor package, which increases the reproducibility and flexibility of analyses outside the web environment. POMAShiny and POMA are both freely available at https://github.com/nutrimetabolomics/POMAShiny and https://github.com/nutrimetabolomics/POMA, respectively.  相似文献   

16.
BackgroundLiquid Biopsy (LB) in the form of e.g., circulating tumor cells (CTCs) is a promising non-invasive approach to support current therapeutic cancer management. However, the proof of clinical utility of CTCs in informing therapeutic decision-making for e.g., breast cancer in clinical trials and associated translational research projects is facing the issues of low CTC positivity rates and low CTC numbers – even in the metastasized situation. To compensate for this dilemma, clinical CTC trials are designed as large multicenter endeavors with decentralized sample collection, processing and storage of products, making data management highly important to enable high-quality translational CTC research.AimIn the DETECT clinical CTC trials we aimed at developing a custom-made, browser-based virtual database to harmonize and organize both decentralized processing and storage of LB specimens and to enable the collection of clinically meaningful LB sample.MethodsViBiBa processes data from various sources, harmonizes the data and creates an easily searchable multilayered database.ResultsAn open-source virtual bio-banking web-application termed ViBiBa was created, which automatically processes data from multiple non-standardized sources. These data are automatically checked and merged into one centralized databank and are providing the opportunity to extract clinically relevant patient cohorts and CTC sample collections.SummaryViBiBa, which is a highly flexible tool that allows for decentralized sample storage of liquid biopsy specimens, facilitates a solution which promotes collaboration in a user-friendly, federalist and highly structured way. The source code is available under the MIT license from https://vibiba.com or https://github.com/asperciesl/ViBiBa  相似文献   

17.
G-quadruplex DNA structures have become attractive drug targets, and native mass spectrometry can provide detailed characterization of drug binding stoichiometry and affinity, potentially at high throughput. However, the G-quadruplex DNA polymorphism poses problems for interpreting ligand screening assays. In order to establish standardized MS-based screening assays, we studied 28 sequences with documented NMR structures in (usually ∼100 mM) potassium, and report here their circular dichroism (CD), melting temperature (Tm), NMR spectra and electrospray mass spectra in 1 mM KCl/100 mM trimethylammonium acetate. Based on these results, we make a short-list of sequences that adopt the same structure in the MS assay as reported by NMR, and provide recommendations on using them for MS-based assays. We also built an R-based open-source application to build and consult a database, wherein further sequences can be incorporated in the future. The application handles automatically most of the data processing, and allows generating custom figures and reports. The database is included in the g4dbr package (https://github.com/EricLarG4/g4dbr) and can be explored online (https://ericlarg4.github.io/G4_database.html).  相似文献   

18.
Protein designers use a wide variety of software tools for de novo design, yet their repertoire still lacks a fast and interactive all-atom search engine. To solve this, we have built the Suns program: a real-time, atomic search engine integrated into the PyMOL molecular visualization system. Users build atomic-level structural search queries within PyMOL and receive a stream of search results aligned to their query within a few seconds. This instant feedback cycle enables a new “designability”-inspired approach to protein design where the designer searches for and interactively incorporates native-like fragments from proven protein structures. We demonstrate the use of Suns to interactively build protein motifs, tertiary interactions, and to identify scaffolds compatible with hot-spot residues. The official web site and installer are located at http://www.degradolab.org/suns/ and the source code is hosted at https://github.com/godotgildor/Suns (PyMOL plugin, BSD license), https://github.com/Gabriel439/suns-cmd (command line client, BSD license), and https://github.com/Gabriel439/suns-search (search engine server, GPLv2 license).
This is a PLOS Computational Biology Software Article
  相似文献   

19.
When working on an ongoing genome sequencing and assembly project, it is rather inconvenient when gene identifiers change from one build of the assembly to the next. The gene labelling system described here, UniqTag, addresses this common challenge. UniqTag assigns a unique identifier to each gene that is a representative k-mer, a string of length k, selected from the sequence of that gene. Unlike serial numbers, these identifiers are stable between different assemblies and annotations of the same data without requiring that previous annotations be lifted over by sequence alignment. We assign UniqTag identifiers to ten builds of the Ensembl human genome spanning eight years to demonstrate this stability. The implementation of UniqTag in Ruby and an R package are available at https://github.com/sjackman/uniqtag sjackman/uniqtag. The R package is also available from CRAN: install.packages ("uniqtag"). Supplementary material and code to reproduce it is available at https://github.com/sjackman/uniqtag-paper.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号