首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Hua  Kui  Zhang  Xuegong 《BMC genomics》2019,20(2):93-101
Background

Metagenomic sequencing is a powerful technology for studying the mixture of microbes or the microbiomes on human and in the environment. One basic task of analyzing metagenomic data is to identify the component genomes in the community. This task is challenging due to the complexity of microbiome composition, limited availability of known reference genomes, and usually insufficient sequencing coverage.

Results

As an initial step toward understanding the complete composition of a metagenomic sample, we studied the problem of estimating the total length of all distinct component genomes in a metagenomic sample. We showed that this problem can be solved by estimating the total number of distinct k-mers in all the metagenomic sequencing data. We proposed a method for this estimation based on the sequencing coverage distribution of observed k-mers, and introduced a k-mer redundancy index (KRI) to fill in the gap between the count of distinct k-mers and the total genome length. We showed the effectiveness of the proposed method on a set of carefully designed simulation data corresponding to multiple situations of true metagenomic data. Results on real data indicate that the uncaptured genomic information can vary dramatically across metagenomic samples, with the potential to mislead downstream analyses.

Conclusions

We proposed the question of how long the total genome length of all different species in a microbial community is and introduced a method to answer it.

  相似文献   

2.
Yoon  Byung-Jun  Qian  Xiaoning  Kahveci  Tamer  Pal  Ranadip 《BMC genomics》2020,21(9):1-3
Background

Haplotypes, the ordered lists of single nucleotide variations that distinguish chromosomal sequences from their homologous pairs, may reveal an individual’s susceptibility to hereditary and complex diseases and affect how our bodies respond to therapeutic drugs. Reconstructing haplotypes of an individual from short sequencing reads is an NP-hard problem that becomes even more challenging in the case of polyploids. While increasing lengths of sequencing reads and insert sizes helps improve accuracy of reconstruction, it also exacerbates computational complexity of the haplotype assembly task. This has motivated the pursuit of algorithmic frameworks capable of accurate yet efficient assembly of haplotypes from high-throughput sequencing data.

Results

We propose a novel graphical representation of sequencing reads and pose the haplotype assembly problem as an instance of community detection on a spatial random graph. To this end, we construct a graph where each read is a node with an unknown community label associating the read with the haplotype it samples. Haplotype reconstruction can then be thought of as a two-step procedure: first, one recovers the community labels on the nodes (i.e., the reads), and then uses the estimated labels to assemble the haplotypes. Based on this observation, we propose ComHapDet – a novel assembly algorithm for diploid and ployploid haplotypes which allows both bialleleic and multi-allelic variants.

Conclusions

Performance of the proposed algorithm is benchmarked on simulated as well as experimental data obtained by sequencing Chromosome 5 of tetraploid biallelic Solanum-Tuberosum (Potato). The results demonstrate the efficacy of the proposed method and that it compares favorably with the existing techniques.

  相似文献   

3.
DNA sample contamination is a frequent problem in DNA sequencing studies and can result in genotyping errors and reduced power for association testing. We recently described methods to identify within-species DNA sample contamination based on sequencing read data, showed that our methods can reliably detect and estimate contamination levels as low as 1%, and suggested strategies to identify and remove contaminated samples from sequencing studies. Here we propose methods to model contamination during genotype calling as an alternative to removal of contaminated samples from further analyses. We compare our contamination-adjusted calls to calls that ignore contamination and to calls based on uncontaminated data. We demonstrate that, for moderate contamination levels (5%–20%), contamination-adjusted calls eliminate 48%–77% of the genotyping errors. For lower levels of contamination, our contamination correction methods produce genotypes nearly as accurate as those based on uncontaminated data. Our contamination correction methods are useful generally, but are particularly helpful for sample contamination levels from 2% to 20%.  相似文献   

4.
Purpose

The aims of this study were to report the first isolation of Erysipelothrix sp. strain 2 (ES2) associated with clinical signs of diseases as well as mortality in turkeys and identify the antimicrobial resistance of the isolates.

Methods

We evaluated 118 farms for bacteriological analysis and TaqMan real-time PCR to identify the microorganism in different organs. After this, we made the epidemiological analysis between the positive flocks and the mortality mean. We performed the sequencing of the 16S rRNA region and the assessment of antimicrobial resistance.

Results

We have identified 18 (15.25%) as ES2-positive flocks, without any other species from the same genus being found. After analysing the organ samples, we found liver as the organ of choice for the isolation of the ES2. The sequencing of 16S rRNA region of ES2 identified high homology with E. tonsillarum and E. rhusiopathiae, suggesting that it is not the best-suited target to identify this species. We have found a positive association between isolation of the bacteria in organs and flocks’ mortality. Positive flocks had a mortality mean rate of 6.87%, which is significantly greater than 3.76% in negative flocks. Ill turkeys had gross lesions of generalized septicaemia. The bacterial isolates showed high resistance to fosfomycin and trimethoprim/sulfamethoxazole and sensibility to norfloxacin, amoxicillin and lincomycin/spectinomycin.

Conclusion

This is the first study in the world that addressed ES2 as the causative agent of erysipelas in turkey.

  相似文献   

5.
BackgroundUsing epigenetic markers and fragmentomics of cell-free DNA for cancer detection has been proven applicable.MethodsWe further investigated the diagnostic potential of combining two features (epigenetic markers and fragmentomic information) of cell-free DNA for detecting various types of cancers. To do this, we extracted cfDNA fragmentomic features from 191 whole-genome sequencing data and studied them in 396 low-pass 5hmC sequencing data, which included four common cancer types and control samples.ResultsIn our analysis of 5hmC sequencing data from cancer samples, we observed aberrant ultra-long fragments (220–500 bp) that differed from normal samples in terms of both size and coverage profile. These fragments played a significant role in predicting cancer. Leveraging the ability to detect cfDNA hydroxymethylation and fragmentomic markers simultaneously in low-pass 5hmC sequencing data, we developed an integrated model that incorporated 63 features representing both fragmentomic features and hydroxymethylation signatures. This model achieved high sensitivity and specificity for pan-cancer detection (88.52% and 82.35%, respectively).ConclusionWe showed that fragmentomic information in 5hmC sequencing data is an ideal marker for cancer detection and that it shows high performance in low-pass sequencing data.  相似文献   

6.
Background

A metagenome is a collection of genomes, usually in a micro-environment, and sequencing a metagenomic sample en masse is a powerful means for investigating the community of the constituent microorganisms. One of the challenges is in distinguishing between similar organisms due to rampant multiple possible assignments of sequencing reads, resulting in false positive identifications. We map the problem to a topological data analysis (TDA) framework that extracts information from the geometric structure of data. Here the structure is defined by multi-way relationships between the sequencing reads using a reference database.

Results

Based primarily on the patterns of co-mapping of the reads to multiple organisms in the reference database, we use two models: one a subcomplex of a Barycentric subdivision complex and the other a Čech complex. The Barycentric subcomplex allows a natural mapping of the reads along with their coverage of organisms while the Čech complex takes simply the number of reads into account to map the problem to homology computation. Using simulated genome mixtures we show not just enrichment of signal but also microbe identification with strain-level resolution.

Conclusions

In particular, in the most refractory of cases where alternative algorithms that exploit unique reads (i.e., mapped to unique organisms) fail, we show that the TDA approach continues to show consistent performance. The Čech model that uses less information is equally effective, suggesting that even partial information when augmented with the appropriate structure is quite powerful.

  相似文献   

7.
Background: Invasive species can interfere in the structure and functioning of ecosystems. Better understanding of the evolution of such species will be useful when planning their management and eradication.

Aims: We aimed to compare patterns of genetic variability in Impatiens glandulifera in native and introduced regions.

Methods: We used native samples from India and Pakistan, and non-native samples from Canada, Finland and the UK. Genetic analyses included genotyping using 10 microsatellite markers and sequencing of the nuclear ITS region.

Results: Mean allele numbers from native and introduced samples were even, 8.8 and 8.5, respectively, while expected heterozygosities were higher in native samples (mean 0.738) than in non-native samples (mean 0.477). Hardy–Weinberg equilibrium testing indicated significant heterozygote deficiencies at 70% of the loci. Inbreeding coefficients were high in both native and introduced regions (range 0.201–0.726). STRUCTURE analyses showed that native samples from India and Pakistan possessed similar clustering patterns while non-native samples from the UK and Canada resembled each other. One of the four Finnish populations had a similar pattern with the UK and Canadian populations, while the rest showed similarly unique genetic compositions. ITS sequencing indicated in Pakistani samples two polymorphic sites not found in Indian samples but present in some samples from Canada, Finland and the UK.

Conclusions: Distinct population genetic patterns indicate that human-mediated dispersal is important in I. glandulifera.  相似文献   

8.
ABSTRACT

We have utilized the California GeoTracker database to evaluate field duplicate variability and the significance of sample contamination for groundwater and vapor samples collected from contaminated sites in California. Vapor duplicates are more variable than water duplicates with median percent difference in concentration of 25% compared to 7% for water samples. In addition, large differences in concentration were more common in vapor duplicates. For vapor analyte pairs, 20% of pairs had a percent difference in concentration of >300% while, for groundwater analyte pairs, only 3% had a percent difference of >300%. Contamination of samples during collection or analysis is also more significant for vapor samples. For water samples, sample contamination appears unlikely to result in false positive exceedances of drinking water standards; however, for vapor samples, sample contamination may result in false positive exceedances of indoor air screening values. For vapor samples, the use of reusable canisters and flow controllers is likely an important source of sample contamination.  相似文献   

9.
He  Feifei  Li  Yang  Tang  Yu-Hang  Ma  Jian  Zhu  Huaiqiu 《BMC genomics》2016,17(1):141-151
Background

The identification of inversions of DNA segments shorter than read length (e.g., 100 bp), defined as micro-inversions (MIs), remains challenging for next-generation sequencing reads. It is acknowledged that MIs are important genomic variation and may play roles in causing genetic disease. However, current alignment methods are generally insensitive to detect MIs. Here we develop a novel tool, MID (Micro-Inversion Detector), to identify MIs in human genomes using next-generation sequencing reads.

Results

The algorithm of MID is designed based on a dynamic programming path-finding approach. What makes MID different from other variant detection tools is that MID can handle small MIs and multiple breakpoints within an unmapped read. Moreover, MID improves reliability in low coverage data by integrating multiple samples. Our evaluation demonstrated that MID outperforms Gustaf, which can currently detect inversions from 30 bp to 500 bp.

Conclusions

To our knowledge, MID is the first method that can efficiently and reliably identify MIs from unmapped short next-generation sequencing reads. MID is reliable on low coverage data, which is suitable for large-scale projects such as the 1000 Genomes Project (1KGP). MID identified previously unknown MIs from the 1KGP that overlap with genes and regulatory elements in the human genome. We also identified MIs in cancer cell lines from Cancer Cell Line Encyclopedia (CCLE). Therefore our tool is expected to be useful to improve the study of MIs as a type of genetic variant in the human genome. The source code can be downloaded from: http://cqb.pku.edu.cn/ZhuLab/MID.

  相似文献   

10.
Zhu  Fangfang  Li  Jiang  Liu  Juan  Min  Wenwen 《BMC genetics》2021,22(1):1-10
Background

Next-generation sequencing (NGS) has profoundly changed the approach to genetic/genomic research. Particularly, the clinical utility of NGS in detecting mutations associated with disease risk has contributed to the development of effective therapeutic strategies. Recently, comprehensive analysis of somatic genetic mutations by NGS has also been used as a new approach for controlling the quality of cell substrates for manufacturing biopharmaceuticals. However, the quality evaluation of cell substrates by NGS largely depends on the limit of detection (LOD) for rare somatic mutations. The purpose of this study was to develop a simple method for evaluating the ability of whole-exome sequencing (WES) by NGS to detect mutations with low allele frequency. To estimate the LOD of WES for low-frequency somatic mutations, we repeatedly and independently performed WES of a reference genomic DNA using the same NGS platform and assay design. LOD was defined as the allele frequency with a relative standard deviation (RSD) value of 30% and was estimated by a moving average curve of the relation between RSD and allele frequency.

Results

Allele frequencies of 20 mutations in the reference material that had been pre-validated by droplet digital PCR (ddPCR) were obtained from 5, 15, 30, or 40 G base pair (Gbp) sequencing data per run. There was a significant association between the allele frequencies measured by WES and those pre-validated by ddPCR, whose p-value decreased as the sequencing data size increased. By this method, the LOD of allele frequency in WES with the sequencing data of 15 Gbp or more was estimated to be between 5 and 10%.

Conclusions

For properly interpreting the WES data of somatic genetic mutations, it is necessary to have a cutoff threshold of low allele frequencies. The in-house LOD estimated by the simple method shown in this study provides a rationale for setting the cutoff.

  相似文献   

11.
Metabarcoding of environmental samples on second‐generation sequencing platforms has rapidly become a valuable tool for ecological studies. A fundamental assumption of this approach is the reliance on being able to track tagged amplicons back to the samples from which they originated. In this study, we address the problem of sequences in metabarcoding sequencing outputs with false combinations of used tags (tag jumps). Unless these sequences can be identified and excluded from downstream analyses, tag jumps creating sequences with false, but already used tag combinations, can cause incorrect assignment of sequences to samples and artificially inflate diversity. In this study, we document and investigate tag jumping in metabarcoding studies on Illumina sequencing platforms by amplifying mixed‐template extracts obtained from bat droppings and leech gut contents with tagged generic arthropod and mammal primers, respectively. We found that an average of 2.6% and 2.1% of sequences had tag combinations, which could be explained by tag jumping in the leech and bat diet study, respectively. We suggest that tag jumping can happen during blunt‐ending of pools of tagged amplicons during library build and as a consequence of chimera formation during bulk amplification of tagged amplicons during library index PCR. We argue that tag jumping and contamination between libraries represents a considerable challenge for Illumina‐based metabarcoding studies, and suggest measures to avoid false assignment of tag jumping‐derived sequences to samples.  相似文献   

12.
Background

Metabolomics provides measurement of numerous metabolites in human samples, which can be a useful tool in clinical research. Blood and urine are regarded as preferred subjects of study because of their minimally invasive collection and simple preprocessing methods. Adhering to standard operating procedures is an essential factor in ensuring excellent sample quality and reliable results.

Aim of review

In this review, we summarize the studies about the impacts of various preprocessing factors on metabolomics studies involving clinical blood and urine samples in order to provide guidance for sample collection and preprocessing.

Key scientific concepts of review

Clinical information is important for sample grouping and data analysis which deserves attention before sample collection. Plasma and serum as well as urine samples are appropriate for metabolomics analysis. Collection tubes, hemolysis, delay at room temperature, and freeze–thaw cycles may affect metabolic profiles of blood samples. Collection time, time between sampling and examination, contamination, normalization strategies, and storage conditions may alter analysis results of urine samples. Taking these collection and preprocessing factors into account, this review provides suggestions of standard sample preprocessing.

  相似文献   

13.
摘要 目的:构建靶向约200个AML基因突变的sgRNA基因敲除文库,为进一步探索诱发AML的信号通路网络奠定基础。方法:TCGA对200名AML病人进行了全基因组或全外显子组测序,鉴定出约2000个AML相关基因突变,从中选出了约200个突变两次或以上的基因作为靶向基因;接着,从Brie文库中挑选出相应基因的sgRNA序列,每个基因对应4条sgRNA;利用Gibson组装酶连接到慢病毒载体内,得到sgRNA文库;之后,采用pSSA荧光素酶基因报告系统鉴定文库sgRNA的切割活性;对文库进行高通量测序鉴定;用慢病毒包装文库,并测定病毒滴度。结果:1、构建了一个靶向约200个AML突变的sgRNA基因敲除文库;2、pSSA荧光素酶基因报告系统鉴定文库sgRNA具有切割活性;3、鉴定的7个单克隆质粒序列完全正确;4、高通量测序鉴定文库丰度和均一性符合要求;5、用慢病毒包装成病毒文库,测定病毒文库滴度为4.4×107符合后续实验要求。结论:成功构建了靶向约200个基因突变的sgRNA敲除文库,可用于大规模地筛选诱发AML的基因突变,为探索AML发生、发展的分子机制以及药物靶点奠定基础。  相似文献   

14.
Li  Wei Vivian  Razaee  Zahra S.  Li  Jingyi Jessica 《BMC genomics》2016,17(1):109-125
Background

The dynamics of epigenomic marks in their relevant chromatin states regulate distinct gene expression patterns, biological functions and phenotypic variations in biological processes. The availability of high-throughput epigenomic data generated by next-generation sequencing technologies allows a data-driven approach to evaluate the similarities and differences of diverse tissue and cell types in terms of epigenomic features. While ChromImpute has allowed for the imputation of large-scale epigenomic information to yield more robust data to capture meaningful relationships between biological samples, widely used methods such as hierarchical clustering and correlation analysis cannot adequately utilize epigenomic data to accurately reveal the distinction and grouping of different tissue and cell types.

Methods

We utilize a three-step testing procedure–ANOVA, t test and overlap test to identify tissue/cell-type- associated enhancers and promoters and to calculate a newly defined Epigenomic Overlap Measure (EPOM). EPOM results in a clear correspondence map of biological samples from different tissue and cell types through comparison of epigenomic marks evaluated in their relevant chromatin states.

Results

Correspondence maps by EPOM show strong capability in distinguishing and grouping different tissue and cell types and reveal biologically meaningful similarities between Heart and Muscle, Blood & T-cell and HSC & B-cell, Brain and Neurosphere, etc. The gene ontology enrichment analysis both supports and explains the discoveries made by EPOM and suggests that the associated enhancers and promoters demonstrate distinguishable functions across tissue and cell types. Moreover, the tissue/cell-type-associated enhancers and promoters show enrichment in the disease-related SNPs that are also associated with the corresponding tissue or cell types. This agreement suggests the potential of identifying causal genetic variants relevant to cell-type-specific diseases from our identified associated enhancers and promoters.

Conclusions

The proposed EPOM measure demonstrates superior capability in grouping and finding a clear correspondence map of biological samples from different tissue and cell types. The identified associated enhancers and promoters provide a comprehensive catalog to study distinct biological processes and disease variants in different tissue and cell types. Our results also find that the associated promoters exhibit more cell-type-specific functions than the associated enhancers do, suggesting that the non-associated promoters have more housekeeping functions than the non-associated enhancers.

  相似文献   

15.
16.
Background

Short-read resequencing of genomes produces abundant information of the genetic variation of individuals. Due to their numerous nature, these variants are rarely exhaustively validated. Furthermore, low levels of undetected variant miscalling will have a systematic and disproportionate impact on the interpretation of individual genome sequence information, especially should these also be carried through into in reference databases of genomic variation.

Results

We find that sequence variation from short-read sequence data is subject to recurrent-yet-intermittent miscalling that occurs in a sequence intrinsic manner and is very sensitive to sequence read length. The miscalls arise from difficulties aligning short reads to redundant genomic regions, where the rate of sequencing error approaches the sequence diversity between redundant regions. We find the resultant miscalled variants to be sensitive to small sequence variations between genomes, and thereby are often intrinsic to an individual, pedigree, strain or human ethnic group. In human exome sequences, we identify 2–300 recurrent false positive variants per individual, almost all of which are present in public databases of human genomic variation. From the exomes of non-reference strains of inbred mice, we identify 3–5000 recurrent false positive variants per mouse – the number of which increasing with greater distance between an individual mouse strain and the reference C57BL6 mouse genome. We show that recurrently miscalled variants may be reproduced for a given genome from repeated simulation rounds of read resampling, realignment and recalling. As such, it is possible to identify more than two-thirds of false positive variation from only ten rounds of simulation.

Conclusion

Identification and removal of recurrent false positive variants from specific individual variant sets will improve overall data quality. Variant miscalls arising are highly sequence intrinsic and are often specific to an individual, pedigree or ethnicity. Further, read length is a strong determinant of whether given false variants will be called for any given genome – which has profound significance for cohort studies that pool datasets collected and sequenced at different points in time.

  相似文献   

17.
18.
目的 通过全基因组测序(whole genome sequencing,WGS)获得高密度单核苷酸多态性(single nucleotide polymorphism,SNP)分型数据,评估分型准确性,研究建立WGS数据用于法医SNP系谱推断的方法。方法 通过华大MGISEQ-200RS测序平台对样本进行深度为30×的WGS,从测序数据中提取Wegene GSA芯片中的645 199个常染色体SNP位点,质控过滤后运用IBS/IBD算法计算预测亲缘关系,并对样本的族群来源进行分析。结果 从测序数据中提取的SNP分型与Wegene GSA芯片分型的一致率大于99.62%。测序获得的SNP数据使用IBS算法可预测1~4级亲缘关系,4级亲缘预测置信区间准确性达100%,使用IBD算法可预测1~7级亲缘关系,7级亲缘预测为有亲缘关系的准确性达100%,通过高深度WGS数据获取的SNP系谱推断能力与芯片预测结果无显著差异。同时,WGS数据用于族群推断与调查结果一致。结论 WGS技术可应用于法医SNP系谱推断,为案件侦破提供线索。  相似文献   

19.
20.

Background

Massively parallel sequencing technology is revolutionizing approaches to genomic and genetic research. Since its advent, the scale and efficiency of Next-Generation Sequencing (NGS) has rapidly improved. In spite of this success, sequencing genomes or genomic regions with extremely biased base composition is still a great challenge to the currently available NGS platforms. The genomes of some important pathogenic organisms like Plasmodium falciparum (high AT content) and Mycobacterium tuberculosis (high GC content) display extremes of base composition. The standard library preparation procedures that employ PCR amplification have been shown to cause uneven read coverage particularly across AT and GC rich regions, leading to problems in genome assembly and variation analyses. Alternative library-preparation approaches that omit PCR amplification require large quantities of starting material and hence are not suitable for small amounts of DNA/RNA such as those from clinical isolates. We have developed and optimized library-preparation procedures suitable for low quantity starting material and tolerant to extremely high AT content sequences.

Results

We have used our optimized conditions in parallel with standard methods to prepare Illumina sequencing libraries from a non-clinical and a clinical isolate (containing ~53% host contamination). By analyzing and comparing the quality of sequence data generated, we show that our optimized conditions that involve a PCR additive (TMAC), produces amplified libraries with improved coverage of extremely AT-rich regions and reduced bias toward GC neutral templates.

Conclusion

We have developed a robust and optimized Next-Generation Sequencing library amplification method suitable for extremely AT-rich genomes. The new amplification conditions significantly reduce bias and retain the complexity of either extremes of base composition. This development will greatly benefit sequencing clinical samples that often require amplification due to low mass of DNA starting material.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号