首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.

Background

Insertions and deletions (indels) are the most abundant form of structural variation in all genomes. Indels have been increasingly recognized as an important source of molecular markers due to high-density occurrence, cost-effectiveness, and ease of genotyping. Coupled with developments in bioinformatics, next-generation sequencing (NGS) platforms enable the discovery of millions of indel polymorphisms by comparing the whole genome sequences of individuals within a species.

Results

A total of 1,973,746 unique indels were identified in 345 maize genomes, with an overall density of 958.79 indels/Mbp, and an average allele number of 2.76, ranging from 2 to 107. There were 264,214 indels with polymorphism information content (PIC) values greater than or equal to 0.5, accounting for 13.39 % of overall indels. Of these highly polymorphic indels, we designed primer pairs for 83,481 and 29,403 indels with major allele differences (i.e. the size difference between the most and second most frequent alleles) greater than or equal to 3 and 8 bp, respectively, based on the differing resolution capabilities of gel electrophoresis. The accuracy of our indel markers was experimentally validated, and among 100 indel markers, average accuracy was approximately 90 %. In addition, we also validated the polymorphism of the indel markers. Of 100 highly polymorphic indel markers, all had polymorphisms with average PIC values of 0.54.

Conclusions

The maize genome is rich in indel polymorphisms. Intriguingly, the level of polymorphism in genic regions of the maize genome was higher than that in intergenic regions. The polymorphic indel markers developed from this study may enhance the efficiency of genetic research and marker-assisted breeding in maize.

Electronic supplementary material

The online version of this article (doi:10.1186/s12864-015-1797-5) contains supplementary material, which is available to authorized users.  相似文献   

2.

Background

With the number of available genome sequences increasing rapidly, the magnitude of sequence data required for multiple-genome analyses is a challenging problem. When large-scale rearrangements break the collinearity of gene orders among genomes, genome comparison algorithms must first identify sets of short well-conserved sequences present in each genome, termed anchors. Previously, anchor identification among multiple genomes has been achieved using pairwise alignment tools like BLASTZ through progressive alignment tools like TBA, but the computational requirements for sequence comparisons of multiple genomes quickly becomes a limiting factor as the number and scale of genomes grows.

Methodology/Principal Findings

Our algorithm, named Murasaki, makes it possible to identify anchors within multiple large sequences on the scale of several hundred megabases in few minutes using a single CPU. Two advanced features of Murasaki are (1) adaptive hash function generation, which enables efficient use of arbitrary mismatch patterns (spaced seeds) and therefore the comparison of multiple mammalian genomes in a practical amount of computation time, and (2) parallelizable execution that decreases the required wall-clock and CPU times. Murasaki can perform a sensitive anchoring of eight mammalian genomes (human, chimp, rhesus, orangutan, mouse, rat, dog, and cow) in 21 hours CPU time (42 minutes wall time). This is the first single-pass in-core anchoring of multiple mammalian genomes. We evaluated Murasaki by comparing it with the genome alignment programs BLASTZ and TBA. We show that Murasaki can anchor multiple genomes in near linear time, compared to the quadratic time requirements of BLASTZ and TBA, while improving overall accuracy.

Conclusions/Significance

Murasaki provides an open source platform to take advantage of long patterns, cluster computing, and novel hash algorithms to produce accurate anchors across multiple genomes with computational efficiency significantly greater than existing methods. Murasaki is available under GPL at http://murasaki.sourceforge.net.  相似文献   

3.

Background

The new generation of massively parallel DNA sequencers, combined with the challenge of whole human genome resequencing, result in the need for rapid and accurate alignment of billions of short DNA sequence reads to a large reference genome. Speed is obviously of great importance, but equally important is maintaining alignment accuracy of short reads, in the 25–100 base range, in the presence of errors and true biological variation.

Methodology

We introduce a new algorithm specifically optimized for this task, as well as a freely available implementation, BFAST, which can align data produced by any of current sequencing platforms, allows for user-customizable levels of speed and accuracy, supports paired end data, and provides for efficient parallel and multi-threaded computation on a computer cluster. The new method is based on creating flexible, efficient whole genome indexes to rapidly map reads to candidate alignment locations, with arbitrary multiple independent indexes allowed to achieve robustness against read errors and sequence variants. The final local alignment uses a Smith-Waterman method, with gaps to support the detection of small indels.

Conclusions

We compare BFAST to a selection of large-scale alignment tools - BLAT, MAQ, SHRiMP, and SOAP - in terms of both speed and accuracy, using simulated and real-world datasets. We show BFAST can achieve substantially greater sensitivity of alignment in the context of errors and true variants, especially insertions and deletions, and minimize false mappings, while maintaining adequate speed compared to other current methods. We show BFAST can align the amount of data needed to fully resequence a human genome, one billion reads, with high sensitivity and accuracy, on a modest computer cluster in less than 24 hours. BFAST is available at http://bfast.sourceforge.net.  相似文献   

4.

Background

Candida albicans is a ubiquitous opportunistic fungal pathogen that afflicts immunocompromised human hosts. With rare and transient exceptions the yeast is diploid, yet despite its clinical relevance the respective sequences of its two homologous chromosomes have not been completely resolved.

Results

We construct a phased diploid genome assembly by deep sequencing a standard laboratory wild-type strain and a panel of strains homozygous for particular chromosomes. The assembly has 700-fold coverage on average, allowing extensive revision and expansion of the number of known SNPs and indels. This phased genome significantly enhances the sensitivity and specificity of allele-specific expression measurements by enabling pooling and cross-validation of signal across multiple polymorphic sites. Additionally, the diploid assembly reveals pervasive and unexpected patterns in allelic differences between homologous chromosomes. Firstly, we see striking clustering of indels, concentrated primarily in the repeat sequences in promoters. Secondly, both indels and their repeat-sequence substrate are enriched near replication origins. Finally, we reveal an intimate link between repeat sequences and indels, which argues that repeat length is under selective pressure for most eukaryotes. This connection is described by a concise one-parameter model that explains repeat-sequence abundance in C. albicans as a function of the indel rate, and provides a general framework to interpret repeat abundance in species ranging from bacteria to humans.

Conclusions

The phased genome assembly and insights into repeat plasticity will be valuable for better understanding allele-specific phenomena and genome evolution.  相似文献   

5.

Background

Mobile elements are active in the human genome, both in the germline and cancers, where they can mutate driver genes.

Results

While analysing whole genome paired-end sequencing of oesophageal adenocarcinomas to find genomic rearrangements, we identified three ways in which new mobile element insertions appear in the data, resembling translocation or insertion junctions: inserts where unique sequence has been transduced by an L1 (Long interspersed element 1) mobile element; novel inserts that are confidently, but often incorrectly, mapped by alignment software to L1s or polyA tracts in the reference sequence; and a combination of these two ways, where different sequences within one insert are mapped to different loci. We identified nine unique sequences that were transduced by neighbouring L1s, both L1s in the reference genome and L1s not present in the reference. Many of the resulting inserts were small fragments that include little or no recognisable mobile element sequence. We found 6 loci in the reference genome to which sequence reads from inserts were frequently mapped, probably erroneously, by alignment software: these were either L1 sequence or particularly long polyA runs. Inserts identified from such apparent rearrangement junctions averaged 16 inserts/tumour, range 0–153 insertions in 43 tumours. However, many inserts would not be detected by mapping the sequences to the reference genome, because they do not include sufficient mappable sequence. To estimate total somatic inserts we searched for polyA sequences that were not present in the matched normal or other normals from the same tumour batch, and were not associated with known polymorphisms. Samples of these candidate inserts were verified by sequencing across them or manual inspection of surrounding reads: at least 85 % were somatic and resembled L1-mediated events, most including L1Hs sequence. Approximately 100 such inserts were detected per tumour on average (range zero to approximately 700).

Conclusions

Somatic mobile elements insertions are abundant in these tumours, with over 75 % of cases having a number of novel inserts detected. The inserts create a variety of problems for the interpretation of paired-end sequencing data.

Electronic supplementary material

The online version of this article (doi:10.1186/s12864-015-1685-z) contains supplementary material, which is available to authorized users.  相似文献   

6.

Background

With the advance of next generation sequencing (NGS) technologies, a large number of insertion and deletion (indel) variants have been identified in human populations. Despite much research into variant calling, it has been found that a non-negligible proportion of the identified indel variants might be false positives due to sequencing errors, artifacts caused by ambiguous alignments, and annotation errors.

Results

In this paper, we examine indel redundancy in dbSNP, one of the central databases for indel variants, and develop a standalone computational pipeline, dubbed Vindel, to detect redundant indels. The pipeline first applies indel position information to form candidate redundant groups, then performs indel mutations to the reference genome to generate corresponding indel variant substrings. Finally the indel variant substrings in the same candidate redundant groups are compared in a pairwise fashion to identify redundant indels. We applied our pipeline to check for redundancy in the human indels in dbSNP. Our pipeline identified approximately 8% redundancy in insertion type indels, 12% in deletion type indels, and overall 10% for insertions and deletions combined. These numbers are largely consistent across all human autosomes. We also investigated indel size distribution and adjacent indel distance distribution for a better understanding of the mechanisms generating indel variants.

Conclusions

Vindel, a simple yet effective computational pipeline, can be used to check whether a set of indels are redundant with respect to those already in the database of interest such as NCBI’s dbSNP. Of the approximately 5.9 million indels we examined, nearly 0.6 million are redundant, revealing a serious limitation in the current indel annotation. Statistics results prove the consistency of the pipeline on indel redundancy detection for all 22 chromosomes. Apart from the standalone Vindel pipeline, the indel redundancy check algorithm is also implemented in the web server http://bioinformatics.cs.vt.edu/zhanglab/indelRedundant.php.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-014-0359-1) contains supplementary material, which is available to authorized users.  相似文献   

7.

Background

Molecular mechanisms associated with frequent relapse of diffuse large B-cell lymphoma (DLBCL) are poorly defined. It is especially unclear how primary tumor clonal heterogeneity contributes to relapse. Here, we explore unique features of B-cell lymphomas - VDJ recombination and somatic hypermutation - to address this question.

Results

We performed high-throughput sequencing of rearranged VDJ junctions in 14 pairs of matched diagnosis-relapse tumors, among which 7 pairs were further characterized by exome sequencing. We identify two distinctive modes of clonal evolution of DLBCL relapse: an early-divergent mode in which clonally related diagnosis and relapse tumors diverged early and developed in parallel; and a late-divergent mode in which relapse tumors developed directly from diagnosis tumors with minor divergence. By examining mutation patterns in the context of phylogenetic information provided by VDJ junctions, we identified mutations in epigenetic modifiers such as KMT2D as potential early driving events in lymphomagenesis and immune escape alterations as relapse-associated events.

Conclusions

Altogether, our study for the first time provides important evidence that DLBCL relapse may result from multiple, distinct tumor evolutionary mechanisms, providing rationale for therapies for each mechanism. Moreover, this study highlights the urgent need to understand the driving roles of epigenetic modifier mutations in lymphomagenesis, and immune surveillance factor genetic lesions in relapse.

Electronic supplementary material

The online version of this article (doi:10.1186/s13059-014-0432-0) contains supplementary material, which is available to authorized users.  相似文献   

8.

Background

Recent advances in deep digital sequencing have unveiled an unprecedented degree of clonal heterogeneity within a single tumor DNA sample. Resolving such heterogeneity depends on accurate estimation of fractions of alleles that harbor somatic mutations. Unlike substitutions or small indels, structural variants such as deletions, duplications, inversions and translocations involve segments of DNAs and are potentially more accurate for allele fraction estimations. However, no systematic method exists that can support such analysis.

Results

In this paper, we present a novel maximum-likelihood method that estimates allele fractions of structural variants integratively from various forms of alignment signals. We develop a tool, BreakDown, to estimate the allele fractions of most structural variants including medium size (from 1 kilobase to 1 megabase) deletions and duplications, and balanced inversions and translocations.

Conclusions

Evaluation based on both simulated and real data indicates that our method systematically enables structural variants for clonal heterogeneity analysis and can greatly enhance the characterization of genomically instable tumors.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2105-15-299) contains supplementary material, which is available to authorized users.  相似文献   

9.

Background

The Ion Torrent PGM is a popular benchtop sequencer that shows promise in replacing conventional Sanger sequencing as the gold standard for mutation detection. Despite the PGM’s reported high accuracy in calling single nucleotide variations, it tends to generate many false positive calls in detecting insertions and deletions (indels), which may hinder its utility for clinical genetic testing.

Results

Recently, the proprietary analytical workflow for the Ion Torrent sequencer, Torrent Suite (TS), underwent a series of upgrades. We evaluated three major upgrades of TS by calling indels in the BRCA1 and BRCA2 genes. Our analysis revealed that false negative indels could be generated by TS under both default calling parameters and parameters adjusted for maximum sensitivity. However, indel calling with the same data using the open source variant callers, GATK and SAMtools showed that false negatives could be minimised with the use of appropriate bioinformatics analysis. Furthermore, we identified two variant calling measures, Quality-by-Depth (QD) and VARiation of the Width of gaps and inserts (VARW), which substantially reduced false positive indels, including non-homopolymer associated errors without compromising sensitivity. In our best case scenario that involved the TMAP aligner and SAMtools, we achieved 100% sensitivity, 99.99% specificity and 29% False Discovery Rate (FDR) in indel calling from all 23 samples, which is a good performance for mutation screening using PGM.

Conclusions

New versions of TS, BWA and GATK have shown improvements in indel calling sensitivity and specificity over their older counterpart. However, the variant caller of TS exhibits a lower sensitivity than GATK and SAMtools. Our findings demonstrate that although indel calling from PGM sequences may appear to be noisy at first glance, proper computational indel calling analysis is able to maximize both the sensitivity and specificity at the single base level, paving the way for the usage of this technology for future clinical genetic testing.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2164-15-516) contains supplementary material, which is available to authorized users.  相似文献   

10.

Background

Obtaining an accurate sequence alignment is fundamental for consistently analyzing biological data. Although this problem may be efficiently solved when only two sequences are considered, the exact inference of the optimal alignment easily gets computationally intractable for the multiple sequence alignment case. To cope with the high computational expenses, approximate heuristic methods have been proposed that address the problem indirectly by progressively aligning the sequences in pairs according to their relatedness. These methods however are not flexible to change the alignment of an already aligned group of sequences in the view of new data, resulting thus in compromises on the quality of the deriving alignment. In this paper we present ReformAlign, a novel meta-alignment approach that may significantly improve on the quality of the deriving alignments from popular aligners. We call ReformAlign a meta-aligner as it requires an initial alignment, for which a variety of alignment programs can be used. The main idea behind ReformAlign is quite straightforward: at first, an existing alignment is used to construct a standard profile which summarizes the initial alignment and then all sequences are individually re-aligned against the formed profile. From each sequence-profile comparison, the alignment of each sequence against the profile is recorded and the final alignment is indirectly inferred by merging all the individual sub-alignments into a unified set. The employment of ReformAlign may often result in alignments which are significantly more accurate than the starting alignments.

Results

We evaluated the effect of ReformAlign on the generated alignments from ten leading alignment methods using real data of variable size and sequence identity. The experimental results suggest that the proposed meta-aligner approach may often lead to statistically significant more accurate alignments. Furthermore, we show that ReformAlign results in more substantial improvement in cases where the starting alignment is of relatively inferior quality or when the input sequences are harder to align.

Conclusions

The proposed profile-based meta-alignment approach seems to be a promising and computationally efficient method that can be combined with practically all popular alignment methods and may lead to significant improvements in the generated alignments.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2105-15-265) contains supplementary material, which is available to authorized users.  相似文献   

11.
12.
13.
14.

Background

The discovery and mapping of genomic variants is an essential step in most analysis done using sequencing reads. There are a number of mature software packages and associated pipelines that can identify single nucleotide polymorphisms (SNPs) with a high degree of concordance. However, the same cannot be said for tools that are used to identify the other types of variants. Indels represent the second most frequent class of variants in the human genome, after single nucleotide polymorphisms. The reliable detection of indels is still a challenging problem, especially for variants that are longer than a few bases.

Results

We have developed a set of algorithms and heuristics collectively called indelMINER to identify indels from whole genome resequencing datasets using paired-end reads. indelMINER uses a split-read approach to identify the precise breakpoints for indels of size less than a user specified threshold, and supplements that with a paired-end approach to identify larger variants that are frequently missed with the split-read approach. We use simulated and real datasets to show that an implementation of the algorithm performs favorably when compared to several existing tools.

Conclusions

indelMINER can be used effectively to identify indels in whole-genome resequencing projects. The output is provided in the VCF format along with additional information about the variant, including information about its presence or absence in another sample. The source code and documentation for indelMINER can be freely downloaded from www.bx.psu.edu/miller_lab/indelMINER.tar.gz.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-015-0483-6) contains supplementary material, which is available to authorized users.  相似文献   

15.

Background

Protein misfolding is usually deleterious for the cell, either as a consequence of the loss of protein function or the buildup of insoluble and toxic aggregates. The aggregation behavior of a given polypeptide is strongly influenced by the intrinsic properties encoded in its sequence. This has allowed the development of effective computational methods to predict protein aggregation propensity.

Methodology/Principal Findings

Here, we use the AGGRESCAN algorithm to approximate the aggregation profile of an experimental cytosolic Escherichia coli proteome. The analysis indicates that the aggregation propensity of bacterial proteins is associated with their length, conformation, location, function, and abundance. The data are consistent with the predictions of other algorithms on different theoretical proteomes.

Conclusions/Significance

Overall, the study suggests that the avoidance of protein aggregation in functional environments acts as a strong evolutionary constraint on polypeptide sequences in both prokaryotic and eukaryotic organisms.  相似文献   

16.

Background

Domestication has shaped the horse and lead to a group of many different types. Some have been under strong human selection while others developed in close relationship with nature. The aim of our study was to perform next generation sequencing of breed and non-breed horses to provide an insight into genetic influences on selective forces.

Results

Whole genome sequencing of five horses of four different populations revealed 10,193,421 single nucleotide polymorphisms (SNPs) and 1,361,948 insertion/deletion polymorphisms (indels). In comparison to horse variant databases and previous reports, we were able to identify 3,394,883 novel SNPs and 868,525 novel indels. We analyzed the distribution of individual variants and found significant enrichment of private mutations in coding regions of genes involved in primary metabolic processes, anatomical structures, morphogenesis and cellular components in non-breed horses and in contrast to that private mutations in genes affecting cell communication, lipid metabolic process, neurological system process, muscle contraction, ion transport, developmental processes of the nervous system and ectoderm in breed horses.

Conclusions

Our next generation sequencing data constitute an important first step for the characterization of non-breed in comparison to breed horses and provide a large number of novel variants for future analyses. Functional annotations suggest specific variants that could play a role for the characterization of breed or non-breed horses.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2164-15-562) contains supplementary material, which is available to authorized users.  相似文献   

17.

Background

Phylogenetic methods produce hierarchies of molecular species, inferring knowledge about taxonomy and evolution. However, there is not yet a consensus methodology that provides a crisp partition of taxa, desirable when considering the problem of intra/inter-patient quasispecies classification or infection transmission event identification. We introduce the threshold bootstrap clustering (TBC), a new methodology for partitioning molecular sequences, that does not require a phylogenetic tree estimation.

Methodology/Principal Findings

The TBC is an incremental partition algorithm, inspired by the stochastic Chinese restaurant process, and takes advantage of resampling techniques and models of sequence evolution. TBC uses as input a multiple alignment of molecular sequences and its output is a crisp partition of the taxa into an automatically determined number of clusters. By varying initial conditions, the algorithm can produce different partitions. We describe a procedure that selects a prime partition among a set of candidate ones and calculates a measure of cluster reliability. TBC was successfully tested for the identification of type-1 human immunodeficiency and hepatitis C virus subtypes, and compared with previously established methodologies. It was also evaluated in the problem of HIV-1 intra-patient quasispecies clustering, and for transmission cluster identification, using a set of sequences from patients with known transmission event histories.

Conclusion

TBC has been shown to be effective for the subtyping of HIV and HCV, and for identifying intra-patient quasispecies. To some extent, the algorithm was able also to infer clusters corresponding to events of infection transmission. The computational complexity of TBC is quadratic in the number of taxa, lower than other established methods; in addition, TBC has been enhanced with a measure of cluster reliability. The TBC can be useful to characterise molecular quasipecies in a broad context.  相似文献   

18.

Background

Activation induced deaminase (AID) mediates class switch recombination and somatic hypermutation of immunoglobulin (Ig) genes in germinal centre B cells. In order to regulate its specific activity and as a means to keep off-target mutations low, several mechanisms have evolved, including binding to specific cofactors, phosphorylation and destabilization of nuclear AID protein. Although ubiquitination at lysine residues of AID is recognized as an essential step in initiating degradation of nuclear AID, any functional relevance of lysine modifications has remained elusive.

Methodology/Principal Findings

Here, we report functional implications of lysine modifications of the human AID protein by generating a panel of lysine to arginine mutants of AID and assessment of their catalytic class switch activity. We found that only mutation of Lys22 to Arg resulted in a significant reduction of class switching to IgG1 in transfected primary mouse B cells. This decrease in activity was neither reflected in reduced hypermutation of Ig genes in AID-mutant transfected DT40 B cell lines nor recapitulated in bacterial deamination assays, pointing to involvement of post-translational modification of Lys22 for AID activity in B cells.

Conclusions/Significance

Our results imply that lysine modification may represent a novel level of AID regulation and that Lys22 is important for effective AID activity.  相似文献   

19.

Background

Trypanosoma cruzi is the causal agent of Chagas Disease. Recently, the genomes of representative strains from two major evolutionary lineages were sequenced, allowing the construction of a detailed genetic diversity map for this important parasite. However this map is focused on coding regions of the genome, leaving a vast space of regulatory regions uncharacterized in terms of their evolutionary conservation and/or divergence.

Methodology

Using data from the hybrid CL Brener and Sylvio X10 genomes (from the TcVI and TcI Discrete Typing Units, respectively), we identified intergenic regions that share a common evolutionary ancestry, and are present in both CL Brener haplotypes (TcII-like and TcIII-like) and in the TcI genome; as well as intergenic regions that were conserved in only two of the three genomes/haplotypes analyzed. The genetic diversity in these regions was characterized in terms of the accumulation of indels and nucleotide changes.

Principal Findings

Based on this analysis we have identified i) a core of highly conserved intergenic regions, which remained essentially unchanged in independently evolving lineages; ii) intergenic regions that show high diversity in spite of still retaining their corresponding upstream and downstream coding sequences; iii) a number of defined sequence motifs that are shared by a number of unrelated intergenic regions. A fraction of indels explains the diversification of some intergenic regions by the expansion/contraction of microsatellite-like repeats.  相似文献   

20.

Background

Although more than one thousand complete mitochondrial DNA (mtDNA) sequences have been determined in teleostean fishes, only a few gene rearrangements have been observed, and genome-scale rearrangements are even rarer. However, flatfishes (Pleuronectiformes) have been identified as having diverse types of mitochondrial gene rearrangements. It has been reported that tongue soles and the blue flounder mitogenomes exhibit different types of large-scale gene rearrangements.

Results

In the present study, the complete mitochondrial genome of another flatfish, Samariscus latus, was sequenced, and genome-scale rearrangements were observed. The genomic features of this flounder are different from those of any other studied vertebrates, including flatfish species too. The mitogenome of S. latus is characterized by the duplication and translocation of the control region (CR). The genes located between the two CRs are divided into two clusters in which their relative orders are maintained.

Conclusions

We propose a “Double Replications and Random Loss” model to explain the rearrangement events in S. latus mitogenome. This model consists of the following steps. First, the CR was duplicated and translocated. Subsequently, double replications of the mitogenome were successively initiated from the two CRs, leading to the duplication of the genes between the two CRs. Finally, one of each pair of duplicated genes was lost in a random event.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2164-15-352) contains supplementary material, which is available to authorized users.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号