共查询到20条相似文献,搜索用时 0 毫秒
1.
2.
Dutilh BE Jurgelenaite R Szklarczyk R van Hijum SA Harhangi HR Schmid M de Wild B Françoijs KJ Stunnenberg HG Strous M Jetten MS Op den Camp HJ Huynen MA 《Bioinformatics (Oxford, England)》2011,27(14):1929-1933
MOTIVATION: The intensification of DNA sequencing will increasingly unveil uncharacterized species with potential alternative genetic codes. A total of 0.65% of the DNA sequences currently in Genbank encode their proteins with a variant genetic code, and these exceptions occur in many unrelated taxa. RESULTS: We introduce FACIL (Fast and Accurate genetic Code Inference and Logo), a fast and reliable tool to evaluate nucleic acid sequences for their genetic code that detects alternative codes even in species distantly related to known organisms. To illustrate this, we apply FACIL to a set of mitochondrial genomic contigs of Globobulimina pseudospinescens. This foraminifer does not have any sequenced close relative in the databases, yet we infer its alternative genetic code with high confidence values. Results are intuitively visualized in a Genetic Code Logo. Availability and implementation: FACIL is available as a web-based service at http://www.cmbi.ru.nl/FACIL/ and as a stand-alone program. 相似文献
3.
Next generation sequencing (NGS) of PCR amplicons is a standard approach to detect genetic variations in personalized medicine such as cancer diagnostics. Computer programs used in the NGS community often miss insertions and deletions (indels) that constitute a large part of known human mutations. We have developed HeurAA, an open source, heuristic amplicon aligner program. We tested the program on simulated datasets as well as experimental data from multiplex sequencing of 40 amplicons in 12 oncogenes collected on a 454 Genome Sequencer from lung cancer cell lines. We found that HeurAA can accurately detect all indels, and is more than an order of magnitude faster than previous programs. HeurAA can compare reads and reference sequences up to several thousand base pairs in length, and it can evaluate data from complex mixtures containing reads of different gene-segments from different samples. HeurAA is written in C and Perl for Linux operating systems, the code and the documentation are available for research applications at http://sourceforge.net/projects/heuraa/ 相似文献
4.
Steven F. Solga Matthew L. Mudalel Lisa A. Spacek Terence H. Risby 《Journal of visualized experiments : JoVE》2014,(88)
This exhaled breath ammonia method uses a fast and highly sensitive spectroscopic method known as quartz enhanced photoacoustic spectroscopy (QEPAS) that uses a quantum cascade based laser. The monitor is coupled to a sampler that measures mouth pressure and carbon dioxide. The system is temperature controlled and specifically designed to address the reactivity of this compound. The sampler provides immediate feedback to the subject and the technician on the quality of the breath effort. Together with the quick response time of the monitor, this system is capable of accurately measuring exhaled breath ammonia representative of deep lung systemic levels. Because the system is easy to use and produces real time results, it has enabled experiments to identify factors that influence measurements. For example, mouth rinse and oral pH reproducibly and significantly affect results and therefore must be controlled. Temperature and mode of breathing are other examples. As our understanding of these factors evolves, error is reduced, and clinical studies become more meaningful. This system is very reliable and individual measurements are inexpensive. The sampler is relatively inexpensive and quite portable, but the monitor is neither. This limits options for some clinical studies and provides rational for future innovations. 相似文献
5.
Jae Hoon Sul Towfique Raj Simone de Jong Paul I.W. de Bakker Soumya Raychaudhuri Roel A. Ophoff Barbara E. Stranger Eleazar Eskin Buhm Han 《American journal of human genetics》2015,96(6):857-868
In studies of expression quantitative trait loci (eQTLs), it is of increasing interest to identify eGenes, the genes whose expression levels are associated with variation at a particular genetic variant. Detecting eGenes is important for follow-up analyses and prioritization because genes are the main entities in biological processes. To detect eGenes, one typically focuses on the genetic variant with the minimum p value among all variants in cis with a gene and corrects for multiple testing to obtain a gene-level p value. For performing multiple-testing correction, a permutation test is widely used. Because of growing sample sizes of eQTL studies, however, the permutation test has become a computational bottleneck in eQTL studies. In this paper, we propose an efficient approach for correcting for multiple testing and assess eGene p values by utilizing a multivariate normal distribution. Our approach properly takes into account the linkage-disequilibrium structure among variants, and its time complexity is independent of sample size. By applying our small-sample correction techniques, our method achieves high accuracy in both small and large studies. We have shown that our method consistently produces extremely accurate p values (accuracy > 98%) for three human eQTL datasets with different sample sizes and SNP densities: the Genotype-Tissue Expression pilot dataset, the multi-region brain dataset, and the HapMap 3 dataset. 相似文献
6.
Gabriel A. Al-Ghalith Emmanuel Montassier Henry N. Ward Dan Knights 《PLoS computational biology》2016,12(1)
The explosion of bioinformatics technologies in the form of next generation sequencing (NGS) has facilitated a massive influx of genomics data in the form of short reads. Short read mapping is therefore a fundamental component of next generation sequencing pipelines which routinely match these short reads against reference genomes for contig assembly. However, such techniques have seldom been applied to microbial marker gene sequencing studies, which have mostly relied on novel heuristic approaches. We propose NINJA Is Not Just Another OTU-Picking Solution (NINJA-OPS, or NINJA for short), a fast and highly accurate novel method enabling reference-based marker gene matching (picking Operational Taxonomic Units, or OTUs). NINJA takes advantage of the Burrows-Wheeler (BW) alignment using an artificial reference chromosome composed of concatenated reference sequences, the “concatesome,” as the BW input. Other features include automatic support for paired-end reads with arbitrary insert sizes. NINJA is also free and open source and implements several pre-filtering methods that elicit substantial speedup when coupled with existing tools. We applied NINJA to several published microbiome studies, obtaining accuracy similar to or better than previous reference-based OTU-picking methods while achieving an order of magnitude or more speedup and using a fraction of the memory footprint. NINJA is a complete pipeline that takes a FASTA-formatted input file and outputs a QIIME-formatted taxonomy-annotated BIOM file for an entire MiSeq run of human gut microbiome 16S genes in under 10 minutes on a dual-core laptop. 相似文献
7.
Background
The analysis of biological networks has become a major challenge due to the recent development of high-throughput techniques that are rapidly producing very large data sets. The exploding volumes of biological data are craving for extreme computational power and special computing facilities (i.e. super-computers). An inexpensive solution, such as General Purpose computation based on Graphics Processing Units (GPGPU), can be adapted to tackle this challenge, but the limitation of the device internal memory can pose a new problem of scalability. An efficient data and computational parallelism with partitioning is required to provide a fast and scalable solution to this problem.Results
We propose an efficient parallel formulation of the k-Nearest Neighbour (kNN) search problem, which is a popular method for classifying objects in several fields of research, such as pattern recognition, machine learning and bioinformatics. Being very simple and straightforward, the performance of the kNN search degrades dramatically for large data sets, since the task is computationally intensive. The proposed approach is not only fast but also scalable to large-scale instances. Based on our approach, we implemented a software tool GPU-FS-kNN (GPU-based Fast and Scalable k-Nearest Neighbour) for CUDA enabled GPUs. The basic approach is simple and adaptable to other available GPU architectures. We observed speed-ups of 50–60 times compared with CPU implementation on a well-known breast microarray study and its associated data sets.Conclusion
Our GPU-based Fast and Scalable k-Nearest Neighbour search technique (GPU-FS-kNN) provides a significant performance improvement for nearest neighbour computation in large-scale networks. Source code and the software tool is available under GNU Public License (GPL) at https://sourceforge.net/p/gpufsknn/. 相似文献8.
9.
Flinn B Rothwell C Griffiths R Lägue M DeKoeyer D Sardana R Audy P Goyer C Li XQ Wang-Pruski G Regan S 《Plant molecular biology》2005,59(3):407-433
To help develop an understanding of the genes that govern the developmental characteristics of the potato (Solanum tuberosum), as well as the genes associated with responses to specified pathogens and storage conditions, The Canadian Potato Genome
Project (CPGP) carried out 5′ end sequencing of regular, normalized and full-length cDNA libraries of the Shepody potato cultivar,
generating over 66,600 expressed sequence tags (ESTs). Libraries sequenced represented tuber developmental stages, pathogen-challenged
tubers, as well as leaf, floral developmental stages, suspension cultured cells and roots. All libraries analysed to date
have contributed unique sequences, with the normalized libraries high on the list. In addition, a low molecular weight library
has enhanced the 3′ ends of our sequence assemblies. Using the combined assembly dataset, unique tuber developmental, cold
storage and pathogen-challenged sequences have been identified. A comparison of the ESTs specific to the pathogen-challenged
tuber and foliar libraries revealed minimal overlap between these libraries. Mixed assemblies using over 189,000 potato EST
sequences from CPGP and The Institute for Genomics Research (TIGR) has revealed common sequences, as well as CPGP- and TIGR-unique
sequences.
Electronic Supplementary Material Electronic Supplementary material is available for this article at
and accessible for authorised users. 相似文献
10.
Oren E. Livne Lide Han Gorka Alkorta-Aranburu William Wentworth-Sheilds Mark Abney Carole Ober Dan L. Nicolae 《PLoS computational biology》2015,11(3)
Founder populations and large pedigrees offer many well-known advantages for genetic mapping studies, including cost-efficient study designs. Here, we describe PRIMAL (PedigRee IMputation ALgorithm), a fast and accurate pedigree-based phasing and imputation algorithm for founder populations. PRIMAL incorporates both existing and original ideas, such as a novel indexing strategy of Identity-By-Descent (IBD) segments based on clique graphs. We were able to impute the genomes of 1,317 South Dakota Hutterites, who had genome-wide genotypes for ~300,000 common single nucleotide variants (SNVs), from 98 whole genome sequences. Using a combination of pedigree-based and LD-based imputation, we were able to assign 87% of genotypes with >99% accuracy over the full range of allele frequencies. Using the IBD cliques we were also able to infer the parental origin of 83% of alleles, and genotypes of deceased recent ancestors for whom no genotype information was available. This imputed data set will enable us to better study the relative contribution of rare and common variants on human phenotypes, as well as parental origin effect of disease risk alleles in >1,000 individuals at minimal cost. 相似文献
11.
Joan Segura Manuel Alejandro Marín-López Pamela F. Jones Baldo Oliva Narcis Fernandez-Fuentes 《PloS one》2015,10(3)
The experimental determination of the structure of protein complexes cannot keep pace with the generation of interactomic data, hence resulting in an ever-expanding gap. As the structural details of protein complexes are central to a full understanding of the function and dynamics of the cell machinery, alternative strategies are needed to circumvent the bottleneck in structure determination. Computational protein docking is a valid and valuable approach to model the structure of protein complexes. In this work, we describe a novel computational strategy to predict the structure of protein complexes based on data-driven docking: VORFFIP-driven dock (V-D2OCK). This new approach makes use of our newly described method to predict functional sites in protein structures, VORFFIP, to define the region to be sampled during docking and structural clustering to reduce the number of models to be examined by users. V-D2OCK has been benchmarked using a validated and diverse set of protein complexes and compared to a state-of-art docking method. The speed and accuracy compared to contemporary tools justifies the potential use of VD2OCK for high-throughput, genome-wide, protein docking. Finally, we have developed a web interface that allows users to browser and visualize V-D2OCK predictions from the convenience of their web-browsers. 相似文献
12.
13.
Evangelos Pafilis Sune P. Frankild Lucia Fanini Sarah Faulwetter Christina Pavloudi Aikaterini Vasileiadou Christos Arvanitidis Lars Juhl Jensen 《PloS one》2013,8(6)
The exponential growth of the biomedical literature is making the need for efficient, accurate text-mining tools increasingly clear. The identification of named biological entities in text is a central and difficult task. We have developed an efficient algorithm and implementation of a dictionary-based approach to named entity recognition, which we here use to identify names of species and other taxa in text. The tool, SPECIES, is more than an order of magnitude faster and as accurate as existing tools. The precision and recall was assessed both on an existing gold-standard corpus and on a new corpus of 800 abstracts, which were manually annotated after the development of the tool. The corpus comprises abstracts from journals selected to represent many taxonomic groups, which gives insights into which types of organism names are hard to detect and which are easy. Finally, we have tagged organism names in the entire Medline database and developed a web resource, ORGANISMS, that makes the results accessible to the broad community of biologists. The SPECIES software is open source and can be downloaded from http://species.jensenlab.org along with dictionary files and the manually annotated gold-standard corpus. The ORGANISMS web resource can be found at http://organisms.jensenlab.org. 相似文献
14.
Linear motifs mediate a wide variety of cellular functions, which makes their characterization in protein sequences crucial to understanding cellular systems. However, the short length and degenerate nature of linear motifs make their discovery a difficult problem. Here, we introduce MotifHound, an algorithm particularly suited for the discovery of small and degenerate linear motifs. MotifHound performs an exact and exhaustive enumeration of all motifs present in proteins of interest, including all of their degenerate forms, and scores the overrepresentation of each motif based on its occurrence in proteins of interest relative to a background (e.g., proteome) using the hypergeometric distribution. To assess MotifHound, we benchmarked it together with state-of-the-art algorithms. The benchmark consists of 11,880 sets of proteins from S. cerevisiae; in each set, we artificially spiked-in one motif varying in terms of three key parameters, (i) number of occurrences, (ii) length and (iii) the number of degenerate or “wildcard” positions. The benchmark enabled the evaluation of the impact of these three properties on the performance of the different algorithms. The results showed that MotifHound and SLiMFinder were the most accurate in detecting degenerate linear motifs. Interestingly, MotifHound was 15 to 20 times faster at comparable accuracy and performed best in the discovery of highly degenerate motifs. We complemented the benchmark by an analysis of proteins experimentally shown to bind the FUS1 SH3 domain from S. cerevisiae. Using the full-length protein partners as sole information, MotifHound recapitulated most experimentally determined motifs binding to the FUS1 SH3 domain. Moreover, these motifs exhibited properties typical of SH3 binding peptides, e.g., high intrinsic disorder and evolutionary conservation, despite the fact that none of these properties were used as prior information. MotifHound is available (http://michnick.bcm.umontreal.ca or http://tinyurl.com/motifhound) together with the benchmark that can be used as a reference to assess future developments in motif discovery. 相似文献
15.
16.
17.
A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies 总被引:2,自引:0,他引:2
Genotype imputation methods are now being widely used in the analysis of genome-wide association studies. Most imputation analyses to date have used the HapMap as a reference dataset, but new reference panels (such as controls genotyped on multiple SNP chips and densely typed samples from the 1,000 Genomes Project) will soon allow a broader range of SNPs to be imputed with higher accuracy, thereby increasing power. We describe a genotype imputation method (IMPUTE version 2) that is designed to address the challenges presented by these new datasets. The main innovation of our approach is a flexible modelling framework that increases accuracy and combines information across multiple reference panels while remaining computationally feasible. We find that IMPUTE v2 attains higher accuracy than other methods when the HapMap provides the sole reference panel, but that the size of the panel constrains the improvements that can be made. We also find that imputation accuracy can be greatly enhanced by expanding the reference panel to contain thousands of chromosomes and that IMPUTE v2 outperforms other methods in this setting at both rare and common SNPs, with overall error rates that are 15%–20% lower than those of the closest competing method. One particularly challenging aspect of next-generation association studies is to integrate information across multiple reference panels genotyped on different sets of SNPs; we show that our approach to this problem has practical advantages over other suggested solutions. 相似文献
18.
Motivation
To obtain large-scale sequence alignments in a fast and flexible way is an important step in the analyses of next generation sequencing data. Applications based on the Smith-Waterman (SW) algorithm are often either not fast enough, limited to dedicated tasks or not sufficiently accurate due to statistical issues. Current SW implementations that run on graphics hardware do not report the alignment details necessary for further analysis.Results
With the Parallel SW Alignment Software (PaSWAS) it is possible (a) to have easy access to the computational power of NVIDIA-based general purpose graphics processing units (GPGPUs) to perform high-speed sequence alignments, and (b) retrieve relevant information such as score, number of gaps and mismatches. The software reports multiple hits per alignment. The added value of the new SW implementation is demonstrated with two test cases: (1) tag recovery in next generation sequence data and (2) isotype assignment within an immunoglobulin 454 sequence data set. Both cases show the usability and versatility of the new parallel Smith-Waterman implementation. 相似文献19.
Rapid DNA preparation for the quick screening is highly demanded in diverse research fields. Here, we combined an extraction buffer and heat treatment to generate DNA templates from yeast and filamentous fungal materials for PCR. This method may be widely applicable to diverse fungal species in clinical and basic studies. 相似文献
20.
《Cell cycle (Georgetown, Tex.)》2013,12(7):817-822
Despite nearly universal conservation through evolution, the precise function of the DinB/pol κ branch of the Y-family of DNA polymerases has remained unclear. Recent results suggest that DinB orthologs from all domains of life proficiently bypass replication blocking lesions that may be recalcitrant to DNA repair mechanisms. Like other translesion DNA polymerases, the error frequency of DinB and its orthologs is higher than the DNA polymerases that replicate the majority of the genome. However, recent results suggest that some Y-family polymerases, including DinB and pol κ, bypass certain types of DNA damage with greater proficiency than an undamaged template. Moreover, they do so relatively accurately. The ability to employ this mechanism to manage DNA damage may be especially important for types of DNA modification that elude repair mechanisms. For these lesions, translesion synthesis may represent a more important line of defense than for other types of DNA damage that are more easily dealt with by other more accurate mechanisms. 相似文献