首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Background

The Smith-Waterman algorithm, which produces the optimal pairwise alignment between two sequences, is frequently used as a key component of fast heuristic read mapping and variation detection tools for next-generation sequencing data. Though various fast Smith-Waterman implementations are developed, they are either designed as monolithic protein database searching tools, which do not return detailed alignment, or are embedded into other tools. These issues make reusing these efficient Smith-Waterman implementations impractical.

Results

To facilitate easy integration of the fast Single-Instruction-Multiple-Data Smith-Waterman algorithm into third-party software, we wrote a C/C++ library, which extends Farrar’s Striped Smith-Waterman (SSW) to return alignment information in addition to the optimal Smith-Waterman score. In this library we developed a new method to generate the full optimal alignment results and a suboptimal score in linear space at little cost of efficiency. This improvement makes the fast Single-Instruction-Multiple-Data Smith-Waterman become really useful in genomic applications. SSW is available both as a C/C++ software library, as well as a stand-alone alignment tool at: https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library.

Conclusions

The SSW library has been used in the primary read mapping tool MOSAIK, the split-read mapping program SCISSORS, the MEI detector TANGRAM, and the read-overlap graph generation program RZMBLR. The speeds of the mentioned software are improved significantly by replacing their ordinary Smith-Waterman or banded Smith-Waterman module with the SSW Library.  相似文献   

2.
3.

Background

The genome of classical laboratory strains of mice is an artificial mosaic of genomes originated from several mouse subspecies with predominant representation (>90%) of the Mus m. domesticus component. Mice of another subspecies, East European/Asian Mus m. musculus, can interbreed with the classical laboratory strains to generate hybrids with unprecedented phenotypic and genotypic variations. To study these variations in depth we prepared the first genomic large insert BAC library from an inbred strain derived purely from the Mus m. musculus-subspecies. The library will be used to seek and characterize genomic sequences controlling specific monogenic and polygenic complex traits, including modifiers of dominant and recessive mutations.

Results

A representative mouse genomic BAC library was derived from a female mouse of the PWD/Ph inbred strain of Mus m. musculus subspecies. The library consists of 144 768 primary clones from which 97% contain an insert of 120 kb average size. The library represents an equivalent of 6.7 × mouse haploid genome, as estimated from the total number of clones carrying genomic DNA inserts and from the average insert size. The clones were arrayed in duplicates onto eight high-density membranes that were screened with seven single-copy gene probes. The individual probes identified four to eleven positive clones, corresponding to 6.9-fold coverage of the mouse genome. Eighty-seven BAC-ends of PWD/Ph clones were sequenced, edited, and aligned with mouse C57BL/6J (B6) genome. Seventy-three BAC-ends displayed unique hits on B6 genome and their alignment revealed 0.92 single nucleotide polymorphisms (SNPs) per 100 bp. Insertions and deletions represented 0.3% of the BAC end sequences.

Conclusion

Analysis of the novel genomic library for the PWD/Ph inbred strain demonstrated coverage of almost seven mouse genome equivalents and a capability to recover clones for specific regions of PWD/Ph genome. The single nucleotide polymorphism between the strains PWD/Ph and C57BL/6J was 0.92/100 bp, a value significantly higher than between classical laboratory strains. The library will serve as a resource for dissecting the phenotypic and genotypic variations between mice of the Mus m. musculus subspecies and classical laboratory mouse strains.  相似文献   

4.
5.

Background

The Ahringer C. elegans RNAi feeding library prepared by cloning genomic DNA fragments has been widely used in genome-wide analysis of gene function. However, the library has not been thoroughly validated by direct sequencing, and there are potential errors, including: 1) mis-annotation (the clone with the retired gene name should be remapped to the actual target gene); 2) nonspecific PCR amplification; 3) cross-RNAi; 4) mis-operation such as sample loading error, etc.

Results

Here we performed a reliability analysis on the Ahringer C. elegans RNAi feeding library, which contains 16,256 bacterial strains, using a bioinformatics approach. Results demonstrated that most (98.3%) of the bacterial strains in the library are reliable. However, we also found that 2,851 (17.54%) bacterial strains need to be re-annotated even they are reliable. Most of these bacterial strains are the clones having the retired gene names. Besides, 28 strains are grouped into unreliable category and 226 strains are marginal because of probably expressing unrelated double-stranded RNAs (dsRNAs). The accuracy of the prediction was further confirmed by direct sequencing analysis of 496 bacterial strains. Finally, a freely accessible database named CelRNAi (http://biocompute.bmi.ac.cn/CelRNAi/) was developed as a valuable complement resource for the feeding RNAi library by providing the predicted information on all bacterial strains. Moreover, submission of the direct sequencing result or any other annotations for the bacterial strains to the database are allowed and will be integrated into the CelRNAi database to improve the accuracy of the library. In addition, we provide five candidate primer sets for each of the unreliable and marginal bacterial strains for users to construct an alternative vector for their own RNAi studies.

Conclusions

Because of the potential unreliability of the Ahringer C. elegans RNAi feeding library, we strongly suggest the user examine the reliability information of the bacterial strains in the CelRNAi database before performing RNAi experiments, as well as the post-RNAi experiment analysis.
  相似文献   

6.
A C++ class library is available to facilitate the implementation of software for genomics and sequence polymorphism analysis. The library implements methods for data manipulation and the calculation of several statistics commonly used to analyze SNP data. The object-oriented design of the library is intended to be extensible, allowing users to design custom classes for their own needs. In addition, routines are provided to process samples generated by a widely used coalescent simulation. AVAILABILITY: The source code (in C++) is available from http://www.molpopgen.org  相似文献   

7.
8.
Variable (V) domains of immunoglobulins (Ig) and T cell receptors (TCR) are generated from genomic V gene segments (V-genes). At present, such V-genes have been annotated only within the genome of a few species. We have developed a bioinformatics tool that accelerates the task of identifying functional V-genes from genome datasets. Automated recognition is accomplished by recognizing key V-gene signatures, such as recombination signal sequences, size of the exon region, and position of amino acid motifs within the translated exon. This algorithm also classifies extracted V-genes into either TCR or Ig loci. We describe the implementation of the algorithm and validate its accuracy by comparing V-genes identified from the human and mouse genomes with known V-gene annotations documented and available in public repositories. The advantages and utility of the algorithm are illustrated by using it to identify functional V-genes in the rat genome, where V-gene annotation is still incomplete. This allowed us to perform a comparative human–rodent phylogenetic analysis based on V-genes that supports the hypothesis that distinct evolutionary pressures shape the TCRs and Igs V-gene repertoires. Our program, together with a user graphical interface, is available as open-source software, downloadable at http://code.google.com/p/vgenextract/.  相似文献   

9.
Genomic mutation consequence calculator   总被引:1,自引:0,他引:1  
  相似文献   

10.
Sequence annotation is essential for genomics-based research. Investigators of a specific genomic region who have developed abundant local discoveries such as genes and genetic markers, or have collected annotations from multiple resources, can be overwhelmed by the difficulty in creating local annotation and the complexity of integrating all the annotations. Presenting such integrated data in a form suitable for data mining and high-throughput experimental design is even more daunting. DNannotator, a web application, was designed to perform batch annotation on a sizeable genomic region. It takes annotation source data, such as SNPs, genes, primers, and so on, prepared by the end-user and/or a specified target of genomic DNA, and performs de novo annotation. DNannotator can also robustly migrate existing annotations in GenBank format from one sequence to another. Annotation results are provided in GenBank format and in tab-delimited text, which can be imported and managed in a database or spreadsheet and combined with existing annotation as desired. Graphic viewers, such as Genome Browser or Artemis, can display the annotation results. Reference data (reports on the process) facilitating the user's evaluation of annotation quality are optionally provided. DNannotator can be accessed at http://sky.bsd.uchicago.edu/DNannotator.htm.  相似文献   

11.
gff2ps: visualizing genomic annotations   总被引:3,自引:0,他引:3  
gff2psis a program for visualizing annotations of genomic sequences. The program takes the annotated features on a genomic sequence in GFF format as input, and produces a visual output in PostScript. While it can be used in a very simple way, it also allows for a great degree of customization through a number of options and/or customization files.  相似文献   

12.
GenMiner is an implementation of association rule discovery dedicated to the analysis of genomic data. It allows the analysis of datasets integrating multiple sources of biological data represented as both discrete values, such as gene annotations, and continuous values, such as gene expression measures. GenMiner implements the new NorDi (normal discretization) algorithm for normalizing and discretizing continuous values and takes advantage of the Close algorithm to efficiently generate minimal non-redundant association rules. Experiments show that execution time and memory usage of GenMiner are significantly smaller than those of the standard Apriori-based approach, as well as the number of extracted association rules. AVAILABILITY: The GenMiner software and supplementary materials are available at http://bioinfo.unice.fr/publications/genminer_article/ and http://keia.i3s.unice.fr/?Implementations:GenMiner SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

13.
Matrix-assisted laser desorption/ionization-mass spectrometry (MALDI-MS) is the pre-eminent technique for mass mapping of glycans. In order to make this technique practical for high-throughput screening, reliable automatic methods of annotating peaks must be devised. We describe an algorithm called Cartoonist that labels peaks in MALDI spectra of permethylated N-glycans with cartoons which represent the most plausible glycans consistent with the peak masses and the types of glycans being analyzed. There are three main parts to Cartoonist. (i) It selects annotations from a library of biosynthetically plausible cartoons. The library we currently use has about 2800 cartoons, but was constructed using only about 300 archetype cartoons entered by hand. (ii) It determines the precision and calibration of the machine used to generate the spectrum. It does this automatically based on the spectrum itself. (iii) It assigns a confidence score to each annotation. In particular, rather than making a binary yes/no decision when annotating a peak, it makes all plausible annotations and associates them with scores indicating the probability that they are correct.  相似文献   

14.

Background

With the rapid accumulation of genomic data, it has become a challenge issue to annotate and interpret these data. As a representative, Gene set enrichment analysis has been widely used to interpret large molecular datasets generated by biological experiments. The result of gene set enrichment analysis heavily relies on the quality and integrity of gene set annotations. Although several methods were developed to annotate gene sets, there is still a lack of high quality annotation methods. Here, we propose a novel method to improve the annotation accuracy through combining the GO structure and gene expression data.

Results

We propose a novel approach for optimizing gene set annotations to get more accurate annotation results. The proposed method filters the inconsistent annotations using GO structure information and probabilistic gene set clusters calculated by a range of cluster sizes over multiple bootstrap resampled datasets. The proposed method is employed to analyze p53 cell lines, colon cancer and breast cancer gene expression data. The experimental results show that the proposed method can filter a number of annotations unrelated to experimental data and increase gene set enrichment power and decrease the inconsistent of annotations.

Conclusions

A novel gene set annotation optimization approach is proposed to improve the quality of gene annotations. Experimental results indicate that the proposed method effectively improves gene set annotation quality based on the GO structure and gene expression data.
  相似文献   

15.
Information theory is a branch of mathematics that overlaps with communications, biology, and medical engineering. Entropy is a measure of uncertainty in the set of information. In this study, for each gene and its exons sets, the entropy was calculated in orders one to four. Based on the relative entropy of genes and exons, Kullback-Leibler divergence was calculated. After obtaining the Kullback-Leibler distance for genes and exons sets, the results were entered as input into 7 clustering algorithms: single, complete, average, weighted, centroid, median, and K-means. To aggregate the results of clustering, the AdaBoost algorithm was used. Finally, the results of the AdaBoost algorithm were investigated by GeneMANIA prediction server to explore the results from gene annotation point of view. All calculations were performed using the MATLAB Engineering Software (2015). Following our findings on investigating the results of genes metabolic pathways based on the gene annotations, it was revealed that our proposed clustering method yielded correct, logical, and fast results. This method at the same that had not had the disadvantages of aligning allowed the genes with actual length and content to be considered and also did not require high memory for large-length sequences. We believe that the performance of the proposed method could be used with other competitive gene clustering methods to group biologically relevant set of genes. Also, the proposed method can be seen as a predictive method for those genes bearing up weak genomic annotations.  相似文献   

16.

Background

Cross-species comparisons of gene neighborhoods (also called genomic contexts) in microbes may provide insight into determining functionally related or co-regulated sets of genes, suggest annotations of previously un-annotated genes, and help to identify horizontal gene transfer events across microbial species. Existing tools to investigate genomic contexts, however, lack features for dynamically comparing and exploring genomic regions from multiple species. As DNA sequencing technologies improve and the number of whole sequenced microbial genomes increases, a user-friendly genome context comparison platform designed for use by a broad range of users promises to satisfy a growing need in the biological community.

Results

Here we present JContextExplorer: a tool that organizes genomic contexts into branching diagrams. We implement several alternative context-comparison and tree rendering algorithms, and allow for easy transitioning between different clustering algorithms. To facilitate genomic context analysis, our tool implements GUI features, such as text search filtering, point-and-click interrogation of individual contexts, and genomic visualization via a multi-genome browser. We demonstrate a use case of our tool by attempting to resolve annotation ambiguities between two highly homologous yet functionally distinct genes in a set of 22 alpha and gamma proteobacteria.

Conclusions

JContextExplorer should enable a broad range of users to analyze and explore genomic contexts. The program has been tested on Windows, Mac, and Linux operating systems, and is implemented both as an executable JAR file and java WebStart. Program executables, source code, and documentation is available at http://www.bme.ucdavis.edu/facciotti/resources_data/software/.  相似文献   

17.
MOTIVATION: To identify and characterize regions of functional interest in genomic sequence requires full, flexible query access to an integrated, up-to-date view of all related information, irrespective of where it is stored (within an organization or across the Internet) and its format (traditional database, flat file, web site, results of runtime analysis). Wide-ranging multi-source queries often return unmanageably large result sets, requiring non-traditional approaches to exclude extraneous data. RESULTS: Target Informatics Net (TINet) is a readily extensible data integration system developed at GlaxoSmith- Kline (GSK), based on the Object-Protocol Model (OPM) multidatabase middleware system of Gene Logic Inc. Data sources currently integrated include: the Mouse Genome Database (MGD) and Gene Expression Database (GXD), GenBank, SwissProt, PubMed, GeneCards, the results of runtime BLAST and PROSITE searches, and GSK proprietary relational databases. Special-purpose class methods used to filter and augment query results include regular expression pattern-matching over BLAST HSP alignments and retrieving partial sequences derived from primary structure annotations. All data sources and methods are accessible through an SQL-like query language or a GUI, so that when new investigations arise no additional programming beyond query specification is required. The power and flexibility of this approach are illustrated in such integrated queries as: (1) 'find homologs in genomic sequence to all novel genes cloned and reported in the scientific literature within the past three months that are linked to the MeSH term 'neoplasms"; (2) 'using a neuropeptide precursor query sequence, return only HSPs where the target genomic sequences conserve the G[KR][KR] motif at the appropriate points in the HSP alignment'; and (3) 'of the human genomic sequences annotated with exon boundaries in GenBank, return only those with valid putative donor/acceptor sites and start/stop codons'.  相似文献   

18.
19.
20.
MOTIVATION: Despite advances in the gene annotation process, the functions of a large portion of gene products remain insufficiently characterized. In addition, the in silico prediction of novel Gene Ontology (GO) annotations for partially characterized gene functions or processes is highly dependent on reverse genetic or functional genomic approaches. To our knowledge, no prediction method has been demonstrated to be highly accurate for sparsely annotated GO terms (those associated to fewer than 10 genes). RESULTS: We propose a novel approach, information theory-based semantic similarity (ITSS), to automatically predict molecular functions of genes based on existing GO annotations. Using a 10-fold cross-validation, we demonstrate that the ITSS algorithm obtains prediction accuracies (precision 97%, recall 77%) comparable to other machine learning algorithms when compared in similar conditions over densely annotated portions of the GO datasets. This method is able to generate highly accurate predictions in sparsely annotated portions of GO, where previous algorithms have failed. As a result, our technique generates an order of magnitude more functional predictions than previous methods. A 10-fold cross validation demonstrated a precision of 90% at a recall of 36% for the algorithm over sparsely annotated networks of the recent GO annotations (about 1400 GO terms and 11,000 genes in Homo sapiens). To our knowledge, this article presents the first historical rollback validation for the predicted GO annotations, which may represent more realistic conditions than more widely used cross-validation approaches. By manually assessing a random sample of 100 predictions conducted in a historical rollback evaluation, we estimate that a minimum precision of 51% (95% confidence interval: 43-58%) can be achieved for the human GO Annotation file dated 2003. AVAILABILITY: The program is available on request. The 97,732 positive predictions of novel gene annotations from the 2005 GO Annotation dataset and other supplementary information is available at http://phenos.bsd.uchicago.edu/ITSS/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号