首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Rapid, sensitive, and specific virus detection is an important component of clinical diagnostics. Massively parallel sequencing enables new diagnostic opportunities that complement traditional serological and PCR based techniques. While massively parallel sequencing promises the benefits of being more comprehensive and less biased than traditional approaches, it presents new analytical challenges, especially with respect to detection of pathogen sequences in metagenomic contexts. To a first approximation, the initial detection of viruses can be achieved simply through alignment of sequence reads or assembled contigs to a reference database of pathogen genomes with tools such as BLAST. However, recognition of highly divergent viral sequences is problematic, and may be further complicated by the inherently high mutation rates of some viral types, especially RNA viruses. In these cases, increased sensitivity may be achieved by leveraging position-specific information during the alignment process. Here, we constructed HMMER3-compatible profile hidden Markov models (profile HMMs) from all the virally annotated proteins in RefSeq in an automated fashion using a custom-built bioinformatic pipeline. We then tested the ability of these viral profile HMMs (“vFams”) to accurately classify sequences as viral or non-viral. Cross-validation experiments with full-length gene sequences showed that the vFams were able to recall 91% of left-out viral test sequences without erroneously classifying any non-viral sequences into viral protein clusters. Thorough reanalysis of previously published metagenomic datasets with a set of the best-performing vFams showed that they were more sensitive than BLAST for detecting sequences originating from more distant relatives of known viruses. To facilitate the use of the vFams for rapid detection of remote viral homologs in metagenomic data, we provide two sets of vFams, comprising more than 4,000 vFams each, in the HMMER3 format. We also provide the software necessary to build custom profile HMMs or update the vFams as more viruses are discovered (http://derisilab.ucsf.edu/software/vFam).  相似文献   

2.
A new functional gene database, FOAM (Functional Ontology Assignments for Metagenomes), was developed to screen environmental metagenomic sequence datasets. FOAM provides a new functional ontology dedicated to classify gene functions relevant to environmental microorganisms based on Hidden Markov Models (HMMs). Sets of aligned protein sequences (i.e. ‘profiles’) were tailored to a large group of target KEGG Orthologs (KOs) from which HMMs were trained. The alignments were checked and curated to make them specific to the targeted KO. Within this process, sequence profiles were enriched with the most abundant sequences available to maximize the yield of accurate classifier models. An associated functional ontology was built to describe the functional groups and hierarchy. FOAM allows the user to select the target search space before HMM-based comparison steps and to easily organize the results into different functional categories and subcategories. FOAM is publicly available at http://portal.nersc.gov/project/m1317/FOAM/.  相似文献   

3.
There is a growing interest in the Non-ribosomal peptide synthetases (NRPSs) and polyketide synthases (PKSs) of microbes, fungi and plants because they can produce bioactive peptides such as antibiotics. The ability to identify the substrate specificity of the enzyme''s adenylation (A) and acyl-transferase (AT) domains is essential to rationally deduce or engineer new products. We here report on a Hidden Markov Model (HMM)-based ensemble method to predict the substrate specificity at high quality. We collected a new reference set of experimentally validated sequences. An initial classification based on alignment and Neighbor Joining was performed in line with most of the previously published prediction methods. We then created and tested single substrate specific HMMs and found that their use improved the correct identification significantly for A as well as for AT domains. A major advantage of the use of HMMs is that it abolishes the dependency on multiple sequence alignment and residue selection that is hampering the alignment-based clustering methods. Using our models we obtained a high prediction quality for the substrate specificity of the A domains similar to two recently published tools that make use of HMMs or Support Vector Machines (NRPSsp and NRPS predictor2, respectively). Moreover, replacement of the single substrate specific HMMs by ensembles of models caused a clear increase in prediction quality. We argue that the superiority of the ensemble over the single model is caused by the way substrate specificity evolves for the studied systems. It is likely that this also holds true for other protein domains. The ensemble predictor has been implemented in a simple web-based tool that is available at http://www.cmbi.ru.nl/NRPS-PKS-substrate-predictor/.  相似文献   

4.
Hidden Markov models (HMMs) and their variants are widely used in Bioinformatics applications that analyze and compare biological sequences. Designing a novel application requires the insight of a human expert to define the model''s architecture. The implementation of prediction algorithms and algorithms to train the model''s parameters, however, can be a time-consuming and error-prone task. We here present HMMConverter, a software package for setting up probabilistic HMMs, pair-HMMs as well as generalized HMMs and pair-HMMs. The user defines the model itself and the algorithms to be used via an XML file which is then directly translated into efficient C++ code. The software package provides linear-memory prediction algorithms, such as the Hirschberg algorithm, banding and the integration of prior probabilities and is the first to present computationally efficient linear-memory algorithms for automatic parameter training. Users of HMMConverter can thus set up complex applications with a minimum of effort and also perform parameter training and data analyses for large data sets.  相似文献   

5.
Of the sequence comparison methods, profile-based methods perform with greater selectively than those that use pairwise comparisons. Of the profile methods, hidden Markov models (HMMs) are apparently the best. The first part of this paper describes calculations that (i) improve the performance of HMMs and (ii) determine a good procedure for creating HMMs for sequences of proteins of known structure. For a family of related proteins, more homologues are detected using multiple models built from diverse single seed sequences than from one model built from a good alignment of those sequences. A new procedure is described for detecting and correcting those errors that arise at the model-building stage of the procedure. These two improvements greatly increase selectivity and coverage.The second part of the paper describes the construction of a library of HMMs, called SUPERFAMILY, that represent essentially all proteins of known structure. The sequences of the domains in proteins of known structure, that have identities less than 95 %, are used as seeds to build the models. Using the current data, this gives a library with 4894 models.The third part of the paper describes the use of the SUPERFAMILY model library to annotate the sequences of over 50 genomes. The models match twice as many target sequences as are matched by pairwise sequence comparison methods. For each genome, close to half of the sequences are matched in all or in part and, overall, the matches cover 35 % of eukaryotic genomes and 45 % of bacterial genomes. On average roughly 15% of genome sequences are labelled as being hypothetical yet homologous to proteins of known structure. The annotations derived from these matches are available from a public web server at: http://stash.mrc-lmb.cam.ac.uk/SUPERFAMILY. This server also enables users to match their own sequences against the SUPERFAMILY model library.  相似文献   

6.
Accurate and rapid characterization of influenza A virus (IAV) hemagglutinin (HA) and neuraminidase (NA) sequences with respect to subtype and clade is at the basis of extended diagnostic services and implicit to molecular epidemiologic studies. ClassyFlu is a new tool and web service for the classification of IAV sequences of the HA and NA gene into subtypes and phylogenetic clades using discriminatively trained profile hidden Markov models (HMMs), one for each subtype or clade. ClassyFlu merely requires as input unaligned, full-length or partial HA or NA DNA sequences. It enables rapid and highly accurate assignment of HA sequences to subtypes H1–H17 but particularly focusses on the finer grained assignment of sequences of highly pathogenic avian influenza viruses of subtype H5N1 according to the cladistics proposed by the H5N1 Evolution Working Group. NA sequences are classified into subtypes N1–N10. ClassyFlu was compared to semiautomatic classification approaches using BLAST and phylogenetics and additionally for H5 sequences to the new “Highly Pathogenic H5N1 Clade Classification Tool” (IRD-CT) proposed by the Influenza Research Database. Our results show that both web tools (ClassyFlu and IRD-CT), although based on different methods, are nearly equivalent in performance and both are more accurate and faster than semiautomatic classification. A retraining of ClassyFlu to altered cladistics as well as an extension of ClassyFlu to other IAV genome segments or fragments thereof is undemanding. This is exemplified by unambiguous assignment to a distinct cluster within subtype H7 of sequences of H7N9 viruses which emerged in China early in 2013 and caused more than 130 human infections. http://bioinf.uni-greifswald.de/ClassyFlu is a free web service. For local execution, the ClassyFlu source code in PERL is freely available.  相似文献   

7.
Biochemical tests are traditionally used for bacterial identification at the species level in clinical microbiology laboratories. While biochemical profiles are generally efficient for the identification of the most important corynebacterial pathogen Corynebacterium diphtheriae, their ability to differentiate between biovars of this bacterium is still controversial. Besides, the unambiguous identification of emerging human pathogenic species of the genus Corynebacterium may be hampered by highly variable biochemical profiles commonly reported for these species, including Corynebacterium striatum, Corynebacterium amycolatum, Corynebacterium minutissimum, and Corynebacterium xerosis. In order to identify the genomic basis contributing for the biochemical variabilities observed in phenotypic identification methods of these bacteria, we combined a comprehensive literature review with a bioinformatics approach based on reconstruction of six specific biochemical reactions/pathways in 33 recently released whole genome sequences. We used data retrieved from curated databases (MetaCyc, PathoSystems Resource Integration Center (PATRIC), The SEED, TransportDB, UniProtKB) associated with homology searches by BLAST and profile Hidden Markov Models (HMMs) to detect enzymes participating in the various pathways and performed ab initio protein structure modeling and molecular docking to confirm specific results. We found a differential distribution among the various strains of genes that code for some important enzymes, such as beta-phosphoglucomutase and fructokinase, and also for individual components of carbohydrate transport systems, including the fructose-specific phosphoenolpyruvate-dependent sugar phosphotransferase (PTS) and the ribose-specific ATP-binging cassette (ABC) transporter. Horizontal gene transfer plays a role in the biochemical variability of the isolates, as some genes needed for sucrose fermentation were seen to be present in genomic islands. Noteworthy, using profile HMMs, we identified an enzyme with putative alpha-1,6-glycosidase activity only in some specific strains of C. diphtheriae and this may aid to understanding of the differential abilities to utilize glycogen and starch between the biovars.  相似文献   

8.
The PANTHER database was designed for high-throughput analysis of protein sequences. One of the key features is a simplified ontology of protein function, which allows browsing of the database by biological functions. Biologist curators have associated the ontology terms with groups of protein sequences rather than individual sequences. Statistical models (Hidden Markov Models, or HMMs) are built from each of these groups. The advantage of this approach is that new sequences can be automatically classified as they become available. To ensure accurate functional classification, HMMs are constructed not only for families, but also for functionally distinct subfamilies. Multiple sequence alignments and phylogenetic trees, including curator-assigned information, are available for each family. The current version of the PANTHER database includes training sequences from all organisms in the GenBank non-redundant protein database, and the HMMs have been used to classify gene products across the entire genomes of human, and Drosophila melanogaster. The ontology terms and protein families and subfamilies, as well as Drosophila gene c;assifications, can be browsed and searched for free. Due to outstanding contractual obligations, access to human gene classifications and to protein family trees and multiple sequence alignments will temporarily require a nominal registration fee. PANTHER is publicly available on the web at http://panther.celera.com.  相似文献   

9.
Plant seeds usually have high concentrations of proteinase and amylase inhibitors. These inhibitors exhibit a wide range of specificity, stability and oligomeric structure. In this communication, we report analysis of sequences that show statistically significant similarity to the double-headed α-amylase/trypsin inhibitor of ragi (Eleusine coracana). Our aim is to understand their evolutionary and structural features. The 14 sequences of this family that are available in the SWISSPROT database form three evolutionarily distinct branches. The branches relate to enzyme specificities and also probably to the oligomeric state of the proteins and not to the botanical class of the plant from which the enzymes are derived. This suggests that the enzyme specificities of the inhibitors evolved before the divergence of commercially cultivated cereals. The inhibitor sequences have three regions that display periodicity in hydrophobicity. It is likely that this feature reflects extended secondary structure in these segments. One of the most variable regions of the polypeptide corresponds to a loop, which is most probably exposed in the native structure of the inhibitors and is responsible for the inhibitory property.  相似文献   

10.
Motivation: A large number of new DNA sequences with virtuallyunknown functions are generated as the Human Genome Projectprogresses. Therefore, it is essential to develop computer algorithmsthat can predict the functionality of DNA segments accordingto their primary sequences, including algorithms that can predictpromoters. Although several promoter-predicting algorithms areavailable, they have high false-positive detections and therate of promoter detection needs to be improved further. Results: In this research, PromFD, a computer program to recognizevertebrate RNA polymerase II promoters, has been developed.Both vertebrate promoters and non-promoter sequences are usedin the analysis. The promoters are obtained from the EukaryoticPromoter Database. Promoters are divided into a training setand a test set. Non-promoter sequences are obtained from theGenBank sequence databank, and are also divided into a trainingset and a test set. The first step is to search out, among allpossible permutations, patterns of strings 5–10 bp long,that are significantly over-represented in the promoter set.The program also searches IMD (Information Matrix Database)matrices that have a significantly higher presence in the promoterset. The results of the searches are stored in the PromFD database,and the program PromFD scores input DNA sequences accordingto their content of the database entries. PromFD predicts promoters—theirlocations and the location of potential TATA boxes, if found.The program can detect 71% of promoters in the training setwith a false-positive rate of under 1 in every 13 000 bp, and47% of promoters in the test set with a false-positive rateof under 1 in every 9800 bp. PromFD uses a new approach andits false-positive identification rate is better compared withother available promoter recognition algorithms. The sourcecode for PromFD is in the ‘c++’ language. Availability: PromFD is available for Unix platforms by anonymousftp to: beagle. colorado. edu, cd pub, get promFD.tar. A Javaversion of the program is also available for netscape 2.0, byhttp: // beagle.colorado.edu/chenq. Contact: E-mail: chenq{at}beagle.colorado.edu  相似文献   

11.

Background

Dementia is an age-related cognitive decline which is indicated by an early degeneration of cortical and sub-cortical structures. Characterizing those morphological changes can help to understand the disease development and contribute to disease early prediction and prevention. But modeling that can best capture brain structural variability and can be valid in both disease classification and interpretation is extremely challenging. The current study aimed to establish a computational approach for modeling the magnetic resonance imaging (MRI)-based structural complexity of the brain using the framework of hidden Markov models (HMMs) for dementia recognition.

Methods

Regularity dimension and semi-variogram were used to extract structural features of the brains, and vector quantization method was applied to convert extracted feature vectors to prototype vectors. The output VQ indices were then utilized to estimate parameters for HMMs. To validate its accuracy and robustness, experiments were carried out on individuals who were characterized as non-demented and mild Alzheimer's diseased. Four HMMs were constructed based on the cohort of non-demented young, middle-aged, elder and demented elder subjects separately. Classification was carried out using a data set including both non-demented and demented individuals with a wide age range.

Results

The proposed HMMs have succeeded in recognition of individual who has mild Alzheimer's disease and achieved a better classification accuracy compared to other related works using different classifiers. Results have shown the ability of the proposed modeling for recognition of early dementia.

Conclusion

The findings from this research will allow individual classification to support the early diagnosis and prediction of dementia. By using the brain MRI-based HMMs developed in our proposed research, it will be more efficient, robust and can be easily used by clinicians as a computer-aid tool for validating imaging bio-markers for early prediction of dementia.
  相似文献   

12.
The common bean, Phaseolus vulgaris, contains a family of defense proteins that comprises phytohemagglutinin (PHA), arcelin, and -amylase inhibitor (AI). Here we report eight new derived amino acid sequences of genes in this family obtained with either the polymerase chain reaction using genomic DNA, or by screening cDNA libraries made with RNA from developing beans. These new sequences are: two AI sequences and arcelin-4 obtained from a wild accession of P. vulgaris that is resistant to the Mexican bean weevil (Zabrotes subfasciatus) and the bean weevil (Acanthoscelides obtectus); an AI sequence from the related species P. acutifolius (tepary bean); a PHA and an arcelin-like sequence from P. acutifolius; an AI-like sequence from P. maculatus; and a PHA sequence from an arcelin-5 type P. vulgaris. A dendrogram of 16 sequences shows that they fall into the three identified groups: phytohemagglutinins, arcelins and AIs. A comparison of these derived amino acid sequences indicates that one of the four amino acid residues that is conserved in all legume lectins and is required for carbohydrate binding is absent from all the arcelins; two of the four conserved residues needed for carbohydrate binding are missing from all the AIs. Proteolytic processing at an Asn-Ser site is required for the activation of AI, and this site is present in all AI-like sequences; this processing site is also found at the same position in certain arcelins, which are not proteolytically processed. The presence of this site is therefore not sufficient for processing to occur.  相似文献   

13.
HMMSPECTR is a tool for finding putative structural homologs for proteins with known primary sequences. HMMSPECTR contains four major components: a data warehouse with the hidden Markov models (HMM) and alignment libraries; a search program which compares the initial protein sequences with the libraries of HMMs; a secondary structure prediction and comparison program; and a dominant protein selection program that prepares the set of 10-15 "best" proteins from the chosen HMMs. The data warehouse contains four libraries of HMMs. The first two libraries were constructed using different HHM preparation options of the HAMMER program. The third library contains parts ("partial HMM") of initial alignments. The fourth library contains trained HMMs. We tested our program against all of the protein targets proposed in the CASP4 competition. The data warehouse included libraries of structural alignments and HMMs constructed on the basis of proteins publicly available in the Protein Data Bank before the CASP4 meeting. The newest fully automated versions of HMMSPECTR 1.02 and 1.02ss produced better results than the best result reported at CASP4 either by r.m.s.d. or by length (or both) in 64% (HMMSPECTR 1.02) and 79% (HMMSPECTR 1.02ss) of the cases. The improvement is most notable for the targets with complexity 4 (difficult fold recognition cases).  相似文献   

14.
Yeast artificial chromosome (YAC) cloning systems enable the cloning of DNA stretches of 50 to well over 2000 kb. This makes it possible to study large intact regions of DNA in detail, by restriction mapping the YAC to produce a physical map and by examining the YAC for coding sequences or genes. YACs are important for their ability to clone the complete sequences of large genes or gene complexes that exceed the size limit for cloning in conventional bacterial cloning vectors like plasmids (up to 10 kb), bacteriophage (15 kb), and cosmids (50 kb). A major advantage of cloning in yeast, a eukaryotc. is that many sequences that are unstable, underrepresented, or absent when cloned into prokaryotic systems, remain stable and intact in YAC clones. It is possible to reinlroduce YACs intact into mammalian cells where the introduced mammalian genes are expressed and used to study the functions of genes in the context of flanking sequences. The correct prolein processing mechanisms are present in the mammalian cells to ensure that a viable protein product is produced.  相似文献   

15.
SUMMARY: Hidden Markov models (HMMs) are widely used for biological sequence analysis because of their ability to incorporate biological information in their structure. An automatic means of optimizing the structure of HMMs would be highly desirable. However, this raises two important issues; first, the new HMMs should be biologically interpretable, and second, we need to control the complexity of the HMM so that it has good generalization performance on unseen sequences. In this paper, we explore the possibility of using a genetic algorithm (GA) for optimizing the HMM structure. GAs are sufficiently flexible to allow incorporation of other techniques such as Baum-Welch training within their evolutionary cycle. Furthermore, operators that alter the structure of HMMs can be designed to favour interpretable and simple structures. In this paper, a training strategy using GAs is proposed, and it is tested on finding HMM structures for the promoter and coding region of the bacterium Campylobacter jejuni. The proposed GA for hidden Markov models (GA-HMM) allows, HMMs with different numbers of states to evolve. To prevent over-fitting, a separate dataset is used for comparing the performance of the HMMs to that used for the Baum-Welch training. The GA-HMM was capable of finding an HMM comparable to a hand-coded HMM designed for the same task, which has been published previously.  相似文献   

16.
Analysis of bisulfite sequencing data usually requires two tasks: to call methylated cytosines (mCs) in a sample, and to detect differentially methylated regions (DMRs) between paired samples. Although numerous tools have been proposed for mC calling, methods for DMR detection have been largely limited. Here, we present Bisulfighter, a new software package for detecting mCs and DMRs from bisulfite sequencing data. Bisulfighter combines the LAST alignment tool for mC calling, and a novel framework for DMR detection based on hidden Markov models (HMMs). Unlike previous attempts that depend on empirical parameters, Bisulfighter can use the expectation-maximization algorithm for HMMs to adjust parameters for each data set. We conduct extensive experiments in which accuracy of mC calling and DMR detection is evaluated on simulated data with various mC contexts, read qualities, sequencing depths and DMR lengths, as well as on real data from a wide range of biological processes. We demonstrate that Bisulfighter consistently achieves better accuracy than other published tools, providing greater sensitivity for mCs with fewer false positives, more precise estimates of mC levels, more exact locations of DMRs and better agreement of DMRs with gene expression and DNase I hypersensitivity. The source code is available at http://epigenome.cbrc.jp/bisulfighter.  相似文献   

17.
18.

Introduction

Hepatitis C virus (HCV) genome contains two envelope proteins (E1 and E2) responsible for the virus entry into the cell. There is a substantial lack of sequences covering the full length of E1/E2 region for genotype 4. Our study aims at providing new sequences as well as characterizing the genetic divergence of the E1/E2 region of HCV 4a using our new sequences along with all publicly available datasets.

Methods

The genomic segments covering the whole E1/E2 region were isolated from Egyptian HCV patients and sequenced. The resulting 36 sequences 36 were analyzed using sequence analysis techniques to study variability within and among hosts in the same time point. Furthermore, previously published HCV E1/E2 sequence datasets for genotype 4a were retrieved and categorized according to the geographical location and date of isolation and were used for further analysis of variability among Egyptian over a period of 15 years, also compared with non-Egyptian sequences to figure out region-specific variability.

Results

Phylogenetic analysis of the new sequences has shown variability within the host and among different individuals in the same time point. Analysis of the 36 sequences along with the Egyptian sequences (254 sequences in E1 in the period from 1997 to 2010 and 8 E2 sequences in the period from 2006 to 2010) has shown temporal change over time. Analysis of the new HCV sequences with the non-Egyptian sequences (182 sequences in E1 and 155 sequences in the E2) has shown region specific variability. The molecular clock rate of E1 was estimated to be 5E-3 per site per year for Egyptian and 5.38E-3 for non-Egyptian. The clock rate of E2 was estimated to be 8.48E per site per year for Egyptian and 6.3E-3 for non-Egyptian.

Conclusion

The results of this study support the high rate of evolution of the Egyptian HCV genotype 4a. It has also revealed significant level of genetic variability among sequences from different regions in the world.
  相似文献   

19.
The MPI Bioinformatics Toolkit (https://toolkit.tuebingen.mpg.de) is a free, one-stop web service for protein bioinformatic analysis. It currently offers 34 interconnected external and in-house tools, whose functionality covers sequence similarity searching, alignment construction, detection of sequence features, structure prediction, and sequence classification. This breadth has made the Toolkit an important resource for experimental biology and for teaching bioinformatic inquiry. Recently, we replaced the first version of the Toolkit, which was released in 2005 and had served around 2.5 million queries, with an entirely new version, focusing on improved features for the comprehensive analysis of proteins, as well as on promoting teaching. For instance, our popular remote homology detection server, HHpred, now allows pairwise comparison of two sequences or alignments and offers additional profile HMMs for several model organisms and domain databases. Here, we introduce the new version of our Toolkit and its application to the analysis of proteins.  相似文献   

20.
Repeat elements are important components of eukaryotic genomes. One limitation in our understanding of repeat elements is that most analyses rely on reference genomes that are incomplete and often contain missing data in highly repetitive regions that are difficult to assemble. To overcome this problem we develop a new method, REPdenovo, which assembles repeat sequences directly from raw shotgun sequencing data. REPdenovo can construct various types of repeats that are highly repetitive and have low sequence divergence within copies. We show that REPdenovo is substantially better than existing methods both in terms of the number and the completeness of the repeat sequences that it recovers. The key advantage of REPdenovo is that it can reconstruct long repeats from sequence reads. We apply the method to human data and discover a number of potentially new repeats sequences that have been missed by previous repeat annotations. Many of these sequences are incorporated into various parasite genomes, possibly because the filtering process for host DNA involved in the sequencing of the parasite genomes failed to exclude the host derived repeat sequences. REPdenovo is a new powerful computational tool for annotating genomes and for addressing questions regarding the evolution of repeat families. The software tool, REPdenovo, is available for download at https://github.com/Reedwarbler/REPdenovo.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号