共查询到20条相似文献,搜索用时 0 毫秒
1.
Background
Automated comparison of complete sets of genes encoded in two genomes can provide insight on the genetic basis of differences in biological traits between species. Gene ontology (GO) is used as a common vocabulary to annotate genes for comparison. Current approaches calculate the fold of unweighted or weighted differences between two species at the high-level GO functional categories. However, to ensure the reliability of the differences detected, it is important to evaluate their statistical significance. It is also useful to search for differences at all levels of GO. 相似文献2.
3.
MOTIVATION: As more genomes are sequenced, the demand for fast gene classification techniques is increasing. To analyze a newly sequenced genome, first the genes are identified and translated into amino acid sequences which are then classified into structural or functional classes. The best-performing protein classification methods are based on protein homology detection using sequence alignment methods. Alignment methods have recently been enhanced by discriminative methods like support vector machines (SVMs) as well as by position-specific scoring matrices (PSSM) as obtained from PSI-BLAST. However, alignment methods are time consuming if a new sequence must be compared to many known sequences-the same holds for SVMs. Even more time consuming is to construct a PSSM for the new sequence. The best-performing methods would take about 25 days on present-day computers to classify the sequences of a new genome (20,000 genes) as belonging to just one specific class--however, there are hundreds of classes. Another shortcoming of alignment algorithms is that they do not build a model of the positive class but measure the mutual distance between sequences or profiles. Only multiple alignments and hidden Markov models are popular classification methods which build a model of the positive class but they show low classification performance. The advantage of a model is that it can be analyzed for chemical properties common to the class members to obtain new insights into protein function and structure. We propose a fast model-based recurrent neural network for protein homology detection, the 'Long Short-Term Memory' (LSTM). LSTM automatically extracts indicative patterns for the positive class, but in contrast to profile methods it also extracts negative patterns and uses correlations between all detected patterns for classification. LSTM is capable to automatically extract useful local and global sequence statistics like hydrophobicity, polarity, volume, polarizability and combine them with a pattern. These properties make LSTM complementary to alignment-based approaches as it does not use predefined similarity measures like BLOSUM or PAM matrices. RESULTS: We have applied LSTM to a well known benchmark for remote protein homology detection, where a protein must be classified as belonging to a SCOP superfamily. LSTM reaches state-of-the-art classification performance but is considerably faster for classification than other approaches with comparable classification performance. LSTM is five orders of magnitude faster than methods which perform slightly better in classification and two orders of magnitude faster than the fastest SVM-based approaches (which, however, have lower classification performance than LSTM). Only PSI-BLAST and HMM-based methods show comparable time complexity as LSTM, but they cannot compete with LSTM in classification performance. To test the modeling capabilities of LSTM, we applied LSTM to PROSITE classes and interpreted the extracted patterns. In 8 out of 15 classes, LSTM automatically extracted the PROSITE motif. In the remaining 7 cases alternative motifs are generated which give better classification results on average than the PROSITE motifs. AVAILABILITY: The LSTM algorithm is available from http://www.bioinf.jku.at/software/LSTM_protein/. 相似文献
4.
MOTIVATION: Local structure segments (LSSs) are small structural units shared by unrelated proteins. They are extensively used in protein structure comparison, and predicted LSSs (PLSSs) are used very successfully in ab initio folding simulations. However, predicted or real LSSs are rarely exploited by protein sequence comparison programs that are based on position-by-position alignments. RESULTS: We developed a SEgment Alignment algorithm (SEA) to compare proteins described as a collection of predicted local structure segments (PLSSs), which is equivalent to an unweighted graph (network). Any specific structure, real or predicted corresponds to a specific path in this network. SEA then uses a network matching approach to find two most similar paths in networks representing two proteins. SEA explores the uncertainty and diversity of predicted local structure information to search for a globally optimal solution. It simultaneously solves two related problems: the alignment of two proteins and the local structure prediction for each of them. On a benchmark of protein pairs with low sequence similarity, we show that application of the SEA algorithm improves alignment quality as compared to FFAS profile-profile alignment, and in some cases SEA alignments can match the structural alignments, a feat previously impossible for any sequence based alignment methods. 相似文献
5.
Microtubule length control is essential for the assembly and function of the mitotic spindle. Kinesin-like motor proteins that directly attenuate microtubule dynamics make key contributions to this control, but the specificity of these motors for different subpopulations of spindle microtubules is not understood. Kif18A (kinesin-8) localizes to the plus ends of the relatively slowly growing kinetochore fibers (K-fibers) and attenuates their dynamics, whereas Kif4A (kinesin-4) localizes to mitotic chromatin and suppresses the growth of highly dynamic, nonkinetochore microtubules. Although Kif18A and Kif4A similarly suppress microtubule growth in vitro, it remains unclear whether microtubule-attenuating motors control the lengths of K-fibers and nonkinetochore microtubules through a common mechanism. To address this question, we engineered chimeric kinesins that contain the Kif4A, Kif18B (kinesin-8), or Kif5B (kinesin-1) motor domain fused to the C-terminal tail of Kif18A. Each of these chimeric kinesins localizes to K-fibers; however, K-fiber length control requires an activity specific to kinesin-8s. Mutational studies of Kif18A indicate that this control depends on both its C-terminus and a unique, positively charged surface loop, called loop2, within the motor domain. These data support a model in which microtubule-attenuating kinesins are molecularly “tuned” to control the dynamics of specific subsets of spindle microtubules. 相似文献
6.
7.
Mapping using unique sequences 总被引:5,自引:0,他引:5
D C Torney 《Journal of molecular biology》1991,217(2):259-264
Theoretical predictions are given for the progress expected, when mapping DNA by identifying clones containing specific unique sequences. Progress is measured in three ways; however, all results depend on (dimensionless counterparts of) the number of clones and the number of unique sequences used. Furthermore, the effects of clone length dispersion are included in the theoretical predictions. Both the clones in the library and the unique sequences are assumed to be generated randomly, with uniform probability of originating at any base in the region to be mapped. The first measure of progress is the expected length fraction of the region to be mapped covered by at least one clone, when clones containing at least one unique sequence are included in the map. The second measure of progress is the expected length fraction of the region to be mapped in "covered intervals", an interval being the region between adjacent unique sequences. Alternative definitions for clones covering an interval are analyzed. The third measure of progress is the expected number of clone islands generated; an island covers successive intervals. Finally, using these measures of progress, we compare the efficiency of this new mapping strategy with conventional clone mapping strategies. 相似文献
8.
9.
Pseudomonas putida Idaho is an organic-solvent-tolerant strain which can degrade and adapt to high concentrations of organic solvents. Here, we announce its first draft genome sequence (6,363,067 bp). We annotated 192 coding sequences (CDSs) responsible for aromatic compound metabolism, 40 CDSs encoding phospholipid synthesis, and 212 CDSs related to stress response. 相似文献
10.
Background
We are interested in the problem of predicting secondary structure for small sets of homologous RNAs, by incorporating limited comparative sequence information into an RNA folding model. The Sankoff algorithm for simultaneous RNA folding and alignment is a basis for approaches to this problem. There are two open problems in applying a Sankoff algorithm: development of a good unified scoring system for alignment and folding and development of practical heuristics for dealing with the computational complexity of the algorithm. 相似文献11.
Hertveldt K Lavigne R Pleteneva E Sernova N Kurochkina L Korchevskii R Robben J Mesyanzhinov V Krylov VN Volckaert G 《Journal of molecular biology》2005,354(3):536-545
Pseudomonas aeruginosa phage EL is a dsDNA phage related to the giant phiKZ-like Myoviridae. The EL genome sequence comprises 211,215 bp and has 201 predicted open reading frames (ORFs). The EL genome does not share DNA sequence homology with other viruses and micro-organisms sequenced to date. However, one-third of the predicted EL gene products (gps) shares similarity (Blast alignments of 17-55% amino acid identity) with phiKZ proteins. Comparative EL and phiKZ genomics reveals that these giant phages are an example of substantially diverged genetic mosaics. Based on the position of similar EL and phiKZ predicted gene products, five genome regions can be delineated in EL, four of which are relatively conserved between EL and phiKZ. Region IV, a 17.7 kb genome region with 28 predicted ORFs, is unique to EL. Fourteen EL ORFs have been assigned a putative function based on protein similarity. Assigned proteins are involved in DNA replication and nucleotide metabolism (NAD+-dependent DNA ligase, ribonuclease HI, helicase, thymidylate kinase), host lysis and particle structure. EL-gp146 is the first chaperonin GroEL sequence identified in a viral genome. Besides a putative transposase, EL harbours predicted mobile endonucleases related to H-N-H and LAGLIDADG homing endonucleases associated with group I intron and intein intervening sequences. 相似文献
12.
Protein sequence alignment has become an essential task in modern molecular biology research. A number of alignment techniques have been documented in literature and their corresponding tools are made available as freeware and commercial software. The choice and use of these tools for sequence alignment through the complete interpretation of alignment results is often considered non-trivial by end-users with limited skill in Bioinformatics algorithm development. Here, we discuss the comparison of sequence alignment techniques based on dynamic programming (N-W, S-W) and heuristics (LFASTA, BL2SEQ) for four sets of sequence data towards an educational purpose. The analysis suggests that heuristics based methods are faster than dynamic programming methods in alignment speed. 相似文献
13.
Because probiotic effects are strain dependent, genomic explanations of these differences will contribute to understanding their mechanisms of action. The genomic sequence of the Bifidobacterium longum probiotic strain NCC2705 was determined, but little is known about the genetic diversity between strains of this species. Suppression subtractive hybridization (SSH) is a powerful method for generating a set of DNA fragments differing between two closely related bacterial strains. The purpose of this study was to identify genetic differences between genomes of B. longum strains NCC2705 and CRC-002 using PCR-based SSH. Strain CRC-002 produces exopolysaccharides whereas NCC2705 is not known for reliable exopolysaccharide production. Thirty-five and 30 different sequences were obtained from the SSH libraries of strains CRC-002 and NCC2705, respectively. Specific CRC-002 genes found were predicted to be involved in the biosynthesis of exopolysaccharides and metabolism of other carbohydrates, and these genes were not present in the genome of strain NCC2705. The identification of an endo-1,4-beta-xylanase gene in the CRC-002 SSH library is an important difference because xylanase genes have previously been proposed as a defining characteristic of the NCC2705 strain. The results demonstrate that the SSH technique was useful to highlight potential genes involved in complex sugar metabolism that differ between the two probiotic strains. 相似文献
14.
15.
Kolb AF 《Cloning and stem cells》2002,4(1):65-80
The targeted modification of the mammalian genome has a variety of applications in research, medicine, and biotechnology. Site-specific recombinases have become significant tools in all of these areas. Conditional gene targeting using site-specific recombinases has enabled the functional analysis of genes, which cannot be inactivated in the germline. The site-specific integration of adeno-associated virus, a major gene therapy vehicle, relies on the recombinase activity of the viral rep proteins. Site-specific recombinases also allow the precise integration of open reading frames encoding pharmaceutically relevant proteins into highly active gene loci in cell lines and transgenic animals. These goals have been accomplished by using a variety of genetic strategies but only a few recombinase proteins. However, the vast repertoire of recombinases, which has recently become available as a result of large-scale sequencing projects, may provide a rich source for the development of novel strategies to precisely alter mammalian genomes. 相似文献
16.
A new protein structure alignment procedure is described. An initial alignment is made by comparing a one-dimensional list of primary, secondary and tertiary structural features (profiles) of two proteins, without explicitly considering the three-dimensional geometry of the structures. The alignment is then iteratively refined in the second step, in which new alignments are found by three-dimensional superposition of the structures based on the current alignment. This new procedure is fast enough to do all-against-all structural comparisons routinely. The procedure sometimes finds an alignment that suggests an evolutionary relationship and which is not normally obtained if only geometry is considered. All pair-wise comparisons were made among 3539 protein structural domains that represent all known protein structures. The resulting 3539 z-scores were used to cluster the proteins. The number of main clusters increased continuously as the z-cutoff was raised, but the number of multiple-member clusters showed a maximum at z-cutoff values of 5.0 and 5.5. When a z-cutoff value of 5.0 was used, the total number of main clusters was 2043, of which only 336 clusters had more than one member. 相似文献
17.
Mäkinen V 《Biomolecular engineering》2007,24(3):337-342
A peak is a pair of real values (x,y), where x is the time when peak of height y is registered. In the peak alignment problem, we are given two sequences of peaks, and our task is to align the sequences allowing some basic edit operations on the peaks. We study an instance of the peak alignment problem that arises in the analysis of Mass Spectrometry data in Systems Biology. There the measurement technique guarantees that two peaks (x,y), (x',y') can only be considered the same if x is close enough to x', and y is close enough to y'. We review some methods to do alignment under such restrictions on matches. 相似文献
18.
Homology modeling is the most commonly used technique to build a three-dimensional model for a protein sequence. It heavily relies on the quality of the sequence alignment between the protein to model and related proteins with a known three dimensional structure. Alignment quality can be assessed according to the physico-chemical properties of the three dimensional models it produces. In this work, we introduce fifteen predictors designed to evaluate the properties of the models obtained for various alignments. They consist of an energy value obtained from different force fields (CHARMM, ProsaII or ANOLEA) computed on residue selected around misaligned regions. These predictors were evaluated on ten challenging test cases. For each target, all possible ungapped alignments are generated and their corresponding models are computed and evaluated. The best predictor, retrieving the structural alignment for 9 out of 10 test cases, is based on the ANOLEA atomistic mean force potential and takes into account residues around misaligned secondary structure elements. The performance of the other predictors is significantly lower. This work shows that substantial improvement in local alignments can be obtained by careful assessment of the local structure of the resulting models. 相似文献
19.
Vorolign, a fast and flexible structural alignment method for two or more protein structures is introduced. The method aligns protein structures using double dynamic programming and measures the similarity of two residues based on the evolutionary conservation of their corresponding Voronoi-contacts in the protein structure. This similarity function allows aligning protein structures even in cases where structural flexibilities exist. Multiple structural alignments are generated from a set of pairwise alignments using a consistency-based, progressive multiple alignment strategy. RESULTS: The performance of Vorolign is evaluated for different applications of protein structure comparison, including automatic family detection as well as pairwise and multiple structure alignment. Vorolign accurately detects the correct family, superfamily or fold of a protein with respect to the SCOP classification on a set of difficult target structures. A scan against a database of >4000 proteins takes on average 1 min per target. The performance of Vorolign in calculating pairwise and multiple alignments is found to be comparable with other pairwise and multiple protein structure alignment methods. AVAILABILITY: Vorolign is freely available for academic users as a web server at http://www.bio.ifi.lmu.de/Vorolign 相似文献
20.
Prescott DM 《Nature reviews. Genetics》2000,1(3):191-198
In some ciliates, the DNA sequences of the germline genomes have been profoundly modified during evolution, providing unprecedented examples of germline DNA malleability. Although the significance of the modifications and malleability is unclear, they may reflect the evolution of mechanisms that facilitate evolution. Because of the modifications, these ciliates must perform remarkable feats of cutting, splicing, rearrangement and elimination of DNA sequences to convert the chromosomal DNA in the germline genome (micronuclear genome) into gene-sized DNA molecules in the somatic genome (macronuclear genome). How these manipulations of DNA are guided and carried out is largely unknown. However, the organization and manipulation of ciliate DNA sequences are new phenomena that expand a general appreciation for the flexibility of DNA in evolution and development. 相似文献