首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
An implementation of Profilesearch (a technique to search forrelationships between a protein sequence and multiply alignedsequences) for a parallel computer is described. The numbercrunchingmachine, consisting of 21 T800 transputers, is connected toa Macintosh IIcx host computer. The program utilizes a standardMacintosh application as its user–interface, resultingin a transparent and user–friendly environment for addressingthe parallel computer. The program is independent of the nwnberof available processors and exceeds the speed of a VAXstation3200 with only one transputer in operation, thus allowing cheapand fast database searches with a PC frontend. For a largernwnber of processors, the speed increase is approximately linearwith no obvious symptoms of saturation with the available maximwnof 21 transputers. The program and environment are usefid tosearch quickly and easily for similarities between a singlesequence or sequence set and individual sequences containedin a large database. The alignment is determined by typicaldynamic programming techniques.  相似文献   

2.
A software package that allows one to carry out multiple alignmentof protein and nucleic acid sequences of almost unlimited lengthand number of sequences is developed on C-DAC parallel computer—atransputer-based machine. The farming approach is used for dataparallelization. The speed gains are almost linear when thenumber of transputers is increased from 4 to 64. The softwareis used to carry out multiple alignment of 100 sequences eachof -chain and ß-chain of hemoglobin and 83 cytochromec sequences. The signature sequence of cytochrome c was foundto be PGTKMXF. The single parameter, multiple alignment score,S, has been used to categorize proteins in different subfamiliesand groups.  相似文献   

3.
Most bioinformatics analyses require the assembly of a multiple sequence alignment. It has long been suspected that structural information can help to improve the quality of these alignments, yet the effect of combining sequences and structures has not been evaluated systematically. We developed 3DCoffee, a novel method for combining protein sequences and structures in order to generate high-quality multiple sequence alignments. 3DCoffee is based on TCoffee version 2.00, and uses a mixture of pairwise sequence alignments and pairwise structure comparison methods to generate multiple sequence alignments. We benchmarked 3DCoffee using a subset of HOMSTRAD, the collection of reference structural alignments. We found that combining TCoffee with the threading program Fugue makes it possible to improve the accuracy of our HOMSTRAD dataset by four percentage points when using one structure only per dataset. Using two structures yields an improvement of ten percentage points. The measures carried out on HOM39, a HOMSTRAD subset composed of distantly related sequences, show a linear correlation between multiple sequence alignment accuracy and the ratio of number of provided structure to total number of sequences. Our results suggest that in the case of distantly related sequences, a single structure may not be enough for computing an accurate multiple sequence alignment.  相似文献   

4.
The profile method, for detecting distantly related proteins by sequence comparison, has been extended to incorporate secondary structure information from known X-ray structures. The sequence of a known structure is aligned to sequences of other members of a given folding class. From the known structure, the secondary structure (alpha-helix, beta-strand or "other") is assigned to each position of the aligned sequences. As in the standard profile method, a position-dependent scoring table, termed a profile, is calculated from the aligned sequences. However, rather than using the standard Dayhoff mutation table in calculating the profile, we use distinct amino acid mutation tables for residues in alpha-helices, beta-strands or other secondary structures to calculate the profile. In addition, we also distinguish between internal and external residues. With this new secondary structure-based profile method, we created a profile for eight-stranded, antiparallel beta barrels of the insecticyanin folding class. It is based on the sequences of retinol-binding protein, insecticyanin and beta-lactoglobulin. Scanning the sequence database with this profile, it was possible to detect the sequence of avidin. The structure of streptavidin is known, and it appears to be distantly related to the antiparallel beta barrels. Also detected is the sequence of complement component C8, which we therefore predict to be a member of this folding class.  相似文献   

5.
Calculation of dot-matrices is a widespread tool in the search for sequence similarities. When sequences are distant, even this approach may fail to point out common regions. If several plots calculated for all members of a sequence set consistently displayed a similarity between them, this would increase its credibility. We present an algorithm to delineate dot-plot agreement. A novel procedure based on matrix multiplication is developed to identify common patterns and reliably aligned regions in a set of distantly related sequences. The algorithm finds motifs independent of input sequence lengths and reduces the dependence on gap penalties. When sequences share greater similarity, the same approach converts to a multiple sequence alignment procedure.  相似文献   

6.
Position-specific substitution matrices, known as profiles,derived from multiple sequence alignments are currently usedto search sequence databases for distantly related members ofprotein families. The performance of the database searches isenhanced by using (i) a sequence weighting scheme which assignshigher weights to more distantly related sequences based onbranch lengths derived from phylogenetic trees, (ii) exclusionof positions with mainly padding characters at sites of insertionsor deletions and (iii) the BLOSUM62 residue comparison matrix.A natural consequence of these modifications is an improvementin the alignment of new sequences to the profiles. However,the accuracy of the alignments can be further increased by employinga similarity residue comparison matrix. These developments areimplemented in a program called PROFILEWEIGHT which runs onUnix and Vax computers. The only input required by the programis the multiple sequence alignment. The output from PROFILEWEIGHTis a profile designed to be used by existing searching and alignmentprograms. Test results from database searches with four differentfamilies of proteins show the improved sensitivity of the weightedprofiles.  相似文献   

7.
A method for comparison of protein sequences based on their primary and secondary structure is described. Protein sequences are annotated with predicted secondary structures (using a modified Chou and Fasman method). Two lettered code sequences are generated (Xx, where X is the amino acid and x is its annotated secondary structure). Sequences are compared with a dynamic programming method (STRALIGN) that includes a similarity matrix for both the amino acids and secondary structures. The similarity value for each paired two-lettered code is a linear combination of similarity values for the paired amino acids and their annotated secondary structures. The method has been applied to eight globin proteins (28 pairs) for which the X-ray structure is known. For protein pairs with high primary sequence similarity (greater than 45%), STRALIGN alignment is identical to that obtained by a dynamic programming method using only primary sequence information. However, alignment of protein pairs with lower primary sequence similarity improves significantly with the addition of secondary structure annotation. Alignment of the pair with the least primary sequence similarity of 16% was improved from 0 to 37% 'correct' alignment using this method. In addition, STRALIGN was successfully applied to seven pairs of distantly related cytochrome c proteins, and three pairs of distantly related picornavirus proteins.  相似文献   

8.
Sequence alignment by cross-correlation.   总被引:1,自引:0,他引:1  
Many recent advances in biology and medicine have resulted from DNA sequence alignment algorithms and technology. Traditional approaches for the matching of DNA sequences are based either on global alignment schemes or heuristic schemes that seek to approximate global alignment algorithms while providing higher computational efficiency. This report describes an approach using the mathematical operation of cross-correlation to compare sequences. It can be implemented using the fast fourier transform for computational efficiency. The algorithm is summarized and sample applications are given. These include gene sequence alignment in long stretches of genomic DNA, finding sequence similarity in distantly related organisms, demonstrating sequence similarity in the presence of massive (approximately 90%) random point mutations, comparing sequences related by internal rearrangements (tandem repeats) within a gene, and investigating fusion proteins. Application to RNA and protein sequence alignment is also discussed. The method is efficient, sensitive, and robust, being able to find sequence similarities where other alignment algorithms may perform poorly.  相似文献   

9.
Small subunit rRNA sequences have been determined for 10 of the most clinically important pathogenic species of the yeast genus Candida (including Torulopsis [Candida] glabrata and Yarrowia [Candida] lipolytica) and for Hansenula polymorpha. Phylogenetic analyses of these sequences and those of Saccharomyces cerevisiae, Kluyveromyces marxianus var. lactis, and Aspergillus fumigatus indicate that Candida albicans, C. tropicalis, C. parapsilosis, and C. viswanathii form a subgroup within the genus. The remaining significant pathogen, T. glabrata, falls into a second, distinct subgroup and is specifically related to S. cerevisiae and more distantly related to C. kefyr (psuedotropicalis) and K. marxianus var. lactis. The 18S rRNA sequence of Y. lipolytica has evolved rapidly in relation to the other Candida sequences examined and appears to be only distantly related to them. As anticipated, species of several other genera appear to bear specific relationships to members of the genus Candida.  相似文献   

10.
We have developed a new primer design strategy for PCR amplification of distantly related gene sequences based on consensus-degenerate hybrid oligonucleotide primers (CODEHOPs). An interactive program has been written to design CODEHOP PCR primers from conserved blocks of amino acids within multiply-aligned protein sequences. Each CODEHOP consists of a pool of related primers containing all possible nucleotide sequences encoding 3-4 highly conserved amino acids within a 3' degenerate core. A longer 5' non-degenerate clamp region contains the most probable nucleotide predicted for each flanking codon. CODEHOPs are used in PCR amplification to isolate distantly related sequences encoding the conserved amino acid sequence. The primer design software and the CODEHOP PCR strategy have been utilized for the identification and characterization of new gene orthologs and paralogs in different plant, animal and bacterial species. In addition, this approach has been successful in identifying new pathogen species. The CODEHOP designer (http://blocks.fhcrc.org/codehop.html) is linked to BlockMaker and the Multiple Alignment Processor within the Blocks Database World Wide Web (http://blocks.fhcrc.org).  相似文献   

11.
From protein sequence space to elementary protein modules   总被引:2,自引:0,他引:2  
Frenkel ZM  Trifonov EN 《Gene》2008,408(1-2):64-71
The formatted protein sequence space is built from identical size fragments of prokaryotic proteins (112 complete proteomes). Connecting sequence-wise similar fragments (points in the space) results in the formation of numerous networks, that combine sometimes different types of proteins sharing, though, fragments with similar or distantly related sequences. The networks are mapped on individual protein sequences revealing distinct regions (modules) associated with prominent networks with well-defined functional identities. Presence of multiple sites of sequence conservation (modules) in a given protein sequence suggests that the annotated protein function may be decomposed in "elementary" subfunctions of the respective modules. The modules correspond to previously discovered conserved closed loop structures and their sequence prototypes.  相似文献   

12.
MOTIVATION: Detecting genes in viral genomes is a complex task. Due to the biological necessity of them being constrained in length, RNA viruses in particular tend to code in overlapping reading frames. Since one amino acid is encoded by a triplet of nucleic acids, up to three genes may be coded for simultaneously in one direction. Conventional hidden Markov model (HMM)-based gene-finding algorithms may typically find it difficult to identify multiple coding regions, since in general their topologies do not allow for the presence of overlapping or nested genes. Comparative methods have therefore been restricted to likelihood ratio tests on potential regions as to being double or single coding, using the fact that the constrictions forced upon multiple-coding nucleotides will result in atypical sequence evolution. Exploiting these same constraints, we present an HMM based gene-finding program, which allows for coding in unidirectional nested and overlapping reading frames, to annotate two homologous aligned viral genomes. Our method does not insist on conserved gene structure between the two sequences, thus making it applicable for the pairwise comparison of more distantly related sequences. RESULTS: We apply our method to 15 pairwise alignments of six different HIV2 genomes. Given sufficient evolutionary distance between the two sequences, we achieve sensitivity of approximately 84-89% and specificity of approximately 97-99.9%. We additionally annotate three pairwise alignments of the more distantly related HIV1 and HIV2, as well as of two different hepatitis viruses, attaining results of approximately 87% sensitivity and approximately 98.5% specificity. We subsequently incorporate prior knowledge by 'knowing' the gene structure of one sequence and annotating the other conditional on it. Boosting accuracy close to perfect we demonstrate that conservation of gene structure on top of nucleotide sequence is a valuable source of information, especially in distantly related genomes. AVAILABILITY: The Java code is available from the authors.  相似文献   

13.
MELDB: a database for microbial esterases and lipases   总被引:1,自引:0,他引:1  
Kang HY  Kim JF  Kim MH  Park SH  Oh TK  Hur CG 《FEBS letters》2006,580(11):2736-2740
MELDB is a comprehensive protein database of microbial esterases and lipases which are hydrolytic enzymes important in the modern industry. Proteins in MELDB are clustered into groups according to their sequence similarities based on a local pairwise alignment algorithm and a graph clustering algorithm (TribeMCL). This differs from traditional approaches that use global pairwise alignment and joining methods. Our procedure was able to reduce the noise caused by dubious alignment in the distantly related or unrelated regions in the sequences. In the database, 883 esterase and lipase sequences derived from microbial sources are deposited and conserved parts of each protein are identified. HMM profiles of each cluster were generated to classify unknown sequences. Contents of the database can be keyword-searched and query sequences can be aligned to sequence profiles and sequences themselves.  相似文献   

14.
MOTIVATION: Identification of short conserved sequence motifs common to a protein family or superfamily can be more useful than overall sequence similarity in suggesting the function of novel gene products. Locating motifs still requires expert knowledge, as automated methods using stringent criteria may not differentiate subtle similarities from statistical noise. RESULTS: We have developed a novel automatic method, based on patterns of conservation of 237 physical-chemical properties of amino acids in aligned protein sequences, to find related motifs in proteins with little or no overall sequence similarity. As an application, our web-server MASIA identified 12 property-based motifs in the apurinic/apyrimidinic endonuclease (APE) family of DNA-repair enzymes of the DNase-I superfamily. Searching with these motifs located distantly related representatives of the DNase-I superfamily, such as Inositol 5'-polyphosphate phosphatases in the ASTRAL40 database, using a Bayesian scoring function. Other proteins containing APE motifs had no overall sequence or structural similarity. However, all were phosphatases and/or had a metal ion binding active site. Thus our automated method can identify discrete elements in distantly related proteins that define local structure and aspects of function. We anticipate that our method will complement existing ones to functionally annotate novel protein sequences from genomic projects. AVAILABILITY: MASIA WEB site: http://www.scsb.utmb.edu/masia/masia.html SUPPLEMENTARY INFORMATION: The dendrogram of 42 APE sequences used to derive motifs is available on http://www.scsb.utmb.edu/comp_biol.html/DNA_repair/publication.html  相似文献   

15.
Retrotransposable elements (REs) and related sequences form a large proportion of conifer genomes. During genome evolution, some RE sequences are degraded or eliminated, but some are evolutionarily stable, and can be identified even in distantly related species. Use of genome sequence information from loblolly pine (Pinus taeda) enables investigation of divergent non-coding RE sequences in other pine and conifer species, including Scots pine (Pinus sylvestris). Non-specific inter-retrotransposon amplified polymorphism technique (IRAP) as well as the amplification polymorphism of 12 RE families were investigated in 80 gymnosperm species. The obtained results were compared with phylogenetic relationships among gymnosperms. Investigation of distantly related gymnosperm species reveals persistent RE sequences, such as IFG and Pineywoods, distributed among a wide range of plant lineages. RE sequence divergence was observed, reflecting periods of inactivity and degradation during speciation of pine lineages, as demonstrated by the delineation of the main pine subgenera. Intraspecific variation of 10 RE copy numbers (CN) between Scots pine individuals ranged from 8.9 to 26.6% of the overall mean estimates. CN analyses were performed in 16 additional gymnosperm species. The analysed pine species contained a similar complement of RE families; however, CN and genome occupation proportions differ. A decrease in RE CN estimates can reflect sequence divergence, associated with independent transposition events. Transposition of some REs can be induced by stress conditions; therefore, even distantly related species inhabiting extreme environments could have similar patterns or distribution of these elements.  相似文献   

16.
MOTIVATION: Searches of biological sequence databases are usually focussed on distinguishing significant from random matches. However, the increasing abundance of related sequences on databases present a second challenge: to distinguish the evolutionarily most closely related sequences (often orthologues) from more distantly related homologues. This is particularly important when searching a database of partial sequences, where short orthologous sequences from a non-conserved region will score much more poorly than non-orthologous (outgroup) sequences from a conserved region. RESULTS: Such inferences are shown to be improved by conditioning the search results on the scores of an outgroup sequence. The log-odds score for each target sequence identified on the database has the log-odds score of the outgroup sequence subtracted from it. A test group of Caenorhabditis elegans kinase sequences and their identified C.elegans outgroups were searched against a test database of human Expressed Sequence Tag (EST) sequences, where the sets of true target sequences were known in advance. The outgroup conditioned method was shown to identify 58% more true positives ahead of the first false positive, compared to the straightforward search without an outgroup. A test dataset of 151 proteins drawn from the C.elegans genome, where the putative 'outgroup' was assigned automatically, similarly found 50% more true positives using outgroup conditioning. Thus, outgroup conditioning provides a means to improve the results of database searching with little increase in the search computation time.  相似文献   

17.
Noncoding DNA in eukaryotes encodes functionally important signals for the regulation of chromosome assembly, DNA replication, and gene expression. The increasing availability of whole-genome sequences of related taxa has led to interest in the evolution of these signals, and the phylogenetic footprints they produce. Cis-regulatory sequences controlling gene expression are often conserved among related species, but are rarely conserved between distantly related taxa. Several experimentally characterized regulatory elements have failed to show sequence similarity even between closely related species.  相似文献   

18.
We report the sequence of a 7800 base pair region of herpes simplex virus type 1 DNA, representing approximately 0.16 to 0.20 map units in the genome. This contains sequences transcribed into a leftward oriented set of five 3' coterminal mRNAs, together with two rightward transcribed flanking genes. One of the leftward genes encodes the virus's alkaline exonuclease, but the other gene products are uncharacterized. The amino acid sequence of one encoded protein suggested that it is a membrane embedded species. The DNA sequence is densely utilised, with two predicted out-of-frame overlaps of coding sequences, and probably six occurrences of promoter elements within coding sequences. Homologues of five of the genes were found for the distantly related Epstein-Barr virus, with a similar overall relative arrangement.  相似文献   

19.

Background  

One of the most powerful methods for the prediction of protein structure from sequence information alone is the iterative construction of profile-type models. Because profiles are built from sequence alignments, the sequences included in the alignment and the method used to align them will be important to the sensitivity of the resulting profile. The inclusion of highly diverse sequences will presumably produce a more powerful profile, but distantly related sequences can be difficult to align accurately using only sequence information. Therefore, it would be expected that the use of protein structure alignments to improve the selection and alignment of diverse sequence homologs might yield improved profiles. However, the actual utility of such an approach has remained unclear.  相似文献   

20.
SUMMARY: NdPASA is a web server specifically designed to optimize sequence alignment between distantly related proteins. The program integrates structure information of the template sequence into a global alignment algorithm by employing neighbor-dependent propensities of amino acids as a unique parameter for alignment. NdPASA optimizes alignment by evaluating the likelihood of a residue pair in the query sequence matching against a corresponding residue pair adopting a particular secondary structure in the template sequence. NdPASA is most effective in aligning homologous proteins sharing low percentage of sequence identity. The server is designed to aid homologous protein structure modeling. A PSI-BLAST search engine was implemented to help users identify template candidates that are most appropriate for modeling the query sequences.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号