首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
M Pontes  X Xu  D Graham  M Riley  R F Doolittle 《Biochemistry》1987,26(6):1611-1617
The messages for two small but abundant apolipoproteins found in lamprey blood plasma were cloned with the aid of oligonucleotide probes based on amino-terminal sequences. In both cases, numerous clones were identified in a lamprey liver cDNA library, consistent with the great abundance of these proteins in lamprey blood. One of the cDNAs (LAL1) has a coding region of 105 amino acids that corresponds to a 21-residue signal peptide, a putative 8-residue propeptide, and the 76-residue mature protein found in blood. The other cDNA (LAL2) codes for a total of 191 residues, the first 23 of which constitute a signal peptide. The two proteins, which occur in the "high-density lipoprotein fraction" of ultracentrifuged plasma, have amino acid compositions similar to those of apolipoproteins found in mammalian blood; computer analysis indicates that the sequences are largely helix-permissive. When the sequences were searched against an amino acid sequence data base, rat apolipoprotein IV was the best matching candidate in both cases. Although a reasonable alignment can be made with that sequence and LAL1, definitive assignment of the two lamprey proteins to typical mammalian classes cannot be made at this point.  相似文献   

2.
The increasing quantity and complexity of sequences and structural data for proteins and nucleic acids create both problems and opportunities for biomedical researchers. Fortunately, a new generation of practical computer tools for data analysis and integrated information retrieval is emerging. Recent developments in fast database searching, multiple sequence alignment, and molecular modeling are discussed and windows-based, mouse-driven software for CD-ROM and network information retrieval are described. Each method is illustrated with a practical example pertinent to lipid research. In particular, the connection among cholesteryl ester transfer protein, bactericidal permeability-increasing protein, and lipopolysaccharide-binding proteins is determined; novel repetitive sequence motifs in mammalian farnesyltransferase subunits and related yeast prenyltransferases are derived; biochemical insights from a three-dimensional model of human apolipoprotein D based on two insect lipocalins are discussed; the relationship between apolipoprotein D and gross cystic disease fluid protein from human breast is reviewed; and prospects for modeling apolipoprotein E-related proteins are described. In addition, information on a number of general and special-purpose sequence, motif, and structural databases is included.  相似文献   

3.
The Shannon information entropy of protein sequences.   总被引:6,自引:1,他引:5       下载免费PDF全文
A comprehensive data base is analyzed to determine the Shannon information content of a protein sequence. This information entropy is estimated by three methods: a k-tuplet analysis, a generalized Zipf analysis, and a "Chou-Fasman gambler." The k-tuplet analysis is a "letter" analysis, based on conditional sequence probabilities. The generalized Zipf analysis demonstrates the statistical linguistic qualities of protein sequences and uses the "word" frequency to determine the Shannon entropy. The Zipf analysis and k-tuplet analysis give Shannon entropies of approximately 2.5 bits/amino acid. This entropy is much smaller than the value of 4.18 bits/amino acid obtained from the nonuniform composition of amino acids in proteins. The "Chou-Fasman" gambler is an algorithm based on the Chou-Fasman rules for protein structure. It uses both sequence and secondary structure information to guess at the number of possible amino acids that could appropriately substitute into a sequence. As in the case for the English language, the gambler algorithm gives significantly lower entropies than the k-tuplet analysis. Using these entropies, the number of most probable protein sequences can be calculated. The number of most probable protein sequences is much less than the number of possible sequences but is still much larger than the number of sequences thought to have existed throughout evolution. Implications of these results for mutagenesis experiments are discussed.  相似文献   

4.
Critical comparison of consensus methods for molecular sequences.   总被引:6,自引:0,他引:6       下载免费PDF全文
Consensus methods are recognized as valuable tools for data analysis, especially when some sort of data aggregation is desired. Although consensus methods for sequences play a vital role in molecular biology, researchers pay little heed to the features and limitations of such methods, and so there are risks that criteria for constructing consensus sequences will be misused or misunderstood. To understand better the issues involved, we conducted a critical comparison of nine consensus methods for sequences, of which eight were used in papers appearing in this journal. We report the results of that comparison, and we make recommendations which we hope will assist researchers when they must select particular consensus methods for particular applications.  相似文献   

5.
We use methods from Data Mining and Knowledge Discovery to design an algorithm for detecting motifs in protein sequences. The algorithm assumes that a motif is constituted by the presence of a "good" combination of residues in appropriate locations of the motif. The algorithm attempts to compile such good combinations into a "pattern dictionary" by processing an aligned training set of protein sequences. The dictionary is subsequently used to detect motifs in new protein sequences. Statistical significance of the detection results are ensured by statistically determining the various parameters of the algorithm. Based on this approach, we have implemented a program called GYM. The Helix-Turn-Helix motif was used as a model system on which to test our program. The program was also extended to detect Homeodomain motifs. The detection results for the two motifs compare favorably with existing programs. In addition, the GYM program provides a lot of useful information about a given protein sequence.  相似文献   

6.
Frequency-domain analysis of biomolecular sequences   总被引:7,自引:0,他引:7  
MOTIVATION: Frequency-domain analysis of biomolecular sequences is hindered by their representation as strings of characters. If numerical values are assigned to each of these characters, then the resulting numerical sequences are readily amenable to digital signal processing. RESULTS: We introduce new computational and visual tools for biomolecular sequences analysis. In particular, we provide an optimization procedure improving upon traditional Fourier analysis performance in distinguishing coding from noncoding regions in DNA sequences. We also show that the phase of a properly defined Fourier transform is a powerful predictor of the reading frame of protein coding regions. Resulting color maps help in visually identifying not only the existence of protein coding areas for both DNA strands, but also the coding direction and the reading frame for each of the exons. Furthermore, we demonstrate that color spectrograms can visually provide, in the form of local 'texture', significant information about biomolecular sequences, thus facilitating understanding of local nature, structure and function.  相似文献   

7.
Two approaches to the understanding of biological sequences are confronted. While the recognition of particular signals in sequences relies on complex physical interactions, the problem is often analysed in terms of the presence or absence of literal motifs (strings) in the sequence. We present here a test-case for evaluating the potential of this approach. We classify DNA sequences as positive or negative depending on whether they contain a single melted domain in the middle of the sequence, which is a global physical property. Two sets of positive "biological" sequences were generated by a computer simulation of evolutionary divergence along the branches of a phylogenetic tree, under the constraint that each intermediate sequence be positive. These two sets and a set of random positive sequences were subjected to pattern analysis. The observed local patterns were used to construct expert systems to discriminate positive from negative sequences. The experts achieved 79% to 90% success on random positive sequences and up to 99% on the biological sets, while making less than 2% errors on negative sequences. Thus, the global constraints imposed on sequences by a physical process may generate local patterns that are sufficient to predict, with a reasonable probability, the behaviour of the sequences. However, rather large sets of biological sequences are required to generate patterns free of illegitimate constraints. Furthermore, depending upon the initial sequence, the sets of sequences generated on a phylogenetic tree may be amenable or refractory to string analysis, while obeying identical physical constraints. Our study clarifies the relationship between experts' errors on positive and negative sequences, and the contributions of legitimate and illegitimate patterns to these errors. The test-case appears suitable both for further investigations of problems in the theory of sequence evolution and for further testing of pattern analysis techniques.  相似文献   

8.
Crossassociation is a computer method of comparing protein sequences. It can help detect amino acid matches, deletions, insertions, and other similarities which would be hard to detect by eye. The method is to slide the sequences past each other one step at a time and to count the number of amino acids that match. At each overlap position, the program prints the percentage match and statistical significance measures of the matching. The null hypothesis for significance is the random arrangement of amino acids in the proportions found in the sequences under study. For most protein pairs, the expected proportion of matches is about 1/14. The method includes computation of three overall similarity measures between sequences which should have use in both evolutionary and taxonomic studies. The use of the method has been tested with actual and hypothetical sequences. Problems of recovering evolutionary relationships by this and related methods are discussed.  相似文献   

9.
Deng M  Yu C  Liang Q  He RL  Yau SS 《PloS one》2011,6(3):e17293

Background

Most existing methods for phylogenetic analysis involve developing an evolutionary model and then using some type of computational algorithm to perform multiple sequence alignment. There are two problems with this approach: (1) different evolutionary models can lead to different results, and (2) the computation time required for multiple alignments makes it impossible to analyse the phylogeny of a whole genome. This motivates us to create a new approach to characterize genetic sequences.

Methodology

To each DNA sequence, we associate a natural vector based on the distributions of nucleotides. This produces a one-to-one correspondence between the DNA sequence and its natural vector. We define the distance between two DNA sequences to be the distance between their associated natural vectors. This creates a genome space with a biological distance which makes global comparison of genomes with same topology possible. We use our proposed method to analyze the genomes of the new influenza A (H1N1) virus, human rhinoviruses (HRV) and mammalian mitochondrial. The result shows that a triple-reassortant swine virus circulating in North America and the Eurasian swine virus belong to the lineage of the influenza A (H1N1) virus. For the HRV and mammalian mitochondrial genomes, the results coincide with biologists'' analyses.

Conclusions

Our approach provides a powerful new tool for analyzing and annotating genomes and their phylogenetic relationships. Whole or partial genomes can be handled more easily and more quickly than using multiple alignment methods. Once a genome space has been constructed, it can be stored in a database. There is no need to reconstruct the genome space for subsequent applications, whereas in multiple alignment methods, realignment is needed to add new sequences. Furthermore, one can make a global comparison of all genomes simultaneously, which no other existing method can achieve.  相似文献   

10.
Reconstructing the evolutionary history of protein sequences will provide a better understanding of divergence mechanisms of protein superfamilies and their functions. Long-term protein evolution often includes dynamic changes such as insertion, deletion, and domain shuffling. Such dynamic changes make reconstructing protein sequence evolution difficult and affect the accuracy of molecular evolutionary methods, such as multiple alignments and phylogenetic methods. Unfortunately, currently available simulation methods are not sufficiently flexible and do not allow biologically realistic dynamic protein sequence evolution. We introduce a new method, indel-Seq-Gen (iSG), that can simulate realistic evolutionary processes of protein sequences with insertions and deletions (indels). Unlike other simulation methods, iSG allows the user to simulate multiple subsequences according to different evolutionary parameters, which is necessary for generating realistic protein families with multiple domains. iSG tracks all evolutionary events including indels and outputs the "true" multiple alignment of the simulated sequences. iSG can also generate a larger sequence space by allowing the use of multiple related root sequences. With all these functions, iSG can be used to test the accuracy of, for example, multiple alignment methods, phylogenetic methods, evolutionary hypotheses, ancestral protein reconstruction methods, and protein family classification methods. We empirically evaluated the performance of iSG against currently available methods by simulating the evolution of the G protein-coupled receptor and lipocalin protein families. We examined their true multiple alignments, reconstruction of the transmembrane regions and beta-strands, and the results of similarity search against a protein database using the simulated sequences. We also presented an example of using iSG for examining how phylogenetic reconstruction is affected by high indel rates.  相似文献   

11.
Membrane proteins play a crucial role in various cellular processes and are essential components of cell membranes. Computational methods have emerged as a powerful tool for studying membrane proteins due to their complex structures and properties that make them difficult to analyze experimentally. Traditional features for protein sequence analysis based on amino acid types, composition, and pair composition have limitations in capturing higher-order sequence patterns. Recently, multiple sequence alignment (MSA) and pre-trained language models (PLMs) have been used to generate features from protein sequences. However, the significant computational resources required for MSA-based features generation can be a major bottleneck for many applications. Several methods and tools have been developed to accelerate the generation of MSAs and reduce their computational cost, including heuristics and approximate algorithms. Additionally, the use of PLMs such as BERT has shown great potential in generating informative embeddings for protein sequence analysis. In this review, we provide an overview of traditional and more recent methods for generating features from protein sequences, with a particular focus on MSAs and PLMs. We highlight the advantages and limitations of these approaches and discuss the methods and tools developed to address the computational challenges associated with features generation. Overall, the advancements in computational methods and tools provide a promising avenue for gaining deeper insights into the function and properties of membrane proteins, which can have significant implications in drug discovery and personalized medicine.  相似文献   

12.
Prediction of gene sequences and their exon-intron structure in large eukaryotic genomic sequences is one of the central problems of mathematical biology. Solving this problem involves, in particular, high-accuracy splice site recognition. Using statistical analysis of a splice site-containing human gene fragment database, some characteristic features were described for nucleotide sequences in the splicing site neighborhood, the frequencies of all nucleotides and dinucleotides were determined, and those with frequencies increased or decreased in comparison to a random sequence were identified. The results can be used in sequence annotation, splicing site prediction, and the recognition of the gene exon-intron structure.  相似文献   

13.
We present a conversational system for the computer analysis of nucleic acid and protein sequences based on the well-known Queen and Korn program (1). The system can be used by persons with only minimal knowledge of computers.  相似文献   

14.
Recognition of coding regions within eukaryotic genomes is one of oldest but yet not solved problems of bioinformatics. New high-accuracy methods of splicing sites recognition are needed to solve this problem. A question of current interest is to identify specific features of nucleotide sequences nearby splicing sites and recognize sites in sequence context. We performed a statistical analysis of human genes fragment database and revealed some characteristics of nucleotide sequences in splicing sites neighborhood. Frequencies of all nucleotides and dinucleotides in splicing sites environment were computed and nucleotides and dinucleotides with extremely high\low occurrences were identified. Statistical information obtained in this work can be used in further development of the methods of splicing sites annotation and exon-intron structure recognition.  相似文献   

15.
16.
In this paper we consider one method of mapping larger units identified from the spatial pattern of sequences of vegetation types. The basic data were presence/absence data for 6450 stands arranged in 90 transects. A second set of data was derived by averaging the species occurrences in non-overlapping groups of 5 stands. A divisive numerical classification was used to determine the primary vegetation units. In all, 5 different sets of primary types were derived, using different species suites, different sample sizes and different numerical methods. We briefly discuss the types identified and their spatial patterns in the area.Each of these types was then used to define a string of type-codes for every transect so that each transect represents a sample from the landscape containing information on the frequency and spatial distribution of the primary vegetation types. The transects may be classified using a Levenshtein dissimilarity measure and agglomerative hierarchical classification, giving 5 analyses of transects, one for each of the primary types discussed above. We then examine these transect classifications to investigate the stability of the vegetation landspace patterns under changes in species used for the primary classification, in size of sample unit and in method of primary classifications. There is a considerable degree of stability in the results. However it seems with this vegetation that the tree species and non-tree species have considerable independence. We also indicate some problems with this approach and some possible extensions.  相似文献   

17.
18.
All popular algorithms of pair-wise alignment of protein primary structures (e.g. Smith-Waterman (SW), FASTA, BLAST, et al.) utilize only amino acid sequences. The SW-algorithm is the most accurate among them, i.e. it produces alignments that are most similar to the alignments obtained by superposition of protein 3D-structures. But even the SW-algorithm is unable to restore the 3D-based alignment if similarity of amino acid sequences (%id) is below 30%. We have proposed a novel alignment method that explicitly takes into account the secondary structure of the compared proteins. We have shown that it creates significantly more accurate alignments compared to SW-algorithm. In particular, for sequences with %id < 30% the average accuracy of the new method is 58% compared to 35% for SW-algorithm (the accuracy of an algorithmic sequence alignment is the part of restored position of a "golden standard" alignment obtained by superposition of corresponding 3D-structures). The accuracy of the proposed method is approximately identical both for experimental, and for theoretically predicted secondary structures. Thus the method can be applied for alignment of protein sequences even if protein 3D-structure is unknown. The program is available at ftp://194.149.64.196/STRUSWER/.  相似文献   

19.
A system for the computer analysis of nucleic acid and protein sequences ("Helix") is described. Format of the DNA sequences is EMBL--compatible and may be easily commented with the help of convenient menus. "Helix" has also following possibilities: an effective alignment of gele reading data and formation of the final sequence; simple making of recombined molecules "in calcular"; calculations of nucleotide and dinucleotide distribution along the sequence; looking for coding frames; calculations percentage of codons and amino acids in coding frames; searching for direct and inverted repeats; sequences alignment; protein secondary structure prediction; restriction mapping; DNA--protein translation. "Helix" also contain programs for RNA-structure prediction, looking for homologies throughover the EMAL bank, choosing optimal sequence for probes and searching promoters. All the programs are written at FORTRAN-77 and automatically translated into FORTRAN-4. "Helix" require only 64 kbite.  相似文献   

20.
Jones M  Ghoorah A  Blaxter M 《PloS one》2011,6(4):e19259

Background

DNA barcoding and other DNA sequence-based techniques for investigating and estimating biodiversity require explicit methods for associating individual sequences with taxa, as it is at the taxon level that biodiversity is assessed. For many projects, the bioinformatic analyses required pose problems for laboratories whose prime expertise is not in bioinformatics. User-friendly tools are required for both clustering sequences into molecular operational taxonomic units (MOTU) and for associating these MOTU with known organismal taxonomies.

Results

Here we present jMOTU, a Java program for the analysis of DNA barcode datasets that uses an explicit, determinate algorithm to define MOTU. We demonstrate its usefulness for both individual specimen-based Sanger sequencing surveys and bulk-environment metagenetic surveys using long-read next-generation sequencing data. jMOTU is driven through a graphical user interface, and can analyse tens of thousands of sequences in a short time on a desktop computer. A companion program, Taxonerator, that adds traditional taxonomic annotation to MOTU, is also presented. Clustering and taxonomic annotation data are stored in a relational database, and are thus amenable to subsequent data mining and web presentation.

Conclusions

jMOTU efficiently and robustly identifies the molecular taxa present in survey datasets, and Taxonerator decorates the MOTU with putative identifications. jMOTU and Taxonerator are freely available from http://www.nematodes.org/.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号