首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The Shannon entropy is a common way of measuring conservation of sites in multiple sequence alignments, and has also been extended with the relative Shannon entropy to account for background frequencies. The von Neumann entropy is another extension of the Shannon entropy, adapted from quantum mechanics in order to account for amino acid similarities. However, there is yet no relative von Neumann entropy defined for sequence analysis. We introduce a new definition of the von Neumann entropy for use in sequence analysis, which we found to perform better than the previous definition. We also introduce the relative von Neumann entropy and a way of parametrizing this in order to obtain the Shannon entropy, the relative Shannon entropy and the von Neumann entropy at special parameter values. We performed an exhaustive search of this parameter space and found better predictions of catalytic sites compared to any of the previously used entropies.  相似文献   

2.
The entropies of protein coding genes from Escherichia coli were calculated according to Boltzmann's formula. Entropies of the coding regions were compared to the entropies of noncoding or miscoding ones. With nucleotides as code units, the entropies of the coding regions, when compared to the entropies of complete sequences (leader and coding region as well as trailer), were seen to be lower but with a marginal statistical significance. With triplets of nucleotides as code units, the entropies of correct reading frames were significantly lower than the entropies of frameshifts +1 and -1. With amino acids as code units, the results were opposite: Biologically functional proteins had significantly higher entropies than proteins translated from the frameshifted sequences. We attempt to explain this paradox with the hypothesis that the genetic code may have the ability of lowering information content (increasing entropy) of proteins while translating them from DNA. This ability might be beneficial to bacteria because it would make the functional proteins more probable (having a higher entropy) than nonfunctional proteins translated from frameshifted sequences.  相似文献   

3.
A large protein sequence database with over 31,000 sequences and 10 million residues has been analysed. The pair probabilities have been converted to entropies using Boltzmann’s law of statistical thermodynamics. A scoring weight corresponding to “mixing entropy” of the amino acid pairs has been developed from which the entropies of the protein sequences have been calculated. The entropy values of natural sequences are lower than their random counterparts of same length and similar amino acid composition. Based on the results it has been proposed that natural sequences are a special set of polypeptides with additional qualification of biological functionality that can be quantified using the entropy concept as worked out in this paper.  相似文献   

4.
Lisewski AM 《PloS one》2008,3(9):e3110
The transmission of genomic information from coding sequence to protein structure during protein synthesis is subject to stochastic errors. To analyze transmission limits in the presence of spurious errors, Shannon's noisy channel theorem is applied to a communication channel between amino acid sequences and their structures established from a large-scale statistical analysis of protein atomic coordinates. While Shannon's theorem confirms that in close to native conformations information is transmitted with limited error probability, additional random errors in sequence (amino acid substitutions) and in structure (structural defects) trigger a decrease in communication capacity toward a Shannon limit at 0.010 bits per amino acid symbol at which communication breaks down. In several controls, simulated error rates above a critical threshold and models of unfolded structures always produce capacities below this limiting value. Thus an essential biological system can be realistically modeled as a digital communication channel that is (a) sensitive to random errors and (b) restricted by a Shannon error limit. This forms a novel basis for predictions consistent with observed rates of defective ribosomal products during protein synthesis, and with the estimated excess of mutual information in protein contact potentials.  相似文献   

5.
Here I systematically examine the information complexity of all primary sequences of natural proteins deposited in the Swiss-Prot database. The sequence complexity is assessed by determining the frequency of occurrence of each amino acid type on sequence windows of fixed length, calculating the Shannon entropy of the window and then averaging over all windows covering the sequence. The minimum value in information content obtained from the present-day record imposes a lower limit in the number of letters that a primeval amino acid alphabet must have had.  相似文献   

6.
Homology detection and protein structure prediction are central themes in bioinformatics. Establishment of relationship between protein sequences or prediction of their structure by sequence comparison methods finds limitations when there is low sequence similarity. Recent works demonstrate that the use of profiles improves homology detection and protein structure prediction. Profiles can be inferred from protein multiple alignments using different approaches. The "Conservatism-of-Conservatism" is an effective profile analysis method to identify structural features between proteins having the same fold but no detectable sequence similarity. The information obtained from protein multiple alignments varies according to the amino acid classification employed to calculate the profile. In this work, we calculated entropy profiles from PSI-BLAST-derived multiple alignments and used different amino acid classifications summarizing almost 500 different attributes. These entropy profiles were converted into pseudocodes which were compared using the FASTA program with an ad-hoc matrix. We tested the performance of our method to identify relationships between proteins with similar fold using a nonredundant subset of sequences having less than 40% of identity. We then compared our results using Coverage Versus Error per query curves, to those obtained by methods like PSI-BLAST, COMPASS and HHSEARCH. Our method, named HIP (Homology Identification with Profiles) presented higher accuracy detecting relationships between proteins with the same fold. The use of different amino acid classifications reflecting a large number of amino acid attributes, improved the recognition of distantly related folds. We propose the use of pseudocodes representing profile information as a fast and powerful tool for homology detection, fold assignment and analysis of evolutionary information enclosed in protein profiles.  相似文献   

7.
Modeling the inherent flexibility of the protein backbone as part of computational protein design is necessary to capture the behavior of real proteins and is a prerequisite for the accurate exploration of protein sequence space. We present the results of a broad exploration of sequence space, with backbone flexibility, through a novel approach: large-scale protein design to structural ensembles. A distributed computing architecture has allowed us to generate hundreds of thousands of diverse sequences for a set of 253 naturally occurring proteins, allowing exciting insights into the nature of protein sequence space. Designing to a structural ensemble produces a much greater diversity of sequences than previous studies have reported, and homology searches using profiles derived from the designed sequences against the Protein Data Bank show that the relevance and quality of the sequences is not diminished. The designed sequences have greater overall diversity than corresponding natural sequence alignments, and no direct correlations are seen between the diversity of natural sequence alignments and the diversity of the corresponding designed sequences. For structures in the same fold, the sequence entropies of the designed sequences cluster together tightly. This tight clustering of sequence entropies within a fold and the separation of sequence entropy distributions for different folds suggest that the diversity of designed sequences is primarily determined by a structure's overall fold, and that the designability principle postulated from studies of simple models holds in real proteins. This has important implications for experimental protein design and engineering, as well as providing insight into protein evolution.  相似文献   

8.
9.
Mechanisms leading to gene variations are responsible for the diversity of species and are important components of the theory of evolution. One constraint on gene evolution is that of protein foldability; the three-dimensional shapes of proteins must be thermodynamically stable. We explore the impact of this constraint and calculate properties of foldable sequences using 3660 structures from the Protein Data Bank. We seek a selection function that receives sequences as input, and outputs survival probability based on sequence fitness to structure. We compute the number of sequences that match a particular protein structure with energy lower than the native sequence, the density of the number of sequences, the entropy, and the "selection" temperature. The mechanism of structure selection for sequences longer than 200 amino acids is approximately universal. For shorter sequences, it is not. We speculate on concrete evolutionary mechanisms that show this behavior.  相似文献   

10.
A simple model is used to illustrate the relationship between the dynamics measured by NMR relaxation methods and the local residual entropy of proteins. The expected local dynamic behavior of well-packed extended amino acid side chains are described by employing a one-dimensional vibrator that encapsulates both the spatial and temporal character of the motion. This model is then related to entropy and to the generalized order parameter of the popular "model-free" treatment often used in the analysis of NMR relaxation data. Simulations indicate that order parameters observed for the methyl symmetry axes in, for example, human ubiquitin correspond to significant local entropies. These observations have obvious significance for the issue of the physical basis of protein structure, dynamics, and stability.  相似文献   

11.
Protein interactions within a multimolecular complex can result in information and energy transfer between proteins. This can lead in turn to the emergence of novel functions of some proteins of the complex. Various examples of this situation can be found in the scientific literature. This is probably the case for prion protein, chloroplast phosphoribulokinase bound to glyceraldehyde phosphate dehydrogenase, Ras system, and pancreatic lipase bound to biomembranes, to cite but a few. Any enzyme reaction, or enzyme reaction network, carries Shannon entropy and information. On contrary to genome entropy, the entropy of enzyme reactions and metabolic sequences is sensitive to 'external' signals, such as substrate, effector and proton concentrations. Complex structural organization of the cell is associated with a higher entropy content, and one can calculate the gain of entropy and information due to integration and complexity. One may conclude from this brief analysis that the informational content of a living cell is much larger than that of its genome.  相似文献   

12.
《IRBM》2021,42(6):400-406
1) ObjectivePulmonary optical endomicroscopy (POE) is an imaging technology in real time. It allows to examine pulmonary alveoli at a microscopic level. Acquired in clinical settings, a POE image sequence can have as much as 25% of the sequence being uninformative frames (i.e. pure-noise and motion artifacts). For future data analysis, these uninformative frames must be first removed from the sequence. Therefore, the objective of our work is to develop an automatic detection method of uninformative images in endomicroscopy images.2) Material and methodsWe propose to take the detection problem as a classification one. Considering advantages of deep learning methods, a classifier based on CNN (Convolutional Neural Network) is designed with a new loss function based on Havrda-Charvat entropy which is a parametrical generalization of the Shannon entropy. We propose to use this formula to get a better hold on all sorts of data since it provides a model more stable than the Shannon entropy.3) ResultsOur method is tested on one POE dataset including 3895 distinct images and is showing better results than using Shannon entropy and behaves better with regard to the problem of overfitting. We obtain 70% of accuracy with Shannon entropy versus 77 to 79% with Havrda-Charvat.4) ConclusionWe can conclude that Havrda-Charvat entropy is better suited for restricted and or noisy datasets due to its generalized nature. It is also more suitable for classification in endomicroscopy datasets.  相似文献   

13.
Dynamic aspects of R-R intervals have often been analyzed by means of linear and nonlinear measures. The goal of this study was to analyze binary sequences, in which only the dynamic information is retained, by means of two different aspects of regularity. R-R interval sequences derived from 24-h electrocardiogram (ECG) recordings of 118 healthy subjects were converted to symbolic binary sequences that coded the beat-to-beat increase or decrease in the R-R interval. Shannon entropy was used to quantify the occurrence of short binary patterns (length N = 5) in binary sequences derived from 10-min intervals. The regularity of the short binary patterns was analyzed on the basis of approximate entropy (ApEn). ApEn had a linear dependence on mean R-R interval length, with increasing irregularity occurring at longer R-R interval length. Shannon entropy of the same sequences showed that the increase in irregularity is accompanied by a decrease in occurrence of some patterns. Taken together, these data indicate that irregular binary patterns are more probable when the mean R-R interval increases. The use of surrogate data confirmed a nonlinear component in the binary sequence. Analysis of two consecutive 24-h ECG recordings for each subject demonstrated good intraindividual reproducibility of the results. In conclusion, quantification of binary sequences derived from ECG recordings reveals properties that cannot be found using the full information of R-R interval sequences.  相似文献   

14.
15.
Characterizing enzyme sequences and identifying their active sites is a very important task. The current experimental methods are too expensive and labor intensive to handle the rapidly accumulating protein sequences and structure data. Thus accurate, high-throughput in silico methods for identifying catalytic residues and enzyme function prediction are much needed. In this paper, we propose a novel sequence-based catalytic domain prediction method using a sequence clustering and an information-theoretic approaches. The first step is to perform the sequence clustering analysis of enzyme sequences from the same functional category (those with the same EC label). The clustering analysis is used to handle the problem of widely varying sequence similarity levels in enzyme sequences. The clustering analysis constructs a sequence graph where nodes are enzyme sequences and edges are a pair of sequences with a certain degree of sequence similarity, and uses graph properties, such as biconnected components and articulation points, to generate sequence segments common to the enzyme sequences. Then amino acid subsequences in the common shared regions are aligned and then an information theoretic approach called aggregated column related scoring scheme is performed to highlight potential active sites in enzyme sequences. The aggregated information content scoring scheme is shown to be effective to highlight residues of active sites effectively. The proposed method of combining the clustering and the aggregated information content scoring methods was successful in highlighting known catalytic sites in enzymes of Escherichia coli K12 in terms of the Catalytic Site Atlas database. Our method is shown to be not only accurate in predicting potential active sites in the enzyme sequences but also computationally efficient since the clustering approach utilizes two graph properties that can be computed in linear to the number of edges in the sequence graph and computation of mutual information does not require much time. We believe that the proposed method can be useful for identifying active sites of enzyme sequences from many genome projects.  相似文献   

16.
A combination of data derived from peptide sequencing and nucleic acid sequencing of cloned cDNA fragments has been used to define the complete amino acid sequence of a 10,000 M.W., thyroxine containing polypeptide derived from bovine thyroglobulin. This fragment, TG-F, which was obtained following reduction and alkylation, has been placed at the amino terminus of the parent protein with hormone located at residue 5 in the primary sequence of the thyroglobulin molecule. The carboxyl terminal sequence of this fragment -Cys-Gln-Leu-Gln is found on the N-terminal side of a lys residue, suggesting that the peptide bond cleavage which occurs to produce this 80 residue fragment from the parent (330K) thyroglobulin chain is a gln-lys. In addition, the amino acid sequence of this 10K fragment contains: No sequence which would be a substrate for glycosylation and no carbohydrate. Several repeated homologous amino acid sequences. A striking number of beta-bends predicted from Chou-Fasman analyses, particularly near its carboxyl terminus.  相似文献   

17.
Predicting functionally important residues from sequence conservation   总被引:2,自引:1,他引:1  
MOTIVATION: All residues in a protein are not equally important. Some are essential for the proper structure and function of the protein, whereas others can be readily replaced. Conservation analysis is one of the most widely used methods for predicting these functionally important residues in protein sequences. RESULTS: We introduce an information-theoretic approach for estimating sequence conservation based on Jensen-Shannon divergence. We also develop a general heuristic that considers the estimated conservation of sequentially neighboring sites. In large-scale testing, we demonstrate that our combined approach outperforms previous conservation-based measures in identifying functionally important residues; in particular, it is significantly better than the commonly used Shannon entropy measure. We find that considering conservation at sequential neighbors improves the performance of all methods tested. Our analysis also reveals that many existing methods that attempt to incorporate the relationships between amino acids do not lead to better identification of functionally important sites. Finally, we find that while conservation is highly predictive in identifying catalytic sites and residues near bound ligands, it is much less effective in identifying residues in protein-protein interfaces. AVAILABILITY: Data sets and code for all conservation measures evaluated are available at http://compbio.cs.princeton.edu/conservation/  相似文献   

18.
Data compression is concerned with how information is organized in data. Efficient storage means removal of redundancy from the data being stored in the DNA molecule. Data compression algorithms remove redundancy and are used to understand biologically important molecules. We present a compression algorithm, "DNABIT Compress" for DNA sequences based on a novel algorithm of assigning binary bits for smaller segments of DNA bases to compress both repetitive and non repetitive DNA sequence. Our proposed algorithm achieves the best compression ratio for DNA sequences for larger genome. Significantly better compression results show that "DNABIT Compress" algorithm is the best among the remaining compression algorithms. While achieving the best compression ratios for DNA sequences (Genomes),our new DNABIT Compress algorithm significantly improves the running time of all previous DNA compression programs. Assigning binary bits (Unique BIT CODE) for (Exact Repeats, Reverse Repeats) fragments of DNA sequence is also a unique concept introduced in this algorithm for the first time in DNA compression. This proposed new algorithm could achieve the best compression ratio as much as 1.58 bits/bases where the existing best methods could not achieve a ratio less than 1.72 bits/bases.  相似文献   

19.
Protein structure information is very useful for the confirmation of protein function. The protein structural class can provide information for protein 3D structure analysis, causing the conformation of the protein overall folding type plays a significant part in molecular biology. In this paper, we focus on the prediction of protein structural class which was based on new feature representation. We extract features from the Chou-Fasman parameter, amino acid compositions, amino acids hydrophobicity features, polarity information and pair-coupled amino acid composition. The prediction result by the Support vector machine (SVM) classifier shows that our method is better than some others.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号