共查询到20条相似文献,搜索用时 15 毫秒
1.
Rackovsky S Scheraga HA 《Journal of biomolecular structure & dynamics》2011,28(4):593-4; discussion 669-674
2.
Oliver M. Lean 《Biology & philosophy》2014,29(3):395-413
Shannon information is commonly assumed to be the wrong way in which to conceive of information in most biological contexts. Since the theory deals only in correlations between systems, the argument goes, it can apply to any and all causal interactions that affect a biological outcome. Since informational language is generally confined to only certain kinds of biological process, such as gene expression and hormone signalling, Shannon information is thought to be unable to account for this restriction. It is often concluded that a richer, teleosemantic sense of information is needed. I argue against this view, and show that a coherent and sufficiently restrictive theory of biological information can be constructed with Shannon information at its core. This can be done by paying due attention some crucial distinctions: between information quantity and its fitness value, and between carrying information and having the function of doing so. From this I construct an account of how informational functions arise, and show that the “subject matter” of these functions can easily be seen as the natural information dealt with by Shannon’s theory. 相似文献
3.
Koslicki D 《Bioinformatics (Oxford, England)》2011,27(8):1061-1067
4.
If DNA were a random string over its alphabet {A, C, G, T}, an optimal code would assign two bits to each nucleotide. DNA may be imagined to be a highly ordered, purposeful molecule, and one might therefore reasonably expect statistical models of its string representation to produce much lower entropy estimates. Surprisingly, this has not been the case for many natural DNA sequences, including portions of the human genome. We introduce a new statistical model (compression algorithm), the strongest reported to date, for naturally occurring DNA sequences. Conventional techniques code a nucleotide using only slightly fewer bits (1.90) than one obtains by relying only on the frequency statistics of individual nucleotides (1.95). Our method in some cases increases this gap by more than fivefold (1.66) and may lead to better performance in microbiological pattern recognition applications. One of our main contributions, and the principle source of these improvements, is the formal inclusion of inexact match information in the model. The existence of matches at various distances forms a panel of experts which are then combined into a single prediction. The structure of this combination is novel and its parameters are learned using Expectation Maximization (EM). Experiments are reported using a wide variety of DNA sequences and compared whenever possible with earlier work. Four reasonable notions for the string distance function used to identify near matches, are implemented and experimentally compared. We also report lower entropy estimates for coding regions extracted from a large collection of nonredundant human genes. The conventional estimate is 1.92 bits. Our model produces only slightly better results (1.91 bits) when considering nucleotides, but achieves 1.84-1.87 bits when the prediction problem is divided into two stages: (i) predict the next amino acid-based on inexact polypeptide matches, and (ii) predict the particular codon. Our results suggest that matches at the amino acid level play some role, but a small one, in determining the statistical structure of nonredundant coding sequences. 相似文献
5.
6.
7.
G. I. Makhatadze P. L. Privalov 《Protein science : a publication of the Protein Society》1996,5(3):507-510
The failure to appreciate that the hydration of polar groups is a major contribution to the entropy of protein unfolding has led to considerable underestimates for the loss of configurational freedom when a protein chain folds. 相似文献
8.
Yan-Bin Wang Zhu-Hong You Xiao Li Tong-Hai Jiang Li Cheng Zhan-Heng Chen 《BMC systems biology》2018,12(8):129
Background
Self-interacting Proteins (SIPs) plays a critical role in a series of life function in most living cells. Researches on SIPs are important part of molecular biology. Although numerous SIPs data be provided, traditional experimental methods are labor-intensive, time-consuming and costly and can only yield limited results in real-world needs. Hence,it’s urgent to develop an efficient computational SIPs prediction method to fill the gap. Deep learning technologies have proven to produce subversive performance improvements in many areas, but the effectiveness of deep learning methods for SIPs prediction has not been verified.Results
We developed a deep learning model for predicting SIPs by constructing a Stacked Long Short-Term Memory (SLSTM) neural network that contains “dropout”. We extracted features from protein sequences using a novel feature extraction scheme that combined Zernike Moments (ZMs) with Position Specific Weight Matrix (PSWM). The capability of the proposed approach was assessed on S.erevisiae and Human SIPs datasets. The result indicates that the approach based on deep learning can effectively resist data skew and achieve good accuracies of 95.69 and 97.88%, respectively. To demonstrate the progressiveness of deep learning, we compared the results of the SLSTM-based method and the celebrated Support Vector Machine (SVM) method and several other well-known methods on the same datasets.Conclusion
The results show that our method is overall superior to any of the other existing state-of-the-art techniques. As far as we know, this study first applies deep learning method to predict SIPs, and practical experimental results reveal its potential in SIPs identification.9.
The Homeodomain Resource is a comprehensive collection of sequence, structure and genomic information on the homeodomain protein family. Available through the Resource are both full-length and domain-only sequence data, as well as X-ray and NMR structural data for proteins and protein-DNA complexes. Also available is information on human genetic diseases and disorders in which proteins from the homeodomain family play an important role; genomic information includes relevant gene symbols, cytogenetic map locations, and specific mutation data. Search engines are provided to allow users to easily query the component databases and assemble specialized data sets. The Homeodomain Resource is available through the World Wide Web at http://genome.nhgri.nih.gov/homeodomain 相似文献
10.
A statistical method for characterizing nucleotidic sequences based on maximum entropy techniques is presented. The method uses only codon usage tables and takes into account the length of sequences, and preserves the information contained in each codon by a punctual index. We present the methodological aspects of the analysis, showing an application relative to nucleotidic sequences of eukaryotes. 相似文献
11.
MOTIVATION: We present a new concept that combines data storage and data analysis in genome research, based on an associative network memory. As an illustration, 115 000 conserved regions from over 73 000 published sequences (i.e. from the entire annotated part of the SWISSPROT sequence database) were identified and clustered by a self-organizing network. Similarity and kinship, as well as degree of distance between the conserved protein segments, are visualized as neighborhood relationship on a two-dimensional topographical map. RESULTS: Such a display overcomes the restrictions of linear list processing and allows local and global sequence relationships to be studied visually. Families are memorized as prototype vectors of conserved regions. On a massive parallel machine, clustering and updating of the database take only a few seconds; a rapid analysis of incoming data such as protein sequences or ESTs is carried out on present-day workstations. AVAILABILITY: Access to the database is available at http://www.bioinf.mdc-berlin.de/unter2.html++ + CONTACT: (hanke,lehmann,reich)@mdc-berlin.de; bork@embl-heidelberg.de 相似文献
12.
Concepts of the uniqueness of the amino acid sequences of proteins were defined in a prior report (Saroff, H. A. and F. A. Kutyna. 1981. “The Uniqueness of Protein Sequences: A Monte Carlo Analysis.”Bull. math. Biol. 43, 619–639), which presented a detailed discussion ofi-uniqueness, i.e. the tendency of small peptides to be repeated within an amino acid sequence of a protein. We now report on the quantitative analysis ofo-uniqueness, which evaluates the tendency of small peptides to be repeated amongst different proteins, usually of a single species. A detailed analysis of theo-uniqueness of several proteins is presented to illustrate the method and the range of values encountered. Uniqueness data on sequences of human proteins in a data bank of sequences containing about 32,500 amino acids are made available in the form of a microfiche. Analysis of biologically active subsequences such as the angiotensins and the enkephalins suggest a tendency of the subsequences contributing to the property ofo-uniqueness to cluster in portions of the parent protein sequence which are biologically active. This property may provide a general method for predicting biologically active areas of proteins. Current data may already be adequate to permit useful predictions, and the rapidly accumulating and interrelated new data on nucleic acid and protein sequences will further enhance the power ofo-uniqueness analysis. 相似文献
13.
Sadovsky MG 《Bulletin of mathematical biology》2003,65(2):309-322
A new method to compare two (or several) symbol sequences is developed. The method is based on the comparison of the frequencies
of the small fragments of the compared sequences; it requires neither string editing, nor other transformations of the compared
objects. The comparison is executed through a calculation of the specific entropy of a frequency dictionary against the special
dictionary called the hybrid one; this latter is the statistical ancestor of the group of sequences under comparison. Some
applications of the developed method in the fields of genetics and bioinformatics are discussed. 相似文献
14.
Giri Narasimhan Changsong Bu Yuan Gao Xuning Wang Ning Xu Kalai Mathee 《Journal of computational biology》2002,9(5):707-720
We use methods from Data Mining and Knowledge Discovery to design an algorithm for detecting motifs in protein sequences. The algorithm assumes that a motif is constituted by the presence of a "good" combination of residues in appropriate locations of the motif. The algorithm attempts to compile such good combinations into a "pattern dictionary" by processing an aligned training set of protein sequences. The dictionary is subsequently used to detect motifs in new protein sequences. Statistical significance of the detection results are ensured by statistically determining the various parameters of the algorithm. Based on this approach, we have implemented a program called GYM. The Helix-Turn-Helix motif was used as a model system on which to test our program. The program was also extended to detect Homeodomain motifs. The detection results for the two motifs compare favorably with existing programs. In addition, the GYM program provides a lot of useful information about a given protein sequence. 相似文献
15.
16.
17.
18.
Apweiler R 《Briefings in bioinformatics》2001,2(1):9-18
With the rapid growth of sequence databases, there is an increasing need for reliable functional characterisation and annotation of newly predicted proteins. To cope with such large data volumes, faster and more effective means of protein sequence characterisation and annotation are required. One promising approach is automatic large-scale functional characterisation and annotation, which is generated with limited human interaction. However, such an approach is heavily dependent on reliable data sources. The SWISS-PROT protein sequence database plays an essential role here owing to its high level of functional information. 相似文献
19.
Human lactate dehydrogenase B (LDH-B) cDNA was isolated and sequenced. The LDH-B cDNA insert consists of the protein-coding sequence (999 bp), the 5' (54 bp) and 3' (203 bp) non-coding regions, and the poly(A) tail (50 bp). The predicted sequence of 333 amino acid residues was confirmed by amino acid composition and/or sequence analyses of a total of 185 (56%) residues from tryptic peptides of human LDH-B protein. The nucleotide and amino acid sequences of the human LDH-B coding region show 68% and 75% homologies respectively with those of the human LDH-A. The peptide map and amino acid composition data have been deposited as Supplementary Publication SUP 50139 (7 pages) at the British Library Lending Division, Boston Spa, Wetherby, West Yorkshire LS23 7BQ, U.K., from whom copies are available on prepayment [see Biochem. J. (1987) 241, 5]. 相似文献
20.
Evolution of protein sequences and structures. 总被引:9,自引:0,他引:9
The relationship between sequence similarity and structural similarity has been examined in 36 protein families with five or more diverse members whose structures are known. The structural similarity within a family (as determined with the DALI structure comparison program) is linearly related to sequence similarity (as determined by a Smith-Waterman search of the protein sequences in the structure database). The correlation between structural similarity and sequence similarity is very high; 18 of the 36 families had linear correlation coefficients r>/=0.878, and only nine had correlation coefficients r=0.815. Inclusion of higher-order terms in the structure/sequence relationship improved the fit by less than 7% in 27 of the 36 families. Differences in sequence/structure correlations are distributed evenly among the four protein structural classes, alpha, beta, alpha/beta, and alpha+beta. While most protein families show high correlations between sequence similarity and structural similarity, the amount of structural change per sequence change, i.e. the structural mutation sensitivity, varies almost fourfold. Protein families with high and low structural mutation sensitivity are distributed evenly among protein structure classes. In addition, we did not detect strong correlations between structural mutation sensitivity and either protein family mutation rates or protein size. Our results are more consistent with models of protein structure that encode a protein family's fold throughout the protein sequence, and not just in a few critical residues. 相似文献