首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The solvent accessibility of each residue is predicted on the basis of the protein sequence. A set of 338 monomeric, non-homologous and high-resolution protein crystal structures is used as a learning set and a jackknife procedure is applied to each entry. The prediction is based on the comparison of the observed and the average values of the solvent-accessible area. It appears that the prediction accuracy is significantly improved by considering the residue types preceding and/or following the residue whose accessibility must be predicted. In contrast, the separate treatment of different secondary structural types does not improve the quality of the prediction. It is furthermore shown that the residue accessibility is much better predicted in small than in larger proteins. Such a discrepancy must be carefully considered in any algorithm for predicting residue accessibility.  相似文献   

2.
MOTIVATION: The antigen receptors of adaptive immunity-T-cell receptors and immunoglobulins-are encoded by genes assembled stochastically from combinatorial libraries of gene segments. Immunoglobulin genes then experience further diversification through hypermutation. Analysis of the somatic genetics of the immune response depends explicitly on inference of the details of the recombinatorial process giving rise to each of the participating antigen receptor genes. We have developed a dynamic programming algorithm to perform this reconstruction and have implemented it as web-accessible software called SoDA (Somatic Diversification Analysis). RESULTS: We tested SoDA against a set of 120 artificial immunoglobulin sequences generated by simulation of recombination and compared the results with two other widely used programs. SoDA inferred the correct gene segments more frequently than the other two programs. We further tested these programs using 30 human immunoglobulin genes from Genbank and here highlight instances where the recombinations inferred by the three programs differ. SoDA appears generally to find more likely recombinations.  相似文献   

3.
MOTIVATION: We consider the problem of identifying low-complexity regions (LCRs) in a protein sequence. LCRs are regions of biased composition, normally consisting of different kinds of repeats. RESULTS: We define new complexity measures to compute the complexity of a sequence based on a given scoring matrix, such as BLOSUM 62. Our complexity measures also consider the order of amino acids in the sequence and the sequence length. We develop a novel graph-based algorithm called GBA to identify LCRs in a protein sequence. In the graph constructed for the sequence, each vertex corresponds to a pair of similar amino acids. Each edge connects two pairs of amino acids that can be grouped together to form a longer repeat. GBA finds short subsequences as LCR candidates by traversing this graph. It then extends them to find longer subsequences that may contain full repeats with low complexities. Extended subsequences are then post-processed to refine repeats to LCRs. Our experiments on real data show that GBA has significantly higher recall compared to existing algorithms, including 0j.py, CARD, and SEG. AVAILABILITY: The program is available on request.  相似文献   

4.
Aryl-alcohol oxidase (AAO), an FAD-dependent enzyme involved in lignin degradation, has been cloned from Pleurotus eryngii. The AAO protein is composed of 593 amino acids, 27 of which form a signal peptide. It shows 33% sequence identity with glucose oxidase from Aspergillus niger and lower homology with other oxidoreductases. The predicted secondary structures of both enzymes are very similar. For AAO, it is predicted to contain 13 putative alpha-helices and two major beta-sheets, each of the putative beta-sheets formed by six beta-strands. The ADP binding site and the signature-2 consensus sequence of the glucose-methanol-choline (GMC) oxidoreductases were also present. Moreover, residues potentially involved in catalysis and substrate binding were identified in the vicinity of the flavin ring. They include two histidines (H502 and H546) and several aromatic residues (Y78, Y92 and F501), as reported in other FAD oxidoreductases.  相似文献   

5.
Hong Y  Kang J  Lee D  van Rossum DB 《PloS one》2010,5(10):e13596
A major computational challenge in the genomic era is annotating structure/function to the vast quantities of sequence information that is now available. This problem is illustrated by the fact that most proteins lack comprehensive annotations, even when experimental evidence exists. We previously theorized that embedded-alignment profiles (simply "alignment profiles" hereafter) provide a quantitative method that is capable of relating the structural and functional properties of proteins, as well as their evolutionary relationships. A key feature of alignment profiles lies in the interoperability of data format (e.g., alignment information, physio-chemical information, genomic information, etc.). Indeed, we have demonstrated that the Position Specific Scoring Matrices (PSSMs) are an informative M-dimension that is scored by quantitatively measuring the embedded or unmodified sequence alignments. Moreover, the information obtained from these alignments is informative, and remains so even in the "twilight zone" of sequence similarity (<25% identity). Although our previous embedding strategy was powerful, it suffered from contaminating alignments (embedded AND unmodified) and high computational costs. Herein, we describe the logic and algorithmic process for a heuristic embedding strategy named "Adaptive GDDA-BLAST." Adaptive GDDA-BLAST is, on average, up to 19 times faster than, but has similar sensitivity to our previous method. Further, data are provided to demonstrate the benefits of embedded-alignment measurements in terms of detecting structural homology in highly divergent protein sequences and isolating secondary structural elements of transmembrane and ankyrin-repeat domains. Together, these advances allow further exploration of the embedded alignment data space within sufficiently large data sets to eventually induce relevant statistical inferences. We show that sequence embedding could serve as one of the vehicles for measurement of low-identity alignments and for incorporation thereof into high-performance PSSM-based alignment profiles.  相似文献   

6.
The Smith-Waterman (SW) algorithm is a typical technique for local sequence alignment in computational biology. However, the SW algorithm does not consider the local behaviours of the amino acids, which may result in loss of some useful information. Inspired by the success of Markov Edit Distance (MED) method, this paper therefore proposes a novel Markov pairwise protein sequence alignment (MPPSA) method that takes the local context dependencies into consideration. The numerical results have shown its superiority to the SW for pairwise protein sequence comparison.  相似文献   

7.
Protein libraries are essential to the field of protein engineering. Increasingly, probabilistic protein design is being used to synthesize combinatorial protein libraries, which allow the protein engineer to explore a vast space of amino acid sequences, while at the same time placing restrictions on the amino acid distributions. To this end, if site-specific amino acid probabilities are input as the target, then the codon nucleotide distributions that match this target distribution can be used to generate a partially randomized gene library. However, it turns out to be a highly nontrivial computational task to find the codon nucleotide distributions that exactly matches a given target distribution of amino acids. We first showed that for any given target distribution an exact solution may not exist at all. Formulated as a constrained optimization problem, we then developed a genetic algorithm-based approach to find codon nucleotide distributions that match as closely as possible to the target amino acid distribution. As compared with the previous gradient descent method on various objective functions, the new method consistently gave more optimized distributions as measured by the relative entropy between the calculated and the target distributions. To simulate the actual lab solutions, new objective functions were designed to allow for two separate sets of codons in seeking a better match to the target amino acid distribution.  相似文献   

8.
Testing of the additivity-based protein sequence to reactivity algorithm   总被引:1,自引:0,他引:1  
The standard free energies of association (or equilibrium constants) are predicted for 11 multiple variants of the turkey ovomucoid third domain, a member of the Kazal family of protein inhibitors, each interacting with six selected enzymes. The equilibrium constants for 38 of 66 possible interactions are strong enough to measure, and for these, the predicted and measured free energies are compared, thus providing an additional test of the additivity-based sequence to reactivity algorithm. The test appears to be unbiased as the 11 variants were designed a decade ago to study furin inhibition and the specificity of furin differs greatly from the specificities of our six target enzymes. As the contact regions of these inhibitors are highly positive, nonadditivity was expected. Of the 11 variants, one does not satisfy the restriction that either P(2) Thr or P(1)' Glu should be present and all three measurable results on it, as expected, are nonadditive. For the remaining 35 measurements, 22 are additive, 12 are partially additive, and only one is (slightly) nonadditive. These results are comparable to those obtained for a set of 398 equilibrium constants for natural variants of ovomucoid third domains. The expectation that clustering of charges would be nonadditive is modified to the expectation that major nonadditivity will be observed only if the combining sites in both associating proteins involve large charge clusters of the opposite sign. It is also shown here that an analysis of a small variant set can be accomplished with a smaller subset, in this case 13 variants, rather than the whole set of 191 members used for the complete algorithm.  相似文献   

9.
We have parallelized the FASTA algorithm for biological sequencecomparison using Linda, a machine-independent parallel programminglanguage. The resulting parallel program runs on a variety ofdifferent parallel machines. A straightforward parallelizationstrategy works well if the amount of computation to be doneis relatively large. When the amount of computation is reduced,however, disk I/O becomes a bottleneck which may prevent additionalspeed-up as the number of processors is increased. The paperdescribes the parallelization of FASTA, and uses FASTA to illustratethe I/O bottleneck problem that may arise when performing paralleldatabase search with a fast sequence comparison algorithm. Thepaper also describes several program design strategies thatcan help with this problem. The paper discusses how this bottleneckis an example of a general problem that may occur when parallelizing,or otherwise speeding up, a time-consuming computation. Received on July 25, 1990; accepted on October 15, 1990  相似文献   

10.
PISCES: a protein sequence culling server   总被引:21,自引:0,他引:21  
PISCES is a public server for culling sets of protein sequences from the Protein Data Bank (PDB) by sequence identity and structural quality criteria. PISCES can provide lists culled from the entire PDB or from lists of PDB entries or chains provided by the user. The sequence identities are obtained from PSI-BLAST alignments with position-specific substitution matrices derived from the non-redundant protein sequence database. PISCES therefore provides better lists than servers that use BLAST, which is unable to identify many relationships below 40% sequence identity and often overestimates sequence identity by aligning only well-conserved fragments. PDB sequences are updated weekly. PISCES can also cull non-PDB sequences provided by the user as a list of GenBank identifiers, a FASTA format file, or BLAST/PSI-BLAST output.  相似文献   

11.
Data Grid integrates geographically distributed resources for solving data sensitive scientific applications. Dynamic data replication algorithms are becoming increasingly valuable in solving large-scale, realistic, difficult problems, and selecting replica with multiple selection criteria—availability, security and time- is one of these problems. The current algorithms do not offer balanced QoS levels and the mechanism of rating QoS parameters. In this paper, we propose a new replica selection strategy, which based on response time and security. However, replication should be used wisely because the storage size of each Data Grid site is limited. Thus, the site must keep only the important replicas. We also present a new replica replacement strategy based on the availability of the file, the last time the replica was requested, number of access, and size of replica. We evaluate our algorithm using the OptorSim simulator and find that it offers better performance in comparison with other algorithms in terms of mean job execution time, effective network usage, SE usage, replication frequency, and hit ratio.  相似文献   

12.
An algorithm for multiple sequence comparison was implementedin FORTRAN 77 for VAX/VMS in GCG-atible format. The MULTICOMPprogram package includes several procedures with which one querysequence can be compared simultaneously to several DNA, RNAor amino acid sequences. The same technique was also introducedfor comparing propensities of secondary structural features,which can be predicted on the basis of amino acid sequences.The technique has been applied to a wide range of sequence andstructural analyses.  相似文献   

13.
14.

Background  

Genomic sequence data cannot be fully appreciated in isolation. Comparative genomics – the practice of comparing genomic sequences from different species – plays an increasingly important role in understanding the genotypic differences between species that result in phenotypic differences as well as in revealing patterns of evolutionary relationships. One of the major challenges in comparative genomics is producing a high-quality alignment between two or more related genomic sequences. In recent years, a number of tools have been developed for aligning large genomic sequences. Most utilize heuristic strategies to identify a series of strong sequence similarities, which are then used as anchors to align the regions between the anchor points. The resulting alignment is globally correct, but in many cases is suboptimal locally. We describe a new program, GenAlignRefine, which improves the overall quality of global multiple alignments by using a genetic algorithm to improve local regions of alignment. Regions of low quality are identified, realigned using the program T-Coffee, and then refined using a genetic algorithm. Because a better COFFEE (Consistency based Objective Function For alignmEnt Evaluation) score generally reflects greater alignment quality, the algorithm searches for an alignment that yields a better COFFEE score. To improve the intrinsic slowness of the genetic algorithm, GenAlignRefine was implemented as a parallel, cluster-based program.  相似文献   

15.
A comparison of scoring functions for protein sequence profile alignment   总被引:3,自引:0,他引:3  
MOTIVATION: In recent years, several methods have been proposed for aligning two protein sequence profiles, with reported improvements in alignment accuracy and homolog discrimination versus sequence-sequence methods (e.g. BLAST) and profile-sequence methods (e.g. PSI-BLAST). Profile-profile alignment is also the iterated step in progressive multiple sequence alignment algorithms such as CLUSTALW. However, little is known about the relative performance of different profile-profile scoring functions. In this work, we evaluate the alignment accuracy of 23 different profile-profile scoring functions by comparing alignments of 488 pairs of sequences with identity < or =30% against structural alignments. We optimize parameters for all scoring functions on the same training set and use profiles of alignments from both PSI-BLAST and SAM-T99. Structural alignments are constructed from a consensus between the FSSP database and CE structural aligner. We compare the results with sequence-sequence and sequence-profile methods, including BLAST and PSI-BLAST. RESULTS: We find that profile-profile alignment gives an average improvement over our test set of typically 2-3% over profile-sequence alignment and approximately 40% over sequence-sequence alignment. No statistically significant difference is seen in the relative performance of most of the scoring functions tested. Significantly better results are obtained with profiles constructed from SAM-T99 alignments than from PSI-BLAST alignments. AVAILABILITY: Source code, reference alignments and more detailed results are freely available at http://phylogenomics.berkeley.edu/profilealignment/  相似文献   

16.
Dai Q  Liu X  Yao Y  Zhao F 《Amino acids》2012,42(5):1867-1877
There are two crucial problems with statistical measures for sequence comparison: overlapping structures and background information of words in biological sequences. Word normalization in improved composition vector method took into account these problems and achieved better performance in evolutionary analysis. The word normalization is desirable, but not sufficient, because it assumes that the four bases A, C, T, and G occur randomly with equal chance. This paper proposed an improved word normalization which uses Markov model to estimate exact k-word distribution according to observed biological sequence and thus has the ability to adjust the background information of the k-word frequencies in biological sequences. The improved word normalization was tested with three experiments and compared with the existing word normalization. The experiment results confirm that the improved word normalization using Markov model to estimate the exact k-word distribution in biological sequences is more efficient.  相似文献   

17.
18.
Multiple sequence alignment by a pairwise algorithm   总被引:1,自引:0,他引:1  
An algorithm is described that processes the results of a conventionalpairwise sequence alignment program to automatically producean unambiguous multiple alignment of many sequences. Unlikeother, more complex, multiple alignment programs, the methoddescribed here is fast enough to be used on almost any multiplesequence alignment problem. Received on September 25, 1986; accepted on January 29, 1987  相似文献   

19.
Keratin proteins synthesized by dorsal or tarsometatarsal embryonic chick epidermis in heterotopic and heterospecific epidermal-dermal recombinants were analyzed by polyacrylamide gel electrophoresis and were compared to those produced by normal nondissociated dorsal and tarsometatarsal embryonic skin, as well as to those produced by control homotopic recombinants. Recombinant skins were grafted on the chick chorioallantoic membrane and grown for 8 or 11 days. Recombinants comprising dorsal feather-forming dermis formed feathers, irrespective of the origin of the epidermis. The electrophoretic band patterns of the keratins extracted from these feathers were of typical feather type. Conversely recombinants comprising tarsometatarsal scale-forming dermis formed scales, irrespective of the origin of the epidermis. The band patterns of the keratins extracted from the epidermis of these scales were of typical scale type. Heterospecific recombinants comprising chick dorsal feather-forming epidermis and mouse plantar dermis gave rise to six footpads arranged in a typical mouse pattern. In these recombinants, the chick epidermis produced keratins, the band pattern of which was of typical chick scale type. These results demonstrate that the dermis not only induces the formation of cutaneous appendages in confirmity with its regional origin, but also triggers off in the epidermis the biosynthesis of either of two different keratin types, in accordance with the regional type (feather, scale, or pad) of cutaneous appendages induced. The possible relationship between region-specific morphogenesis and cytodifferentiation is discussed in comparison with results obtained in other kinds of epithelial-mesenchymal interactions.  相似文献   

20.
A new approach to sequence comparison: normalized sequence alignment   总被引:3,自引:0,他引:3  
The Smith-Waterman algorithm for local sequence alignment is one of the most important techniques in computational molecular biology. This ingenious dynamic programming approach was designed to reveal the highly conserved fragments by discarding poorly conserved initial and terminal segments. However, the existing notion of local similarity has a serious flaw: it does not discard poorly conserved intermediate segments. The Smith-Waterman algorithm finds the local alignment with maximal score but it is unable to find local alignment with maximum degree of similarity (e.g. maximal percent of matches). Moreover, there is still no efficient algorithm that answers the following natural question: do two sequences share a (sufficiently long) fragment with more than 70% of similarity? As a result, the local alignment sometimes produces a mosaic of well-conserved fragments artificially connected by poorly-conserved or even unrelated fragments. This may lead to problems in comparison of long genomic sequences and comparative gene prediction as recently pointed out by Zhang et al. (Bioinformatics, 15, 1012-1019, 1999). In this paper we propose a new sequence comparison algorithm (normalized local alignment ) that reports the regions with maximum degree of similarity. The algorithm is based on fractional programming and its running time is O(n2log n). In practice, normalized local alignment is only 3-5 times slower than the standard Smith-Waterman algorithm.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号