首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We use methods from Data Mining and Knowledge Discovery to design an algorithm for detecting motifs in protein sequences. The algorithm assumes that a motif is constituted by the presence of a "good" combination of residues in appropriate locations of the motif. The algorithm attempts to compile such good combinations into a "pattern dictionary" by processing an aligned training set of protein sequences. The dictionary is subsequently used to detect motifs in new protein sequences. Statistical significance of the detection results are ensured by statistically determining the various parameters of the algorithm. Based on this approach, we have implemented a program called GYM. The Helix-Turn-Helix motif was used as a model system on which to test our program. The program was also extended to detect Homeodomain motifs. The detection results for the two motifs compare favorably with existing programs. In addition, the GYM program provides a lot of useful information about a given protein sequence.  相似文献   

2.
A new program called GAMMA (genetic algorithm for multiple molecule alignment) has been developed for the superimposition of several three-dimensional chemical structures. Superimposition of molecules and evaluation of structural similarity is an important task in drug design and pharmaceutical research. Similarities of compounds are determined by this program either based on their structural or their physicochemical properties by defining different matching criteria. These matching criteria are atomic properties such as atomic number or partial atomic charges. The program is based on a combination of a genetic algorithm with a numerical optimization process. A major goal of this hybrid procedure is to address the conformational flexibility of ligand molecules adequately. Thus, only one conformation per structure is necessary and the program can work even when only one conformation of a compound is stored in a database. The genetic algorithm optimizes in a nondeterministic process the size and the geometric fit of the overlay. The geometric fit of the conformations is further improved by changing torsional angles combining the genetic algorithm and the directed tweak method. The determination of the fitness of a superimposition is based on the Pareto optimization. As an application the superimposition of a set of Cytochrome P450c17 enzyme inhibitors has been performed.Electronic Supplementary Material available.  相似文献   

3.
In this paper we present a branch and bound algorithm for local gapless multiple sequence alignment (motif alignment) and its implementation. The algorithm uses both score-based bounding and a novel bounding technique based on the "consistency" of the alignment. A sequence order independent search tree is used in conjunction with a technique for avoiding redundant calculations inherent in the structure of the tree. This is the first program to exploit the fact that the motif alignment problem is easier for short motifs. Indeed, for a short fixed motif width, the running time of the algorithm is asymptotically linear in the size of the input. We tested the performance of the program on a dataset of 300 E. coli promoter sequences and a dataset of 85 lipocalin protein sequences. For a motif width of 4, the optimal alignment of the entire set of sequences can be found. For the more natural motif width of 6, the program can align 21 sequences of length 100, more than twice the number of sequences which can be aligned by the best previous exact algorithm. The algorithm can relax the constraint of requiring each sequence to be aligned, and align 105 of the 300 promoter sequences with a motif width of 6. For the lipocalin dataset, we introduce a technique for reducing the effective alphabet size with a minimal loss of useful information. With this technique, we show that the program can find meaningful motifs in a reasonable amount of time by optimizing the score over three motif positions.  相似文献   

4.
We have developed a program for the fast and accurate detection of spontaneous synaptic events. The algorithm identifies each event of which the slope and amplitude which meet criteria. The significant feature of this algorithm is its stepwise and exploratory search for the onset and the peak points. During the first step, the program employing the algorithm makes a rough estimate of the candidate for a synaptic event, and determines a 'temporary' onset data point. The next step is the detection of the true onset data point and 'temporary' peak data point, which probably exist several points after the temporary onset data point. The third step is a backward search to detect the true peak data point. The final step is to check whether the amplitude of the detected event exceeds the threshold. This stepwise and shuttlewise search allows for the accurate detection of the peak points. Using this program, we succeeded in detecting an increased frequency and amplitude of spontaneous excitatory postsynaptic currents in chick cerebral neurons following the application of 12-O-tetradecanoyl-phorbol-13-acetate (TPA). In addition, we demonstrated that the program employing the algorithm was able to be used for the detection of extracellular action potentials.  相似文献   

5.
Multiple alignment is an important problem in computational biology. It is well known that it can be solved exactly by a dynamic programming algorithm which in turn can be interpreted as a shortest path computation in a directed acyclic graph. The A* algorithm (or goal-directed unidirectional search) is a technique that speeds up the computation of a shortest path by transforming the edge lengths without losing the optimality of the shortest path. We implemented the A* algorithm in a computer program similar to MSA (Gupta et al., 1995) and FMA (Shibuya and Imai, 1997). We incorporated in this program new bounding strategies for both lower and upper bounds and show that the A* algorithm, together with our improvements, can speed up computations considerably. Additionally, we show that the A* algorithm together with a standard bounding technique is superior to the well-known Carrillo-Lipman bounding since it excludes more nodes from consideration.  相似文献   

6.
ProteoCat is a computer program that has been designed to help researchers in the planning of large-scale proteomic experiments. The central part of this program is the unit of hydrolysis simulation that supports 4 proteases (trypsin, lysine C, endoproteinases Asp-N and GluC). For peptides obtained after virtual hydrolysis or loaded from data files a number of properties important in mass-spectrometric experiments can be calculated and predicted; the resultant data can be analyzed or filtered (to reduce a set of peptides). The program is using new and improved modifications of own earlier developed methods for pI prediction, which can be also predicted by means of popular pKa scales proposed by other reseachers. The algorithm for prediction of peptide retention time has been realized similarly to the algorithm used in the SSRCalc program. Using ProteoCat it is possible to estimate the coverage of amino acid sequences of analyzed proteins under defined limitation on peptides detection, as well as the possibility of assembly of peptide fragments with user-defined minimal sizes of “sticky” ends. The program has a graphical user interface, written on JAVA and available at http://www.ibmc.msk.ru/LPCIT/ProteoCat.  相似文献   

7.
A greedy algorithm for aligning DNA sequences.   总被引:39,自引:0,他引:39  
For aligning DNA sequences that differ only by sequencing errors, or by equivalent errors from other sources, a greedy algorithm can be much faster than traditional dynamic programming approaches and yet produce an alignment that is guaranteed to be theoretically optimal. We introduce a new greedy alignment algorithm with particularly good performance and show that it computes the same alignment as does a certain dynamic programming algorithm, while executing over 10 times faster on appropriate data. An implementation of this algorithm is currently used in a program that assembles the UniGene database at the National Center for Biotechnology Information.  相似文献   

8.
A molecular modeling program is presented which has been written for Microsoft windows 3.1 and Windows NT operating systems. The program permits interactive molecular manipulation and also provides analytical tools such as energy computations and solvent accessible surfaces. An extremely fast algorithm is used which generates realistic space-filling CPK images in addition to wire frame, ribbons, MIDAS, labels, and points. An important feature of this algorithm is a highly optimized Z-buffer, which is described.  相似文献   

9.
Sequence analysis is the basis of bioinformatics, while sequence alignment is a fundamental task for sequence analysis. The widely used alignment algorithm, Dynamic Programming, though generating optimal alignment, takes too much time due to its high computation complexity O(N(2)). In order to reduce computation complexity without sacrificing too much accuracy, we have developed a new approach to align two homologous sequences. The new approach presented here, adopting our novel algorithm which combines the methods of probabilistic and combinatorial analysis, reduces the computation complexity to as low as O(N). The computation speed by our program is at least 15 times faster than traditional pairwise alignment algorithms without a loss of much accuracy. We hence named the algorithm Super Pairwise Alignment (SPA). The pairwise alignment execution program based on SPA and the detailed results of the aligned sequences discussed in this article are available upon request.  相似文献   

10.
The functional annotation of the new protein sequences represents a major drawback for genomic science. The best way to suggest the function of a protein from its sequence is by finding a related one for which biological information is available. Current alignment algorithms display a list of protein sequence stretches presenting significant similarity to different protein targets, ordered by their respective mathematical scores. However, statistical and biological significance do not always coincide, therefore, the rearrangement of the program output according to more biological characteristics than the mathematical scoring would help functional annotation. A new method that predicts the putative function for the protein integrating the results from the PSI-BLAST program and a fuzzy logic algorithm is described. Several protein sequence characteristics have been checked in their ability to rearrange a PSI-BLAST profile according more to their biological functions. Four of them: amino acid content, matched segment length and hydropathic and flexibility profiles positively contributed, upon being integrated by a fuzzy logic algorithm into a program, BYPASS, to the accurate prediction of the function of a protein from its sequence. Antonio Gómez and Juan Cedano contributed equally to this work.  相似文献   

11.
A novel program has been developed for the interpretation of 15N relaxation rates in terms of macromolecular anisotropic rotational diffusion. The program is based on a highly efficient simulated annealing/minimization algorithm, designed specifically to search the parametric space described by the isotropic, axially symmetric and fully anisotropic rotational diffusion tensor models. The high efficiency of this algorithm allows extensive noise-based Monte Carlo error analysis. Relevant statistical tests are systematically applied to provide confidence limits for the proposed tensorial models. The program is illustrated here using the example of the cytochrome c from Rhodobacter capsulatus, a four-helix bundle heme protein, for which data at three different field strengths were independently analysed and compared.  相似文献   

12.
The development of a program for the identification of a model including up to 15 compartments is presented. The identification of the model parameters with this program package is based upon the improved Gauss Marquardt algorithm. This program, implemented on a microcomputer (Data General) Eclipse 64 K RAM), uses a calculation and automatic generation of a partial derivative routine. Thus, starting from the differential equations of the model correctly written, there is no longer any risk of error.  相似文献   

13.
MOTIVATION: Selecting SNP markers for genome-wide association studies is an important and challenging task. The goal is to minimize the number of markers selected for genotyping in a particular platform and therefore reduce genotyping cost while simultaneously maximizing the information content provided by selected markers. RESULTS: We devised an improved algorithm for tagSNP selection using the pairwise r(2) criterion. We first break down large marker sets into disjoint pieces, where more exhaustive searches can replace the greedy algorithm for tagSNP selection. These exhaustive searches lead to smaller tagSNP sets being generated. In addition, our method evaluates multiple solutions that are equivalent according to the linkage disequilibrium criteria to accommodate additional constraints. Its performance was assessed using HapMap data. AVAILABILITY: A computer program named FESTA has been developed based on this algorithm. The program is freely available and can be downloaded at http://www.sph.umich.edu/csg/qin/FESTA/  相似文献   

14.
A calculator program that performs a nonlinear least-squares fit to data conforming to the one-compartment model with zero-order input is described. The program, which is designed for the Hewlett-Packard HP-41 CV calculator, is based on the Gauss-Newton iterative algorithm as modified by Hartley. A subroutine for calculation of initial parameter estimates is incorporated into the program. Plasma concentration data relative to a single oral dose of a sustained-release theophylline formulation are used to demonstrate the practical application of the program.  相似文献   

15.
MULTAN: a program to align multiple DNA sequences.   总被引:4,自引:4,他引:0       下载免费PDF全文
I describe a computer program which can align a large number of nucleic acid sequences with one another. The program uses an heuristic, iterative algorithm which has been tested extensively, and is found to produce useful alignments of a variety of sequence families. The algorithm is fast enough to be practical for the analysis of large number of sequences, and is implemented in a program which contains a variety of other functions to facilitate the analysis of the aligned result.  相似文献   

16.
ABSTRACT: BACKGROUND: Linkage analysis is the rst step in the search for a disease gene. Linkage studies have facilitated the identication of several hundred human genes that can harbor mutations leading to a disease phenotype. In this paper, we study a very important case, where the sampled individuals are closely related, but the pedigree is not given. This situation happens very often when the individuals share a common ancestor 6 or more generations ago. To our knowledge, no algorithm can give good results for this case. RESULTS: To solve this problem, we rst developed some heuristic algorithms for haplotype inference without any given pedigree. We propose a model using the parsimony principle that can be viewed as an extension of the model rst proposed by Dan Guseld. Our heuristic algorithm uses Clark's inference rule to infer haplotype segments. CONCLUSIONS: We ran our program both on the simulated data and a set of real data from the phase II HapMap database. Experiments show that our program performs well. The recall value is from 90% to 99% in various cases. This implies that the program can report more than 90% of the true mutation regions. The value of precision varies from 29% to 90%. When the precision is 29%, the size of the reported regions is three times that of the true mutation region. This is still very useful for narrowing down the range of the disease gene location. Our program can complete the computation for all the tested cases, where there are about 110,000 SNPs on a chromosome, within 20 seconds.  相似文献   

17.
We present a model-based parallel algorithm for origin and orientation refinement for 3D reconstruction in cryoTEM. The algorithm is based upon the Projection Theorem of the Fourier Transform. Rather than projecting the current 3D model and searching for the best match between an experimental view and the calculated projections, the algorithm computes the Discrete Fourier Transform (DFT) of each projection and searches for the central section ("cut") of the 3D DFT that best matches the DFT of the projection. Factors that affect the efficiency of a parallel program are first reviewed and then the performance and limitations of the proposed algorithm are discussed. The parallel program that implements this algorithm, called PO(2)R, has been used for the refinement of several virus structures, including those of the 500 Angstroms diameter dengue virus (to 9.5 Angstroms resolution), the 850 Angstroms mammalian reovirus (to better than 7A), and the 1800 Angstroms paramecium bursaria chlorella virus (to 15 Angstroms).  相似文献   

18.
Incomplete lineage sorting can cause incongruence between the phylogenetic history of genes (the gene tree) and that of the species (the species tree), which can complicate the inference of phylogenies. In this article, I present a new coalescent-based algorithm for species tree inference with maximum likelihood. I first describe an improved method for computing the probability of a gene tree topology given a species tree, which is much faster than an existing algorithm by Degnan and Salter (2005). Based on this method, I develop a practical algorithm that takes a set of gene tree topologies and infers species trees with maximum likelihood. This algorithm searches for the best species tree by starting from initial species trees and performing heuristic search to obtain better trees with higher likelihood. This algorithm, called STELLS (which stands for Species Tree InfErence with Likelihood for Lineage Sorting), has been implemented in a program that is downloadable from the author's web page. The simulation results show that the STELLS algorithm is more accurate than an existing maximum likelihood method for many datasets, especially when there is noise in gene trees. I also show that the STELLS algorithm is efficient and can be applied to real biological datasets.  相似文献   

19.
A computer algorithm has been developed which identifies tRNA genes and tRNA-like structures in DNA sequences. The program searches the sequence string for specific base positions that correspond to the invariant and semi-invariant bases found in tRNAs. The tRNA nature of the sequence is confirmed by the presence of complementary base pairing at the tRNA's calculated 5' and 3' ends (which in situ constitutes the amino-acyl stem region). The program achieves greater than 96% accuracy when run against known tRNA sequences in the Genbank database. The program is modular and is readily modified to allow searching either a file or database. The program is written in "C" and operates on a D.E.C. Vax 750. The utility of the algorithm is demonstrated by the identification of a distinctive tRNA structure in an intron of a published bovine hemoglobin gene.  相似文献   

20.
A pseudo-random generator is an algorithm to generate a sequence of objects determined by a truly random seed which is not truly random. It has been widely used in many applications, such as cryptography and simulations. In this article, we examine current popular machine learning algorithms with various on-line algorithms for pseudo-random generated data in order to find out which machine learning approach is more suitable for this kind of data for prediction based on on-line algorithms. To further improve the prediction performance, we propose a novel sample weighted algorithm that takes generalization errors in each iteration into account. We perform intensive evaluation on real Baccarat data generated by Casino machines and random number generated by a popular Java program, which are two typical examples of pseudo-random generated data. The experimental results show that support vector machine and k-nearest neighbors have better performance than others with and without sample weighted algorithm in the evaluation data set.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号