首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 515 毫秒
1.
2.
GeneRAGE: a robust algorithm for sequence clustering and domain detection   总被引:9,自引:0,他引:9  
MOTIVATION: Efficient, accurate and automatic clustering of large protein sequence datasets, such as complete proteomes, into families, according to sequence similarity. Detection and correction of false positive and negative relationships with subsequent detection and resolution of multi-domain proteins. RESULTS: A new algorithm for the automatic clustering of protein sequence datasets has been developed. This algorithm represents all similarity relationships within the dataset in a binary matrix. Removal of false positives is achieved through subsequent symmetrification of the matrix using a Smith-Waterman dynamic programming alignment algorithm. Detection of multi-domain protein families and further false positive relationships within the symmetrical matrix is achieved through iterative processing of matrix elements with successive rounds of Smith-Waterman dynamic programming alignments. Recursive single-linkage clustering of the corrected matrix allows efficient and accurate family representation for each protein in the dataset. Initial clusters containing multi-domain families, are split into their constituent clusters using the information obtained by the multi-domain detection step. This algorithm can hence quickly and accurately cluster large protein datasets into families. Problems due to the presence of multi-domain proteins are minimized, allowing more precise clustering information to be obtained automatically. AVAILABILITY: GeneRAGE (version 1.0) executable binaries for most platforms may be obtained from the authors on request. The system is available to academic users free of charge under license.  相似文献   

3.
A phenetic classification based on overall morphological similarity between the species in the family Plectonemertidae (genera Plectonemertes, Campbellonemertes, Potamonemertes, Leptonemertes, Katechonemertes, Argonemertes, Anliponemertes, and Acteonemertes ) was undertaken and the result compared with a cladistic and an evolutionary classification. Similarity between species was computed by Gower's general coefficient of similarity and various techniques were used to find patterns in the similarity matrix: single-linkage, average-linkage, and complete-linkage clustering, together with principal coordinate analysis. Although the explicit aim of phenetics is not to estimate the phylogeny, the classification based on overall similarity still portrays phylogeny better than an intuitive assessment of morphological similarity, as judged by the cladistic analysis. The classification does not support the previously proposed hypothesis that the two freshwater genera Campbellonemertes and Potamonemertes have descended from a terrestrial ancestor.  相似文献   

4.
A型行为者应激变量方程的图景   总被引:4,自引:0,他引:4  
目的:建立A型行为者的应激测量模型方程和应激变量方程与相似性,并作控制分析;方法:A型行为类型问卷得分获得判分尺度,把安静状态与应激状态下测定血浆皮质醇水平的变量以及应激测量模型的应激量和基础量,再经矩阵反演出具体变化方程、并分析稳定性与可控性和能观性;结果:获得一组有生理意义的方程:(1)y=(s-0.839/s 1)X;(2)7.587d^2y/d^2s—(3.068 1.027y)dy/ds-0.1475y=0等,不同时可控性和能观性的特性,心理应激的反馈作用对稳定性有影响;结论:本文初步获得一组应激变量方程,应激反应在心理和生理方面都具有相似性.  相似文献   

5.
MOTIVATION: The discovery of patterns shared by several sequences that differ greatly is a basic task in sequence analysis, and still a challenge. Several methods have been developed for detecting patterns. Methods commonly used for motif search include the Gibbs sampler, Expectation-Maximization (EM) algorithm and some intuitive greedy approaches. One cannot guarantee the optimality of the result produced by the Gibbs sampler in a single run. The deterministic EM methods tend to get trapped by local optima. Solutions found by greedy approaches are rarely sufficiently good. RESULTS: A simple model describing a motif or a portion of local multiple sequence alignment is the weight matrix model, in which a motif is characterized with position-specific probabilities. Two substitution matrices are proposed to relate the sequence similarity with the weight matrix. Combining the substitution matrix and weight matrix, we examine three typical sets of protein sequences with increasing complexity. At a low score threshold for pair similarity, sliding windows are compared with a seed window to find the score sum, which provides a measure of statistical significance for multiple sequence comparison. Such a similarity analysis reveals many aspects of motifs. Blocks determined by similarity can be used to deduce a primary weight matrix or an improved substitution matrix. The algorithm successfully obtains the optimal solution for the test sets by just greedy iteration.  相似文献   

6.
A suite of some dozen programmes written in FORTRAN77 to run on VAX computers using the VMS operating system, and which utilizes a Digital Command Language (DCL) shell to allow it to be menu driven has been in use at the Division of Molecular Biology for about nine months. The package allows the user to obtain both dot matrix and line matrix plots, find and output specific regions of similarity and compute statistics for randomly generated sequences. In all these cases the user may specify either a maximum number of gaps in the match that will be tolerated or a minimum percentage similarity allowable for a match to be registered. The system allows the user to create a batch job for any of these analyses; so, for example, a number of line matrix plots can be specified from a remote alpha-numeric terminal which can be plotted later at a graphics terminal. In addition, computation of quasi-correlation statistics (Qr) for nucleotide sequences or correlation statistics (r) for amino acid residue sequences may be computed. Help facilities and documentation including examples are provided.  相似文献   

7.
A method for comparison of protein sequences based on their primary and secondary structure is described. Protein sequences are annotated with predicted secondary structures (using a modified Chou and Fasman method). Two lettered code sequences are generated (Xx, where X is the amino acid and x is its annotated secondary structure). Sequences are compared with a dynamic programming method (STRALIGN) that includes a similarity matrix for both the amino acids and secondary structures. The similarity value for each paired two-lettered code is a linear combination of similarity values for the paired amino acids and their annotated secondary structures. The method has been applied to eight globin proteins (28 pairs) for which the X-ray structure is known. For protein pairs with high primary sequence similarity (greater than 45%), STRALIGN alignment is identical to that obtained by a dynamic programming method using only primary sequence information. However, alignment of protein pairs with lower primary sequence similarity improves significantly with the addition of secondary structure annotation. Alignment of the pair with the least primary sequence similarity of 16% was improved from 0 to 37% 'correct' alignment using this method. In addition, STRALIGN was successfully applied to seven pairs of distantly related cytochrome c proteins, and three pairs of distantly related picornavirus proteins.  相似文献   

8.
Analysis of the extent of genetic variation within genetic resources is important for diversity preservation and also for breeders who exploit it. We investigated the recently introduced molecular marker technique of DNA diversity array technology (DArT), with the objective of characterising diversity in the likely relatively narrow genetic background of Czech malting barley cultivars. A total of 94 obsolete or registered barley cultivars and some hulless barley lines primarily of Czech origin were characterised by DArT analysis. A total of 271 polymorphic marker alleles were revealed across the analysed set of accessions, 37 of which were identified as being overrepresented; the other 234 markers were used for further analysis. The average dissimilarity value within the analysed set of accessions was 0.692. To assess how well DArT is suited for individual barley characteristic evaluation, available agronomical data from three yield field trials were used. Out of 94 barley genotypes used in the field trials that were assessed by DArTs, 41 have been grown over time as malting cultivars in the region. Similarity matrices based on Gower’s coefficient for mixed data and simple matching coefficient were used to compare DaRT and agronomical results. We demonstrate that a DArT-based similarity matrix and an agronomical data-based similarity matrix correlated well. To assess the genetic structure of the entire collection, K-means and simple matching coefficient clustering were used. Statistical analysis confirmed the power of the DArT system, in fact they efficiently grouped old genetic resources and modern cultivars in the expected way. Our results show that the level of genetic diversity has not changed substantially over time, but significant shifts in allelic frequency have occurred. In addition, a DArT-based dendrogram and principal component analysis (PCA) plots clearly demonstrated the impact of breeding practices on the diversity of Czech spring malting barley cultivars over time.  相似文献   

9.
The two methods available for analyzing the global structural identifiability of the parameters of a nonlinear system with a specified input function, the Taylor series approach and the similarity transformation approach, are compared and contrasted through application to three examples. It is shown that, as for linear systems, it is very difficult to predict which of the available methods will result in the least effort for a particular example. The role of modern symbolic manipulation packages in the analysis is assessed. The third example proves intractable using the similarity transformation approach as originally formulated, but the analysis is completed using a reformulation that exploits the polynominal form of the system equations in the example.  相似文献   

10.
A phenetic classification based on overall morphological similarity between the species in the family Plectonemertidae (genera Plectonemertes, Campbellonemertes, Potamonemertes, Leptonemertes, Katechonemertes, Argonemertes, Anliponemertes, and Acteonemertes) was undertaken and the result compared with a cladistic and an evolutionary classification. Similarity between species was computed by Gower's general coefficient of similarity and various techniques were used to find patterns in the similarity matrix: single-linkage, average-linkage, and complete-linkage clustering, together with principal coordinate analysis. Although the explicit aim of phenetics is not to estimate the phylogeny, the classification based on overall similarity still portrays phylogeny better than an intuitive assessment of morphological similarity, as judged by the cladistic analysis. The classification does not support the previously proposed hypothesis that the two freshwater genera Campbellonemertes and Potamonemertes have descended from a terrestrial ancestor.  相似文献   

11.
SUMMARY: The CluSTr database employs a fully automatic single-linkage hierarchical clustering method based on a similarity matrix. In order to compute the matrix, first all-against-all pair-wise comparisons between protein sequences are computed using the Smith-Waterman algorithm. The statistical significance of the similarity scores is then assessed using a Monte Carlo analysis, yielding Z-values, which are used to populate the matrix. This paper describes automated annotation experiments that quantify the predictive power and hence the biological relevance of the CluSTr data. The experiments utilized the UniProt data-mining framework to derive annotation predictions using combinations of InterPro and CluSTr. We show that this combination of data sources greatly increases the precision of predictions made by the data-mining framework, compared with the use of InterPro data alone. We conclude that the CluSTr approach to clustering proteins makes a valuable contribution to traditional protein classifications. AVAILABILITY: http://www.ebi.ac.uk/clustr/.  相似文献   

12.
Velvet bean (Mucuna pruriens) seeds contain the catecholic amino acid L-DoPA (L-3,4-dihydroxyphenylalanine), which is a neurotransmitter precursor and used for the treatment of Parkinson's disease and mental disorders. The great demand for L-DoPA is largely met by the pharmaceutical industry through extraction of the compound from wild populations of this plant; commercial exploitation of this compound is hampered because of its limited availability. The trichomes present on the pods can cause severe itching, blisters and dermatitis, discouraging cultivation. We screened genetic stocks of velvet bean for the trichome-less trait, along with high seed yield and L-DoPA content. The highest yielding trichome-less elite strain was selected and indentified on the basis of a PCR-based DNA fingerprinting method (RAPD), using deca-nucleotide primers. A genetic similarity index matrix was obtained through multivariant analysis using Nei and Li's coefficient. The similarity coefficients were used to generate a tree for cluster analysis using the UPGMA method. Analysis of amplification spectra of 408 bands obtained with 56 primers allowed us to distinguish a trichome-less elite strain of M. pruriens.  相似文献   

13.
14.
A method for seed proteome analysis using MALDI-TOF mass spectrometry is described. The data were used to estimate the genetic diversity degree among twelve genotypes of pepper (Capsicum). The resulting spectra were converted into a binary matrix consisting of 23 protein data sets, and genetic similarity values were calculated with the FreeTree software and Jaccard's coefficient of similarity. We have also been able to identify the presence of certain proteins in the extracts, by checking their masses on on-line databases.  相似文献   

15.
The profile hidden Markov model (PHMM) is widely used to assign the protein sequences to their respective families. A major limitation of a PHMM is the assumption that given states the observations (amino acids) are independent. To overcome this limitation, the dependency between amino acids in a multiple sequence alignment (MSA) which is the representative of a PHMM can be appended to the PHMM. Due to the fact that with a MSA, the sequences of amino acids are biologically related, the one-by-one dependency between two amino acids can be considered. In other words, based on the MSA, the dependency between an amino acid and its corresponding amino acid located above can be combined with the PHMM. For this purpose, the new emission probability matrix which considers the one-by-one dependencies between amino acids is constructed. The parameters of a PHMM are of two types; transition and emission probabilities which are usually estimated using an EM algorithm called the Baum-Welch algorithm. We have generalized the Baum-Welch algorithm using similarity emission matrix constructed by integrating the new emission probability matrix with the common emission probability matrix. Then, the performance of similarity emission is discussed by applying it to the top twenty protein families in the Pfam database. We show that using the similarity emission in the Baum-Welch algorithm significantly outperforms the common Baum-Welch algorithm in the task of assigning protein sequences to protein families.  相似文献   

16.
Genetic diversity among 13 different cultivars of date palm (Phoenix dactylifera L.) of Saudi Arabia was studied using random amplified polymorphic DNA (RAPD) markers. The screening of 140 RAPD primers allowed selection of 37 primers which revealed polymorphism, and the results were reproducible. All 13 genotypes were distinguishable by their unique banding patterns produced by 37 selected primers. Cluster analysis by the unweighted paired group method of arithmetic mean (UPGMA) showed two main clusters. Cluster A consisted of five cultivars (Shehel, Om-Kobar, Ajwa, Om-Hammam and Bareem) with 0.59–0.89 Nei and Li's coefficient in the similarity matrix. Cluster B consisted of seven cultivars (Rabeeha, Shishi, Nabtet Saif, Sugai, Sukkary Asfar, Sukkary Hamra and Nabtet Sultan) with a 0.66–0.85 Nei and Li's similarity range. Om-Hammam and Bareem were the two most closely related cultivars among the 13 cultivars with the highest value in the similarity matrix for Nei and Li's coefficient (0.89). Ajwa was closely related with Om-Hammam and Bareem with the second highest value in the similarity matrix (0.86). Sukkary Hamra and Nabtet Sultan were also closely related, with the third highest value in the similarity matrix (0.85). The cultivar Barny did not belong to any of the cluster groups. It was 34% genetically similar to the rest of the 12 cultivars. The average similarity among the 13 cultivars was more than 50%. As expected, most of the cultivars have a narrow genetic base. The results of the analysis can be used for the selection of possible parents to generate a mapping population. The variation detected among the closely related genotypes indicates the efficiency of RAPD markers over the morphological and isozyme markers for the identification and construction of genetic linkage maps.Communicated by H.F. Linskens  相似文献   

17.
Genetic diversity analysis was undertaken in 42 geographically distant genotypes accessions of bottle gourd (Lagenaria siceraria) from India northeastern (14) and northern region (28) using inter-simple sequence repeat (ISSR) markers. A total of 209 amplified bands were obtained from 20 ISSR primers used in this study, of which 186 were polymorphic with 89.00 % band polymorphism. Various parameters namely, observed number of alleles, effective number of alleles, Nei’s gene diversity/heterozygosity, resolving power, Shannon’s information index and gene flow were estimated under experiment. Jaccard’s similarity coefficient matrix was generated for pairwise comparisons between individual ISSR profiles and UPGMA cluster analysis based on this matrix showed clustering into six groups. Jaccard’s coefficient of similarity values ranged from 0.409 to 0.847, with a mean of 0.628 revealing a moderate level of genetic diversity. The Bayesian model-based approach to infer hidden genetic population structures using the multilocus ISSR markers revealed two populations among the 42 genotypes. This is the first report on the assessment of genetic variation using ISSR markers in this medicinal vegetable plant, and this study of diversity analysis will be helpful in analyzing future hybrid breeding strategy and devising effective germplasm exploration and conservation strategy.  相似文献   

18.
A special matrix of amino acid antigenic similarity for computer detection of the potential antigenic proximity of unrelated proteins is proposed. The matrix was built using the data concerning affinities of amino acid residue interactions between subunits in oligomeric proteins. The diagonal elements of the matrix characterize the recognition of amino acid residues and the non-diagonal ones represent the relative similarity measure of antibody--amino acid residue interactions specificity. The application of the new matrix for comparing proteins allows the hydrophilic potentially immunologically active regions of sequences to be picked out as similar fragments. When the influenza virus hemagglutinin was compared with 116 human proteins, eight fragments were picked out, that could not be determined by means of the routinely used MDM78 matrix. The antigenic similarity matrix for defining the forbidden structures is proposed to be used for preparing the peptidic antiviral vaccines.  相似文献   

19.
This paper is intended as an investigation of the biogeographic characteristics of insect faunas of the seven islands in West Coastal of Incheon, Korea, using quantitative analysis. The faunal similarity is examined using the Bray & Curtis similarity. The obtained similarity value matrix was examined by a cluster analysis using UPGMA method. The number and the distribution records of each species in the areas are 1,001 species of insects belonging to 12 orders from the seven investigated islands. Among above seven islands, Seokmodo has the highest number of species, 497 species, while Yeonpyeongdo has the lowest, 136 species. The species composition of insects reported in Ganghwado was 309 species under seven orders. The similarity values between seven localities investigated range from 24.907(Gyodongdo to Yeonpyeongdo) to 49.899(Baengnyeongdo to Ganghwado). That is, the species composition of Baengnyeongdo(47.90%) was similar to that of Ganghwado, while that of Yeonpyeongdo(25.28%) was different from that. The cluster analysis using a similarity index shows that all the islands of these areas can be divided into 3 groups at the level of 30.97%.  相似文献   

20.
One of the major research directions in bioinformatics is that of predicting the protein superfamily in large databases and classifying a given set of protein domains into superfamilies. The classification reflects the structural, evolutionary and functional relatedness. These relationships are embodied in hierarchical classification such as Structural Classification of Protein (SCOP), which is manually curated. Such classification is essential for the structural and functional analysis of proteins. Yet, a large number of proteins remain unclassified. We have proposed an unsupervised machine-learning FuzzyART neural network algorithm to classify a given set of proteins into SCOP superfamilies. The proposed method is fast learning and uses an atypical non-linear pattern recognition technique. In this approach, we have constructed a similarity matrix from p-values of BLAST all-against-all, trained the network with FuzzyART unsupervised learning algorithm using the similarity matrix as input vectors and finally the trained network offers SCOP superfamily level classification. In this experiment, we have evaluated the performance of our method with existing techniques on six different datasets. We have shown that the trained network is able to classify a given similarity matrix of a set of sequences into SCOP superfamilies at high classification accuracy.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号