首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Most molecular analyses, including phylogenetic inference, are based on sequence alignments. We present an algorithm that estimates relatedness between biomolecules without the requirement of sequence alignment by using a protein frequency matrix that is reduced by singular value decomposition (SVD), in a latent semantic index information retrieval system. Two databases were used: one with 832 proteins from 13 mitochondrial gene families and another composed of 1000 sequences from nine types of proteins retrieved from GenBank. Firstly, 208 sequences from the first database and 200 from the second were randomly selected and compared using edit distance between each pair of sequences and respective cosines and Euclidean distances from SVD. Correlation between cosine and edit distance was -0.32 (P < 0.01) and between Euclidean distance and edit distance was +0.70 (P < 0.01). In order to check the ability of SVD in classifying sequences according to their categories, we used a sample of 202 sequences from the 13 gene families as queries (test set), and the other proteins (630) were used to generate the frequency matrix (training set). The classification algorithm applies a voting scheme based on the five most similar sequences with each query. With a 3-peptide frequency matrix, all 202 queries were correctly classified (accuracy = 100%). This algorithm is very attractive, because sequence alignments are neither generated nor required. In order to achieve results similar to those obtained with edit distance analysis, we recommend that Euclidean distance be used as a similarity measure for protein sequences in latent semantic indexing methods.  相似文献   

2.
In molecular biology, the issue of quantifying the similarity between two biological sequences is very important. Past research has shown that word-based search tools are computationally efficient and can find some new functional similarities or dissimilarities invisible to other algorithms like FASTA. Recently, under the independent model of base composition, Wu, Burke, and Davison (1997, Biometrics 53, 1431 1439) characterized a family of word-based dissimilarity measures that defined distance between two sequences by simultaneously comparing the frequencies of all subsequences of n adjacent letters (i.e., n-words) in the two sequences. Specifically, they introduced the use of Mahalanobis distance and standardized Euclidean distance into the study of DNA sequence dissimilarity. They showed that both distances had better sensitivity and selectivity than the commonly used Euclidean distance. The purpose of this article is to extend Mahalanobis and standardized Euclidean distances to Markov chain models of base composition. In addition, a new dissimilarity measure based on Kullback-Leibler discrepancy between frequencies of all n-words in the two sequences is introduced. Applications to real data demonstrate that Kullback-Leibler discrepancy gives a better performance than Euclidean distance. Moreover, under a Markov chain model of order kQ for base composition, where kQ is the estimated order based on the query sequence, standardized Euclidean distance performs very well. Under such a model, it performs as well as Mahalanobis distance and better than Kullback-Leibler discrepancy and Euclidean distance. Since standardized Euclidean distance is drastically faster to compute than Mahalanobis distance, in a usual workstation/PC computing environment, the use of standardized Euclidean distance under the Markov chain model of order kQ of base composition is generally recommended. However, if the user is very concerned with computational efficiency, then the use of Kullback-Leibler discrepancy, which can be computed as fast as Euclidean distance, is recommended. This can significantly enhance the current technology in comparing large datasets of DNA sequences.  相似文献   

3.
The phylogeny of the subgenus Polyommatus (s. str.) was studied on the basis of an analysis of the nucleotide sequences of cytochrome oxidase I gene (COI) mtDNA in 76 blue butterflies belonging to 36 taxons of subgenus and species rank. There are only 12 monophyletic clades that corresponded to taxons of the species level: P. icarus, P. ciloicus, P. amorata, P. forsteri, P. celina, P. venus, P. ariana, P. stoliczkanus, P. erigone and P. amor were identified among the variety of all studied taxons.  相似文献   

4.
Ordination is a powerful method for analysing complex data setsbut has been largely ignored in sequence analysis. This papershows how to use principal coordinates analysis to find low–dimensionalrepresentations of distance matrices derived from aligned setsof sequences. The method takes a matrix of Euclidean distancesbetween all pairs of sequence and finds a coordinate space wherethe distances are exactly preserved The main problem is to finda measure of distance between aligned sequences that is Euclidean.The simplest distance function is the square root of the percentagedifference (as measured by identities) between two sequences,where one ignores any positions in the alignment where thereis a gap in any sequence. If one does not ignore positions witha gap, the distances cannot be guaranteed to be Euclidean butthe deleterious effects are trivial. Two examples of using themethod are shown. A set of 226 aligned globins were analysedand the resulting ordination very successfully represents theknown patterns of relationship between the sequences. In theother example, a set of 610 aligned 5S rRNA sequences were analysed.Sequence ordinations complement phylogenetic analyses. Theyshould not be viewed as a complete alternative.  相似文献   

5.
Homologous amino acid sequences of phospholipases A2 (PLA2) of snakes belonging to the families Elapidae, Viperidae, and Colubridae were considered in order to study the conservative and variable regions location. The PLA2 sequences were divided into two groups (taxons) according to the phylogenetic tree reconstructed from the pair similarity matrix. Results of the intergroup comparison were plotted to facilitate the identification of significant conservative and variable regions. It was shown that the results of the comparison between two phylogenetic groups of snake PLA2 did not much depend on the number of each group representatives and did not markedly change if one of the groups was represented by the single sequence. The knowledge of the number and location of conservative and variable regions and their dependence on the phylogenetic relations between compared taxa may be used to predict a synthetic peptide structure to obtain specific antibodies against PLA2 of one of these taxons. Such prediction is possible if there is a specific region conservative for one taxon but variable for two of them.  相似文献   

6.
A new approach for the analysis of hotspots of mutations is described. It is based on the classification of hotspot site sequences. Using this approach, the consensuses RGYW and TAA of hotspot sites were revealed in the V gene. Correlation between somatic mutations and these consensuses is investigated by the statistical weight method in 323 somatic substitutions in 14 V genes. Assuming the absence of any correlation, the probability of observing such data in the sample would be very low (0.0003). These results support the idea that emergence of somatic mutation is significantly influenced by neighbouring base sequences. This idea was also supported by the analysis of 296 somatic mutations in flanking sequences of V genes. It is supposed that this influence is an important feature of somatic hypermutagenesis.  相似文献   

7.
Hot environments are between the supporting life extreme niches that appear to have maintained some degree of pristine quality and of special biotechnological interest. Knowledge on biodiversity in terrestrial hot springs is still scanty and has not been compared in the light of the specificity of those extreme ecological niches. Study on diversity of thermophilic bacteria inhabiting a hot spring located in Rupi Basin (RB), South-West Bulgaria, revealed a high phylogenetic richness in it (genotypic diversity is 0.37). A total of 120 clones were examined, and grouped in 28 phylogenetic types by their RFLP profile. 16S rRNA gene analysis allowed the identification of nine divisions from the domain Bacteria and one Candidate division. Ten of the retrieved bacterial sequences representing one third of the sequence types showed less than 97% similarity to the closest neighbor and were referred as new sequences. Four of them were distantly related to validly described bacteria (showed ≤90% similarity) suggesting new taxons on at least genus level. Comparison of biodiversity in the spring from Rupi Basin, Bulgaria with that described from other terrestrial hot springs revealed that Proteobacteria, Hydrogenobacter/Aquifex and Thermus are common bacterial groups for terrestrial hot springs. Simultaneously, specific bacterial taxons were observed in different springs.  相似文献   

8.
Mojie Duan  Minghai Li  Li Han  Shuanghong Huo 《Proteins》2014,82(10):2585-2596
Dimensionality reduction is widely used in searching for the intrinsic reaction coordinates for protein conformational changes. We find the dimensionality?reduction methods using the pairwise root?mean?square deviation (RMSD) as the local distance metric face a challenge. We use Isomap as an example to illustrate the problem. We believe that there is an implied assumption for the dimensionality‐reduction approaches that aim to preserve the geometric relations between the objects: both the original space and the reduced space have the same kind of geometry, such as Euclidean geometry vs. Euclidean geometry or spherical geometry vs. spherical geometry. When the protein free energy landscape is mapped onto a 2D plane or 3D space, the reduced space is Euclidean, thus the original space should also be Euclidean. For a protein with N atoms, its conformation space is a subset of the 3N‐dimensional Euclidean space R3N. We formally define the protein conformation space as the quotient space of R3N by the equivalence relation of rigid motions. Whether the quotient space is Euclidean or not depends on how it is parameterized. When the pairwise RMSD is employed as the local distance metric, implicit representations are used for the protein conformation space, leading to no direct correspondence to a Euclidean set. We have demonstrated that an explicit Euclidean‐based representation of protein conformation space and the local distance metric associated to it improve the quality of dimensionality reduction in the tetra‐peptide and β‐hairpin systems. Proteins 2014; 82:2585–2596. © 2014 Wiley Periodicals, Inc.  相似文献   

9.
Homologous amino acid sequences of phospholipases A2 (PLA2) of snakes belonging to the families Elapidae, Viperidae, and Colubridae were considered in order to study the conservative and variable regions location. The PLA2 sequences were divided into two groups (taxons) according to the phylogenetic tree reconstructed from the pair similarity matrix. Results of the intergroup comparison were plotted to facilitate the identification of significant conservative and variable regions. It was shown that the results of the comparison between two phylogenetic groups of snake PLA2 did not much depend on the number of each group representatives and did not markedly change if one of the groups was represented by the single sequence. The knowledge of the number and location of conservative and variable regions and their dependence on the phylogenetic relations between compared taxa may be used to predict a synthetic peptide structure to obtain specific antibodies against PLA2 of one of these taxons. Such prediction is possible if there is a specific region conservative for one taxon but variable for two of them.  相似文献   

10.
A review of aphidiine wasps (Hymenoptera: Braconidae) parasitizing the Uroleucon species in the West Palaearctic is presented. Eleven species are keyed and illustrated. In addition, a new hymenopteran parasitoid species: Praon nonveilleri n. sp. from Uroleucon inulicola (Hille Ris Lambers) infesting Inula ensifolia L., is described. The new species is diagnosed and illustrated. It belongs to the “dorsale-yomenaé” species group and was collected from the Djetinja canyon in Serbia and Montenegro. The aphidiines presented in this work were identified from 97 aphid taxons occurring on 236 plant taxons. Furthermore, 361 original parasitoid – host aphid – host plant associations of the species mentioned in the key are presented. Finally, phylogenetic relationships inside the “dorsale-yomenae” species group and related species were reconstructed using cladistic distance methods.  相似文献   

11.
Kovács GM  Jakucs E 《Mycorrhiza》2006,16(8):567-574
In the present study, white truffle ectomycorrhizae (EM) collected in deciduous forests (Populus, Quercus, and Fagus) from Hungary were characterized by morphological–anatomical and molecular methods. Our investigations suggest that the EM of white truffles (e.g., Tuber rapaeodorum, Tuber puberulum, Tuber rufum) are common and abundant members of the forest communities in the area. The ITS sequences of 14 EM specimens and 46 additional fruitbody sequences from the GenBank were clustered into four main groups in phylogenetic analyses. In the ITS-1 region, a characteristic indel pattern was found, which supports the clades. Although our analyses indicate definite genetic distance between the groups of the phylogenetic tree, these clades do not correspond to the traditional taxons identified by fruitbody characteristics. Comparison of the ectomycorrhizae shows that neither is mycorrhizal anatomy a good tool to separate the groups, because the characters (like the epidermoid or angular mantle structure, cell wall thickness, the sape and size of cystidia) are too variable and overlap between the clades. The interspecific similarity, observed both in ectomycorrhizal and fruitbody characters, strengthen the sensu lato morpho-species concept of this group. Our study, which combines comprehensive molecular and anatomical approach to characterize and identify ectomycorrhizae of white truffles from natural samples, stress out the need of the taxonomical revision of this group.  相似文献   

12.
ABSTRACT

A groundwater field is a complex and open system. Groundwater simulation and prediction often deviated from true values, which is attributed to the uncertainty of groundwater modeling. The conceptual model (model struture) is one of the main sources of groundwater modeling uncertianty. In this study, the mean Euclidean distance (MED) between model simulations and observations is proposed to assess the integrated likelihood value of a conceptual model in Bayesian model averaging (BMA). Moreover, this proposed BMA method is compared with the traditional generalized likelihood uncertainty estimation (GLUE) BMA method by a synthetical groundwater model, and the characteristics of these two BMA methods are summarized.  相似文献   

13.
Zheng X  Liu T  Wang J 《Amino acids》2009,37(2):427-433
A complexity-based approach is proposed to predict subcellular location of proteins. Instead of extracting features from protein sequences as done previously, our approach is based on a complexity decomposition of symbol sequences. In the first step, distance between each pair of protein sequences is evaluated by the conditional complexity of one sequence given the other. Subcellular location of a protein is then determined using the k-nearest neighbor algorithm. Using three widely used data sets created by Reinhardt and Hubbard, Park and Kanehisa, and Gardy et al., our approach shows an improvement in prediction accuracy over those based on the amino acid composition and Markov model of protein sequences.  相似文献   

14.
Abstract

We analysed morphological variation among 17 forewing characters within five populations of the paper wasp, Polistes dominulus, in Iran. The raw planar coordinate data were aligned using geometric and mathematical calculations in Kendall's shape space. After transfer of the data to a linear Euclidean space, i.e., tangent space, multi‐variate analysis of 135 images of forewings were made using their geometric morphometric characters (30 in the forewings). We observed a direct correlation between morphological characters and the geographically easiest travel distance along river valleys and mountain ranges.  相似文献   

15.
给出了蛋白质序列的一种六维表示方法,根据这种表示方法有3种不同表示形式,利用这3种形式来构造距离矩阵的信息熵,然后通过信息熵向量的欧式距离、夹角来比较序列之间的相似性。  相似文献   

16.
目的:为解决肿瘤亚型识别过程中易出现的维数灾难和过拟合问题,提出了一种改进的粒子群BP神经网络集成算法。方法:算法采用欧式距离和互信息来初步过滤冗余基因,之后用Relief算法进一步处理,得到候选特征基因集合。采用BP神经网络作为基分类器,将特征基因提取与分类器训练相结合,改进的粒子群对其权值和阈值进行全局搜索优化。结果:当隐含层神经元个数为5时,候选特征基因个数为110时,QPSO/BP算法全局优化和搜索,此时的分类准确率最高。结论:该算法不但提高了肿瘤分型识别的准确率,而且降低了学习的复杂度。  相似文献   

17.

Background

Gene flow maintains genetic diversity within a species and is influenced by individual behavior and the geographical features of the species' habitat. Here, we have characterized the geographical distribution of genetic patterns in giant pandas (Ailuropoda melanoleuca) living in four isolated patches of the Xiaoxiangling and Daxiangling Mountains. Three geographic distance definitions were used with the "isolation by distance theory": Euclidean distance (EUD), least-cost path distance (LCD) defined by food resources, and LCD defined by habitat suitability.

Results

A total of 136 genotypes were obtained from 192 fecal samples and one blood sample, corresponding to 53 unique genotypes. Geographical maps plotted at high resolution using smaller neighborhood radius definitions produced large cost distances, because smaller radii include a finer level of detail in considering each pixel. Mantel tests showed that most correlation indices, particularly bamboo resources defined for different sizes of raster cell, were slightly larger than the correlations calculated for the Euclidean distance, with the exception of Patch C. We found that natural barriers might have decreased gene flow between the Xiaoxiangling and Daxiangling regions.

Conclusions

Landscape features were found to partially influence gene flow in the giant panda population. This result is closely linked to the biological character and behavior of giant pandas because, as bamboo feeders, individuals spend most of their lives eating bamboo or moving within the bamboo forest. Landscape-based genetic analysis suggests that gene flow will be enhanced if the connectivity between currently fragmented bamboo forests is increased.
  相似文献   

18.
A number of methods to predicting the folding type of a protein based on its amino acid composition have been developed during the past few years. In order to perform an objective and fair comparison of different prediction methods, a Monte Carlo simulation method was proposed to calculate the asymptotic limit of the prediction accuracy [Zhang and Chou (1992),Biophys. J. 63, 1523–1529, referred to as simulation method I]. However, simulation method I was based on an oversimplified assumption, i.e., there are no correlations between the compositions of different amino acids. By taking into account such correlations, a new method, referred to as simulation method II, has been proposed to recalculate the objective accuracy of prediction for the least Euclidean distance method [Nakashimaet al. (1986),J. Biochem. 99, 152–162] and the least Minkowski distance method [Chou (1989),Prediction in Protein Structure and the Principles of Protein Conformation, Plenum Press, New York, pp. 549–586], respectively. The results show that the prediction accuracy of the former is still better than that of the latter, as found by simulation method I; however, after incorporating the correlative effect, the objective prediction accuracies become lower for both methods. The reason for this phenomenon is discussed in detail. The simulation method and the idea developed in this paper can be applied to examine any other statistical prediction method, including the computersimulated neural network method.  相似文献   

19.
《Bird Study》2012,59(3):366-377
ABSTRACT

Capsule: Our findings regarding Hen Harrier Circus cyaneus territory site selection and breeding success in Ireland offer an opportunity for the development of initiatives and conservation actions aimed at enhancing the suitability of upland areas for breeding Hen Harriers and ensuring the long-term persistence of the species.

Aims: To investigate landscape-scale associations between habitat composition and Hen Harrier territory site selection, and to explore the influence of habitat and climate on breeding success.

Methods: We used multi-model inference from generalized linear models and Euclidean distance analyses to explore the influence of habitat, topographic, anthropogenic and climatic factors on Hen Harrier territory selection and breeding success in Ireland, based on data from national breeding surveys in 2010 and 2015.

Results: Hen Harrier territories were associated with heath/shrub, bog and pre-thicket coniferous forests. Comparisons between territories and randomly generated pseudo-absences (upland and lowland) showed that breeding pairs preferentially select for these habitats. Breeding success was negatively influenced by rainfall early in the breeding season and by climatic instability, and was positively influenced by the presence of heath/shrub and bog.

Conclusions: The results suggest that Hen Harrier breeding success is compromised by the synergistic effects of climate, landscape composition and management. Effective conservation of Hen Harriers in Ireland will therefore rely on landscape-scale initiatives.  相似文献   

20.
The successful prediction of thermophilic proteins is useful for designing stable enzymes that are functional at high temperature. We have used the increment of diversity (ID), a novel amino acid composition-based similarity distance, in a 2-class K-nearest neighbor classifier to classify thermophilic and mesophilic proteins. And the KNN-ID classifier was successfully developed to predict the thermophilic proteins. Instead of extracting features from protein sequences as done previously, our approach was based on a diversity measure of symbol sequences. The similarity distance between each pair of protein sequences was first calculated to quantitatively measure the similarity level of one given sequence and the other. The query protein is then determined using the K-nearest neighbor algorithm. Comparisons with multiple recently published methods showed that the KNN-ID proposed in this study outperforms the other methods. The improved predictive performance indicated it is a simple and effective classifier for discriminating thermophilic and mesophilic proteins. At last, the influence of protein length and protein identity on prediction accuracy was discussed further. The prediction model and dataset used in this article can be freely downloaded from http://wlxy.imu.edu.cn/college/biostation/fuwu/KNN-ID/index.htm.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号