共查询到20条相似文献,搜索用时 8 毫秒
1.
Integrated gene and species phylogenies from unaligned whole genome protein sequences 总被引:2,自引:0,他引:2
MOTIVATION: Most molecular phylogenies are based on sequence alignments. Consequently, they fail to account for modes of sequence evolution that involve frequent insertions or deletions. Here we present a method for generating accurate gene and species phylogenies from whole genome sequence that makes use of short character string matches not placed within explicit alignments. In this work, the singular value decomposition of a sparse tetrapeptide frequency matrix is used to represent the proteins of organisms uniquely and precisely as vectors in a high-dimensional space. Vectors of this kind can be used to calculate pairwise distance values based on the angle separating the vectors, and the resulting distance values can be used to generate phylogenetic trees. Protein trees so derived can be examined directly for homologous sequences. Alternatively, vectors defining each of the proteins within an organism can be summed to provide a vector representation of the organism, which is then used to generate species trees. RESULTS: Using a large mitochondrial genome dataset, we have produced species trees that are largely in agreement with previously published trees based on the analysis of identical datasets using different methods. These trees also agree well with currently accepted phylogenetic theory. In principle, our method could be used to compare much larger bacterial or nuclear genomes in full molecular detail, ultimately allowing accurate gene and species relationships to be derived from a comprehensive comparison of complete genomes. In contrast to phylogenetic methods based on alignments, sequences that evolve by relative insertion or deletion would tend to remain recognizably similar. 相似文献
2.
There is consensus surrounding the need to include a third dimension when estimating Species Distribution Models (SDMs), which is of special interest for marine species. Application of the third dimension is, however, rarely available, thus users are obliged to manually combine 2D SDM outputs (i.e., suitability or presence/absence maps) for 3D distribution generation. Herein, the Niche of Occurrence 3D (NOO3D) is presented, which is a new, simple modelling procedure that provides 3D distributions using both 3D occurrence samples and environmental datasets that consist of one layer per depth value. NOO3D performance was evaluated using five virtual marine species to avoid errors associated with real data sets (three pelagic species, with wide, medium, and narrow distributions, respectively, a mesopelagic species and an abyssal species). These virtual species are distributed across the North Atlantic Ocean and were built to a 0.5° x 0.5° resolution and considering 49 depth levels (from 0.43 m to an undersea depth of 5274.7 m). NOO3D results were also compared to those provided by 3D Alpha Shapes and Maximum Entropy (MaxEnt). The True Positive Rate (TPR), or sensitivity, True Negative Rate (TNR), or specificity, False Positive Rate (FPR), or commission error, and False Negative Rate (FNR), or omission error, were employed in order to facilitate comparison between methods. MaxEnt performed best for TPR, TSS and FNR, and Alpha Shape 3D performed best for FPR and TNR. NOO3D was always the second-ranked method for all metrics considered, which indicates that it was the most suitable method. The provided results indicate that NOO3D can be considered a viable alternative in achieving three-dimensional species distribution models. 相似文献
3.
Deep generative models have gained recent popularity for chemical design. Many of these models have historically operated in 2D space; however, more recently explicit 3D molecular generative models have become of interest, which are the topic of this article. Dozens of published models have been developed in the last few years to generate molecules directly in 3D, outputting both the atom types and coordinates, either in one-shot or adding atoms or fragments step-by-step. These 3D generative models can also be guided by structural information such as a binding pocket representation to successfully generate molecules with docking score ranges similar to known actives, but still showing lower computational efficiency and generation throughput than 1D/2D generative models and sometimes producing unrealistic conformations. We advocate for a unified benchmark of metrics to evaluate generation and propose perspectives to be addressed in next implementations. 相似文献
4.
Kawabata T 《Nucleic acids research》2003,31(13):3367-3369
The recent accumulation of large amounts of 3D structural data warrants a sensitive and automatic method to compare and classify these structures. We developed a web server for comparing protein 3D structures using the program Matras (http://biunit.aist-nara.ac.jp/matras). An advantage of Matras is its structure similarity score, which is defined as the log-odds of the probabilities, similar to Dayhoff's substitution model of amino acids. This score is designed to detect evolutionarily related (homologous) structural similarities. Our web server has three main services. The first one is a pairwise 3D alignment, which is simply align two structures. A user can assign structures by either inputting PDB codes or by uploading PDB format files in the local machine. The second service is a multiple 3D alignment, which compares several protein structures. This program employs the progressive alignment algorithm, in which pairwise 3D alignments are assembled in the proper order. The third service is a 3D library search, which compares one query structure against a large number of library structures. We hope this server provides useful tools for insights into protein 3D structures. 相似文献
5.
The D2-D3 expansion segments of the 28S ribosomal RNA (rRNA) were sequenced and compared to predict secondary structures for Hoplolaiminae species based on free energy minimization and comparative sequence analysis. The free energy based prediction method provides putative stem regions within primary structure and these base pairings in stems were confirmed manually by compensatory base changes among closely and distantly related species. Sequence differences ranged from identical between Hoplolaimus columbus and H. seinhorsti to 20.8% between Scutellonema brachyurum and H. concaudajuvencus. The comparative sequence analysis and energy minimization method yielded 9 stems in the D2 and 6 stems in the D3 which showed complete or partial compensatory base changes. At least 75% of nucleotides in the D2 and 68% of nucleotides in the D3 were related with formation of base pairings to maintain secondary structure. GC contents in stems ranged from 61 to 73% for the D2 and from 64 to 71% for the D3 region. These ranges are higher than G-C contents in loops which ranged from 37 to 48% in the D2 and 33-45% in the D3. In stems, G-C/C-G base pairings were the most common in the D2 and the D3 and also non-canonical base pairs including A•A and U•U, C•U/U•C, and G•A/A•G occurred in stems. The predicted secondary model and new sequence alignment based on predicted secondary structures for the D2 and D3 expansion segments provide useful information to assign positional nucleotide homology and reconstruction of more reliable phylogenetic trees. 相似文献
6.
Motif3D is a web-based protein structure viewer designed to allow sequence motifs, and in particular those contained in the fingerprints of the PRINTS database, to be visualised on three-dimensional (3D) structures. Additional functionality is provided for the rhodopsin-like G protein-coupled receptors, enabling fingerprint motifs of any of the receptors in this family to be mapped onto the single structure available, that of bovine rhodopsin. Motif3D can be used via the web interface available at: http://www.bioinf.man.ac.uk/dbbrowser/motif3d/motif3d.html. 相似文献
7.
Neha J. Varghese Supratim Mukherjee Natalia Ivanova Konstantinos T. Konstantinidis Kostas Mavrommatis Nikos C. Kyrpides Amrita Pati 《Nucleic acids research》2015,43(14):6761-6771
Increased sequencing of microbial genomes has revealed that prevailing prokaryotic species assignments can be inconsistent with whole genome information for a significant number of species. The long-standing need for a systematic and scalable species assignment technique can be met by the genome-wide Average Nucleotide Identity (gANI) metric, which is widely acknowledged as a robust measure of genomic relatedness. In this work, we demonstrate that the combination of gANI and the alignment fraction (AF) between two genomes accurately reflects their genomic relatedness. We introduce an efficient implementation of AF,gANI and discuss its successful application to 86.5M genome pairs between 13,151 prokaryotic genomes assigned to 3032 species. Subsequently, by comparing the genome clusters obtained from complete linkage clustering of these pairs to existing taxonomy, we observed that nearly 18% of all prokaryotic species suffer from anomalies in species definition. Our results can be used to explore central questions such as whether microorganisms form a continuum of genetic diversity or distinct species represented by distinct genetic signatures. We propose that this precise and objective AF,gANI-based species definition: the MiSI (Microbial Species Identifier) method, be used to address previous inconsistencies in species classification and as the primary guide for new taxonomic species assignment, supplemented by the traditional polyphasic approach, as required. 相似文献
8.
I-TASSER server for protein 3D structure prediction 总被引:5,自引:0,他引:5
Yang Zhang 《BMC bioinformatics》2008,9(1):40
Background
Prediction of 3-dimensional protein structures from amino acid sequences represents one of the most important problems in computational structural biology. The community-wide Critical Assessment of Structure Prediction (CASP) experiments have been designed to obtain an objective assessment of the state-of-the-art of the field, where I-TASSER was ranked as the best method in the server section of the recent 7th CASP experiment. Our laboratory has since then received numerous requests about the public availability of the I-TASSER algorithm and the usage of the I-TASSER predictions. 相似文献9.
Jingfen Zhang Qingguo Wang Bogdan Barz Zhiquan He Ioan Kosztin Yi Shang Dong Xu 《Proteins》2010,78(5):1137-1152
There have been steady improvements in protein structure prediction during the past 2 decades. However, current methods are still far from consistently predicting structural models accurately with computing power accessible to common users. Toward achieving more accurate and efficient structure prediction, we developed a number of novel methods and integrated them into a software package, MUFOLD. First, a systematic protocol was developed to identify useful templates and fragments from Protein Data Bank for a given target protein. Then, an efficient process was applied for iterative coarse‐grain model generation and evaluation at the Cα or backbone level. In this process, we construct models using interresidue spatial restraints derived from alignments by multidimensional scaling, evaluate and select models through clustering and static scoring functions, and iteratively improve the selected models by integrating spatial restraints and previous models. Finally, the full‐atom models were evaluated using molecular dynamics simulations based on structural changes under simulated heating. We have continuously improved the performance of MUFOLD by using a benchmark of 200 proteins from the Astral database, where no template with >25% sequence identity to any target protein is included. The average root‐mean‐square deviation of the best models from the native structures is 4.28 Å, which shows significant and systematic improvement over our previous methods. The computing time of MUFOLD is much shorter than many other tools, such as Rosetta. MUFOLD demonstrated some success in the 2008 community‐wide experiment for protein structure prediction CASP8. Proteins 2010. © 2009 Wiley‐Liss, Inc. 相似文献
10.
11.
MOTIVATION: Our aim is to develop a process that automatically defines a repertory of contiguous 3D protein structure fragments and can be used in homology modeling. We present here improvements to the method we introduced previously: the 'hybrid protein model' (de Brevern and Hazout, THEOR: Chem. Acc., 106, 36-47, (2001)) The hybrid protein learns a non-redundant databank encoded in a structural alphabet composed of 16 Protein Blocks (PBs; de Brevern et al., Proteins, 41, 271-287, (2000)). Every local fold is learned by looking for the most similar pattern present in the hybrid protein and modifying it slightly. Finally each position corresponds to a cluster of similar 3D local folds. RESULTS: In this paper, we describe improvements to our method for building an optimal hybrid protein: (i) 'baby training,' which is defined as the introduction of large structure fragments and the progressive reduction in the size of training fragments; and (ii) the deletion of the redundant parts of the hybrid protein. This repertory of contiguous 3D protein structure fragments should be a useful tool for molecular modeling. 相似文献
12.
13.
Annotation of any newly determined protein sequence depends on the pairwise sequence identity with known sequences. However,
for the twilight zone sequences which have only 15–25% identity, the pair-wise comparison methods are inadequate and the annotation
becomes a challenging task. Such sequences can be annotated by using methods that recognize their fold. Bowie et al. described
a 3D1D profile method in which the amino acid sequences that fold into a known 3D structure are identified by their compatibility
to that known 3D structure. We have improved the above method by using the predicted secondary structure information and employ
it for fold recognition from the twilight zone sequences. In our Protein Secondary Structure 3D1D (PSS-3D1D) method, a score
(w) for the predicted secondary structure of the query sequence is included in finding the compatibility of the query sequence
to the known fold 3D structures. In the benchmarks, the PSS-3D1D method shows a maximum of 21% improvement in predicting correctly
the α + β class of folds from the sequences with twilight zone level of identity, when compared with the 3D1D profile method.
Hence, the PSS-3D1D method could offer more clues than the 3D1D method for the annotation of twilight zone sequences. The
web based PSS-3D1D method is freely available in the PredictFold server at . 相似文献
14.
A coding sequence is defined as a DNA sequence coding the primary structure of a protein (a polypeptide). Such a sequence must satisfy a specific constraint, which consists in coding a functional protein. As the genetic code is degenerated, there exists, for a given polypeptide, a set of synonymous sequences which would code the same polypeptide. Translation conditional models are being defined on such sets. The aim of this paper is to give a common formalism. Besides the codon bias model, a few other conditional models will be defined. Statistical estimators and comparison methods will be briefly presented. These models can be used for gene classification, or to find out, in a real sequence, remarkable features. An example will be presented on Escherichia coli genes. 相似文献
15.
16.
Takeshi Kawabata Satoshi Fukuchi Keiichi Homma Motonori Ota Jiro Araki Takehiko Ito Nobuyuki Ichiyoshi Ken Nishikawa 《Nucleic acids research》2002,30(1):294-298
Large-scale genome projects generate an unprecedented number of protein sequences, most of them are experimentally uncharacterized. Predicting the 3D structures of sequences provides important clues as to their functions. We constructed the Genomes TO Protein structures and functions (GTOP) database, containing protein fold predictions of a huge number of sequences. Predictions are mainly carried out with the homology search program PSI-BLAST, currently the most popular among high-sensitivity profile search methods. GTOP also includes the results of other analyses, e.g. homology and motif search, detection of transmembrane helices and repetitive sequences. We have completed analyzing the sequences of 41 organisms, with the number of proteins exceeding 120 000 in total. GTOP uses a graphical viewer to present the analytical results of each ORF in one page in a ‘color-bar’ format. The assigned 3D structures are presented by Chime plug-in or RasMol. The binding sites of ligands are also included, providing functional information. The GTOP server is available at http://spock.genes.nig.ac.jp/~genome/gtop.html. 相似文献
17.
Diuron, a chlorine-substituted dimethyl herbicide, is widely used in agriculture. Though the degradation of diuron in water has been studied much with experiments, little is known about the detailed degradation mechanism from the molecular level. In this work, the degradation mechanisms for OH-induced reactions of diuron in water phase are investigated at the MPWB1K/6–311+G(3df,2p)//MPWB1K/6–31+G(d,p) level with polarizable continuum model (PCM) calculation. Three reaction types including H-atom abstraction, addition, and substitution are identified. For H-atom abstraction reactions, the calculation results show that the reaction abstracting H atom from the methyl group has the lowest energy barrier; the potential barrier of ortho- H (H1’) abstraction is higher than the meta- H abstraction, and the reason is possibly that part of the potential energy is to overcome the side chain torsion for the H1’ abstraction reaction. For addition pathways, the ortho- site (C (2) atom) is the most favorable site that OH may first attack; the potential barriers for OH additions to the ortho- sites (pathways R7 and R8) and the chloro-substituted para- site (R10) are lower than other sites, indicating the ortho- and para- sites are more favorable to be attacked, matching well with the -NHCO- group as an ortho-para directing group. Figure
Representative pathways including abstraction, addition and substitution for OH and diuron reactions 相似文献
18.
19.
Although multiple sequence alignments (MSAs) are essential for a wide range of applications from structure modeling to prediction of functional sites, construction of accurate MSAs for distantly related proteins remains a largely unsolved problem. The rapidly increasing database of spatial structures is a valuable source to improve alignment quality. We explore the use of 3D structural information to guide sequence alignments constructed by our MSA program PROMALS. The resulting tool, PROMALS3D, automatically identifies homologs with known 3D structures for the input sequences, derives structural constraints through structure-based alignments and combines them with sequence constraints to construct consistency-based multiple sequence alignments. The output is a consensus alignment that brings together sequence and structural information about input proteins and their homologs. PROMALS3D can also align sequences of multiple input structures, with the output representing a multiple structure-based alignment refined in combination with sequence constraints. The advantage of PROMALS3D is that it gives researchers an easy way to produce high-quality alignments consistent with both sequences and structures of proteins. PROMALS3D outperforms a number of existing methods for constructing multiple sequence or structural alignments using both reference-dependent and reference-independent evaluation methods. 相似文献