首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.

Background

Development of sensitive sequence search procedures for the detection of distant relationships between proteins at superfamily/fold level is still a big challenge. The intermediate sequence search approach is the most frequently employed manner of identifying remote homologues effectively. In this study, examination of serine proteases of prolyl oligopeptidase, rhomboid and subtilisin protein families were carried out using plant serine proteases as queries from two genomes including A. thaliana and O. sativa and 13 other families of unrelated folds to identify the distant homologues which could not be obtained using PSI-BLAST.

Methodology/Principal Findings

We have proposed to start with multiple queries of classical serine protease members to identify remote homologues in families, using a rigorous approach like Cascade PSI-BLAST. We found that classical sequence based approaches, like PSI-BLAST, showed very low sequence coverage in identifying plant serine proteases. The algorithm was applied on enriched sequence database of homologous domains and we obtained overall average coverage of 88% at family, 77% at superfamily or fold level along with specificity of ∼100% and Mathew’s correlation coefficient of 0.91. Similar approach was also implemented on 13 other protein families representing every structural class in SCOP database. Further investigation with statistical tests, like jackknifing, helped us to better understand the influence of neighbouring protein families.

Conclusions/Significance

Our study suggests that employment of multiple queries of a family for the Cascade PSI-BLAST searches is useful for predicting distant relationships effectively even at superfamily level. We have proposed a generalized strategy to cover all the distant members of a particular family using multiple query sequences. Our findings reveal that prior selection of sequences as query and the presence of neighbouring families can be important for covering the search space effectively in minimal computational time. This study also provides an understanding of the ‘bridging’ role of related families.  相似文献   

2.
We have developed a comprehensive expressed sequence tag database search method and used it for the identification of new members of the G-protein coupled receptor superfamily. Our approach proved to be especially useful for the detection of expressed sequence tag sequences that do not encode conserved parts of a protein, making it an ideal tool for the identification of members of divergent protein families or of protein parts without conserved domain structures in the expressed sequence tag database. At least 14 of the expressed sequence tags found with this strategy are promising candidates for new putative G-protein coupled receptors. Here, we describe the sequence and expression analysis of five new members of this receptor superfamily, namely GPR84, GPR86, GPR87, GPR90 and GPR91. We also studied the genomic structure and chromosomal localization of the respective genes applying in silico methods. A cluster of six closely related G-protein coupled receptors was found on the human chromosome 3q24-3q25. It consists of four orphan receptors (GPR86, GPR87, GPR91, and H963), the purinergic receptor P2Y1, and the uridine 5'-diphosphoglucose receptor KIAA0001. It seems likely that these receptors evolved from a common ancestor and therefore might have related ligands. In conclusion, we describe a data mining procedure that proved to be useful for the identification and first characterization of new genes and is well applicable for other gene families.  相似文献   

3.
A large proportion of protein-protein interactions is mediated by families of peptide-binding domains. Comprehensive characterization of each of these domains is critical for understanding the mechanisms and networks of protein interaction at the domain level. However, existing methods are all based on large scale screenings for each domain that are inefficient to deal with hundreds of members in major domain families. We developed a systematic strategy for efficient binding property characterization of peptide-binding domains based on high throughput validation screening of a specialized candidate ligand library using yeast two-hybrid mating array. Its outstanding feature is that the overall efficiency is dramatically improved compared with that of traditional screening, and it will be higher as the system cycles. PDZ domain family was first used to test the strategy. Five PDZ domains were rapidly characterized. Broader binding properties were identified compared with other methods, including novel recognition specificities that provided the basis for major revision of conventional PDZ classification. Several novel interactions were discovered, serving as significant clues for further functional investigation. This strategy can be easily extended to a variety of peptide-binding domains as a powerful tool for comprehensive analysis of domain binding property in proteomic scale.  相似文献   

4.
Li T  Du P  Xu N 《PloS one》2010,5(11):e15411
Phosphorylation is an important type of protein post-translational modification. Identification of possible phosphorylation sites of a protein is important for understanding its functions. Unbiased screening for phosphorylation sites by in vitro or in vivo experiments is time consuming and expensive; in silico prediction can provide functional candidates and help narrow down the experimental efforts. Most of the existing prediction algorithms take only the polypeptide sequence around the phosphorylation sites into consideration. However, protein phosphorylation is a very complex biological process in vivo. The polypeptide sequences around the potential sites are not sufficient to determine the phosphorylation status of those residues. In the current work, we integrated various data sources such as protein functional domains, protein subcellular location and protein-protein interactions, along with the polypeptide sequences to predict protein phosphorylation sites. The heterogeneous information significantly boosted the prediction accuracy for some kinase families. To demonstrate potential application of our method, we scanned a set of human proteins and predicted putative phosphorylation sites for Cyclin-dependent kinases, Casein kinase 2, Glycogen synthase kinase 3, Mitogen-activated protein kinases, protein kinase A, and protein kinase C families (available at http://cmbi.bjmu.edu.cn/huphospho). The predicted phosphorylation sites can serve as candidates for further experimental validation. Our strategy may also be applicable for the in silico identification of other post-translational modification substrates.  相似文献   

5.
Over the past two decades, many ingenious efforts have been made in protein remote homology detection. Because homologous proteins often diversify extensively in sequence, it is challenging to demonstrate such relatedness through entirely sequence-driven searches. Here, we describe a computational method for the generation of 'protein-like' sequences that serves to bridge gaps in protein sequence space. Sequence profile information, as embodied in a position-specific scoring matrix of multiply aligned sequences of bona fide family members, serves as the starting point in this algorithm. The observed amino acid propensity and the selection of a random number dictate the selection of a residue for each position in the sequence. In a systematic manner, and by applying a 'roulette-wheel' selection approach at each position, we generate parent family-like sequences and thus facilitate an enlargement of sequence space around the family. When generated for a large number of families, we demonstrate that they expand the utility of natural intermediately related sequences in linking distant proteins. In 91% of the assessed examples, inclusion of designed sequences improved fold coverage by 5-10% over searches made in their absence. Furthermore, with several examples from proteins adopting folds such as TIM, globin, lipocalin and others, we demonstrate that the success of including designed sequences in a database positively sensitized methods such as PSI-BLAST and Cascade PSI-BLAST and is a promising opportunity for enormously improved remote homology recognition using sequence information alone.  相似文献   

6.
Searches using position specific scoring matrices (PSSMs) have been commonly used in remote homology detection procedures such as PSI-BLAST and RPS-BLAST. A PSSM is generated typically using one of the sequences of a family as the reference sequence. In the case of PSI-BLAST searches the reference sequence is same as the query. Recently we have shown that searches against the database of multiple family-profiles, with each one of the members of the family used as a reference sequence, are more effective than searches against the classical database of single family-profiles. Despite relatively a better overall performance when compared with common sequence-profile matching procedures, searches against the multiple family-profiles database result in a few false positives and false negatives. Here we show that profile length and divergence of sequences used in the construction of a PSSM have major influence on the performance of multiple profile based search approach. We also identify that a simple parameter defined by the number of PSSMs corresponding to a family that is hit, for a query, divided by the total number of PSSMs in the family can distinguish effectively the true positives from the false positives in the multiple profiles search approach.  相似文献   

7.
MOTIVATION: Many studies have shown that database searches using position-specific score matrices (PSSMs) or profiles as queries are more effective at identifying distant protein relationships than are searches that use simple sequences as queries. One popular program for constructing a PSSM and comparing it with a database of sequences is Position-Specific Iterated BLAST (PSI-BLAST). RESULTS: This paper describes a new software package, IMPALA, designed for the complementary procedure of comparing a single query sequence with a database of PSI-BLAST-generated PSSMs. We illustrate the use of IMPALA to search a database of PSSMs for protein folds, and one for protein domains involved in signal transduction. IMPALA's sensitivity to distant biological relationships is very similar to that of PSI-BLAST. However, IMPALA employs a more refined analysis of statistical significance and, unlike PSI-BLAST, guarantees the output of the optimal local alignment by using the rigorous Smith-Waterman algorithm. Also, it is considerably faster when run with a large database of PSSMs than is BLAST or PSI-BLAST when run against the complete non-redundant protein database.  相似文献   

8.
In this study, we show that it is possible to increase the performance over PSI-BLAST by using evolutionary information for both query and target sequences. This information can be used in three different ways: by sequence linking, profile-profile alignments, and by combining sequence-profile and profile-sequence searches. If only PSI-BLAST is used, 16% of superfamily-related protein domains can be detected at 90% specificity, but if a sequence-profile and a profile-sequence search are combined, this is increased to 20%, profile-profile searches detects 19%, whereas a linking procedure identifies 22% of these proteins. All three methods show equal performance, but the best combination of speed and accuracy seems to be obtained by the combined searches, because this method shows a good performance even at high specificity and the lowest computational cost. In addition, we show that the E-values reported by all these methods, including PSI-BLAST, underestimate the true rate of false positives. This behavior is seen even if a very strict E-value cutoff and a limited number of iterations are used. However, the difference is more pronounced with a looser E-value cutoff and more iterations.  相似文献   

9.
EPS8 codes for a protein essential in Ras to Rac signaling leading to actin remodeling. Three genes highly homologous to EPS8 were discovered, thereby defining a novel gene family. Here, we report the genomic structure of EPS8 and the EPS8-related genes in human and mouse. We performed BLASTN searches against the Celera Human Genome and Mouse Fragments Database. The mouse fragments were manually assembled, and the organization of both human and mouse genes was reconstructed. The gene structures in Celera annotations of the human and mouse genomes were compared to outline correspondences and divergences. We also compared the EPS8 family gene structures predicted by Celera with those predicted by NCBI. Moreover, we performed a virtual analysis of the expression of the EPS8 gene family members by using the SAGEmap Database in NCBI. Finally, we analyzed the domain organization of the gene products and their evolutionary conservation to define novel putative domains, thereby helping to predict novel modality of action for the members of this gene family. The data obtained will be instrumental in directing further experimental functional characterization of these genes.  相似文献   

10.
Profile-based sequence search procedures are commonly employed to detect remote relationships between proteins. We provide an assessment of a Cascade PSI-BLAST protocol that rigorously employs intermediate sequences in detecting remote relationships between proteins. In this approach we detect using PSI-BLAST, which involves multiple rounds of iteration, an initial set of homologues for a protein in a 'first generation' search by querying a database. We propagate a 'second generation' search in the database, involving multiple runs of PSI-BLAST using each of the homologues identified in the previous generation as queries to recognize homologues not detected earlier. This non-directed search process can be viewed as an iteration of iterations that is continued to detect further homologues until no new hits are detectable. We present an assessment of the coverage of this 'cascaded' intermediate sequence search on diverse folds and find that searches for up to three generations detect most known homologues of a query. Our assessments show that this approach appears to perform better than the traditional use of PSI-BLAST by detecting 15% more relationships within a family and 35% more relationships within a superfamily. We show that such searches can be performed on generalized sequence databases and non-trivial relationships between proteins can be detected effectively. Such a propagation of searches maximizes the chances of detecting distant homologies by effectively scanning protein "fold space".  相似文献   

11.
Russell MW  Raeker MO  Korytkowski KA  Sonneman KJ 《Gene》2002,282(1-2):237-246
Members of the Dbl family of guanine nucleotide exchange factors (GEFs) have important roles in the organization of actin-based cytoskeletal structures of a wide variety of cell types. Through the activation of members of the Rho family of GTP signaling molecules, these exchange factors elicit cytoskeletal alterations that allow cellular remodeling. As important regulators of RhoGTPase activity, members of this family are candidates for mediating the RhoGTPase activation and cytoskeletal changes that occur during cardiac development and during the myocardial response to hypertrophic stimuli. In this study, we characterize a novel human gene that is expressed in skeletal and cardiac muscle and has putative functional domains similar to those found in members of both the Dbl family of GEFs and the titin family of myosin light chain kinases (MLCK). The cDNA sequence of this gene, which has been designated Obscurin-myosin light chain kinase (Obscurin-MLCK), would be predicted to encode for at least 68 immunoglobulin domains, two fibronectin domains, one calcium/calmodulin binding domain, a RhoGTP exchange factor domain, and two serine-threonine kinase domains. The combination of the putative Rho GEF and two kinase domains has not been noted in any other members of the titin or Dbl families. Alternative splicing allows the generation of a number of unique Obscurin-MLCK isoforms that contain various combinations of the functional domains. One group of isoforms is comparable to Unc-89, a Caenorhabditis elegans sarcomere-associated protein, in that they contain a putative RhoGEF domain and multiple immunoglobulin repeats. Other isoforms more closely resemble MLCK, containing one or both of the putative carboxy-terminal serine-threonine kinase domains. The modular nature of the Obscurin-MLCK isoforms indicates that it may have an array of functions important to cardiac and skeletal muscle physiology.  相似文献   

12.
Abstract

Profile-based sequence search procedures are commonly employed to detect remote relationships between proteins. We provide an assessment of a Cascade PSI-BLAST protocol that rigorously employs intermediate sequences in detecting remote relationships between proteins. In this approach we detect using PSI-BLAST, which involves multiple rounds of iteration, an initial set of homologues for a protein in a ‘first generation’ search by querying a database. We propagate a ‘second generation’ search in the database, involving multiple runs of PSI-BLAST using each of the homologues identified in the previous generation as queries to recognize homologues not detected earlier. This non-directed search process can be viewed as an iteration of iterations that is continued to detect further homologues until no new hits are detectable. We present an assessment of the coverage of this ‘cascaded’ intermediate sequence search on diverse folds and find that searches for up to three generations detect most known homologues of a query. Our assessments show that this approach appears to perform better than the traditional use of PSI-BLAST by detecting 15% more relationships within a family and 35% more relationships within a superfamily. We show that such searches can be performed on generalized sequence databases and non-trivial relationships between proteins can be detected effectively. Such a propagation of searches maximizes the chances of detecting distant homologies by effectively scanning protein “fold space”.  相似文献   

13.
A technique has been developed to search a proteome database for new members of a functional class of membrane protein. It takes advantage of the highly conserved secondary structure of functionally related membrane proteins. Such proteins typically have the same number of transmembrane domains located at similar relative positions in their polypeptide sequence. This gives rise to a characteristic pattern of peaks in their hydropathy profiles. To conduct a search, each member of a polypeptide database is converted to a hydropathy profile, peaks are automatically detected, and the pattern of peaks is compared with a template. A template was designed for the acetylcholine (ACh) and glycine receptors of the cys-loop receptor superfamily. The key feature was a closely spaced triplet of hydropathy peaks bracketed by deep valleys. When applied to the human proteome the search procedure retrieved 153 profiles with a receptor-like triplet of peaks. The approach was highly selective with 70% of the retrieved profiles annotated as known or putative receptors. These included ACh, glycine, gamma-amino butyric acid and serotonin receptors, which are all related by sequence. However, ionotropic glutamate receptors, which have almost no sequence homology with ACh receptors, were also retrieved. Thus, the strategy can find members of a functional class that cannot be identified by sequence alignment. To demonstrate that the strategy can easily be extended to other membrane protein families, a template was developed for the neurotransmitter/Na+ symporter family, and similar results were obtained. This approach should prove a useful adjunct to sequence-based retrieval tools when searching for novel membrane proteins.  相似文献   

14.
Sequence databases are rapidly growing, thereby increasing the coverage of protein sequence space, but this coverage is uneven because most sequencing efforts have concentrated on a small number of organisms. The resulting granularity of sequence space creates many problems for profile-based sequence comparison programs. In this paper, we suggest several strategies that address these problems, and at the same time speed up the searches for homologous proteins and improve the ability of profile methods to recognize distant homologies. One of our strategies combines database clustering, which removes highly redundant sequence, and a two-step PSI-BLAST (PDB-BLAST), which separates sequence spaces of profile composition and space of homology searching. The combination of these strategies improves distant homology recognitions by more than 100%, while using only 10% of the CPU time of the standard PSI-BLAST search. Another method, intermediate profile searches, allows for the exploration of additional search directions that are normally dominated by large protein sub-families within very diverse families. All methods are evaluated with a large fold-recognition benchmark.  相似文献   

15.
The ProDom database is a comprehensive set of protein domain families automatically generated from the SWISS-PROT and TrEMBL sequence databases. An associated database, ProDom-CG, has been derived as a restriction of ProDom to completely sequenced genomes. The ProDom construction method is based on iterative PSI-BLAST searches and multiple alignments are generated for each domain family. The ProDom web server provides the user with a set of tools to visualise multiple alignments, phylogenetic trees and domain architectures of proteins, as well as a BLAST-based server to analyse new sequences for homologous domains. The comprehensive nature of ProDom makes it particularly useful to help sustain the growth of InterPro.  相似文献   

16.
Consensus-Degenerate Hybrid Oligonucleotide Primer (CODEHOP) PCR primers derived from amino acid sequence motifs which are highly conserved between members of a protein family have proven to be highly effective in the identification and characterization of distantly related family members. Here, the use of the CODEHOP strategy to identify novel viruses and obtain sequence information for phylogenetic characterization, gene structure determination and genome analysis is reviewed. While this review describes techniques for the identification of members of the herpesvirus family of DNA viruses, the same methodology and approach is applicable to other virus families.  相似文献   

17.
SNF2家族新成员Ercc61的cDNA克隆与表达分析   总被引:3,自引:0,他引:3  
SNF2家族蛋白在基因组复制、修复与表达中具有重要作用.报道了SNF2家族新成员Ercc61(excision repair crosscomplementing rodent repair deficiency,complementation group 6-like)的cDNA克隆、特性与表达分析.通过表达序列标签(EST)搜索和组装,获得了cDNA全长4002 bp的新基因Ercc6l(GenBank Acc.No AY172688),然后通过RT-PCR在小鼠胚胎心脏成功克隆了该基因.Ercc6l在小鼠基因组中由两个外显子和一个内含子组成,定位于X染色体,最大开放阅读框(ORF)编码一个含l 240个氨基酸的假定蛋白质.该假定蛋白质含有SNF2蛋白的8个保守基序(SNF2结构域).通过与SNF2家族各亚家族的成员进行多重比对,初步确认Ercc6l属于ERCC6亚家族成员.将Ercc6l编码区克隆到pEGFP-C3然后转染HeLa,3T3和B16细胞,融合蛋白主要定位于胞浆.BLAST搜索检索出69条小鼠EST与Erccol同源,这些EST主要来自胚胎和肿瘤组织.对小鼠不同发育时期的多种组织进行RT-PCR,发现Ercc6l在胚胎期强表达,出生产后表达显著下调.这些结果提示Ercc6l在胚胎发育和肿瘤发生中可能具有重要作用.  相似文献   

18.
Alignments grow, secondary structure prediction improves.   总被引:12,自引:0,他引:12  
Using information from sequence alignments significantly improves protein secondary structure prediction. Typically, more divergent profiles yield better predictions. Recently, various groups have shown that accuracy can be improved significantly by using PSI-BLAST profiles to develop new prediction methods. Here, we focused on the influences of various alignment strategies on two 8-year-old PHD methods. The following results stood out. (i) PHD using pairwise alignments predicts about 72% of all residues correctly in one of the three states: helix, strand, and other. Using larger databases and PSI-BLAST raised accuracy to 75%. (ii) More than 60% of the improvement originated from the growth of current sequence databases; about 20% resulted from detailed changes in the alignment procedure (substitution matrix, thresholds, and gap penalties). Another 20% of the improvement resulted from carefully using iterated PSI-BLAST searches. (iii) It is of interest that we failed to improve prediction accuracy further when attempting to refine the alignment by dynamic programming (MaxHom and ClustalW). (iv) Improvement through family growth appears to saturate at some point. However, most families have not reached this saturation. Hence, we anticipate that prediction accuracy will continue to rise with database growth.  相似文献   

19.
We describe a method to assign a protein structure to a functional family using family-specific fingerprints. Fingerprints represent amino acid packing patterns that occur in most members of a family but are rare in the background, a nonredundant subset of PDB; their information is additional to sequence alignments, sequence patterns, structural superposition, and active-site templates. Fingerprints were derived for 120 families in SCOP using Frequent Subgraph Mining. For a new structure, all occurrences of these family-specific fingerprints may be found by a fast algorithm for subgraph isomorphism; the structure can then be assigned to a family with a confidence value derived from the number of fingerprints found and their distribution in background proteins. In validation experiments, we infer the function of new members added to SCOP families and we discriminate between structurally similar, but functionally divergent TIM barrel families. We then apply our method to predict function for several structural genomics proteins, including orphan structures. Some predictions have been corroborated by other computational methods and some validated by subsequent functional characterization.  相似文献   

20.
The bile/arsenite/riboflavin transporter (BART) superfamily   总被引:1,自引:0,他引:1  
Secondary transmembrane transport carriers fall into families and superfamilies allowing prediction of structure and function. Here we describe hundreds of sequenced homologues that belong to six families within a novel superfamily, the bile/arsenite/riboflavin transporter (BART) superfamily, of transport systems and putative signalling proteins. Functional data for members of three of these families are available, and they transport bile salts and other organic anions, the bile acid:Na(+) symporter (BASS) family, inorganic anions such as arsenite and antimonite, the arsenical resistance-3 (Acr3) family, and the riboflavin transporter (RFT) family. The first two of these families, as well as one more family with no functionally characterized members, exhibit a probable 10 transmembrane spanner (TMS) topology that arose from a tandemly duplicated 5 TMS unit. Members of the RFT family have a 5 TMS topology, and are homologous to each of the repeat units in the 10 TMS proteins. The other two families [sensor histidine kinase (SHK) and kinase/phosphatase/synthetase/hydrolase (KPSH)] have a single 5 TMS unit preceded by an N-terminal TMS and followed by a hydrophilic sensor histidine kinase domain (the SHK family) or catalytic domains resembling sensor kinase, phosphatase, cyclic di-GMP synthetase and cyclic di-GMP hydrolase catalytic domains, as well as various noncatalytic domains (the KPSH family). Because functional data are not available for members of the SHK and KPSH families, it is not known if the transporter domains retain transport activity or have evolved exclusive functions in molecular reception and signal transmission. This report presents characteristics of a unique protein superfamily and provides guides for future studies concerning structural, functional and mechanistic properties of its constituent members.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号