首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 140 毫秒
1.
We present a web service allowing to automatically assign sequences to homologous gene families from a set of databases. After identification of the most similar gene family to the query sequence, this sequence is added to the whole alignment and the phylogenetic tree of the family is rebuilt. Thus, the phylogenetic position of the query sequence in its gene family can be easily identified. AVAILABILITY: http://pbil.univ-lyon1.fr/software/HoSeqI/.  相似文献   

2.
3.
Identification of novel kinases based on their sequence conservation within kinase catalytic domain has relied so far on two major approaches, low-stringency hybridization of cDNA libraries, and PCR method using degenerate primers. Both of these approaches at times are technically difficult and time-consuming. We have developed a procedure that can significantly reduce the time and effort involved in searching for novel kinases and increase the sensitivity of the analysis. This procedure exploits the computer analysis of a vast resource of human cDNA sequences represented in the expressed sequence tag (EST) database. Seventeen novel human cDNA clones showing significant homology to serine/threonine kinases, including STE-20, CDK- and YAK-related family kinases, were identified by searching EST database. Further sequence analysis of these novel kinases obtained either directly from EST clones or from PCR-RACE products confirmed their identity as protein kinases. Given the rapid accumulation of the EST database and the advent of powerful computer analysis software, this approach provides a fast, sensitive, and economical way to identify novel kinases as well as other genes from EST database.  相似文献   

4.
Novel sequences are DNA sequences present in an individual''s genome but absent in the human reference assembly. They are predicted to be biologically important, both individual and population specific, and consistent with the known human migration paths. Recent works have shown that an average person harbors 2–5 Mb of such sequences and estimated that the human pan-genome contains as high as 19–40 Mb of novel sequences. To identify them in a de novo genome assembly, some existing sequence aligners have been used but no computational method has been specifically proposed for this task. In this work, we developed NSIT (Novel Sequence Identification Tool), a software that can accurately and efficiently identify novel sequences in an individual''s de novo whole genome assembly. We identified and characterized 1.1 Mb, 1.2 Mb, and 1.0 Mb of novel sequences in NA18507 (African), YH (Asian), and NA12878 (European) de novo genome assemblies, respectively. Our results show very high concordance with the previous work using the respective reference assembly. In addition, our results using the latest human reference assembly suggest that the amount of novel sequences per individual may not be as high as previously reported. We additionally developed a graphical viewer for comparisons of novel sequence contents. The viewer also helped in identifying sequence contamination; we found 130 kb of Epstein-Barr virus sequence in the previously published NA18507 novel sequences as well as 287 kb of zebrafish repeats in NA12878 de novo assembly. NSIT requires 2GB of RAM and 1.5–2 hrs on a commodity desktop. The program is applicable to input assemblies with varying contig/scaffold sizes, ranging from 100 bp to as high as 50 Mb. It works in both 32-bit and 64-bit systems and outperforms, by large margins, other fast sequence aligners previously applied to this task. To our knowledge, NSIT is the first software designed specifically for novel sequence identification in a de novo human genome assembly.  相似文献   

5.
6.
The rice (Oryza sativa) genome contains 1,429 protein kinases, the vast majority of which have unknown functions. We created a phylogenomic database (http://rkd.ucdavis.edu) to facilitate functional analysis of this large gene family. Sequence and genomic data, including gene expression data and protein-protein interaction maps, can be displayed for each selected kinase in the context of a phylogenetic tree allowing for comparative analysis both within and between large kinase subfamilies. Interaction maps are easily accessed through links and displayed using Cytoscape, an open source software platform. Chromosomal distribution of all rice kinases can also be explored via an interactive interface.  相似文献   

7.
8.
MOTIVATION: Identification of short conserved sequence motifs common to a protein family or superfamily can be more useful than overall sequence similarity in suggesting the function of novel gene products. Locating motifs still requires expert knowledge, as automated methods using stringent criteria may not differentiate subtle similarities from statistical noise. RESULTS: We have developed a novel automatic method, based on patterns of conservation of 237 physical-chemical properties of amino acids in aligned protein sequences, to find related motifs in proteins with little or no overall sequence similarity. As an application, our web-server MASIA identified 12 property-based motifs in the apurinic/apyrimidinic endonuclease (APE) family of DNA-repair enzymes of the DNase-I superfamily. Searching with these motifs located distantly related representatives of the DNase-I superfamily, such as Inositol 5'-polyphosphate phosphatases in the ASTRAL40 database, using a Bayesian scoring function. Other proteins containing APE motifs had no overall sequence or structural similarity. However, all were phosphatases and/or had a metal ion binding active site. Thus our automated method can identify discrete elements in distantly related proteins that define local structure and aspects of function. We anticipate that our method will complement existing ones to functionally annotate novel protein sequences from genomic projects. AVAILABILITY: MASIA WEB site: http://www.scsb.utmb.edu/masia/masia.html SUPPLEMENTARY INFORMATION: The dendrogram of 42 APE sequences used to derive motifs is available on http://www.scsb.utmb.edu/comp_biol.html/DNA_repair/publication.html  相似文献   

9.
During 1998 the primary focus of the Genome Sequence DataBase (GSDB; http://www.ncgr.org/gsdb ) located at the National Center for Genome Resources (NCGR) has been to improve data quality, improve data collections, and provide new methods and tools to access and analyze data. Data quality has been improved by extensive curation of certain data fields necessary for maintaining data collections and for using certain tools. Data quality has also been increased by improvements to the suite of programs that import data from the International Nucleotide Sequence Database Collaboration (IC). The Sequence Tag Alignment and Consensus Knowledgebase (STACK), a database of human expressed gene sequences developed by the South African National Bioinformatics Institute (SANBI), became available within the last year, allowing public access to this valuable resource of expressed sequences. Data access was improved by the addition of the Sequence Viewer, a platform-independent graphical viewer for GSDB sequence data. This tool has also been integrated with other searching and data retrieval tools. A BLAST homology search service was also made available, allowing researchers to search all of the data, including the unique data, that are available from GSDB. These improvements are designed to make GSDB more accessible to users, extend the rich searching capability already present in GSDB, and to facilitate the transition to an integrated system containing many different types of biological data.  相似文献   

10.
The vertebrate olfactory receptor (OR) subgenome harbors the largest known gene family, which has been expanded by the need to provide recognition capacity for millions of potential odorants. We implemented an automated procedure to identify all OR coding regions from published sequences. This led us to the identification of 831 OR coding regions (including pseudogenes) from 24 vertebrate species. The resulting dataset was subjected to neighbor-joining phylogenetic analysis and classified into 32 distinct families, 14 of which include only genes from tetrapodan species (Class II ORs). We also report here the first identification of OR sequences from a marsupial (koala) and a monotreme (platypus). Analysis of these OR sequences suggests that the ancestral mammal had a small OR repertoire, which expanded independently in all three mammalian subclasses. Classification of ``fish-like' (Class I) ORs indicates that some of these ancient ORs were maintained and even expanded in mammals. A nomenclature system for the OR gene superfamily is proposed, based on a divergence evolutionary model. The nomenclature consists of the root symbol `OR', followed by a family numeral, subfamily letter(s), and a numeral representing the individual gene within the subfamily. For example, OR3A1 is an OR gene of family 3, subfamily A, and OR7E12P is an OR pseudogene of family 7, subfamily E. The symbol is to be preceded by a species indicator. We have assigned the proposed nomenclature symbols for all 330 human OR genes in the database. A WWW tool for automated name assignment is provided. Received: / Accepted:  相似文献   

11.
Prostate specific antigen (PSA), as a widely used clinical biomarker in prostate cancer diagnostics, exists in multiple molecular forms. However, all of these forms might not be recognized in a given sample by the standard immunoassays. Therefore, we have investigated PSA isoforms, separated by size, using mass spectrometric analyses. The objective of these developments was to identify and specify the various forms of PSA. To optimize successful identification of different PSA forms, we have developed a bioinformatic strategy, consisting of high resolution MALDI-MS PMF and sequencing MS/MS data searches. To improve sequence-based identification, the recently introduced Proteios software environment was employed, allowing the combination of multiple database search engines in an automated manner. We could unambiguously identify PSA in clinical samples by all detectable tryptic peptides, which were found to be common in several isoforms.  相似文献   

12.
MOTIVATION: By identifying an unknown gene or protein as a member of a known family, we can infer a wealth of previously compiled information pertinent to that family and its members. RESULTS: This paper introduces a method that classifies sequences using familial definitions from the PRINTS database, allowing progress to be made with the identification of distant evolutionary relationships. The approach makes use of the contextual information inherent in a multiple-motif method, and has the power to identify hitherto unidentified relationships in mass genome data. We exemplify our method by a comparison of database searches with uncharacterized sequences from the Caenorhabditis elegans and Saccharomyces cerevisiae genome projects. This analysis tool combines a simple, user-friendly interface with the capacity to provide an 'intelligent', biologically relevant result.  相似文献   

13.
For the identification of novel proteins using MS/MS, de novo sequencing software computes one or several possible amino acid sequences (called sequence tags) for each MS/MS spectrum. Those tags are then used to match, accounting amino acid mutations, the sequences in a protein database. If the de novo sequencing gives correct tags, the homologs of the proteins can be identified by this approach and software such as MS-BLAST is available for the matching. However, de novo sequencing very often gives only partially correct tags. The most common error is that a segment of amino acids is replaced by another segment with approximately the same masses. We developed a new efficient algorithm to match sequence tags with errors to database sequences for the purpose of protein and peptide identification. A software package, SPIDER, was developed and made available on Internet for free public use. This paper describes the algorithms and features of the SPIDER software.  相似文献   

14.
ARROGANT (ARRay OrGANizing Tool) is a software tool developed to facilitate the identification, annotation and comparison of large collections of genes or clones. The objective is to enable users to compile gene/clone collections from different databases, allowing them to design experiments and analyze the collections as well as associated experimental data efficiently. ARROGANT can relate different sequence identifiers to their common reference sequence using the UniGene database, allowing for the comparison of data from two different microarray experiments. ARROGANT has been successfully used to analyze microarray expression data for colon cancer, to compile genes potentially related to cardiac diseases for subsequent resequencing (to identify single nucleotide polymorphisms, SNPs), to design a new comprehensive human cDNA microarray for cancer, to combine and compare expression data generated by different microarrays and to provide annotation for genes on custom and Affymetrix chips.  相似文献   

15.
In this minireview I briefly describe the new methods suggested for cloning sequences identical by descent, homo-or hemizygously deleted, amplified or polymorphic, and compare them with the most efficient techniques developed earlier. The new methods include cloning of identical sequences (CIS), cloning of polymorphic sequences (COP), and cloning of deleted sequences (CODE). Although these methods are based on the same combination of biochemical techniques, their aims are different. These methods are fully complementary, and they may be combined to analyze a given object. If one aims to clone a disease gene responsible for familial cancer syndrome, these methods may be applied as follows. CIS can be used to identify the sequences identical by descent comparing the DNA obtained from affected or unaffected family members. COP can be used to find sequences that are different between affected and unaffected members, and CODE would be useful to compare tumor and normal (control) samples to isolate, deleted sequences (putative candidate tumor suppressor genes) and amplified sequences (putative oncogenes). The COP and CODE procedures can be applied to analyze the CpG islands, thus allowing direct candidate gene identification.  相似文献   

16.
In 1997 the primary focus of the Genome Sequence DataBase (GSDB; www. ncgr.org/gsdb ) located at the National Center for Genome Resources was to improve data quality and accessibility. Efforts to increase the quality of data within the database included two major projects; one to identify and remove all vector contamination from sequences in the database and one to create premier sequence sets (including both alignments and discontiguous sequences). Data accessibility was improved during the course of the last year in several ways. First, a graphical database sequence viewer was made available to researchers. Second, an update process was implemented for the web-based query tool, Maestro. Third, a web-based tool, Excerpt, was developed to retrieve selected regions of any sequence in the database. And lastly, a GSDB flatfile that contains annotation unique to GSDB (e.g., sequence analysis and alignment data) was developed. Additionally, the GSDB web site provides a tool for the detection of matrix attachment regions (MARs), which can be used to identify regions of high coding potential. The ultimate goal of this work is to make GSDB a more useful resource for genomic comparison studies and gene level studies by improving data quality and by providing data access capabilities that are consistent with the needs of both types of studies.  相似文献   

17.
Reverse complementary DNA sequences - sequences that are inadvertently given backwards with all purines and pyrimidines transposed - can affect sequence analysis detrimentally unless taken into account. We present an open-source, high-throughput software tool -v-revcomp (http://www.cmde.science.ubc.ca/mohn/software.html) - to detect and reorient reverse complementary entries of the small-subunit rRNA (16S) gene from sequencing datasets, particularly from environmental sources. The software supports sequence lengths ranging from full length down to the short reads that are characteristic of next-generation sequencing technologies. We evaluated the reliability of v-revcomp by screening all 406 781 16S sequences deposited in release 102 of the curated SILVA database and demonstrated that the tool has a detection accuracy of virtually 100%. We subsequently used v-revcomp to analyse 1 171 646 16S sequences deposited in the International Nucleotide Sequence Databases and found that about 1% of these user-submitted sequences were reverse complementary. In addition, a nontrivial proportion of the entries were otherwise anomalous, including reverse complementary chimeras, sequences associated with wrong taxa, nonribosomal genes, sequences of poor quality or otherwise erroneous sequences without a reasonable match to any other entry in the database. Thus, v-revcomp is highly efficient in detecting and reorienting reverse complementary 16S sequences of almost any length and can be used to detect various sequence anomalies.  相似文献   

18.
The plant hormone ethylene is involved in several developmental and physiological processes in plants, including senescence, fruit ripening and organ abscission, as well as in biotic and abiotic stress responses. Initiation of these processes involves complex regulation of both ethylene biosynthesis and the ability of cells to perceive the hormone and respond in an appropriate manner, a process which is regulated both spatially and temporally. Ethylene is a gaseous hormone whose sensitivity is a key factor to limiting its response in target cells. We made a search of the Coffee Expressed Sequence Tag (CAFEST) database for expressed sequence tags related to known elements of the ethylene signaling pathway. Sequences showing a reliable similarity were clusterized, annotated and analyzed for conserved domains. Multiple alignments comprising the sequences that we found and sequences of ethylene signaling elements from other species were made, and their phylogeny was assessed by phylogenetic trees constructed with the MEGA4 software. The expression profile was assessed by in silico Northern blot analysis performed using the Cluster and TreeView programs. The CAFEST database was found to have a large number of sequences related to previously described ethylene signaling pathway elements, allowing identification of putative members from almost every step of this pathway. The phylogenetic trees demonstrated high similarity between the sequences found in the CAFEST and those from other species, and the electronic Northern blot analysis detected their expression in various tissues, development stages and stress conditions.  相似文献   

19.
Sequence alignments are fundamental to a wide range of applications, including database searching, functional residue identification and structure prediction techniques. These applications predict or propagate structural/functional/evolutionary information based on a presumed homology between the aligned sequences. If the initial hypothesis of homology is wrong, no subsequent application, however sophisticated, can be expected to yield accurate results. Here we present a novel method, LEON, to predict homology between proteins based on a multiple alignment of complete sequences (MACS). In MACS, weak signals from distantly related proteins can be considered in the overall context of the family. Intermediate sequences and the combination of individual weak matches are used to increase the significance of low-scoring regions. Residue composition is also taken into account by incorporation of several existing methods for the detection of compositionally biased sequence segments. The accuracy and reliability of the predictions is demonstrated in large-scale comparisons with structural and sequence family databases, where the specificity was shown to be >99% and the sensitivity was estimated to be ~76%. LEON can thus be used to reliably identify the complex relationships between large multidomain proteins and should be useful for automatic high-throughput genome annotations, 2D/3D structure predictions, protein–protein interaction predictions etc.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号