首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
MOTIVATION: Tandem repeats are associated with disease genes, play an important role in evolution and are important in genomic organization and function. Although much research has been done on short perfect patterns of repeats, there has been less focus on imperfect repeats. Thus, there is an acute need for a tandem repeats database that provides reliable and up to date information on both perfect and imperfect tandem repeats in the human genome and relates these to disease genes. RESULTS: This paper presents a web-accessible relational tandem repeats database that relates tandem repeats to gene locations and disease genes of the human genome. In contrast to other available databases, this database identifies both perfect and imperfect repeats of 1-2000 bp unit lengths. The utility of this database has been illustrated by analysing these repeats for their distribution and frequencies across chromosomes and genomic locations and between protein-coding and non-coding regions. The applicability of this database to identify diseases associated with previously uncharacterized tandem repeats is demonstrated.  相似文献   

2.
Jorda J  Baudrand T  Kajava AV 《Proteomics》2012,12(9):1333-1336
Rapidly increasing genomic data present new challenges for scientists: making sense of millions of amino acid sequences requires a systematic approach and information about their 3D structure, function, and evolution. Over the last decade, numerous studies demonstrated the fundamental importance of protein tandem repeats and their involvement in human diseases. Bioinformatics analysis of these regions requires special computer programs and databases, since the conventional approaches predominantly developed for globular domains have limited success. To perform a global comparative analysis of protein tandem repeats, we developed the Protein Tandem Repeat DataBase (PRDB). PRDB is a curated database that includes the protein tandem repeats found in sequence databanks by the T‐REKS program. The database is available at http://bioinfo.montp.cnrs.fr/?r=repeatDB  相似文献   

3.
A novel hybrid methodology for the automated identification of peptides via de novo integer linear optimization, local database search, and tandem mass spectrometry is presented in this article. A modified version of the de novo identification algorithm PILOT, is utilized to construct accurate de novo peptide sequences. A modified version of the local database search tool FASTA is used to query these de novo predictions against the nonredundant protein database to resolve any low-confidence amino acids in the candidate sequences. The computational burden associated with performing several alignments is alleviated with the use of distributive computing. Extensive computational studies are presented for this new hybrid methodology, as well as comparisons with MASCOT for a set of 38 quadrupole time-of-flight (QTOF) and 380 OrbiTrap tandem mass spectra. The results for our proposed hybrid method for the OrbiTrap spectra are also compared with a modified version of PepNovo, which was trained for use on high-precision tandem mass spectra, and the tag-based method InsPecT. The de novo sequences of PILOT and PepNovo are also searched against the nonredundant protein database using CIDentify to compare with the alignments achieved by our modifications of FASTA. The comparative studies demonstrate the excellent peptide identification accuracy gained from combining the strengths of our de novo method, which is based on integer linear optimization, and database driven search methods.  相似文献   

4.
MITOP (http://www.mips.biochem.mpg.de/proj/medgen/mitop/) is a comprehensive database for genetic and functional information on both nuclear- and mitochondrial-encoded proteins and their genes. The five species files--Saccharomyces cerevisiae, Mus musculus, Caenorhabditis elegans, Neurospora crassa and Homo sapiens--include annotated data derived from a variety of online resources and the literature. A wide spectrum of search facilities is given in the overlapping sections 'Gene catalogues', 'Protein catalogues', 'Homologies', 'Pathways and metabolism' and 'Human disease catalogue' including extensive references and hyperlinks to other databases. Central features are the results of various homology searches, which should facilitate the investigations into interspecies relationships. Precomputed FASTA searches using all the MITOP yeast protein entries and a list of the best human EST hits with graphical cluster alignments related to the yeast reference sequence are presented. The orthologue tables with cross-listings to all the protein entries for each species in MITOP have been expanded by adding the genomes of Rickettsia prowazeckii and Escherichia coli. To find new mitochondrial proteins the complete yeast genome has been analyzed using the MITOPROT program which identifies mitochondrial targeting sequences. The 'Human disease catalogue' contains tables with a total of 110 human diseases related to mitochondrial protein abnormalities, sorted by clinical criteria and age of onset. MITOP should contribute to the systematic genetic characterization of the mitochondrial proteome in relation to human disease.  相似文献   

5.

Background  

Repbase is a reference database of eukaryotic repetitive DNA, which includes prototypic sequences of repeats and basic information described in annotations. Updating and maintenance of the database requires specialized tools, which we have created and made available for use with Repbase, and which may be useful as a template for other curated databases.  相似文献   

6.
The MITOP database http://websvr.mips.biochem.mpg. de/proj/medgen/mitop/ consolidates information on both nuclear- and mitochondrial-encoded genes and their proteins. The five species files- Saccharomyces cerevisiae, Mus musculus, Caenorhabditis elegans, Neurospora crassa and Homo sapiens -include annotated data derived from a variety of online resources and the literature. A wide spectrum of search facilities is given in the interelated sections 'Gene catalogues', 'Protein catalogues', 'Homologies', 'Pathways and metabolism', and 'Human disease catalogue' including extensive references and hyperlinks for each entry. Precomputed FASTA searches using all the MITOP yeast protein entries and a list of the best EST hits with graphical cluster alignments related to the yeast reference sequence are presented. The MITOP orthologue tables with cross-listing to all the protein entries for each species in the database facilitate investigations into interspecies homology. A program (MITOPROT) is available to identify mitochondrial targeting sequences and graphical depictions of several important mitochondrial processes are included. The 'Human disease catalogue' lists a total of 101 disorders related to mitochondrial protein abnormalities, sorted by clinical criteria and age of onset.  相似文献   

7.
MRD is a database system to access the microsatellite repeats information of genomes such as archea, eubacteria, and other eukaryotic genomes whose sequence information is available in public domains. MRD stores information about simple tandemly repeated k-mer sequences where k= 1 to 6, i.e. monomer to hexamer. The web interface allows the users to search for the repeat of their interest and to know about the association of the repeat with genes and genomic regions in the specific organism. The data contains the abundance and distribution of microsatellites in the coding and non-coding regions of the genome. The exact location of repeats with respect to genomic regions of interest (such as UTR, exon, intron or intergenic regions) whichever is applicable to organism is highlighted. MRD is available on the World Wide Web at and/or . The database is designed as an open-ended system to accommodate the microsatellite repeats information of other genomes whose complete sequences will be available in future through public domain.  相似文献   

8.
There are many computer programs that can match tandem mass spectra of peptides to database-derived sequences; however, situations can arise where mass spectral data cannot be correlated with any database sequence. In such cases, sequences can be automatically deduced de novo, without recourse to sequence databases, and the resulting peptide sequences can be used to perform homologous nonexact searches of sequence databases. This article describes details on how to implement both a de novo sequencing program called “Lutefisk,” and a version of FASTA that has been modified to account for sequence ambiguities inherent in tandem mass spectrometry data.  相似文献   

9.
The positional candidate gene approach accelerates the discovery of genes involved in disease. However, the properties of such disease genes are very diverse and the sample size of known disease genes is too small and does not warrant success by the use of a machine-learning approach. A user-defined scoring system may thus help to determine the priority of candidate genes. Spinocerebellar ataxia (SCA) is a good model to test this approach because most SCA subtypes are caused by an expansion of short tandem repeats (STRs). The SCA db is a candidate gene database for SCA, which collected 3185 genes for 17 types of SCA. Those SCA subtypes that have known disease genes can be used as positive controls to optimize the parameters. The users may browse the candidate genes of a given SCA subtype by using the default parameters. The known disease genes were found to be the top three candidates using the default parameters. Alternatively, the users may score the candidate genes by changing the weight or the scores on the basis of their own working hypothesis. AVAILABILITY: This database is available at http://ymbc.ym.edu.tw/sca/  相似文献   

10.
SUMMARY: The Arthropodan Mitochondrial Genomes Accessible database (AMiGA) is a relational database developed to help in managing access to the increasing amount of data arising from developments in arthropodan mitochondrial genomics (136 mitochondrial genomes as of September 2005). The strengths of AMiGA include (1) a more accessible and up-to-date database containing a more comprehensive set of mitochondrial genomes for this phylum, (2) the provision of flexible search options for retrieving detailed information such as bibliographical data, genomic graphics, FASTA sequences and taxonomical status, (3) the possibility of enhanced comparative analyses by multiple alignment of single or concatenated sets of genes, (4) more accurate and updated information resulting from a specific curation process called AMiGA Notes and (5) the possibility of including unpublished sequences in a password-restricted area for comparative analysis with the other sequences stored in the database. AVAILABILITY: http://amiga.cbmeg.unicamp.br CONTACT: lessinger@amiga.cbmeg.unicamp.br SUPPLEMENTARY INFORMATION: Detailed information, including an illustrated tutorial, is available from the above URL.  相似文献   

11.
HOWDY: an integrated database system for human genome research   总被引:1,自引:0,他引:1  
HOWDY is an integrated database system for accessing and analyzing human genomic information (http://www-alis.tokyo.jst.go.jp/HOWDY/). HOWDY stores information about relationships between genetic objects and the data extracted from a number of databases. HOWDY consists of an Internet accessible user interface that allows thorough searching of the human genomic databases using the gene symbols and their aliases. It also permits flexible editing of the sequence data. The database can be searched using simple words and the search can be restricted to a specific cytogenetic location. Linear maps displaying markers and genes on contig sequences are available, from which an object can be chosen. Any search starting point identifies all the information matching the query. HOWDY provides a convenient search environment of human genomic data for scientists unsure which database is most appropriate for their search.  相似文献   

12.
Issac B  Raghava GP 《BioTechniques》2002,33(3):548-50, 552, 554-6
Similarity searches are a powerful method for solving important biological problems such as database scanning, evolutionary studies, gene prediction, and protein structure prediction. FASTA is a widely used sequence comparison tool for rapid database scanning. Here we describe the GWFASTA server that was developed to assist the FASTA user in similarity searches against partially and/or completely sequenced genomes. GWFASTA consists of more than 60 microbial genomes, eight eukaryote genomes, and proteomes of annotatedgenomes. Infact, it provides the maximum number of databases for similarity searching from a single platform. GWFASTA allows the submission of more than one sequence as a single query for a FASTA search. It also provides integrated post-processing of FASTA output, including compositional analysis of proteins, multiple sequences alignment, and phylogenetic analysis. Furthermore, it summarizes the search results organism-wise for prokaryotes and chromosome-wise for eukaryotes. Thus, the integration of different tools for sequence analyses makes GWFASTA a powerful toolfor biologists.  相似文献   

13.
Exact Tandem Repeats Analyzer 1.0 (E-TRA) combines sequence motif searches with keywords such as ‘organs’, ‘tissues’, ‘cell lines’ and ‘development stages’ for finding simple exact tandem repeats as well as non-simple repeats. E-TRA has several advanced repeat search parameters/options compared to other repeat finder programs as it not only accepts GenBank, FASTA and expressed sequence tags (EST) sequence files, but also does analysis of multiple files with multiple sequences. The minimum and maximum tandem repeat motif lengths that E-TRA finds vary from one to one thousand. Advanced user defined parameters/options let the researchers use different minimum motif repeats search criteria for varying motif lengths simultaneously. One of the most interesting features of genomes is the presence of relatively short tandem repeats (TRs). These repeated DNA sequences are found in both prokaryotes and eukaryotes, distributed almost at random throughout the genome. Some of the tandem repeats play important roles in the regulation of gene expression whereas others do not have any known biological function as yet. Nevertheless, they have proven to be very beneficial in DNA profiling and genetic linkage analysis studies. To demonstrate the use of E-TRA, we used 5,465,605 human EST sequences derived from 18,814,550 GenBank EST sequences. Our results indicated that 12.44% (679,800) of the human EST sequences contained simple and non-simple repeat string patterns varying from one to 126 nucleotides in length. The results also revealed that human organs, tissues, cell lines and different developmental stages differed in number of repeats as well as repeat composition, indicating that the distribution of expressed tandem repeats among tissues or organs are not random, thus differing from the un-transcribed repeats found in genomes.  相似文献   

14.
Zhikong scallop (Chlamys farreri Jones et Preston, 1904) is one of the most commercially important bivalves in China, but research on its genome is underdeveloped. In this study, we constructed the first Zhikong scallop fosmid library, and analyzed the fosmid end sequences to provide a preliminary assessment of the genome. The library consists of 133,851 clones with an average insert size of about 40 kb, amounting to 4.3 genome equivalents. Fosmid stability assays indicate that Zhikong scallop DNA was stable during propagation in the fosmid system. Library screening with two genes and seven microsatellite markers yielded between two and eight positive clones, and none of those tested was absent from the library. End-sequencing of 480 individual clones generated 828 sequences after trimming, with an average sequence length of 624 bp. BLASTN searches of the nr and EST databases of GenBank and BLASTX searches of the nr database resulted in 213 (25.72%) and 44 (5.31%) significant hits (E < e−5), respectively. Repetitive sequences analysis resulted in 375 repeats, accounting for 15.84% of total length, which were composed of interspersed repetitive sequences, tandem repeats, and low-complexity sequences. The fosmid library, in conjunction with the fosmid end sequences, will serve as a useful resource for physical mapping and positional cloning, and provide a better understanding of the Zhikong scallop genome.  相似文献   

15.
MOTIVATION: Complex genomes contain numerous repeated sequences, and genomic duplication is believed to be a main evolutionary mechanism to obtain new functions. Several tools are available for de novo repeat sequence identification, and many approaches exist for clustering homologous protein sequences. We present an efficient new approach to identify and cluster homologous DNA sequences with high accuracy at the level of whole genomes, excluding low-complexity repeats, tandem repeats and annotated interspersed repeats. We also determine the boundaries of each group member so that it closely represents a biological unit, e.g. a complete gene, or a partial gene coding a protein domain. RESULTS: We developed a program called HomologMiner to identify homologous groups applicable to genome sequences that have been properly marked for low-complexity repeats and annotated interspersed repeats. We applied it to the whole genomes of human (hg17), macaque (rheMac2) and mouse (mm8). Groups obtained include gene families (e.g. olfactory receptor gene family, zinc finger families), unannotated interspersed repeats and additional homologous groups that resulted from recent segmental duplications. Our program incorporates several new methods: a new abstract definition of consistent duplicate units, a new criterion to remove moderately frequent tandem repeats, and new algorithmic techniques. We also provide preliminary analysis of the output on the three genomes mentioned above, and show several applications including identifying boundaries of tandem gene clusters and novel interspersed repeat families. AVAILABILITY: All programs and datasets are downloadable from www.bx.psu.edu/miller_lab.  相似文献   

16.
Genomic imprinting is an epigenetic mechanism that results in monoallelic expression of genes depending on parent-of-origin of the allele. Although the conservation of genomic imprinting among mammalian species has been widely reported for many genes, there is accumulating evidence that some genes escape this conservation. Most known imprinted genes have been identified in the mouse and human, with few imprinted genes reported in cattle. Comparative analysis of genomic imprinting across mammalian species would provide a powerful tool for elucidating the mechanisms regulating the unique expression of imprinted genes. In this study we analyzed the imprinting of 22 genes in human, mouse, and cattle and found that in only 11 was imprinting conserved across the three species. In addition, we analyzed the occurrence of the sequence elements CpG islands, C + G content, tandem repeats, and retrotransposable elements in imprinted and in nonimprinted (control) cattle genes. We found that imprinted genes have a higher G + C content and more CpG islands and tandem repeats. Short interspersed nuclear elements (SINEs) were notably fewer in number in imprinted cattle genes compared to control genes, which is in agreement with previous reports for human and mouse imprinted regions. Long interspersed nuclear elements (LINEs) and long terminal repeats (LTRs) were found to be significantly underrepresented in imprinted genes compared to control genes, contrary to reports on human and mouse. Of considerable significance was the finding of highly conserved tandem repeats in nine of the genes imprinted in all three species. Electronic supplementary material The online version of this article (doi: ) contains supplementary material, which is available to authorized users.  相似文献   

17.
In the past decade there has been an increase in the number of completely sequenced genomes due to the race of multibillion-dollar genome-sequencing projects. The enormous biological sequence data thus flooding into the sequence databases necessitates the development of efficient tools for comparative genome sequence analysis. The information deduced by such analysis has various applications viz. structural and functional annotation of novel genes and proteins, finding gene order in the genome, gene fusion studies, constructing metabolic pathways etc. Such study also proves invaluable for pharmaceutical industries, such as in silico drug target identification and new drug discovery. There are various sequence analysis tools available for mining such useful information of which FASTA and Smith-Waterman algorithms are widely used. However, analyzing large datasets of genome sequences using the above codes seems to be impractical on uniprocessor machines. Hence there is a need for improving the performance of the above popular sequence analysis tools on parallel cluster computers. Performance of the Smith-Waterman (SSEARCH) and FASTA programs were studied on PARAM 10000, a parallel cluster of workstations designed and developed in-house. FASTA and SSEARCH programs, which are available from the University of Virginia, were ported on PARAM and were optimized. In this era of high performance computing, where the paradigm is shifting from conventional supercomputers to the cost-effective general-purpose cluster of workstations and PCs, this study finds extreme relevance. Good performance of sequence analysis tools on a cluster of workstations was demonstrated, which is important for accelerating identification of novel genes and drug targets by screening large databases.  相似文献   

18.
The GoSh database is a collection of 58 990 Capra hircus and Ovis aries expressed sequence tags. A perl pipeline was prepared to process sequences, and data were collected in a MySQL database. A PHP-based web interface allows browsing and querying the database. Putative single nucleotide polymorphism (SNP) detection, as well as search to repeats were performed, and links to external related resources were provided. Sequences were annotated against three different databases and an algorithm was implemented to create statistics of the distribution of retrieved homologous ontologies in the Gene Ontology categories. The GoSh database is a repository of data and links related to goat and sheep expressed genes. AVAILABILITY: The GoSh database is available at http://www.itb.cnr.it/gosh/  相似文献   

19.
UniProt archive     
UniProt Archive (UniParc) is the most comprehensive, non-redundant protein sequence database available. Its protein sequences are retrieved from predominant, publicly accessible resources. All new and updated protein sequences are collected and loaded daily into UniParc for full coverage. To avoid redundancy, each unique sequence is stored only once with a stable protein identifier, which can be used later in UniParc to identify the same protein in all source databases. When proteins are loaded into the database, database cross-references are created to link them to the origins of the sequences. As a result, performing a sequence search against UniParc is equivalent to performing the same search against all databases cross-referenced by UniParc. UniParc contains only protein sequences and database cross-references; all other information must be retrieved from the source databases.  相似文献   

20.
We are interested in detecting homologous genomic DNA sequences with the goal of locating approximate inverted, interspersed, and tandem repeats. Standard search techniques start by detecting small matching parts, called seeds, between a query sequence and database sequences. Contiguous seed models have existed for many years. Recently, spaced seeds were shown to be more sensitive than contiguous seeds without increasing the random hit rate. To determine the superiority of one seed model over another, a model of homologous sequence alignment must be chosen. Previous studies evaluating spaced and contiguous seeds have assumed that matches and mismatches occur within these alignments, but not insertions and deletions (indels). This is perhaps appropriate when searching for protein coding sequences (<5% of the human genome), but is inappropriate when looking for repeats in the majority of genomic sequence where indels are common. In this paper, we assume a model of homologous sequence alignment which includes indels and we describe a new seed model, called indel seeds, which explicitly allows indels. We present a waiting time formula for computing the sensitivity of an indel seed and show that indel seeds significantly outperform contiguous and spaced seeds when homologies include indels. We discuss the practical aspect of using indel seeds and finally we present results from a search for inverted repeats in the dog genome using both indel and spaced seeds.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号