首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
It is now common practice to retrieve, by key words, highly specialized selections of sequences from general-purpose databases such as EMBL, GenBank, etc. The sequences included in a selection are often interconnected, which means that there are duplications, embeddings, intersections, homology, common structural elements. Knowledge of these interconnections is necessary for further processing of the sequences. We propose a rapid (single scan) method for identification of such interconnections by means of complexity analysis that generalizes the Lempel-Ziv approach. Analysis of a selection of 5'-flanking regions of vertebrate growth hormone genes from EMBL is presented as an example.  相似文献   

2.
Large-scale search for genes on which positive selection may operate   总被引:36,自引:19,他引:17  
We conducted a systematic search for the candidate genes on which positive selection may operate, on the premise that for such genes the number of nonsynonymous substitution is expected to be larger than that of synonymous substitutions when the nucleotide sequences of genes under investigation are compared with each other. By obtaining 3,595 groups of homologous sequences from the DDBJ, EMBL, and GenBank DNA sequence databases, we found that 17 gene groups can be the candidates for the genes on which positive selection may operate. Thus, such genes are found to occupy only about 0.5% of the vast number of gene groups so far available. Interestingly enough 9 out of the 17 gene groups were the surface antigens of parasites or viruses.   相似文献   

3.
Wang L  Rodriguez-Tomé P  Redaschi N  McNeil P  Robinson A  Lijnzaad P 《Genome biology》2000,1(5):research0010.1-research001010

Background  

The EMBL Nucleotide Sequence Database is a comprehensive database of DNA and RNA sequences and related information traditionally made available in flat-file format. Queries through tools such as SRS (Sequence Retrieval System) also return data in flat-file format. Flat files have a number of shortcomings, however, and the resources therefore currently lack a flexible environment to meet individual researchers' needs. The Object Management Group's common object request broker architecture (CORBA) is an industry standard that provides platform-independent programming interfaces and models for portable distributed object-oriented computing applications. Its independence from programming languages, computing platforms and network protocols makes it attractive for developing new applications for querying and distributing biological data.  相似文献   

4.
In the context of the international project aimed at sequencing the whole genome of Bacillus subtilis we have developed a non-redundant, fully annotated database of sequences from this organism. Starting from the B.subtilis sequences available in the EMBL, GenBank and DDBJ collections we have removed all encountered duplications and then added extra annotations to the sequences (e.g. accession numbers for the genes, locations on the genetic map, codon usage, etc.) We have also added cross-references to the EMBL, MEDLINE, SWISS-PROT and ENZYME data banks. The present system results from merging of the NRSub and SubtiList databases and the sequence contigs used in the two systems are identical. NRSub is distributed as a flatfile in EMBL format (which is supported by most sequence analysis software packages) and as an ACNUC database, while SubtiList is distributed as a relational database under 4th Dimension. It is possible to access the data through two dedicated World Wide Web servers located in France and Japan.  相似文献   

5.
EXProt (database for EXPerimentally verified Protein functions) is a new non-redundant database containing protein sequences for which the function has been experimentally verified. It is a selection of 3976 entries from the Prokaryotes section of the EMBL Nucleotide Sequence Database, Release 66, and 375 entries from the Pseudomonas Community Annotation Project (PseudoCAP). The entries in EXProt all have a unique ID number and provide information about the organism, protein sequence, functional annotation, link to entry in original database, and if known, gene name and link to references in PubMed/Medline. The EXProt web page (http://www.cmbi.nl/EXProt) provides further details of the database and a link to a BLAST search (blastp & blastx) of the database. The EXProt entries are indexed in SRS (http://www.cmbi.nl/srs/) and can be searched by means of keywords. Authors can be reached by email (exprot(cmbi.kun.nl).  相似文献   

6.
The use of databanks in genetic research assumes reliability of the information they contain. Currently, error-detection in the manually or electronically entered data contained in the nucleotide sequence databanks at EMBL, Heidelberg and GenBank at Los Alamos is limited. We have used a subset of sequences from these databanks to train neural networks to recognize pre-mRNA splicing signals in human genes. During the training on 33 human genes from the EMBL databank seven genes appeared to disturb the learning process. Subsequent investigation revealed discrepancies from the original published papers, for three genes. In four genes, we found wrongly assigned splicing frames of introns. We believe this to be a reflection of the fact that splicing frames cannot always be unambiguously assigned on the basis of experimental data. Thus incorrect assignment appear both due to mere typographical misprints as well as erroneous interpretation of experiments. Training on 241 human sequences from GenBank revealed nine new errors. We propose that such errors could be detected by computer algorithms designed to check the consistency of data prior to their incorporation in databanks.  相似文献   

7.
The discovery of group I introns in small subunit nuclear rDNA (nsrDNA) is becoming more common as the effort to generate phylogenies based upon nsrDNA sequences grows. In this paper we describe the discovery of the first two group I introns in the nsrDNA from the genus Acanthamoeba. The introns are in different locations in the genes, and have no significant primary sequence similarity to each other. They are identified as group I introns by the conserved P, Q, R and S sequences (1), and the ability to fit the sequences to a consensus secondary structure model for the group I introns (1, 2). Both introns are absent from the mature srRNA. A BLAST search (3) of nucleic acid sequences present in GenBank and EMBL revealed that the A. griffini intron was most similar to the nsrDNA group I intron of the green alga Dunaliella parva. A similar search found that the A. lenticulata intron was not similar to any of the other reported group I introns.  相似文献   

8.
Predictive motifs derived from cytosine methyltransferases.   总被引:36,自引:51,他引:36       下载免费PDF全文
Thirteen bacterial DNA methyltransferases that catalyze the formation of 5-methylcytosine within specific DNA sequences possess related structures. Similar building blocks (motifs), containing invariant positions, can be found in the same order in all thirteen sequences. Five of these blocks are highly conserved while a further five contain weaker similarities. One block, which has the most invariant residues, contains the proline-cysteine dipeptide of the proposed catalytic site. A region in the second half of each sequence is unusually variable both in length and sequence composition. Those methyltransferases that exhibit significant homology in this region share common specificity in DNA recognition. The five highly conserved motifs can be used to discriminate the known 5-methylcytosine forming methyltransferases from all other methyltransferases of known sequence, and from all other identified proteins in the PIR, GenBank and EMBL databases. These five motifs occur in a mammalian methyltransferase responsible for the formation of 5-methylcytosine within CG dinucleotides. By searching the unidentified open reading frames present in the GenBank and EMBL databases, two potential 5-methylcytosine forming methyltransferases have been found.  相似文献   

9.
This paper proposes multiport parallel and multidirectional intraconnected associative memories of outer product type with reduced interconnections. Some new reduced order memory architectures such as k-directional and k-port parallel memories are suggested. These architectures are, also, very suitable for implementation of spatio-temporal sequences and multiassociative memories. It is shown that in the proposed memory architectures, a substational reduction in interconnections is achieved if the actual length of original N-bit long vectors is subdivided into k sublengths. Using these sublengths, submemory matrices, T s or W s , are computed, which are then intraconnected to form k-port parallel or k-directional memories. The subdivisions of N-bit long vectors into k sublengths save of interconnections. It is shown, by means of an example, that more than 80% reduction in interconnections is achieved. Minimum limit in bits on k as well as maximum limit on subdivisions in k is determined. The topologies of reduced interconnectivity developed in this paper are symmetric in structure and can be used to scale up to larger systems. The underlying principal of construction, storage and retrieval processes of such associative memories has been analyzed. The effect of complexity of different levels of reduced interconnectivity on the quality of retrieval, signal to noise ratio, and storage capacity has been investigated. The model possesses analogies to biological neural structures and digital parallel port memories commonly used in parallel and multiprocessing systems.  相似文献   

10.
It has been previously shown that protein sequences containing a quasi-repetitive assortment of amino acids are common in genomes and databases such as Swiss-Prot but are under-represented in the structure-based Protein Data Bank (PDB). Structural genomics groups have been using the absence of these “low-complexity” sequences for several years as a way to select proteins that have a good chance of successful structure determination. In this study, we examine the data deposited in the PDB as well as the available data from structural genomics groups in TargetDB and PepcDB to reveal interesting trends that could be taken into consideration when using low-complexity sequences as part of the target selection process.  相似文献   

11.
We continued our effort to make a comprehensive database (LISTA) for the yeast Saccharomyces cerevisiae. In this database each sequence has been attributed a single genetic name. In the case of duplicated sequences a simple method has been applied to distinguish between sequences of one and the same gene from non-allelic sequences of duplicated genes. If necessary, synonyms are given in the case of allelic duplicated sequences. Thus sequences can be found either by the name or by synonyms given in LISTA. Each entry contains the genetic name, the mnemonic from the EMBL data bank, the codon bias, reference of the publication of the sequence, Chromosomal location as far as known, Swissprot and EMBL accession numbers. To obtain more information on the included sequences, each entry has been screened against non-redundant nucleotide and protein data bank collections resulting in LISTA-HON and LISTA-HOP. The LISTA data base can be linked to the associated data sets or to nucleotide and protein banks by the Sequence Retrieval System (SRS).  相似文献   

12.
Because retrotransposons are the major component of plant genomes, analysis of the target site selection of retrotransposons is important for understanding the structure and evolution of plant genomes. Here, we examined the target site specificity of the rice retrotransposon Tos17, which can be activated by tissue culture. We have produced 47,196 Tos17-induced insertion mutants of rice. This mutant population carries approximately 500,000 insertions. We analyzed >42,000 flanking sequences of newly transposed Tos17 copies from 4316 mutant lines. More than 20,000 unique loci were assigned on the rice genomic sequence. Analysis of these sequences showed that insertion events are three times more frequent in genic regions than in intergenic regions. Consistent with this result, Tos17 was shown to prefer gene-dense regions over centromeric heterochromatin regions. Analysis of insertion target sequences revealed a palindromic consensus sequence, ANGTT-TSD-AACNT, flanking the 5-bp target site duplication. Although insertion targets are distributed throughout the chromosomes, they tend to cluster, and 76% of the clusters are located in genic regions. The mechanisms of target site selection by Tos17, the utility of the mutant lines, and the knockout gene database are discussed. --The nucleotide sequence data were uploaded to the DDBJ, EMBL, and GenBank nucleotide sequence databases under accession numbers AG020727 to AG025611 and AG205093 to AG215049.  相似文献   

13.
The cDNA sequences of chicken and hagfish prothrombin have been determined. The sequences predict that prothrombin from both species is synthesized as a prepro-protein consisting of a putative Gla domain, two kringle domains, and a two-chain protease domain. Chicken and hagfish prothrombin share 51.6% amino acid sequence identity (313/627 residues). Both chicken and hagfish prothrombin are structurally very similar to human, bovine, rat, and mouse prothrombin and all six species share 41% amino acid sequence identity. Amino acid sequence alignments of human, bovine, rat, mouse, chicken, and hagfish prothrombin suggest that the thrombin B-chain and the propeptide-Gla domain are the regions most constrained for the common function(s) of vertebrate prothrombins.The nucleotide sequences reported in this paper have been submitted to the EMBL/Genbank database under the following secession numbers: M 81391 for Gallus gallus, M 81393 for Eptatretus stouti.Correspondence to: R.T.A. MacGillivray  相似文献   

14.
In this study, I explain the observation that a rather limited number of residues (about 10) establishes the immunoglobulin fold for the sequences of about 100 residues. Immunoglobulin fold proteins (IgF) comprise SCOP protein superfamilies with rather different functions and with less than 10% sequence identity; their alignment can be accomplished only taking into account the 3D structure. Therefore, I believe that discovering the additional common features of the sequences is necessary to explain the existence of a common fold for these SCOP superfamilies. We propose a method for analysis of pair-wise interconnections between residues of the multiple sequence alignment which helps us to reveal the set of mutually correlated positions, inherent to almost every superfamily of this protein fold. Hence, the set of constant positions (comprising the hydrophobic common core) and the set of variable but mutually correlated ones can serve as a basis of having the common 3D structure for rather distinct protein sequences.  相似文献   

15.
The amount of nucleotide sequence data is increasing exponentially. We therefore made an effort to make a comprehensive database (LISTA) for the yeast Saccharomyces cerevisiae. Each sequence has been attributed a single genetic name and in the case of allelic duplicated sequences, synonyms are given, if necessary. For the nomenclature we have introduced a standard principle for naming gene sequences based on priority rules. We have also applied a simple method to distinguish duplicated sequences of one and the same gene from non-allelic sequences of duplicated genes. By using these principles we have sorted out a lot of confusion in the literature and databanks. Along with the genetic name, the mnemonic from the EMBL databank, the codon bias, reference of the publication of the sequence and the EMBL accession numbers are included in each entry.  相似文献   

16.
The nucleotide sequence data reported in this paper have been submitted to the EMBL nucleotide sequence database and have been assigned the accession number X96986. The nameDPB1 * 6601 was officially assigned by the WHO Nomenclature Committee in May 1996. This follows the agreed policy that, subject to the conditions stated in the most recent Nomenclature Report (Bodmer et al. 1995), names will be assigned to new sequences as they are identified. Lists of such sequences will be published in the following WHO Nomenclature Report  相似文献   

17.
Genes of the major histocompatibility complex (MHC) are exceptionally polymorphic due to the combined effects of natural and sexual selection. Most research in wild populations has focused on the second exon of a single class II locus (DRB), but complete gene sequences can provide an illuminating backdrop for studies of intragenic selection, recombination, and organization. To this end, we characterized class II loci in the banner-tailed kangaroo rat (Dipodomys spectabilis). Seven DRB-like sequences (provisionally named MhcDisp-DRB*01 through *07) were isolated from spleen cDNA and most likely comprise ≥5 loci; this multiformity is quite unlike the situation in muroid rodents such as Mus, Rattus, and Peromyscus. In silico translation revealed the presence of important structural residues for glycosylation sites, salt bonds, and CD4+ T-cell recognition. Amino-acid distances varied widely among the seven sequences (2–34%). Nuclear DNA sequences from the Disp-DRB*07 locus (∼10 kb) revealed a conventional exon/intron structure as well as a number of microsatellites and short interspersed nuclear elements (B4, Alu, and IDL-Geo subfamilies). Rates of nucleotide substitution at Disp-DRB*07 are similar in both exons and introns (π = 0.015 and 0.012, respectively), which suggests relaxed selection and may indicate that this locus is an expressed pseudogene. Finally, we performed BLASTn searches against Dipodomys ordii genomic sequences (unassembled reads) and find 90–97% nucleotide similarity between the two kangaroo rat species. Collectively, these data suggest that class II diversity in heteromyid rodents is based on polylocism and departs from the muroid architecture. Electronic supplementary material  The online version of this article (doi:) contains supplementary material, which is available to authorized users. Nucleotide sequence data reported are available in the DDBJ/EMBL/GenBank databases under the accession numbers EU817477–EU817485.  相似文献   

18.
Twenty-six sequences of a short interspersed repetitive element (SINE) with a size of approximately 150 base pairs (bp) were isolated from the genomic DNA of Vicugna vicugna (vicuna). RNA polymerase III split promoter sequence was observed in most of them, and many had direct repeats flanking to SINEs as well as a poly(A)-like structure. The SINE sequences were designated as ``vic-1' sequences. Comparison of the vic-1 consensus sequence with sequences registered in the DNA database (DDBJ/EMBL/GENBANK) revealed that the vic-1 sequence had a 79% homology with mouse ala-tRNA gene. In addition, the tRNA-related region of the consensus sequence was folded into a cloverleaf structure as with mouse ala-tRNA. These findings strongly indicated that vic-1 was a retroposon derived from ala-tRNA gene. The vic-1 sequences were used as a probe for dot-blot hybridization to examine the distribution of their homologous sequences in the genomes of various animal species spanning 14 orders, of which, homologous sequences were found only in the Camelidae family. In order to examine the phylogenetical relationship among vicuna, llama, and camel, vic-1 insertion analysis and homology analysis of vic-1 sequences were performed at each locus. The analyses indicated that vic-1 sequences were generated in a common ancestor of the animal species, and that camels first branched off from the clade Camelidae, followed by vicunas and llamas. Received: 5 July 2000 / Accepted: 14 November 2000  相似文献   

19.
In this paper we describe a modification to the lambda vector EMBL3 which greatly expedites the construction of restriction maps of cloned DNA sequences. In the modified vector, EMBL3cos, all the phage coding sequences are placed to the right of the cloning sites so that the left cohesive end is separated by only 200bp, rather than 20kb (as in conventional lambda vectors), from the inserted DNA fragment. We show that reliable restriction maps can be rapidly constructed from partial digests of clones made in this vector by labelling the left cohesive end with a complementary 32P-labelled oligonucleotide. In addition, we quantify the restriction of clones containing human DNA by the McrA and McrB systems of E. coli and show that the use of Mcr- plating strains can increase the yield of recombinant phage up to tenfold, to give cloning efficiencies of greater than or equal to 10(7) pfu/microgram of human DNA.  相似文献   

20.
Summary The sequences of the genes coding for a hydroxyproline-rich glycoprotein from two varieties of maize (Zea mays, Ac1503 and W22), a teosinte (Zea diploperennis) and sorghum (Sorghum vulgare) have been obtained and compared. Distinct patterns of variability have been observed along their sequences. The 500 by region immediately upstream of the TATA box is highly conserved in theZea species and contains stretches of sequences also found in the sorghum gene. Further upstream, significant rearrangements are observed, even between the two maize varieties. These observations allow definition of a 5 region, which is common to the four genes and is probably essential for their expression. The 3 end shows variability, mostly due to small duplications and single nucleotide substitutions. There is an intron present in this region showing a high degree of sequence conservation among the four genes analyzed. The coding region is the most divergent, but variability arises from duplications of fragments coding for similar protein blocks and from single nucleotide substitutions. These results indicate that a number of distinct mechanisms (probably point mutation, transposon insertion and excision, homologous recombination and unequal crossing-over) are active in the production of sequence variability in maize and related species. They are revealed in different parts of the gene, probably as the result of the different types of functional constraints acting on them, and of the specific nature of the sequence in each region.The sequences reported in this paper have been deposited in the EMBL/GenBank Database (Bolt, Beranek, and Newman Laboratories, Cambridge, Mass., and EMBL, Heidelberg), accession nos. M36635 (maize Ac1503), X63134 (maize W22), X64173 (teosinte) and X56010 (sorghum)  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号