首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
WindowMasker: window-based masker for sequenced genomes   总被引:3,自引:0,他引:3  
MOTIVATION: Matches to repetitive sequences are usually undesirable in the output of DNA database searches. Repetitive sequences need not be matched to a query, if they can be masked in the database. RepeatMasker/Maskeraid (RM), currently the most widely used software for DNA sequence masking, is slow and requires a library of repetitive template sequences, such as a manually curated RepBase library, that may not exist for newly sequenced genomes. RESULTS: We have developed a software tool called WindowMasker (WM) that identifies and masks highly repetitive DNA sequences in a genome, using only the sequence of the genome itself. WM is orders of magnitude faster than RM because WM uses a few linear-time scans of the genome sequence, rather than local alignment methods that compare each library sequence with each piece of the genome. We validate WM by comparing BLAST outputs from large sets of queries applied to two versions of the same genome, one masked by WM, and the other masked by RM. Even for genomes such as the human genome, where a good RepBase library is available, searching the database as masked with WM yields more matches that are apparently non-repetitive and fewer matches to repetitive sequences. We show that these results hold for transcribed regions as well. WM also performs well on genomes for which much of the sequence was in draft form at the time of the analysis. AVAILABILITY: WM is included in the NCBI C++ toolkit. The source code for the entire toolkit is available at ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools++/CURRENT/. Once the toolkit source is unpacked, the instructions for building WindowMasker application in the UNIX environment can be found in file src/app/winmasker/README.build. SUPPLEMENTARY INFORMATION: Supplementary data are available at ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/windowmasker/windowmasker_suppl.pdf  相似文献   

2.
Clustal W and Clustal X version 2.0   总被引:70,自引:0,他引:70  
SUMMARY: The Clustal W and Clustal X multiple sequence alignment programs have been completely rewritten in C++. This will facilitate the further development of the alignment algorithms in the future and has allowed proper porting of the programs to the latest versions of Linux, Macintosh and Windows operating systems. AVAILABILITY: The programs can be run on-line from the EBI web server: http://www.ebi.ac.uk/tools/clustalw2. The source code and executables for Windows, Linux and Macintosh computers are available from the EBI ftp site ftp://ftp.ebi.ac.uk/pub/software/clustalw2/  相似文献   

3.
BLAST (Basic Local Alignment Search Tool) searches against DNA and protein sequence databases have become an indispensable tool for biomedical research. The proliferation of the genome sequencing projects is steadily increasing the fraction of genome-derived sequences in the public databases and their importance as a public resource. We report here the availability of Genomic BLAST, a novel graphical tool for simplifying BLAST searches against complete and unfinished genome sequences. This tool allows the user to compare the query sequence against a virtual database of DNA and/or protein sequences from a selected group of organisms with finished or unfinished genomes. The organisms for such a database can be selected using either a graphic taxonomy-based tree or an alphabetical list of organism-specific sequences. The first option is designed to help explore the evolutionary relationships among organisms within a certain taxonomy group when performing BLAST searches. The use of an alphabetical list allows the user to perform a more elaborate set of selections, assembling any given number of organism-specific databases from unfinished or complete genomes. This tool, available at the NCBI web site http://www.ncbi.nlm.nih.gov/cgi-bin/Entrez/genom_table_cgi, currently provides access to over 170 bacterial and archaeal genomes and over 40 eukaryotic genomes.  相似文献   

4.
GenBank          下载免费PDF全文
GenBank (R) is a comprehensive sequence database that contains publicly available DNA sequences for more than 119 000 different organisms, obtained primarily through the submission of sequence data from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the BankIt (web) or Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the EMBL Data Library in the UK and the DNA Data Bank of Japan helps ensure worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, go to the NCBI home page at: http://www.ncbi.nlm.nih.gov.  相似文献   

5.
Sequence search algorithm assessment and testing toolkit (SAT)   总被引:2,自引:0,他引:2  
MOTIVATION: The Sequence Search Algorithm Assessment and Testing Toolkit (SAT) aims to be a complete package for the comparison of different protein homology search algorithms. The structural classification of proteins can provide us with a clear criterion for judgment in homology detection. There have been several assessments based on structural sequences with classifications but a good deal of similar work is now being repeated with locally developed procedures and programs. The SAT will provide developers with a complete package which will save time and produce more comparable performance assessments for search algorithms. The package is complete in the sense that it provides a non-redundant large sequence resource database, a well-characterized query database of proteins domains, all the parsers and some previous results from PSI-BLAST and a hidden markov model algorithm. RESULTS: An analysis on two different data sets was carried out using the SAT package. It compared the performance of a full protein sequence database (RSDB100) with a non-redundant representative sequence database derived from it (RSDB50). The performance measurement indicated that the full database is sub-optimal for a homology search. This result justifies the use of much smaller and faster RSDB50 than RSDB100 for the SAT. AVAILABILITY: A web site is up. The whole packa ge is accessible via www and ftp. ftp://ftp.ebi.ac.uk/pub/contrib/jong/SAT http://cyrah.ebi.ac.uk:1111/Proj/Bio/SAT http://www.mrc-lmb.cam.ac.uk/genomes/SAT In the package, some previous assessment results produced by the package can also be found for reference. CONTACT: jong@ebi.ac.uk  相似文献   

6.
MOTIVATION: For large-scale structural assignment to sequences, as in computational structural genomics, a fast yet sensitive sequence search procedure is essential. A new approach using intermediate sequences was tested as a shortcut to iterative multiple sequence search methods such as PSI-BLAST. RESULTS: A library containing potential intermediate sequences for proteins of known structure (PDB-ISL) was constructed. The sequences in the library were collected from a large sequence database using the sequences of the domains of proteins of known structure as the query sequences and the program PSI-BLAST. Sequences of proteins of unknown structure can be matched to distantly related proteins of known structure by using pairwise sequence comparison methods to find homologues in PDB-ISL. Searches of PDB-ISL were calibrated, and the number of correct matches found at a given error rate was the same as that found by PSI-BLAST. The advantage of this library is that it uses pairwise sequence comparison methods, such as FASTA or BLAST2, and can, therefore, be searched easily and, in many cases, much more quickly than an iterative multiple sequence comparison method. The procedure is roughly 20 times faster than PSI-BLAST for small genomes and several hundred times for large genomes. AVAILABILITY: Sequences can be submitted to the PDB-ISL servers at http://stash.mrc-lmb.cam.ac.uk/PDB_ISL/ or http://cyrah.ebi.ac.uk:1111/Serv/PDB_ISL/ and can be downloaded from ftp://ftp.ebi.ac.uk/pub/contrib/jong/PDB_+ ++ISL/ CONTACT: sat@mrc-lmb.cam.ac.uk and jong@ebi.ac.uk  相似文献   

7.
BeoBLAST is an integrated software package that handles user requests and distributes BLAST and PSI-BLAST searches to nodes of a Beowulf cluster, thus providing a simple way to implement a scalable BLAST system on top of relatively inexpensive computer clusters. Additionally, BeoBLAST offers a number of novel search features through its web interface, including the ability to perform simultaneous searches of multiple databases with multiple queries, and the ability to start a search using the PSSM generated from a previous PSI-BLAST search on a different database. The underlying system can also handle automated querying for high throughput work. AVAILABILITY: Source code is available under the GNU public license at http://bioinformatics.fccc.edu/  相似文献   

8.
The submission of multiple sequence alignment data to EMBL has grown 30-fold in the past 10 years, creating a problem of archiving them. The EBI has developed a new public database of multiple sequence alignments called EMBL-Align. It has a dedicated web-based submission tool, Webin-Align. Together they represent a comprehensive data management solution for alignment data. Webin-Align accepts all the common alignment formats and can display data in CLUSTALW format as well as a new standard EMBL-Align flat file format. The alignments are stored in the EMBL-Align database and can be queried from the EBI SRS (Sequence Retrieval System) server. AVAILABILITY: Webin-Align: http://www.ebi.ac.uk/embl/Submission/align_top.html, EMBL-Align: ftp://ftp.ebi.ac.uk/pub/databases/embl/align, http://srs.ebi.ac.uk/  相似文献   

9.
10.
ViroBLAST is a stand-alone BLAST web interface for nucleotide and amino acid sequence similarity searches. It extends the utility of BLAST to query against multiple sequence databases and user sequence datasets, and provides a friendly output to easily parse and navigate BLAST results. ViroBLAST is readily useful for all research areas that require BLAST functions and is available online and as a downloadable archive for independent installation. Availability: http://indra.mullins.microbiol.washington.edu/blast/viroblast.php.  相似文献   

11.
SBASE 4.0 is the fourth release of SBASE, a collection of annotated protein domain sequences that represent various structural, functional, ligand binding and topogenic segments of proteins. SBASE was designed to facilitate the detection of functional homologies and can be searched with standard database search tools, such as FASTA and BLAST3. The present release contains 61 137 entries provided with standardized names and cross-referenced to all major protein, nucleic acid and sequence pattern collections. The entries are clustered into 13 155 groups in order to facilitate detection of distant similarities. SBASE 4.0 is freely available by anonymous ftp file transfer from ftp.icgeb.trieste.it. Individual records can be retrieved with the gopher server at icgeb.trieste.it and with a World Wide Web server at http://www.icgeb.trieste.it. Automated searching of SBASE with BLAST can be carried out with the electronic mail server sbase@icgeb.trieste.it, which now also provides a graphic representation of the homologies. A related mail server, domain@hubi.abc.hu, assigns SBASE domain homologies on the basis of SWISS-PROT searches.  相似文献   

12.
Post-processing of BLAST results using databases of clustered sequences   总被引:1,自引:0,他引:1  
Motivation: When evaluating the results of a sequence similaritysearch, there are many situations where it can be useful todetermine whether sequences appearing in the results share somedistinguishing characteristic. Such dependencies between databaseentries are often not readily identifiable, but can yield importantnew insights into the biological function of a gene or protein. Results: We have developed a program called CBLAST that sortsthe results of a BLAST sequence similarity search accordingto sequence membership in user-defined ‘clusters’of sequences. To demonstrate the utility of this application,we have constructed two cluster databases. The first describesclusters of nucleotide sequences representing the same gene,as documented in the UNIGENE database, and the second describesclusters of protein sequences which are members of the proteinfamilies documented in the PROSITE database. Cluster databasesand the CBLAST post-processor provide an efficient mechanismfor identifying and exploring relationships and dependenciesbetween new sequences and database entries. Availability: The software described in this article is availablefree of charge from the EBI software archive at < ftp: //ftp.ebi. ac. uk/pub/software/unix >. Contact: E-mail: rainer _fuchs@glaxowellcome.com  相似文献   

13.
The ProtoNet site provides an automatic hierarchical clustering of the SWISS-PROT protein database. The clustering is based on an all-against-all BLAST similarity search. The similarities' E-score is used to perform a continuous bottom-up clustering process by applying alternative rules for merging clusters. The outcome of this clustering process is a classification of the input proteins into a hierarchy of clusters of varying degrees of granularity. ProtoNet (version 1.3) is accessible in the form of an interactive web site at http://www.protonet.cs.huji.ac.il. ProtoNet provides navigation tools for monitoring the clustering process with a vertical and horizontal view. Each cluster at any level of the hierarchy is assigned with a statistical index, indicating the level of purity based on biological keywords such as those provided by SWISS-PROT and InterPro. ProtoNet can be used for function prediction, for defining superfamilies and subfamilies and for large-scale protein annotation purposes.  相似文献   

14.
SUMMARY: The program varsplic.pl uses information present in the SWISS-PROT and TrEMBL databases to create new records for alternatively spliced isoforms. These new records can be used in similarity searches. AVAILABILITY: The program is available at ftp://ftp.ebi.ac.uk/pub/software/swissprot/, together with regularly updated output files. CONTACT: pkersey@ebi.ac.uk  相似文献   

15.
MOTIVATION: Biological sequence databases are highly redundant for two main reasons: 1. various databanks keep redundant sequences with many identical and nearly identical sequences 2. natural sequences often have high sequence identities due to gene duplication. We wanted to know how many sequences can be removed before the databases start losing homology information. Can a database of sequences with mutual sequence identity of 50% or less provide us with the same amount of biological information as the original full database? RESULTS: Comparisons of nine representative sequence databases (RSDB) derived from full protein databanks showed that the information content of sequence databases is not linearly proportional to its size. An RSDB reduced to mutual sequence identity of around 50% (RSDB50) was equivalent to the original full database in terms of the effectiveness of homology searching. It was a third of the full database size which resulted in a six times faster iterative profile searching. The RSDBs are produced at different granularity for efficient homology searching. AVAILABILITY: All the RSDB files generated and the full analysis results are available through internet: ftp://ftp.ebi.ac. uk/pub/contrib/jong/RSDB/http://cyrah.e bi.ac.uk:1111/Proj/Bio/RSDB  相似文献   

16.
InterProScan is a tool that scans given protein sequences against the protein signatures of the InterPro member databases, currently--PROSITE, PRINTS, Pfam, ProDom and SMART. The number of signature databases and their associated scanning tools as well as the further refinement procedures make the problem complex. InterProScan is designed to be a scalable and extensible system with a robust internal architecture. AVAILABILITY: The Perl-based InterProScan implementation is available from the EBI ftp server (ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/) and the SRS-basedInterProScan is available upon request. We provide the public web interface (http://www.ebi.ac.uk/interpro/scan.html) as well as email submission server (interproscan@ebi.ac.uk).  相似文献   

17.
18.
GenBank   总被引:51,自引:4,他引:47       下载免费PDF全文
The GenBank((R))sequence database incorporates publicly available DNA sequences of >55 000 different organisms, primarily through direct submission of sequence data from individual laboratories and large-scale sequencing projects. Most submissions are made using the BankIt (Web) or Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Data exchange with the EMBL Data Library and the DNA Data Bank of Japan helps ensure comprehensive worldwide coverage. GenBank data is accessible through NCBI's integrated retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping and protein structure information, plus the biomedical literature via PubMed. Sequence similarity searching is provided by the BLAST family of programs. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. NCBI also offers a wide range of WWW retrieval and analysis services based on GenBank data. The GenBank database and related resources are freely accessible via the NCBI home page at http://www.ncbi.nlm.nih.gov  相似文献   

19.
SUMMARY: We recently developed algorithmic tools for the identification of functionally important regions in proteins of known three dimensional structure by estimating the degree of conservation of the amino-acid sites among their close sequence homologues. Projecting the conservation grades onto the molecular surface of these proteins reveals patches of highly conserved (or occasionally highly variable) residues that are often of important biological function. We present a new web server, ConSurf, which automates these algorithmic tools. ConSurf may be used for high-throughput characterization of functional regions in proteins. AVAILABILITY: The ConSurf web server is available at:http://consurf.tau.ac.il. SUPPLEMENTARY INFORMATION: A set of examples is available at http://consurf.tau.ac.il under 'GALLERY'.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号