首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
MedPost: a part-of-speech tagger for bioMedical text   总被引:1,自引:0,他引:1  
SUMMARY: We present a part-of-speech tagger that achieves over 97% accuracy on MEDLINE citations. AVAILABILITY: Software, documentation and a corpus of 5700 manually tagged sentences are available at ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/MedPost/medpost.tar.gz  相似文献   

2.
MOTIVATION: A tool that simultaneously aligns multiple protein sequences, automatically utilizes information about protein domains, and has a good compromise between speed and accuracy will have practical advantages over current tools. RESULTS: We describe COBALT, a constraint based alignment tool that implements a general framework for multiple alignment of protein sequences. COBALT finds a collection of pairwise constraints derived from database searches, sequence similarity and user input, combines these pairwise constraints, and then incorporates them into a progressive multiple alignment. We show that using constraints derived from the conserved domain database (CDD) and PROSITE protein-motif database improves COBALT's alignment quality. We also show that COBALT has reasonable runtime performance and alignment accuracy comparable to or exceeding that of other tools for a broad range of problems. AVAILABILITY: COBALT is included in the NCBI C++ toolkit. A Linux executable for COBALT, and CDD and PROSITE data used is available at: ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/cobalt  相似文献   

3.
The majority of human telomere length studies have focused on the overall length of telomeres within a cell. In fact, very few studies have examined telomere length for individual chromosome arms. The objective of this study was to examine the relationship between chromosome arm size and the relative length of the associated telomere. Quantitative Fluorescence In Situ Hybridization (Q-FISH) was used to measure the relative telomere length of each chromosome arm in metaphases from cultured lymphocytes of 17 individuals. A statistically significant positive correlation (r = 0.6) was found between telomere length and the size of the associated chromosome arm, which was estimated based on megabase pair measurements from http://www.ncbi.nlm.nih.gov/projects/mapview/.  相似文献   

4.
BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences   总被引:49,自引:0,他引:49  
'BLAST 2 Sequences', a new BLAST-based tool for aligning two protein or nucleotide sequences, is described. While the standard BLAST program is widely used to search for homologous sequences in nucleotide and protein databases, one often needs to compare only two sequences that are already known to be homologous, coming from related species or, e.g. different isolates of the same virus. In such cases searching the entire database would be unnecessarily time-consuming. 'BLAST 2 Sequences' utilizes the BLAST algorithm for pairwise DNA-DNA or protein-protein sequence comparison. A World Wide Web version of the program can be used interactively at the NCBI WWW site (http://www.ncbi.nlm.nih.gov/gorf/bl2.++ +html). The resulting alignments are presented in both graphical and text form. The variants of the program for PC (Windows), Mac and several UNIX-based platforms can be downloaded from the NCBI FTP site (ftp://ncbi.nlm.nih.gov).  相似文献   

5.
dbSNP: a database of single nucleotide polymorphisms   总被引:12,自引:0,他引:12       下载免费PDF全文
In response to a need for a general catalog of genome variation to address the large-scale sampling designs required by association studies, gene mapping and evolutionary biology, the National Cancer for Biotechnology Information (NCBI) has established the dbSNP database. Submissions to dbSNP will be integrated with other sources of information at NCBI such as GenBank, PubMed, LocusLink and the Human Genome Project data. The complete contents of dbSNP are available to the public at website: http://www.ncbi.nlm.nih.gov/SNP. Submitted SNPs can also be downloaded via anonymous FTP at ftp://ncbi.nlm.nih.gov/snp/  相似文献   

6.
Studies into the genetic origins of tumor cell chemoactivity pose significant challenges to bioinformatic mining efforts. Connections between measures of gene expression and chemoactivity have the potential to identify clinical biomarkers of compound response, cellular pathways important to efficacy and potential toxicities; all vital to anticancer drug development. An investigation has been conducted that jointly explores tumor-cell constitutive NCI60 gene expression profiles and small-molecule NCI60 growth inhibition chemoactivity profiles, viewed from novel applications of self-organizing maps (SOMs) and pathway-centric analyses of gene expressions, to identify subsets of over- and under-expressed pathway genes that discriminate chemo-sensitive and chemo-insensitive tumor cell types. Linear Discriminant Analysis (LDA) is used to quantify the accuracy of discriminating genes to predict tumor cell chemoactivity. LDA results find 15% higher prediction accuracies, using ∼30% fewer genes, for pathway-derived discriminating genes when compared to genes derived using conventional gene expression-chemoactivity correlations. The proposed pathway-centric data mining procedure was used to derive discriminating genes for ten well-known compounds. Discriminating genes were further evaluated using gene set enrichment analysis (GSEA) to reveal a cellular genetic landscape, comprised of small numbers of key over and under expressed on- and off-target pathway genes, as important for a compound’s tumor cell chemoactivity. Literature-based validations are provided as support for chemo-important pathways derived from this procedure. Qualitatively similar results are found when using gene expression measurements derived from different microarray platforms. The data used in this analysis is available at http://pubchem.ncbi.nlm.nih.gov/and http://www.ncbi.nlm.nih.gov/projects/geo (GPL96, GSE32474).  相似文献   

7.
A tool for aligning very similar DNA sequences   总被引:4,自引:0,他引:4  
Results: We have produced a computer program, named sim3, thatsolves the following computational problem. Two DNA sequencesare given, where the shorter sequence is very similar to somecontiguous region of the longer sequence. Sim3 determines sucha similar region of the longer sequence, and then computes anoptimal set of single-nucleotide changes (i.e. insertions, deletionsor substitutions) that will convert the shorter sequence tothat region. Thus, the alignment scoring scheme is designedto model sequencing errors, rather than evolutionary processes.The program can align a 100 kb sequence to a 1 megabase sequencein a few seconds on a workstation, provided that there are veryfew differences between the shorter sequence and some regionin the longer sequence. The program has been used to assemblesequence data for the Genomes Division at the National Centerfor Biotechnology Information. Availability: A version of sim3 for UNIX machines can be obtainedby anonymous ftp from ncbi. nlm. nih. gov, in the pub/sim3 directory. Contact: For portable versions for Macs and PCs, contact zjing@sunset.nlm. nih. gov.  相似文献   

8.
A structure-based method for protein sequence alignment   总被引:1,自引:0,他引:1  
MOTIVATION: With the continuing rapid growth of protein sequence data, protein sequence comparison methods have become the most widely used tools of bioinformatics. Among these methods are those that use position-specific scoring matrices (PSSMs) to describe protein families. PSSMs can capture information about conserved patterns within families, which can be used to increase the sensitivity of searches for related sequences. Certain types of structural information, however, are not generally captured by PSSM search methods. Here we introduce a program, Structure-based ALignment TOol (SALTO), that aligns protein query sequences to PSSMs using rules for placing and scoring gaps that are consistent with the conserved regions of domain alignments from NCBI's Conserved Domain Database. RESULTS: In most cases, the alignment scores obtained using the local alignment version follow an extreme value distribution. SALTO's performance in finding related sequences and producing accurate alignments is similar to or better than that of IMPALA; one advantage of SALTO is that it imposes an explicit gapping model on each protein family. AVAILABILITY: A stand-alone version of the program that can generate global or local alignments is available by ftp distribution (ftp://ftp.ncbi.nih.gov/pub/SALTO/), and has been incorporated to Cn3D structure/alignment viewer. CONTACT: bryant@ncbi.nlm.nih.gov.  相似文献   

9.
dbSNP: the NCBI database of genetic variation   总被引:1,自引:0,他引:1  
In response to a need for a general catalog of genome variation to address the large-scale sampling designs required by association studies, gene mapping and evolutionary biology, the National Center for Biotechnology Information (NCBI) has established the dbSNP database [S.T.Sherry, M.Ward and K. Sirotkin (1999) Genome Res., 9, 677-679]. Submissions to dbSNP will be integrated with other sources of information at NCBI such as GenBank, PubMed, LocusLink and the Human Genome Project data. The complete contents of dbSNP are available to the public at website: http://www.ncbi.nlm.nih.gov/SNP. The complete contents of dbSNP can also be downloaded in multiple formats via anonymous FTP at ftp://ncbi.nlm.nih.gov/snp/.  相似文献   

10.
Motivation: The key to MS -based proteomics is peptide sequencing.The major challenge in peptide sequencing, whether library searchor de novo, is to better infer statistical significance andbetter attain noise reduction. Since the noise in a spectrumdepends on experimental conditions, the instrument used andmany other factors, it cannot be predicted even if the peptidesequence is known. The characteristics of the noise can onlybe uncovered once a spectrum is given. We wish to overcome suchissues. Results: We designed RAId to identify peptides from their associatedtandem mass spectrometry data. RAId performs a novel de novosequencing followed by a search in a peptide library that wecreated. Through de novo sequencing, we establish the spectrum-specificbackground score statistics for the library search. When thedatabase search fails to return significant hits, the top-rankingde novo sequences become potential candidates for new peptidesthat are not yet in the database. The use of spectrum-specificbackground statistics seems to enable RAId to perform well evenwhen the spectral quality is marginal. Other important featuresof RAId include its potential in de novo sequencing alone andthe ease of incorporating post-translational modifications. Availability: Programs implementing the methods described areavailable from the authors on request. Contact: yyu{at}ncbi.nlm.nih.gov Supplementary information: ftp://ftp.ncbi.nih.gov/pub/yyu/Proteomics/MSMS/RAId/MSMS_bioinfo_supp.pdf  相似文献   

11.
MOTIVATION: The blastp and tblastn modules of BLAST are widely used methods for searching protein queries against protein and nucleotide databases, respectively. One heuristic used in BLAST is to consider only database sequences that contain a high-scoring match of length at most 5 to the query. We implemented the capability to use words of length 6 or 7. We demonstrate an improved trade-off between running time and retrieval accuracy, controlled by the score threshold used for short word matches. For example, the running time can be reduced by 20-30% while achieving ROC (receiver operator characteristic) scores similar to those obtained with current default parameters. AVAILABILITY: The option to use long words is in the NCBI C and C++ toolkit code for BLAST, starting with version 2.2.16 of blastall. A Linux executable used to produce the results herein is available at: ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/protein_longwords  相似文献   

12.
WindowMasker: window-based masker for sequenced genomes   总被引:3,自引:0,他引:3  
MOTIVATION: Matches to repetitive sequences are usually undesirable in the output of DNA database searches. Repetitive sequences need not be matched to a query, if they can be masked in the database. RepeatMasker/Maskeraid (RM), currently the most widely used software for DNA sequence masking, is slow and requires a library of repetitive template sequences, such as a manually curated RepBase library, that may not exist for newly sequenced genomes. RESULTS: We have developed a software tool called WindowMasker (WM) that identifies and masks highly repetitive DNA sequences in a genome, using only the sequence of the genome itself. WM is orders of magnitude faster than RM because WM uses a few linear-time scans of the genome sequence, rather than local alignment methods that compare each library sequence with each piece of the genome. We validate WM by comparing BLAST outputs from large sets of queries applied to two versions of the same genome, one masked by WM, and the other masked by RM. Even for genomes such as the human genome, where a good RepBase library is available, searching the database as masked with WM yields more matches that are apparently non-repetitive and fewer matches to repetitive sequences. We show that these results hold for transcribed regions as well. WM also performs well on genomes for which much of the sequence was in draft form at the time of the analysis. AVAILABILITY: WM is included in the NCBI C++ toolkit. The source code for the entire toolkit is available at ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools++/CURRENT/. Once the toolkit source is unpacked, the instructions for building WindowMasker application in the UNIX environment can be found in file src/app/winmasker/README.build. SUPPLEMENTARY INFORMATION: Supplementary data are available at ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/windowmasker/windowmasker_suppl.pdf  相似文献   

13.
14.
15.
We expand the functionally uncharacterized DOMON domain superfamily to identify several novel families, including the first prokaryotic representatives. Using several computational tools we show that it is involved in ligand binding--either as heme- or sugar-binding domains. We present evidence that the DOMON domain along with the DM13 domain comprises a novel electron-transfer system potentially involved in oxidative modification of animal cell-surface proteins. Other novel versions might function as sugar sensors of histidine kinases of bacterial two component systems. Supplementary information: Supplementary data are available at Bioinformatics online and also at ftp://ftp.ncbi.nih.gov/pub/aravind/domon/.  相似文献   

16.
Accurate multiple sequence alignments of proteins are very important to several areas of computational biology and provide an understanding of phylogenetic history of domain families, their identification and classification. This article presents a new algorithm, REFINER, that refines a multiple sequence alignment by iterative realignment of its individual sequences with the predetermined conserved core (block) model of a protein family. Realignment of each sequence can correct misalignments between a given sequence and the rest of the profile and at the same time preserves the family's overall block model. Large-scale benchmarking studies showed a noticeable improvement of alignment after refinement. This can be inferred from the increased alignment score and enhanced sensitivity for database searching using the sequence profiles derived from refined alignments compared with the original alignments. A standalone version of the program is available by ftp distribution (ftp://ftp.ncbi.nih.gov/pub/REFINER) and will be incorporated into the next release of the Cn3D structure/alignment viewer.  相似文献   

17.
18.
The effect of light and calcium depletion on in vivo protein phosphorylation was tested using dark-grown roots of Merit corn. Light caused rapid and specific promotion of phosphorylation of three polypeptides. Pretreatment of roots with ethylene glycol bis N,N,N′, N′ tetraacetic acid and A23187 prevented light-induced changes in protein phosphorylation. We postulate that these changes in protein phosphorylation are involved in the light-induced gravity response.  相似文献   

19.
The Conserved Domain Database (CDD) is now indexed as a separate database within the Entrez system and linked to other Entrez databases such as MEDLINE(R). This allows users to search for domain types by name, for example, or to view the domain architecture of any protein in Entrez's sequence database. CDD can be accessed on the WorldWideWeb at http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cdd. Users may also employ the CD-Search service to identify conserved domains in new sequences, at http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi. CD-Search results, and pre-computed links from Entrez's protein database, are calculated using the RPS-BLAST algorithm and Position Specific Score Matrices (PSSMs) derived from CDD alignments. CD-Searches are also run by default for protein-protein queries submitted to BLAST(R) at http://www.ncbi.nlm.nih.gov/BLAST. CDD mirrors the publicly available domain alignment collections SMART and PFAM, and now also contains alignment models curated at NCBI. Structure information is used to identify the core substructure likely to be present in all family members, and to produce sequence alignments consistent with structure conservation. This alignment model allows NCBI curators to annotate 'columns' corresponding to functional sites conserved among family members.  相似文献   

20.
Microarray-based enrichment of selected genomic loci is a powerful method for genome complexity reduction for next-generation sequencing. Since the vast majority of exons in vertebrate genomes are smaller than 150 nt, we explored the use of short fragment libraries (85–110 bp) to achieve higher enrichment specificity by reducing carryover and adverse effects of flanking intronic sequences. High enrichment specificity (60–75%) was obtained with a relative even base coverage. Up to 98% of the target-sequence was covered more than 20× at an average coverage depth of about 200×. To verify the accuracy of SNP/mutation detection, we evaluated 384 known non-reference SNPs in the targeted regions. At ∼200× average sequence coverage, we were able to survey 96.4% of 1.69 Mb of genomic sequence with only 4.2% false negative calls, mostly due to low coverage. Using the same settings, a total of 1197 novel candidate variants were detected. Verification experiments revealed only eight false positive calls, indicating an overall false positive rate of less than 1 per ∼200 000 bp. Taken together, short fragment libraries provide highly efficient and flexible enrichment of exonic targets and yield relatively even base coverage, which facilitates accurate SNP and mutation detection. Raw sequencing data, alignment files and called SNPs have been submitted into GEO database http://www.ncbi.nlm.nih.gov/geo/ with accession number GSE18542.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号