首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION: The development of chemoinformatics has been hampered by the lack of large, publicly available, comprehensive repositories of molecules, in particular of small molecules. Small molecules play a fundamental role in organic chemistry and biology. They can be used as combinatorial building blocks for chemical synthesis, as molecular probes in chemical genomics and systems biology, and for the screening and discovery of new drugs and other useful compounds. RESULTS: We describe ChemDB, a public database of small molecules available on the Web. ChemDB is built using the digital catalogs of over a hundred vendors and other public sources and is annotated with information derived from these sources as well as from computational methods, such as predicted solubility and three-dimensional structure. It supports multiple molecular formats and is periodically updated, automatically whenever possible. The current version of the database contains approximately 4.1 million commercially available compounds and 8.2 million counting isomers. The database includes a user-friendly graphical interface, chemical reactions capabilities, as well as unique search capabilities. AVAILABILITY: Database and datasets are available on http://cdb.ics.uci.edu.  相似文献   

2.
MOTIVATION: Accurate multiple sequence alignments are essential in protein structure modeling, functional prediction and efficient planning of experiments. Although the alignment problem has attracted considerable attention, preparation of high-quality alignments for distantly related sequences remains a difficult task. RESULTS: We developed PROMALS, a multiple alignment method that shows promising results for protein homologs with sequence identity below 10%, aligning close to half of the amino acid residues correctly on average. This is about three times more accurate than traditional pairwise sequence alignment methods. PROMALS algorithm derives its strength from several sources: (i) sequence database searches to retrieve additional homologs; (ii) accurate secondary structure prediction; (iii) a hidden Markov model that uses a novel combined scoring of amino acids and secondary structures; (iv) probabilistic consistency-based scoring applied to progressive alignment of profiles. Compared to the best alignment methods that do not use secondary structure prediction and database searches (e.g. MUMMALS, ProbCons and MAFFT), PROMALS is up to 30% more accurate, with improvement being most prominent for highly divergent homologs. Compared to SPEM and HHalign, which also employ database searches and secondary structure prediction, PROMALS shows an accuracy improvement of several percent. AVAILABILITY: The PROMALS web server is available at: http://prodata.swmed.edu/promals/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

3.
WindowMasker: window-based masker for sequenced genomes   总被引:3,自引:0,他引:3  
MOTIVATION: Matches to repetitive sequences are usually undesirable in the output of DNA database searches. Repetitive sequences need not be matched to a query, if they can be masked in the database. RepeatMasker/Maskeraid (RM), currently the most widely used software for DNA sequence masking, is slow and requires a library of repetitive template sequences, such as a manually curated RepBase library, that may not exist for newly sequenced genomes. RESULTS: We have developed a software tool called WindowMasker (WM) that identifies and masks highly repetitive DNA sequences in a genome, using only the sequence of the genome itself. WM is orders of magnitude faster than RM because WM uses a few linear-time scans of the genome sequence, rather than local alignment methods that compare each library sequence with each piece of the genome. We validate WM by comparing BLAST outputs from large sets of queries applied to two versions of the same genome, one masked by WM, and the other masked by RM. Even for genomes such as the human genome, where a good RepBase library is available, searching the database as masked with WM yields more matches that are apparently non-repetitive and fewer matches to repetitive sequences. We show that these results hold for transcribed regions as well. WM also performs well on genomes for which much of the sequence was in draft form at the time of the analysis. AVAILABILITY: WM is included in the NCBI C++ toolkit. The source code for the entire toolkit is available at ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools++/CURRENT/. Once the toolkit source is unpacked, the instructions for building WindowMasker application in the UNIX environment can be found in file src/app/winmasker/README.build. SUPPLEMENTARY INFORMATION: Supplementary data are available at ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/windowmasker/windowmasker_suppl.pdf  相似文献   

4.
5.
6.
Pharmacophore feature is defined by a set of chemical structure patterns having the active site of drug like molecule. Pharmacophore can be used to assist in building hypothesis about desirable chemical properties in drug molecule and hence it can be used to refine and modify drug candidates. We predicted the pharmacophoric features of 150 medicinal compounds from plants for anti-cancer, anti-carcinogenic, anti-diabetic, anti-microbial, and anti-oxidant. Estimation of pharmacophoric feature is necessary to ensure the optimal supramolecular interaction with a biological target and to trigger or block its biological response. We subsequently make this data available to open access using a database at the URL: http://www.hccbif.info/index.htm AVAILABILITY: The database is available for free at http://www.hccbif.info/index.htm.  相似文献   

7.
Automated assembly of protein blocks for database searching.   总被引:52,自引:7,他引:45       下载免费PDF全文
A system is described for finding and assembling the most highly conserved regions of related proteins for database searching. First, an automated version of Smith's algorithm for finding motifs is used for sensitive detection of multiple local alignments. Next, the local alignments are converted to blocks and the best set of non-overlapping blocks is determined. When the automated system was applied successively to all 437 groups of related proteins in the PROSITE catalog, 1764 blocks resulted; these could be used for very sensitive searches of sequence databases. Each block was calibrated by searching the SWISS-PROT database to obtain a measure of the chance distribution of matches, and the calibrated blocks were concatenated into a database that could itself be searched. Examples are provided in which distant relationships are detected either using a set of blocks to search a sequence database or using sequences to search the database of blocks. The practical use of the blocks database is demonstrated by detecting previously unknown relationships between oxidoreductases and by evaluating a proposed relationship between HIV Vif protein and thiol proteases.  相似文献   

8.
Imidazole glycerol phosphate dehydratase (IGPD) has become an attractive target for herbicide discovery since it is present in plants and not in mammals. Currently no knowledge is available on the 3-D structure of the IGPD active site. Therefore, we used a pharmacophore model based on known inhibitors and 3-D database searches to identify new active compounds. In vitro testing of compounds from the database searches led to the identification of a class of pyrrole aldehydes as novel inhibitors of IGPD.  相似文献   

9.
MOTIVATION: Algorithmic and modeling advances in the area of protein-protein interaction (PPI) network analysis could contribute to the understanding of biological processes. Local structure of networks can be measured by the frequency distribution of graphlets, small connected non-isomorphic induced subgraphs. This measure of local structure has been used to show that high-confidence PPI networks have local structure of geometric random graphs. Finding graphlets exhaustively in a large network is computationally intensive. More complete PPI networks, as well as PPI networks of higher organisms, will thus require efficient heuristic approaches. RESULTS: We propose two efficient and scalable heuristics for finding graphlets in high-confidence PPI networks. We show that both PPI and their model geometric random networks, have defined boundaries that are sparser than the 'inner parts' of the networks. In addition, these networks exhibit 'uniformity' of local structure inside the networks. Our first heuristic exploits these two structural properties of PPI and geometric random networks to find good estimates of graphlet frequency distributions in these networks up to 690 times faster than the exhaustive searches. Our second heuristic is a variant of a more standard sampling technique and it produces accurate approximate results up to 377 times faster than the exhaustive searches. We indicate how the combination of these approaches may result in an even better heuristic. AVAILABILITY: Supplementary information is available at http://www.cs.toronto.edu/~natasha/BIOINF-2005-0946/Supplementary.pdf. Software implementing the algorithms is available at http://www.cs.toronto.edu/~natasha/BIOINF-2005-0946/estimate_grap-hlets.html. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

10.
11.
Here we describe a software tool for synthesizing molecular genetic data into models of genetic networks. Our software program Ingeneue, written in Java, lets the user quickly turn a map of a genetic network into a dynamical model consisting of a set of ordinary differential equations. We developed Ingeneue as part of an ongoing effort to explore the design and evolvability of genetic networks. Ingeneue has three principal advantages over other available mathematical software: it automates instantiation of the same network model in each cell in a 2-D sheet of cells; it constructs model equations from pre-made building blocks corresponding to common biochemical processes; and it automates searches through parameter space, sensitivity analyses, and other common tasks. Here we discuss the structure of the software and some of the issues we have dealt with. We conclude with some examples of results we have achieved with Ingeneue for the Drosophila segment polarity network.  相似文献   

12.
YAKUSA is a program designed for rapid scanning of a structural database with a query protein structure. It searches for the longest common substructures called SHSPs (structural high-scoring pairs) existing between a query structure and every structure in the structural database. It makes use of protein backbone internal coordinates (alpha angles) in order to describe protein structures as sequences of symbols. The structural similarities are established in 5 steps, the first 3 being analogous to those used in BLAST: (1) building up a deterministic finite automaton describing all patterns identical or similar to those in the query structure; (2) searching for all these patterns in every structure in the database; (3) extending the patterns to longer matching substructures (i.e., SHSPs); (4) selecting compatible SHSPs for each query-database structure pair; and (5) ranking the query-database structure pairs using 3 scores based on SHSP similarity, on SHSP probabilities, and on spatial compatibility of SHSPs. Structural fragment probabilities are estimated according to a mixture transition distribution model, which is an approximation of a high-order Markov chain model. With regard to sensitivity and selectivity of the structural matches, YAKUSA compares well to the best related programs, although it is by far faster: A typical database scan takes about 40 s CPU time on a desktop personal computer. It has also been implemented on a Web server for real-time searches.  相似文献   

13.
On the basis of the homodimeric X-ray structure of dihydrolipoamide dehydrogenase from Azotobacter vinelandii we demonstrate by protein modeling techniques that two dimeric units of this enzyme can associate to a tetrameric structure with intense contacts between the building blocks. Complementary structures of the respective other unit in the tetramer contribute to the active sites. The coenzyme FAD becomes shielded from the environment, thus its binding is stabilized. By energy minimization techniques binding energies and RMS-values were computed and the contact areas between the building blocks were determined to quantify the interaction. In the cell tetramerization of dihydrolipoamide dehydrogenase will be realized upon its incorporation as an enzyme component into the pyruvate dehydrogenase multienzyme complex and will have consequences for the structure and subunit stoichiometry of the complex. Especially, the multiplicity of the three enzyme components, i.e. pyruvate dehydrogenase, dihydrolipoamide acetyltransferase and dihydrolipoamide dehydrogenase in the enzyme complex must be 24:24:24 instead of 24:24:12 assumed so far.Electronic Supplementary Material available.  相似文献   

14.
Patterns of linkage disequilibrium in the MHC region on human chromosome 6p   总被引:5,自引:0,他引:5  
Single nucleotide polymorphisms (SNPs) in the human genome are thought to be organised into blocks of high internal linkage disequilibrium (LD), separated by intermittent recombination hotspots. Since understanding haplotype structure is critical for an accurate assessment of inter-individual genetic differences, we investigated up to 968 SNPs from a 10-Mb region on chromosome 6p21, including the human major histocompatibility complex (MHC), in five different population samples (45–550 individuals). Regions of well-defined block structure were found to coexist alongside large areas lacking any clear structure; occasional long-range LD was observed in all five samples. The four white populations analysed were remarkably similar in terms of the extend and spatial distribution of local LD. In US African Americans, the distribution of LD was similar to that in the white populations but the observed haplotype diversity was higher. The existence of large regions without any clear block structure renders the systematic and thorough construction of SNP haplotype maps a crucial prerequisite for disease-association studies.Electronic Supplementary Material Supplementary material is available in the online version of this article at Electronic database information: URLs for the data in this article are as follows:  相似文献   

15.
MOTIVATION: Profile searches of sequence databases are a sensitive way to detect sequence relationships. Sophisticated profile-profile comparison algorithms that have been recently introduced increase search sensitivity even further. RESULTS: In this article, a simpler approach than profile-profile comparison is presented that has a comparable performance to state-of-the-art tools such as COMPASS, HHsearch and PRC. This approach is called SCOOP (Simple Comparison Of Outputs Program), and is shown to find known relationships between families in the Pfam database as well as detect novel distant relationships between families. Several novel discoveries are presented including the discovery that a domain of unknown function (DUF283) found in Dicer proteins is related to double-stranded RNA-binding domains. AVAILABILITY: SCOOP is freely available under a GNU GPL license from http://www.sanger.ac.uk/Users/agb/SCOOP/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

16.
17.
MOTIVATION: Searches for near exact sequence matches are performed frequently in large-scale sequencing projects and in comparative genomics. The time and cost of performing these large-scale sequence-similarity searches is prohibitive using even the fastest of the extant algorithms. Faster algorithms are desired. RESULTS: We have developed an algorithm, called SST (Sequence Search Tree), that searches a database of DNA sequences for near-exact matches, in time proportional to the logarithm of the database size n. In SST, we partition each sequence into fragments of fixed length called 'windows' using multiple offsets. Each window is mapped into a vector of dimension 4(k) which contains the frequency of occurrence of its component k-tuples, with k a parameter typically in the range 4-6. Then we create a tree-structured index of the windows in vector space, with tree-structured vector quantization (TSVQ). We identify the nearest neighbors of a query sequence by partitioning the query into windows and searching the tree-structured index for nearest-neighbor windows in the database. When the tree is balanced this yields an O(logn) complexity for the search. This complexity was observed in our computations. SST is most effective for applications in which the target sequences show a high degree of similarity to the query sequence, such as assembling shotgun sequences or matching ESTs to genomic sequence. The algorithm is also an effective filtration method. Specifically, it can be used as a preprocessing step for other search methods to reduce the complexity of searching one large database against another. For the problem of identifying overlapping fragments in the assembly of 120 000 fragments from a 1.5 megabase genomic sequence, SST is 15 times faster than BLAST when we consider both building and searching the tree. For searching alone (i.e. after building the tree index), SST 27 times faster than BLAST. AVAILABILITY: Request from the authors.  相似文献   

18.
Identification of proteins from the mass spectra of peptide fragments generated by proteolytic cleavage using database searching has become one of the most powerful techniques in proteome science, capable of rapid and efficient protein identification. Using computer simulation, we have studied how the application of chemical derivatisation techniques may improve the efficiency of protein identification from mass spectrometric data. These approaches enhance ion yield and lead to the promotion of specific ions and fragments, yielding additional database search information. The impact of three alternative techniques has been assessed by searching representative proteome databases for both single proteins and simple protein mixtures. For example, by reliably promoting fragmentation of singly-charged peptide ions at aspartic acid residues after homoarginine derivatisation, 82% of yeast proteins can be unambiguously identified from a single typical peptide-mass datum, with a measured mass accuracy of 50 ppm, by using the associated secondary ion data. The extra search information also provides a means to confidently identify proteins in protein mixtures where only limited data are available. Furthermore, the inclusion of limited sequence information for the peptides can compensate and exceed the search efficiency available via high accuracy searches of around 5 ppm, suggesting that this is a potentially useful approach for simple protein mixtures routinely obtained from two-dimensional gels.  相似文献   

19.
Lack of genomic sequence data and the relatively high cost of tandem mass spectrometry have hampered proteomic investigations into helminths, such as resolving the mechanism underpinning globally reported anthelmintic resistance. Whilst detailed mechanisms of resistance remain unknown for the majority of drug-parasite interactions, gene mutations and changes in gene and protein expression are proposed key aspects of resistance. Comparative proteomic analysis of drug-resistant and -susceptible nematodes may reveal protein profiles reflecting drug-related phenotypes. Using the gastro-intestinal nematode, Haemonchus contortus as case study, we report the application of freely available expressed sequence tag (EST) datasets to support proteomic studies in unsequenced nematodes. EST datasets were translated to theoretical protein sequences to generate a searchable database. In conjunction with matrix-assisted laser desorption ionisation time-of-flight mass spectrometry (MALDI-TOF-MS), Peptide Mass Fingerprint (PMF) searching of databases enabled a cost-effective protein identification strategy. The effectiveness of this approach was verified in comparison with MS/MS de novo sequencing with searching of the same EST protein database and subsequent searches of the NCBInr protein database using the Basic Local Alignment Search Tool (BLAST) to provide protein annotation. Of 100 proteins from 2-DE gel spots, 62 were identified by MALDI-TOF-MS and PMF searching of the EST database. Twenty randomly selected spots were analysed by electrospray MS/MS and MASCOT Ion Searches of the same database. The resulting sequences were subjected to BLAST searches of the NCBI protein database to provide annotation of the proteins and confirm concordance in protein identity from both approaches. Further confirmation of protein identifications from the MS/MS data were obtained by de novo sequencing of peptides, followed by FASTS algorithm searches of the EST putative protein database. This study demonstrates the cost-effective use of available EST databases and inexpensive, accessible MALDI-TOF MS in conjunction with PMF for reliable protein identification in unsequenced organisms.  相似文献   

20.
This paper describes a publicly available knowledge base ofthe chemical compounds involved in intermediary metabolism.We consider the motivations for constructing a knowledge baseof metabolic compounds, the methodology by which it was constructed,and the information that it currently contains. Currently theknowledge base describes 981 compounds, listing for each: synonymsfor its name, a systematic name, CAS registry number, chemicalformula, molecular weight, chemical structure and two–dimensionaldisplay coordinates for the structure. The Compound KnowledgeBase (CompoundKB) illustrates several methodological principlesthat should guide the development of biological knowledge bases.I argue that biological datasets should be made available inmultiple representations to increase their accessibility toend users, and I present multiple representations of the CompoundKB(knowledge base, relational data base and ASN. 1 representations).I also analyze the general characteristics of these representationsto provide an understanding of their relative advantages anddisadvantages. Another principle is that the error rate of biologicaldata bases should be estimated and documented—this analysisis performed for the CompoundKB.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号