首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 982 毫秒
1.
The development of efficient DNA sequencing methods has led to the achievement of the DNA sequence of entire genomes from (to date) 55 prokaryotes, 5 eukaryotic organisms and 10 eukaryotic chromosomes. Thus, an enormous amount of DNA sequence data is available and even more will be forthcoming in the near future. Analysis of this overwhelming amount of data requires bioinformatic tools in order to identify genes that encode functional proteins or RNA. This is an important task, considering that even in the well-studied Escherichia coli more than 30% of the identified open reading frames are hypothetical genes. Future challenges of genome sequence analysis will include the understanding of gene regulation and metabolic pathway reconstruction including DNA chip technology, which holds tremendous potential for biomedicine and the biotechnological production of valuable compounds. The overwhelming volume of information often confuses scientists. This review intends to provide a guide to choosing the most efficient way to analyze a new sequence or to collect information on a gene or protein of interest by applying current publicly available databases and Web services. Recently developed tools that allow functional assignment of genes, mainly based on sequence similarity of the deduced amino acid sequence, using the currently available and increasing biological databases will be discussed.  相似文献   

2.
Insights from human/mouse genome comparisons   总被引:4,自引:0,他引:4  
Large-scale public genomic sequencing efforts have provided a wealth of vertebrate sequence data poised to provide insights into mammalian biology. These include deep genomic sequence coverage of human, mouse, rat, zebrafish, and two pufferfish (Fugu rubripes and Tetraodon nigroviridis) (Aparicio et al. 2002; Lander et al. 2001; Venter et al. 2001; Waterston et al. 2002). In addition, a high-priority has been placed on determining the genomic sequence of chimpanzee, dog, cow, frog, and chicken (Boguski 2002). While only recently available, whole genome sequence data have provided the unique opportunity to globally compare complete genome contents. Furthermore, the shared evolutionary ancestry of vertebrate species has allowed the development of comparative genomic approaches to identify ancient conserved sequences with functionality. Accordingly, this review focuses on the initial comparison of available mammalian genomes and describes various insights derived from such analysis.  相似文献   

3.
We present a software package, Genquire, that allows visualization, querying, hand editing, and de novo markup of complete or partially annotated genomes. The system is written in Perl/Tk and uses, where possible, existing BioPerl data models and methods for representation and manipulation of the sequence and annotation objects. An adaptor API is provided to allow Genquire to display a wide range of databases and flat files, and a plugins API provides an interface to other sequence analysis software. AVAILABILITY: Genquire v3.03 is open-source software. The code is available for download and/or contribution at http://www.bioinformatics.org/Genquire  相似文献   

4.

Background  

The Medium-chain Dehydrogenases/Reductases (MDR) form a protein superfamily whose size and complexity defeats traditional means of subclassification; it currently has over 15000 members in the databases, the pairwise sequence identity is typically around 25%, there are members from all kingdoms of life, the chain-lengths vary as does the oligomericity, and the members are partaking in a multitude of biological processes. There are profile hidden Markov models (HMMs) available for detecting MDR superfamily members, but none for determining which MDR family each protein belongs to. The current torrential influx of new sequence data enables elucidation of more and more protein families, and at an increasingly fine granularity. However, gathering good quality training data usually requires manual attention by experts and has therefore been the rate limiting step for expanding the number of available models.  相似文献   

5.
Z Sun  W Tian 《PloS one》2012,7(8):e42887
The third-generation of sequencing technologies produces sequence reads of 1000 bp or more that may contain high polymorphism information. However, most currently available sequence analysis tools are developed specifically for analyzing short sequence reads. While the traditional Smith-Waterman (SW) algorithm can be used to map long sequence reads, its naive implementation is computationally infeasible. We have developed a new Sequence mapping and Analyzing Program (SAP) that implements a modified version of SW to speed up the alignment process. In benchmarks with simulated and real exon sequencing data and a real E. coli genome sequence data generated by the third-generation sequencing technologies, SAP outperforms currently available tools for mapping short and long sequence reads in both speed and proportion of captured reads. In addition, it achieves high accuracy in detecting SNPs and InDels in the simulated data. SAP is available at https://github.com/davidsun/SAP.  相似文献   

6.
The currently available yeast mitochondrial DNA (mtDNA) sequence is incomplete, contains many errors and is derived from several polymorphic strains. Here, we report that the mtDNA sequence of the strain used for nuclear genome sequencing assembles into a circular map of 85 779 bp which includes 10 kb of new sequence. We give a list of seven small hypothetical open reading frames (ORFs). Hot spots of point mutations are found in exons near the insertion sites of optional mobile group I intron-related sequences. Our data suggest that shuffling of mobile elements plays an important role in the remodelling of the yeast mitochondrial genome.  相似文献   

7.
8.
Promoter trapping involved screening uncharacterized fragments of C. elegans genomic DNA for C. elegans promoter activity. By sequencing the ends of these DNA fragments and locating their genomic origin using the available genome sequence data, promoter trapping has now been shown to identify real promoters of real genes, exactly as anticipated. Developmental expression patterns have thereby been linked to gene sequence, allowing further inferences on gene function to be drawn. Some expression patterns generated by promoter trapping include subcellular details. Localization to the surface of particular cells or even particular aspects of the cell surface was found to be consistent with the genes, now associated with these patterns, encoding membrane-spanning proteins. Data on gene expression patterns are easier to generate and characterize than mutant phenotypes and may provide the best means of interpreting the large quantity of sequence data currently being generated in genome projects. Received: 12 June 1998 / Accepted: 21 August 1998  相似文献   

9.
GenBank.   总被引:8,自引:3,他引:5       下载免费PDF全文
The GenBank sequence database continues to expand its data coverage, quality control, annotation content and retrieval services for the scientific community. Besides handling direct submissions of sequence data from authors, GenBank also incorporates DNA sequences from all available public sources; an integrated retrieval system, known as Entrez, also makes available data from the major protein sequence and structural databases, and from U.S. and European patents. MIDLINE abstracts from published articles describing the sequences are also included as an additional source of biological annotation for sequence entries. GenBank supports distribution of the data via FTP, CD-ROM, and E-mail servers. Network server-client programs provide access to an integrated database for literature retrieval and sequence similarity searching.  相似文献   

10.
With the advent of DNA sequencing technologies, more and more reference genome sequences are available for many organisms. Analyzing sequence variation and understanding its biological importance are becoming a major research aim. However, how to store and process the huge amount of eukaryotic genome data, such as those of the human, mouse and rice, has become a challenge to biologists. Currently available bioinformatics tools used to compress genome sequence data have some limitations, such as the requirement of the reference single nucleotide polymorphisms (SNPs) map and information on deletions and insertions. Here, we present a novel compression tool for storing and analyzing Genome ReSequencing data, named GRS. GRS is able to process the genome sequence data without the use of the reference SNPs and other sequence variation information and automatically rebuild the individual genome sequence data using the reference genome sequence. When its performance was tested on the first Korean personal genome sequence data set, GRS was able to achieve ~159-fold compression, reducing the size of the data from 2986.8 to 18.8 MB. While being tested against the sequencing data from rice and Arabidopsis thaliana, GRS compressed the 361.0 MB rice genome data to 4.4 MB, and the A. thaliana genome data from 115.1 MB to 6.5 KB. This de novo compression tool is available at http://gmdd.shgmo.org/Computational-Biology/GRS.  相似文献   

11.
The KiSS1/GPR54 system in fish   总被引:1,自引:0,他引:1  
Elizur A 《Peptides》2009,30(1):164-170
  相似文献   

12.
Four major protein sequence data collections (NBRF-PIR, PSD-Kyoto, PGtrans, and NEWAT) have been merged into a single nonredundant data bank called PseqIP. The data bank entries were automatically matched by a heuristic computer program relying on the fast computation of the number of tetrapeptides shared by two sequences. PseqIP 1.0 includes 6,068 different protein sequences for a total of 1,357,067 residues, representing most of the available sequence information to date. During the course of this work, we found about 600 occurrences of a protein sequence recorded with a one-amino-acid variation in at least two different data banks. A flat file (ASCII computer-readable format) version of PseqIP 1.0, well-suited for exhaustive homology searches and statistical sequence analysis, is available from our laboratory.  相似文献   

13.
Sfixem is an sequence feature series (SFS) visualization tool implemented in Java. It is designed to visualize data from sequence analysis programs, allowing the user to view multiple sets of computationally generated analysis to assist the analysis process. SFS is used as the data exchange format. AVAILABILITY: Sfixem is available for direct usage or download for local usage at http://sfixem.cgb.ki.se. A protein sequence analysis workbench using Sfixem is available at http://sfinx.cgb.ki.se.  相似文献   

14.
Intraspecific genetic variation of Echinococcus multilocularis, the etiologic agent of human alveolar echinococcosis, has been evaluated among 76 geographic isolates from Europe, Asia and North America by using sequence data of mitochondrial and nuclear DNA. Relatively low genetic variation was found only in the mitochondrial DNA sequence consisting of 3 protein-coding genes. Pairwise divergence among the resultant 18 haplotypes ranged from 0.03 to 1.91%. Phylogenetic trees and parsimony network of these haplotypes depicted a geographic division into European, Asian and North American clades, but 1 haplotype from Inner Mongolia was unrelated to other haplotypes. The coexistence of the Asian and North American haplotypes could be seen, particularly on the St. Lawrence Island in the Bering Sea. These data suggest an evolutionary scenario in which distinct parasite populations derived from glacial refugia have been maintained by indigenous host mammals. The nuclear DNA sequence for the immunodominant B cell epitope region of ezrin/radixin/moesin-like protein (elp) was extremely conservative, indicating that the elp antigen is available for immunodiagnosis in any endemic areas.  相似文献   

15.
Phylogeny estimation is extremely crucial in the study of molecular evolution. The increase in the amount of available genomic data facilitates phylogeny estimation from multilocus sequence data. Although maximum likelihood and Bayesian methods are available for phylogeny reconstruction using multilocus sequence data, these methods require heavy computation, and their application is limited to the analysis of a moderate number of genes and taxa. Distance matrix methods present suitable alternatives for analyzing huge amounts of sequence data. However, the manner in which distance methods can be applied to multilocus sequence data remains unknown. Here, we suggest new procedures to estimate molecular phylogeny using multilocus sequence data and evaluate its significance in the framework of the distance method. We found that concatenation of the multilocus sequence data may result in incorrect phylogeny estimation with an extremely high bootstrap probability (BP), which is due to incorrect estimation of the distances and intentional ignorance of the intergene variations. Therefore, we suggest that the distance matrices for multilocus sequence data be estimated separately and these matrices be subsequently combined to reconstruct phylogeny instead of phylogeny reconstruction using concatenated sequence data. To calculate the BPs of the reconstructed phylogeny, we suggest that 2-stage bootstrap procedures be adopted; in this, genes are resampled followed by resampling of the sequence columns within the resampled genes. By resampling the genes during calculation of BPs, intergene variations are properly considered. Via simulation studies and empirical data analysis, we demonstrate that our 2-stage bootstrap procedures are more suitable than the conventional bootstrap procedure that is adopted after sequence concatenation.  相似文献   

16.

Background  

For many types of analyses, data about gene structure and locations of non-coding regions of genes are required. Although a vast amount of genomic sequence data is available, precise annotation of genes is lacking behind. Finding the corresponding gene of a given protein sequence by means of conventional tools is error prone, and cannot be completed without manual inspection, which is time consuming and requires considerable experience.  相似文献   

17.
SUMMARY: OTUbase is an R package designed to facilitate the analysis of operational taxonomic unit (OTU) data and sequence classification (taxonomic) data. Currently there are programs that will cluster sequence data into OTUs and/or classify sequence data into known taxonomies. However, there is a need for software that can take the summarized output of these programs and organize it into easily accessed and manipulated formats. OTUbase provides this structure and organization within R, to allow researchers to easily manipulate the data with the rich library of R packages currently available for additional analysis. AVAILABILITY: OTUbase is an R package available through Bioconductor. It can be found at http://www.bioconductor.org/packages/release/bioc/html/OTUbase.html.  相似文献   

18.
We report the development of a publicly accessible, curated nucleotide sequence database of hypocrealean entomopathogenic fungi. The goal is to provide a platform for users to easily access sequence data from taxonomic reference strains. The database can be used to accurately identify unknown entomopathogenic fungi based on sequence data for a variety of phylogenetically informative loci. The database provides full multi-locus sequence alignment capabilities. The initial release contains data compiled for 525 strains covering the phylogenetic diversity of three important entomopathogenic families: Clavicipitaceae, Cordycipitaceae, and Ophiocordycipitaceae. Furthermore, Entomopathogen ID can be expanded to other fungal clades of insect pathogens, as sequence data becomes available. The database will allow isolate characterisation and evolutionary analyses. We contend that this freely available, web-accessible database will facilitate the broader community to accurately identify fungal entomopathogens, which will allow users to communicate research results more effectively.  相似文献   

19.
Illumina's Genome Analyzer generates ultra-short sequence reads, typically 36 nucleotides in length, and is primarily intended for resequencing. We tested the potential of this technology for de novo sequence assembly on the 6 Mbp genome of Pseudomonas syringae pv. syringae B728a with several freely available assembly software packages. Using an unpaired data set, velvet assembled >96% of the genome into contigs with an N50 length of 8289 nucleotides and an error rate of 0.33%. edena generated smaller contigs (N50 was 4192 nucleotides) and comparable error rates. ssake and vcake yielded shorter contigs with very high error rates. Assembly of paired-end sequence data carrying 400 bp inserts produced longer contigs (N50 up to 15 628 nucleotides), but with increased error rates (0.5%). Contig length and error rate were very sensitive to the choice of parameter values. Noncoding RNA genes were poorly resolved in de novo assemblies, while >90% of the protein-coding genes were assembled with 100% accuracy over their full length. This study demonstrates that, in practice, de novo assembly of 36-nucleotide reads can generate reasonably accurate assemblies from about 40 × deep sequence data sets. These draft assemblies are useful for exploring an organism's proteomic potential, at a very economic low cost.  相似文献   

20.
Systematic investigation of cellular process by mass spectrometric detection of peptides obtained from proteins digestion or directly from immuno-purification can be a powerful tool when used appropriately. The true sequence of these peptides is defined by the interpretation of spectral data using a variety of available algorithms. However peptide match algorithm scoring is typically based on some, but not all, of the mechanisms of peptide fragmentation. Although algorithm rules for soft ionization techniques generally fit very well to tryptic peptides, manual validation of spectra is often required for endogenous peptides such as MHC class I molecules where traditional trypsin digest techniques are not used. This study summarizes data mining and manual validation of hundreds of peptide sequences from MHC class I molecules in publically available data files. We herein describe several important features to improve and quantify manual validation for these endogenous peptides--post automated algorithm searching. Important fragmentation patterns are discussed for the studied MHC Class I peptides. These findings lead to practical rules that are helpful when performing manual validation. Furthermore, these observations may be useful to improve current peptide search algorithms or development of novel software tools.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号