首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Gene and protein sequence analyses, central components of studies in modern biology are easily amenable to string matching and pattern recognition algorithms. The growing need of analysing whole genome sequences more efficiently and thoroughly, has led to the emergence of new computational methods. Suffix trees and suffix arrays are data structures, well known in many other areas and are highly suited for sequence analysis too. Here we report an improvement to the design of construction of suffix arrays. Enhancement in versatility and scalability, enabled by this approach, is demonstrated through the use of real-life examples.The scalability of the algorithm to whole genomes renders it suitable to address many biologically interesting problems. One example is the evolutionary insight gained by analysing unigrams, bi-grams and higher n-grams, indicating that the genetic code has a direct influence on the overall composition of the genome. Further, different proteomes have been analysed for the coverage of the possible peptide space, which indicate that as much as a quarter of the total space at the tetra-peptide level is left un-sampled in prokaryotic organisms, although almost all tri-peptides can be seen in one protein or another in a proteome. Besides, distinct patterns begin to emerge for the counts of particular tetra and higher peptides, indicative of a ‘meaning’ for tetra and higher n-grams.The toolkit has also been used to demonstrate the usefulness of identifying repeats in whole proteomes efficiently. As an example, 16 members of one COG, coded by the genome of Mycobacterium tuberculosis H37Rv have been found to contain a repeating sequence of 300 amino acids.  相似文献   

2.

Background  

It has been suggested previously that genome and proteome sequences show characteristics typical of natural-language texts such as "signature-style" word usage indicative of authors or topics, and that the algorithms originally developed for natural language processing may therefore be applied to genome sequences to draw biologically relevant conclusions. Following this approach of 'biological language modeling', statistical n-gram analysis has been applied for comparative analysis of whole proteome sequences of 44 organisms. It has been shown that a few particular amino acid n-grams are found in abundance in one organism but occurring very rarely in other organisms, thereby serving as genome signatures. At that time proteomes of only 44 organisms were available, thereby limiting the generalization of this hypothesis. Today nearly 1,000 genome sequences and corresponding translated sequences are available, making it feasible to test the existence of biological language models over the evolutionary tree.  相似文献   

3.
Algorithms for exact string matching have substantial application in computational biology. Time-efficient data structures which support a variety of exact string matching queries, such as the suffix tree and the suffix array, have been applied to such problems. As sequence databases grow, more space-efficient approaches to exact matching are becoming more important. One such data structure, the compressed suffix array (CSA), based on the Burrows-Wheeler transform, has been shown to require memory which is nearly equal to the memory requirements of the original database, while supporting common sorts of query problems time efficiently. However, building a CSA from a sequence in efficient space and time is challenging. In 2002, the first space-efficient CSA construction algorithm was presented. That implementation used (1+2 log2 |summation|)(1+epsilon) bits per character (where epsilon is a small fraction). The construction algorithm ran in as much as twice that space, in O(| summation|n log(n)) time. We have created an implementation which can also achieve these asymptotic bounds, but for small alphabets, and only uses 1/2 (1+|summation|)(1+epsilon) bits per character, a factor of 2 less space for nucleotide alphabets. We present time and space results for the CSA construction and querying of our implementation on publicly available genome data which demonstrate the practicality of this approach.  相似文献   

4.
We describe a novel method for efficient reconstruction of phylogenetic trees, based on sequences of whole genomes or proteomes, whose lengths may greatly vary. The core of our method is a new measure of pairwise distances between sequences. This measure is based on computing the average lengths of maximum common substrings, which is intrinsically related to information theoretic tools (Kullback-Leibler relative entropy). We present an algorithm for efficiently computing these distances. In principle, the distance of two l long sequences can be calculated in O(l) time. We implemented the algorithm using suffix arrays our implementation is fast enough to enable the construction of the proteome phylogenomic tree for hundreds of species and the genome phylogenomic forest for almost two thousand viruses. An initial analysis of the results exhibits a remarkable agreement with "acceptable phylogenetic and taxonomic truth." To assess our approach, our results were compared to the traditional (single-gene or protein-based) maximum likelihood method. The obtained trees were compared to implementations of a number of alternative approaches, including two that were previously published in the literature, and to the published results of a third approach. Comparing their outcome and running time to ours, using a "traditional" trees and a standard tree comparison method, our algorithm improved upon the "competition" by a substantial margin. The simplicity and speed of our method allows for a whole genome analysis with the greatest scope attempted so far. We describe here five different applications of the method, which not only show the validity of the method, but also suggest a number of novel phylogenetic insights.  相似文献   

5.
The explosion in genomic sequence available in public databases has resulted in an unprecedented opportunity for computational whole genome analyses. A number of promising comparative-based approaches have been developed for gene finding, regulatory element discovery and other purposes, and it is clear that these tools will play a fundamental role in analysing the enormous amount of new data that is currently being generated. The synthesis of computationally intensive comparative computational approaches with the requirement for whole genome analysis represents both an unprecedented challenge and opportunity for computational scientists. We focus on a few of these challenges, using by way of example the problems of alignment, gene finding and regulatory element discovery, and discuss the issues that have arisen in attempts to solve these problems in the context of whole genome analysis pipelines.  相似文献   

6.
The genome sequencing project has generated and will continue to generate enormous amounts of sequence data. Since the first complete genome sequence of bacteriumHacmophilus influenzac was published in 1995, the complete genome sequences of 2 eukaryotic and about 22 prokaryotic organisms have been determined. Given this ever-increasing amounts of sequence information, new strategies are necessary to efficiently pursue the next phase of the genome project—the elucidation of gene expression patterns and gene product function on a whole genome scale. In order to assign functional information to the genome sequence, DNA chip technology was developed to efficiently identify the differential expression pattern of independent biological samples. DNA chip provides a new tool for genome expression analysis that may revolutionize many aspects of human life including new drug discovery and human disease diagnostics.  相似文献   

7.
Although Arabidopsis is well established as the premiere model species in plant biology, rice (Oryza sativa) is moving up fast as the second-best model organism. In addition to the availability of large sets of genetic, molecular, and genomic resources, two features make rice attractive as a model species: it represents the taxonomically distinct monocots and is a crop species. Plant structural genomics was pioneered on a genome-scale in Arabidopsis and the lessons learned from these efforts were not lost on rice. Indeed, the sequence and annotation of the rice genome has been greatly accelerated by method improvements made in Arabidopsis. For example, the value of full-length cDNA clones and deep expressed sequence tag resources, obtained in Arabidopsis primarily after release of the complete genome, has been recognized by the rice genomics community. For rice >250,000 expressed sequence tags and 28,000 full-length cDNA sequences are available prior to the completion of the genome sequence. With respect to tools for Arabidopsis functional genomics, deep sequence-tagged lines, inexpensive spotted oligonucleotide arrays, and a near-complete whole genome Affymetrix array are publicly available. The development of similar functional genomics resources for rice is in progress that for the most part has been more streamlined based on lessons learned from Arabidopsis. Genomic resource development has been essential to set the stage for hypothesis-driven research, and Arabidopsis continues to provide paradigms for testing in rice to assess function across taxonomic divisions and in a crop species.  相似文献   

8.
9.
Alpha-satellite DNA of primates: old and new families   总被引:10,自引:0,他引:10  
In this report we review alpha-satellite DNA (AS) sequence data to support the following proposed scenario of AS evolution. Centromeric regions of lower primate chromosomes have solely "old" AS based on type A monomeric units. Type A AS is efficiently homogenized throughout the whole genome and is nearly identical in all chromosomes. In the ancestors of great apes, a divergent variant of the type A monomer acquired the ability to bind CENP-B protein and expanded in the old arrays, mixing irregularly with type A. As a result, a new class of monomers, called type B, was formed. The "new" AS families were established by amplification of divergent segments of irregular A-B arrays and spread to many chromosomes before the human-chimpanzee-gorilla split. The new arrays contain regularly alternating monomers of types A and B. New AS is homogenized within an array with little or no homogenization between chromosomes. Most human chromosomes contain only one new array and one or a few old arrays. However, as a rule only new arrays are efficiently homogenized. Apparently, in evolution, after the establishment of the new arrays homogenization in the old arrays stopped. Notably, kinetochore structures marking functional centromeres are also usually formed on the new arrays. We propose that homogenization of AS may be limited to arrays participating in centromeric function.  相似文献   

10.
Babnigg G  Giometti CS 《Proteomics》2003,3(5):584-600
The analysis of proteomes, i.e., the proteins expressed by biological organisms under a given set of conditions at a given time, requires separating complex protein mixtures into discrete protein components, measuring their relative abundances, and identifying the individual protein components. Many types of data are generated during the course of proteome analysis, including graphic images of the protein profiles, flat files containing numeric data, spreadsheets for assimilating numeric data, and relational database tables for integrating data from multiple experiments. As part of a project to describe the proteomes of microbes of interest to the U.S. Department of Energy, a World-Wide Web-based interface has been developed for the display of protein profiles generated by two-dimensional gel electrophoresis. The web interface is capable of obtaining protein identifications on the fly, interrogating the quantitative data in the context of available genome sequence information, and relating the proteome data to existing metabolic pathway databases. Analysis of protein expression profiles is expedited, providing the capability to efficiently determine the gene locations for proteins modulated in abundance in response to different growth conditions and to locate the positions of the proteins within specific metabolic pathways. The proteome of the archaeon Methanococcus jannaschii, a microbe for which the complete genome sequence is available, is used to demonstrate the capabilities of this evolving web interface (http://proteomeweb.anl.gov).  相似文献   

11.
Despite the complete determination of the genome sequence of several higher eukaryotes, their proteomes remain relatively poorly defined. Information about proteins identified by different experimental and computational methods is stored in different databases, meaning that no single resource offers full coverage of known and predicted proteins. IPI (the International Protein Index) has been developed to address these issues and offers complete nonredundant data sets representing the human, mouse and rat proteomes, built from the Swiss-Prot, TrEMBL, Ensembl and RefSeq databases.  相似文献   

12.
Sequence capture methods for targeted next generation sequencing promise to massively reduce cost of genomics projects compared to untargeted sequencing. However, evaluated capture methods specifically dedicated to biologically relevant genomic regions are rare. Whole exome capture has been shown to be a powerful tool to discover the genetic origin of disease and provides a reduction in target size and thus calculative sequencing capacity of > 90-fold compared to untargeted whole genome sequencing. For further cost reduction, a valuable complementing approach is the analysis of smaller, relevant gene subsets but involving large cohorts of samples. However, effective adjustment of target sizes and sample numbers is hampered by the limited scalability of enrichment systems. We report a highly scalable and automated method to capture a 480 Kb exome subset of 115 cancer-related genes using microfluidic DNA arrays. The arrays are adaptable from 125 Kb to 1 Mb target size and/or one to eight samples without barcoding strategies, representing a further 26 – 270-fold reduction of calculative sequencing capacity compared to whole exome sequencing. Illumina GAII analysis of a HapMap genome enriched for this exome subset revealed a completeness of > 96%. Uniformity was such that > 68% of exons had at least half the median depth of coverage. An analysis of reference SNPs revealed a sensitivity of up to 93% and a specificity of 98.2% or higher.  相似文献   

13.
The genome sequence of Bacillus subtilis was published in 1997 and since then many other bacterial genomes have been sequenced, among them Bacillus licheniformis in 2004. B. subtilis and B. licheniformis are closely related and feature similar saprophytic lifestyles in the soil. Both species can secrete numerous proteins into the surrounding medium enabling them to use high-molecular-weight substances, which are abundant in soils, as nutrient sources. The availability of complete genome sequences allows for the prediction of the proteins containing signals for secretion into the extracellular milieu and also of the proteins which form the secretion machinery needed for protein translocation through the cytoplasmic membrane. To confirm the predicted subcellular localization of proteins, proteomics is the best choice. The extracellular proteomes of B. subtilis and B. licheniformis have been analyzed under different growth conditions allowing comparisons of the extracellular proteomes and conclusions regarding similarities and differences of the protein secretion mechanisms between the two species.  相似文献   

14.
15.
The sequencing of the genomes of a variety of species and the growing databases containing expressed sequence tags (ESTs) and complementary DNAs (cDNAs) facilitate the design of highly specific oligomers for use as genomic markers, PCR primers, or DNA oligo microarrays. The first step in evaluating the specificity of short oligomers of about 20 units in length is to determine the frequencies at which the oligomers occur. However, for oligomers longer than about fifty units this is not efficient, as they usually have a frequency of only 1. A more suitable procedure is to consider the mismatch tolerance of an oligomer, that is, the minimum number of mismatches that allows a given oligomer to match a substring other than the target sequence anywhere in the genome or the EST database. However, calculating the exact value of mismatch tolerance is computationally costly and impractical. Therefore, we studied the problem of checking whether an oligomer meets the constraint that its mismatch tolerance is no less than a given threshold. Here, we present an efficient dynamic programming algorithm solution that utilizes suffix and height arrays. We demonstrated the effectiveness of this algorithm by efficiently computing a dense list of numerous oligo-markers applicable to the human genome. Experimental results show that the algorithm runs faster than well-known Abrahamson's algorithm by orders of magnitude and is able to enumerate 65% approximately 76% of qualified oligomers.  相似文献   

16.
Sequence scanning chicken cosmids: a methodology for genome screening   总被引:2,自引:0,他引:2  
The chicken genome is relatively poorly studied at the molecular level. The karyotype 2n=78 is divided into three main chromosomal sub-groups: the macrochromosomes (six pairs), the intermediate microchromosomes (four pairs) and the microchromosomes (29 pairs). Whilst the microchromosome group comprise only 25% of the DNA, increasing evidence is proving that this is disproportionate to their gene content. This paper demonstrates the utility of cosmid sequence scanning as a potential method for analysing the chicken genome, providing an economical method for the production of a molecular map. The GC content, gene density and repeat distribution are analysed relative to chromosomal origin. Results indicate that gene density is higher on the microchromosomes. During the scanning process an example of conserved linkage between chicken and human (12q34.2) has been demonstrated.  相似文献   

17.
Amplified ribosomal spacer sequence: structure and evolutionary origin   总被引:2,自引:0,他引:2  
A novel class of repeated sequences consisting of tandem arrays of ribosomal spacer sequence has been discovered in a mouse genome. Comparison to normal ribosomal DNA reveals that one repeat unit consists of two separate parts of spacer sequence. This amplified spacer sequence has a pseudogene-like structure but is distinct from the previously reported pseudogenes and orphons in regions lacking coding sequences. So far the amplified spacer sequence has been found only in the BALB/c mouse genome but not in ten other laboratory strains and several wild-type mouse stocks. Surprisingly, a part of the amplified spacer sequence unit had a higher homology to the corresponding part of the ribosomal DNA sequence of Mus musculus molossinus, a Japanese wild-type mouse, than to the corresponding part of the rDNA of the BALB/c mouse. These findings suggest that the amplified spacer sequence of the BALB/c mouse might have partly originated in M. m. molossinus or in a related subspecies.  相似文献   

18.
The accelerating growth of the public microbial genomic data imposes substantial burden on the research community that uses such resources.Building databases for non-redundant reference sequences from massive microbial genomic data based on clustering analysis is essential.However,existing clustering algorithms perform poorly on long genomic sequences.In this article,we present Gclust,a parallel program for clustering complete or draft genomic sequences,where clustering is accelerated with a novel parallelization strategy and a fast sequence comparison algorithm using sparse suffix arrays(SSAs).Moreover,genome identity measures between two sequences are calculated based on their maximal exact matches(MEMs).In this paper,we demonstrate the high speed and clustering quality of Gclust by examining four genome sequence datasets.Gclust is freely available for non-commercial use at https://github.com/niu-lab/gclust.We also introduce a web server for clustering user-uploaded genomes at http://niulab.scgrid.cn/gclust.  相似文献   

19.
The availability of entire genome sequences is expected to revolutionize the way in which biology and medicine are conducted for years to come. However, achieving this promise still requires significant effort in the areas of gene annotation, cloning and expression of thousands of known and heretofore unknown protein-encoding genes. Traditional technologies of manipulating genes are too cumbersome and inefficient when one is dealing with more than a few genes at a time. Entire libraries composed of all protein-encoding open reading frames (ORFs) cloned in highly flexible vectors will be needed to take full advantage of the information found in any genome sequence. The creation of such ORFeome resources using novel technologies for cloning and expressing entire proteomes constitutes an effective gateway from whole genome sequencing efforts to downstream 'omics' applications.  相似文献   

20.
MOTIVATION: Since the newly developed Grid platform has been considered as a powerful tool to share resources in the Internet environment, it is of interest to demonstrate an efficient methodology to process massive biological data on the Grid environments at a low cost. This paper presents an efficient and economical method based on a Grid platform to predict secondary structures of all proteins in a given organism, which normally requires a long computation time through sequential execution, by means of processing a large amount of protein sequence data simultaneously. From the prediction results, a genome scale protein fold space can be pursued. RESULTS: Using the improved Grid platform, the secondary structure prediction on genomic scale and protein topology derived from the new scoring scheme for four different model proteomes was presented. This protein fold space was compared with structures from the Protein Data Bank, database and it showed similarly aligned distribution. Therefore, the fold space approach based on this new scoring scheme could be a guideline for predicting a folding family in a given organism.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号