期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs

Mahmood K Webb GI Song J Whisstock JC Konagurthu AS 《Nucleic acids research》2012,40(6):e44

Broadly, computational approaches for ortholog assignment is a three steps process: (i) identify all putative homologs between the genomes, (ii) identify gene anchors and (iii) link anchors to identify best gene matches given their order and context. In this article, we engineer two methods to improve two important aspects of this pipeline [specifically steps (ii) and (iii)]. First, computing sequence similarity data [step (i)] is a computationally intensive task for large sequence sets, creating a bottleneck in the ortholog assignment pipeline. We have designed a fast and highly scalable sort-join method (afree) based on k-mer counts to rapidly compare all pairs of sequences in a large protein sequence set to identify putative homologs. Second, availability of complex genomes containing large gene families with prevalence of complex evolutionary events, such as duplications, has made the task of assigning orthologs and co-orthologs difficult. Here, we have developed an iterative graph matching strategy where at each iteration the best gene assignments are identified resulting in a set of orthologs and co-orthologs. We find that the afree algorithm is faster than existing methods and maintains high accuracy in identifying similar genes. The iterative graph matching strategy also showed high accuracy in identifying complex gene relationships. Standalone afree available from http://vbc.med.monash.edu.au/～kmahmood/afree. EGM2, complete ortholog assignment pipeline (including afree and the iterative graph matching method) available from http://vbc.med.monash.edu.au/～kmahmood/EGM2. 相似文献

2.

Efficient similarity search in protein structure databases by k-clique hashing 总被引：1，自引：0，他引：1

Weskamp N Kuhn D Hüllermeier E Klebe G 《Bioinformatics (Oxford, England)》2004,20(10):1522-1526

MOTIVATION: Graph-based clique-detection techniques are widely used for the recognition of common substructures in proteins. They permit the detection of resemblances that are independent of sequence or fold homologies and are also able to handle conformational flexibility. Their high computational complexity is often a limiting factor and prevents a detailed and fine-grained modeling of the protein structure. RESULTS: We present an efficient two-step method that significantly speeds up the detection of common substructures, especially when used to screen larger databases. It combines the advantages from both clique-detection and geometric hashing. The method is applied to an established approach for the comparison of protein binding-pockets, and some empirical results are presented. AVAILABILITY: Upon request from the authors. 相似文献

3.

Protein sequence classification using feature hashing

Caragea C Silvescu A Mitra P 《Proteome science》2012,10(Z1):S14

Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by hashing the features into a low-dimensional space, using a hash function, i.e., by mapping features into hash keys, where multiple features can be mapped (at random) to the same hash key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" approach. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks. 相似文献

4.

Efficient computation of spaced seed hashing with block indexing

Girotto Samuele Comin Matteo Pizzi Cinzia 《BMC bioinformatics》2018,19(15):441-38

Background

Spaced-seeds, i.e. patterns in which some fixed positions are allowed to be wild-cards, play a crucial role in several bioinformatics applications involving substrings counting and indexing, by often providing better sensitivity with respect to k-mers based approaches. K-mers based approaches are usually fast, being based on efficient hashing and indexing that exploits the large overlap between consecutive k-mers. Spaced-seeds hashing is not as straightforward, and it is usually computed from scratch for each position in the input sequence. Recently, the FSH (Fast Spaced seed Hashing) approach was proposed to improve the time required for computation of the spaced seed hashing of DNA sequences with a speed-up of about 1.5 with respect to standard hashing computation.

Results

In this work we propose a novel algorithm, Fast Indexing for Spaced seed Hashing (FISH), based on the indexing of small blocks that can be combined to obtain the hashing of spaced-seeds of any length. The method exploits the fast computation of the hashing of runs of consecutive 1 in the spaced seeds, that basically correspond to k-mer of the length of the run.

Conclusions

We run several experiments, on NGS data from simulated and synthetic metagenomic experiments, to assess the time required for the computation of the hashing for each position in each read with respect to several spaced seeds. In our experiments, FISH can compute the hashing values of spaced seeds with a speedup, with respect to the traditional approach, between 1.9x to 6.03x, depending on the structure of the spaced seeds.

相似文献

5.

Efficient combination of multiple word models for improved sequence comparison

Huang X Ye L Chou HH Yang IH Chao KM 《Bioinformatics (Oxford, England)》2004,20(16):2529-2533

MOTIVATION: Studies of efficient and sensitive sequence comparison methods are driven by a need to find homologous regions of weak similarity between large genomes. RESULTS: We describe an improved method for finding similar regions between two sets of DNA sequences. The new method generalizes existing methods by locating word matches between sequences under two or more word models and extending word matches into high-scoring segment pairs (HSPs). The method is implemented as a computer program named DDS2. Experimental results show that DDS2 can find more HSPs by using several word models than by using one word model. AVAILABILITY: The DDS2 program is freely available for academic use in binary code form at http://bioinformatics.iastate.edu/aat/align/align.html and in source code form from the corresponding author. 相似文献

6.

Efficient large-scale purification of restriction fragments by solute-displacement ion-exchange HPLC.

下载免费PDF全文

J H Waterborg A J Robertson 《Nucleic acids research》1993,21(12):2913-2915

Extreme overloading of HPLC columns with sample can create a condition of binding site saturation causing competition and displacement among solutes during column elution. This has been termed solute-displacement chromatography (SD-HPLC). We present an example of this phenomenon for the preparative fractionation and purification of restriction fragments of almost identical size (1337 and 1388 bp) which cannot be resolved by agarose gel electrophoresis. Standard analytical ion-exchange HPLC chromatography failed to separate these fragments from each other and from an unexpectedly early eluting pUC-derived vector fragment of 2.7 kbp. We demonstrate that by intentional overloading of the small (4.6 x 35 mm) non-porous TSK-DEAE HPLC column, hundreds of micrograms of DNA restriction fragments could be resolved and purified in a single HPLC run of less than 30 minutes. 相似文献

7.

Determining a unique defining DNA sequence for yeast species using hashing techniques

Wesselink JJ De La Iglesia B James SA Dicks JL Roberts IN Rayward-Smith VJ 《Bioinformatics (Oxford, England)》2002,18(7):1004-1010

MOTIVATION: Yeasts are often still identified with physiological growth tests, which are both time consuming and unsuitable for detection of a mixture of organisms. Hence, there is a need for molecular methods to identify yeast species. RESULTS: A hashing technique has been developed to search for unique DNA sequences in 702 26S rRNA genes. A unique DNA sequence has been found for almost every yeast species described to date. The locations of the unique defining sequences are in accordance with the variability map of large subunit ribosomal RNA and provide detail of the evolution of the D1/D2 region. This approach will be applicable to the rapid identification of unique sequences in other DNA sequence sets. AVAILABILITY: Freely available upon request from the authors. Supplementary information: Results are available at http://www.sys.uea.ac.uk/~jjw/project/paper 相似文献

8.

Efficient sequence alignment algorithms 总被引：3，自引：0，他引：3

M S Waterman 《Journal of theoretical biology》1984,108(3):333-337

Sequence alignments are becoming more important with the increase of nucleic acid data. Fitch and Smith have recently given an example where multiple insertion/deletions (rather than a series of adjacent single insertion/deletions) are necessary to achieve the correct alignment. Multiple insertion/deletions are known to increase computation time from O(n2) to O(n3) although Gotoh has presented an O(n2) algorithm in the case the multiple insertion/deletion weighting function is linear. It is argued in this paper that it could be desirable to use concave weighting functions. For that case, an algorithm is derived that is conjectured to be O(n2). 相似文献

9.

Gene identification through large-scale EST sequence processing

Lindlöf A 《Applied bioinformatics》2003,2(3):123-129

相似文献

10.

Limits of homology detection by pairwise sequence comparison

Spang R Vingron M 《Bioinformatics (Oxford, England)》2001,17(4):338-342

MOTIVATION: Noise in database searches resulting from random sequence similarities increases as the databases expand rapidly. The noise problems are not a technical shortcoming of the database search programs, but a logical consequence of the idea of homology searches. The effect can be observed in simulation experiments. RESULTS: We have investigated noise levels in pairwise alignment based database searches. The noise levels of 38 releases of the SwissProt database, display perfect logarithmic growth with the total length of the databases. Clustering of real biological sequences reduces noise levels, but the effect is marginal. 相似文献

11.

Cloned sequence repertoires for small- and large-scale biology

Hilson P 《Trends in plant science》2006,11(3):133-141

How to assign function to the tens of thousands of genes discovered in the chromosomes of a few model species? How to complement the classical genetic approaches that are not always ideally suited to decode complex mechanisms? The solutions to these pressing questions are not simple and rely on the development of novel resources and technologies. Here I critically review what clone collections are available and how they can be exploited for the systematic analysis of gene functions in plants. 相似文献

12.

Synergy between sequence and size in large-scale genomics

Gregory TR 《Nature reviews. Genetics》2005,6(9):699-708

Until recently the study of individual DNA sequences and of total DNA content (the C-value) sat at opposite ends of the spectrum in genome biology. For gene sequencers, the vast stretches of non-coding DNA found in eukaryotic genomes were largely considered to be an annoyance, whereas genome-size researchers attributed little relevance to specific nucleotide sequences. However, the dawn of comprehensive genome sequencing has allowed a new synergy between these fields, with sequence data providing novel insights into genome-size evolution, and with genome-size data being of both practical and theoretical significance for large-scale sequence analysis. In combination, these formerly disconnected disciplines are poised to deliver a greatly improved understanding of genome structure and evolution. 相似文献

13.

Sequence comparison by sequence harmony identifies subtype-specific functional sites 总被引：1，自引：0，他引：1

Pirovano W Feenstra KA Heringa J 《Nucleic acids research》2006,34(22):6540-6548

相似文献

14.

Patterns in spontaneous mutation revealed by human-baboon sequence comparison 总被引：7，自引：0，他引：7

Silva JC Kondrashov AS 《Trends in genetics : TIG》2002,18(11):544-547

We have analyzed the alignment of a long homologous region of the human and baboon genomes (approximately 1.5 Mb). We show that the frequency of gaps between aligned segments decreases slowly with gap length, indicating that several successive nucleotides are often deleted or inserted in one event. By contrast, runs of consecutive mismatches decrease rapidly in frequency with increasing length, following an exponential distribution, indicating that nucleotides are mostly substituted one at a time. Nucleotide substitutions are clumped at the scales of <10 and 1000-10,000 nucleotides, but show almost no aggregation at the scales of <10-100 and over approximately 50,000 nucleotides. Apparently, two rather different factors make the substitution rate not exactly uniform along the DNA sequence. Comparison of regions of very similar genomes that are approximately selectively neutral makes it possible to study spontaneous mutation at a new level of resolution. 相似文献

15.

Finding weak similarities between proteins by sequence profile comparison 总被引：3，自引：1，他引：3

下载免费PDF全文

Panchenko AR 《Nucleic acids research》2003,31(2):683-689

To improve the recognition of weak similarities between proteins a method of aligning two sequence profiles is proposed. It is shown that exploring the sequence space in the vicinity of the sequence with unknown properties significantly improves the performance of sequence alignment methods. Consistent with the previous observations the recognition sensitivity and alignment accuracy obtained by a profile–profile alignment method can be as much as 30% higher compared to the sequence–profile alignment method. It is demonstrated that the choice of score function and the diversity of the test profile are very important factors for achieving the maximum performance of the method, whereas the optimum range of these parameters depends on the level of similarity to be recognized. 相似文献

16.

A comparison of two large-scale seed priming techniques

D. GRAY H. R. ROWSE R. L. K. DREW 《The Annals of applied biology》1990,116(3):611-616

Leek seed lots of high (91%) and low (82%) viability were primed in aerated polyethylene glycol (PEG) solutions in Bubble-columns and by a non-osmotic priming technique in both 1987 and 1988 and the seeds were then sown in the field. Both methods of priming established a similar seed moisture content during treatment and, in the laboratory, produced seeds with more rapid and uniform germination than untreated seeds, with a greater advantage for Drum compared with Bubble-column priming in PEG. In the field, both priming techniques gave seedling emergence responses similar to those from priming in PEG by the laboratory-scale technique on filter paper. Both large-scale priming methods gave earlier and more uniform emergence than untreated seeds and gave similar or slightly higher levels of seedling emergence, except on one sowing occasion when seeds were stored before sowing followed by sowing into a drying seedbed. 相似文献

17.

Blastocystis hominis: phylogenetic affinities determined by rRNA sequence comparison 总被引：5，自引：0，他引：5

A M Johnson A Thanou P F Boreham P R Baverstock 《Experimental parasitology》1989,68(3):283-288

In 1912 Blastocystis hominis was identified as a new species and classified as a yeast (Brumpt 1912). In the early 1920s several groups confirmed its classification as a yeast, specifically a member of the genus Schizosaccharomyces (discussed by Zierdt et al. 1967). Apart from an occasional case report, the classification of B. hominis and its role as a harmless intestinal yeast was not questioned for another 50 years. Then, Zierdt (1967) suggested that it should be classified in the phylum Protozoa, subphylum Sporozoa, and that it should be considered as a potential pathogen. The likely role of B. hominis as a human pathogen has recently become more firmly established (Garcia et al. 1984; Sheehan et al. 1986) and its classification has been changed. Although the classification of B. hominis as a protozoon was assumed widely, classification as a sporozoon was not accepted, and the most recent definitive classification of the Protozoa did not even list B. hominis (Lee et al. 1985). Then, based essentially on a review of the known characteristics of the organism, it was recently reclassified into the subphylum Sarcodina (Zierdt 1988). Clearly, the phylogeny of this emerging human pathogen needs definitive analysis (Mehlhorn 1988). 相似文献

18.

Phylogenetic relationships of Cryptosporidium determined by ribosomal RNA sequence comparison 总被引：4，自引：0，他引：4

A M Johnson R Fielke R Lumb P R Baverstock 《International journal for parasitology》1990,20(2):141-147

相似文献

19.

A multiple sequence comparison method

A. K. C. Wong S. C. Chan D. K. Y. Chiu 《Bulletin of mathematical biology》1993,55(2):465-486

This article presents a new method for the comparison of multiple macromolecular sequences. It is based on a hierarchical sequence synthesis procedure that does not require anya priori knowledge of the molecular structure of the sequences or the phylogenetic relations among the sequences. It differs from the existing methods as it has the capability of: (i) generating a statistical-structural model of the sequences through a synthesis process that detects homologous groups of the sequences, and (ii) aligning the sequences while the taxonomic tree of the sequences is being constructed in one single phase. It produces superior results when compared with some existing methods. 相似文献

20.

Expression profiling of salinity-alkali stress responses by large-scale expressed sequence tag analysis in Tamarix hispid

Gao C Wang Y Liu G Yang C Jiang J Li H 《Plant molecular biology》2008,66(3):245-258

相似文献