共查询到20条相似文献,搜索用时 15 毫秒
1.
Broadly, computational approaches for ortholog assignment is a three steps process: (i) identify all putative homologs between the genomes, (ii) identify gene anchors and (iii) link anchors to identify best gene matches given their order and context. In this article, we engineer two methods to improve two important aspects of this pipeline [specifically steps (ii) and (iii)]. First, computing sequence similarity data [step (i)] is a computationally intensive task for large sequence sets, creating a bottleneck in the ortholog assignment pipeline. We have designed a fast and highly scalable sort-join method (afree) based on k-mer counts to rapidly compare all pairs of sequences in a large protein sequence set to identify putative homologs. Second, availability of complex genomes containing large gene families with prevalence of complex evolutionary events, such as duplications, has made the task of assigning orthologs and co-orthologs difficult. Here, we have developed an iterative graph matching strategy where at each iteration the best gene assignments are identified resulting in a set of orthologs and co-orthologs. We find that the afree algorithm is faster than existing methods and maintains high accuracy in identifying similar genes. The iterative graph matching strategy also showed high accuracy in identifying complex gene relationships. Standalone afree available from http://vbc.med.monash.edu.au/~kmahmood/afree. EGM2, complete ortholog assignment pipeline (including afree and the iterative graph matching method) available from http://vbc.med.monash.edu.au/~kmahmood/EGM2. 相似文献
2.
MOTIVATION: Graph-based clique-detection techniques are widely used for the recognition of common substructures in proteins. They permit the detection of resemblances that are independent of sequence or fold homologies and are also able to handle conformational flexibility. Their high computational complexity is often a limiting factor and prevents a detailed and fine-grained modeling of the protein structure. RESULTS: We present an efficient two-step method that significantly speeds up the detection of common substructures, especially when used to screen larger databases. It combines the advantages from both clique-detection and geometric hashing. The method is applied to an established approach for the comparison of protein binding-pockets, and some empirical results are presented. AVAILABILITY: Upon request from the authors. 相似文献
3.
Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by hashing the features into a low-dimensional space, using a hash function, i.e., by mapping features into hash keys, where multiple features can be mapped (at random) to the same hash key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" approach. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks. 相似文献
4.
Background
Spaced-seeds, i.e. patterns in which some fixed positions are allowed to be wild-cards, play a crucial role in several bioinformatics applications involving substrings counting and indexing, by often providing better sensitivity with respect to k-mers based approaches. K-mers based approaches are usually fast, being based on efficient hashing and indexing that exploits the large overlap between consecutive k-mers. Spaced-seeds hashing is not as straightforward, and it is usually computed from scratch for each position in the input sequence. Recently, the FSH (Fast Spaced seed Hashing) approach was proposed to improve the time required for computation of the spaced seed hashing of DNA sequences with a speed-up of about 1.5 with respect to standard hashing computation.Results
In this work we propose a novel algorithm, Fast Indexing for Spaced seed Hashing (FISH), based on the indexing of small blocks that can be combined to obtain the hashing of spaced-seeds of any length. The method exploits the fast computation of the hashing of runs of consecutive 1 in the spaced seeds, that basically correspond to k-mer of the length of the run.Conclusions
We run several experiments, on NGS data from simulated and synthetic metagenomic experiments, to assess the time required for the computation of the hashing for each position in each read with respect to several spaced seeds. In our experiments, FISH can compute the hashing values of spaced seeds with a speedup, with respect to the traditional approach, between 1.9x to 6.03x, depending on the structure of the spaced seeds.5.
MOTIVATION: Studies of efficient and sensitive sequence comparison methods are driven by a need to find homologous regions of weak similarity between large genomes. RESULTS: We describe an improved method for finding similar regions between two sets of DNA sequences. The new method generalizes existing methods by locating word matches between sequences under two or more word models and extending word matches into high-scoring segment pairs (HSPs). The method is implemented as a computer program named DDS2. Experimental results show that DDS2 can find more HSPs by using several word models than by using one word model. AVAILABILITY: The DDS2 program is freely available for academic use in binary code form at http://bioinformatics.iastate.edu/aat/align/align.html and in source code form from the corresponding author. 相似文献
6.
Efficient large-scale purification of restriction fragments by solute-displacement ion-exchange HPLC. 下载免费PDF全文
Extreme overloading of HPLC columns with sample can create a condition of binding site saturation causing competition and displacement among solutes during column elution. This has been termed solute-displacement chromatography (SD-HPLC). We present an example of this phenomenon for the preparative fractionation and purification of restriction fragments of almost identical size (1337 and 1388 bp) which cannot be resolved by agarose gel electrophoresis. Standard analytical ion-exchange HPLC chromatography failed to separate these fragments from each other and from an unexpectedly early eluting pUC-derived vector fragment of 2.7 kbp. We demonstrate that by intentional overloading of the small (4.6 x 35 mm) non-porous TSK-DEAE HPLC column, hundreds of micrograms of DNA restriction fragments could be resolved and purified in a single HPLC run of less than 30 minutes. 相似文献
7.
Wesselink JJ De La Iglesia B James SA Dicks JL Roberts IN Rayward-Smith VJ 《Bioinformatics (Oxford, England)》2002,18(7):1004-1010
MOTIVATION: Yeasts are often still identified with physiological growth tests, which are both time consuming and unsuitable for detection of a mixture of organisms. Hence, there is a need for molecular methods to identify yeast species. RESULTS: A hashing technique has been developed to search for unique DNA sequences in 702 26S rRNA genes. A unique DNA sequence has been found for almost every yeast species described to date. The locations of the unique defining sequences are in accordance with the variability map of large subunit ribosomal RNA and provide detail of the evolution of the D1/D2 region. This approach will be applicable to the rapid identification of unique sequences in other DNA sequence sets. AVAILABILITY: Freely available upon request from the authors. Supplementary information: Results are available at http://www.sys.uea.ac.uk/~jjw/project/paper 相似文献
8.
Efficient sequence alignment algorithms 总被引:3,自引:0,他引:3
M S Waterman 《Journal of theoretical biology》1984,108(3):333-337
Sequence alignments are becoming more important with the increase of nucleic acid data. Fitch and Smith have recently given an example where multiple insertion/deletions (rather than a series of adjacent single insertion/deletions) are necessary to achieve the correct alignment. Multiple insertion/deletions are known to increase computation time from O(n2) to O(n3) although Gotoh has presented an O(n2) algorithm in the case the multiple insertion/deletion weighting function is linear. It is argued in this paper that it could be desirable to use concave weighting functions. For that case, an algorithm is derived that is conjectured to be O(n2). 相似文献
9.
10.
《American journal of human genetics》2021,108(10):1880-1890
11.
MOTIVATION: Noise in database searches resulting from random sequence similarities increases as the databases expand rapidly. The noise problems are not a technical shortcoming of the database search programs, but a logical consequence of the idea of homology searches. The effect can be observed in simulation experiments. RESULTS: We have investigated noise levels in pairwise alignment based database searches. The noise levels of 38 releases of the SwissProt database, display perfect logarithmic growth with the total length of the databases. Clustering of real biological sequences reduces noise levels, but the effect is marginal. 相似文献
12.
Hilson P 《Trends in plant science》2006,11(3):133-141
How to assign function to the tens of thousands of genes discovered in the chromosomes of a few model species? How to complement the classical genetic approaches that are not always ideally suited to decode complex mechanisms? The solutions to these pressing questions are not simple and rely on the development of novel resources and technologies. Here I critically review what clone collections are available and how they can be exploited for the systematic analysis of gene functions in plants. 相似文献
13.
Gregory TR 《Nature reviews. Genetics》2005,6(9):699-708
Until recently the study of individual DNA sequences and of total DNA content (the C-value) sat at opposite ends of the spectrum in genome biology. For gene sequencers, the vast stretches of non-coding DNA found in eukaryotic genomes were largely considered to be an annoyance, whereas genome-size researchers attributed little relevance to specific nucleotide sequences. However, the dawn of comprehensive genome sequencing has allowed a new synergy between these fields, with sequence data providing novel insights into genome-size evolution, and with genome-size data being of both practical and theoretical significance for large-scale sequence analysis. In combination, these formerly disconnected disciplines are poised to deliver a greatly improved understanding of genome structure and evolution. 相似文献
14.
We have analyzed the alignment of a long homologous region of the human and baboon genomes (approximately 1.5 Mb). We show that the frequency of gaps between aligned segments decreases slowly with gap length, indicating that several successive nucleotides are often deleted or inserted in one event. By contrast, runs of consecutive mismatches decrease rapidly in frequency with increasing length, following an exponential distribution, indicating that nucleotides are mostly substituted one at a time. Nucleotide substitutions are clumped at the scales of <10 and 1000-10,000 nucleotides, but show almost no aggregation at the scales of <10-100 and over approximately 50,000 nucleotides. Apparently, two rather different factors make the substitution rate not exactly uniform along the DNA sequence. Comparison of regions of very similar genomes that are approximately selectively neutral makes it possible to study spontaneous mutation at a new level of resolution. 相似文献
15.
16.
Panchenko AR 《Nucleic acids research》2003,31(2):683-689
To improve the recognition of weak similarities between proteins a method of aligning two sequence profiles is proposed. It is shown that exploring the sequence space in the vicinity of the sequence with unknown properties significantly improves the performance of sequence alignment methods. Consistent with the previous observations the recognition sensitivity and alignment accuracy obtained by a profile–profile alignment method can be as much as 30% higher compared to the sequence–profile alignment method. It is demonstrated that the choice of score function and the diversity of the test profile are very important factors for achieving the maximum performance of the method, whereas the optimum range of these parameters depends on the level of similarity to be recognized. 相似文献
17.
Blastocystis hominis: phylogenetic affinities determined by rRNA sequence comparison 总被引:5,自引:0,他引:5
In 1912 Blastocystis hominis was identified as a new species and classified as a yeast (Brumpt 1912). In the early 1920s several groups confirmed its classification as a yeast, specifically a member of the genus Schizosaccharomyces (discussed by Zierdt et al. 1967). Apart from an occasional case report, the classification of B. hominis and its role as a harmless intestinal yeast was not questioned for another 50 years. Then, Zierdt (1967) suggested that it should be classified in the phylum Protozoa, subphylum Sporozoa, and that it should be considered as a potential pathogen. The likely role of B. hominis as a human pathogen has recently become more firmly established (Garcia et al. 1984; Sheehan et al. 1986) and its classification has been changed. Although the classification of B. hominis as a protozoon was assumed widely, classification as a sporozoon was not accepted, and the most recent definitive classification of the Protozoa did not even list B. hominis (Lee et al. 1985). Then, based essentially on a review of the known characteristics of the organism, it was recently reclassified into the subphylum Sarcodina (Zierdt 1988). Clearly, the phylogeny of this emerging human pathogen needs definitive analysis (Mehlhorn 1988). 相似文献
18.
19.
Leek seed lots of high (91%) and low (82%) viability were primed in aerated polyethylene glycol (PEG) solutions in Bubble-columns and by a non-osmotic priming technique in both 1987 and 1988 and the seeds were then sown in the field. Both methods of priming established a similar seed moisture content during treatment and, in the laboratory, produced seeds with more rapid and uniform germination than untreated seeds, with a greater advantage for Drum compared with Bubble-column priming in PEG. In the field, both priming techniques gave seedling emergence responses similar to those from priming in PEG by the laboratory-scale technique on filter paper. Both large-scale priming methods gave earlier and more uniform emergence than untreated seeds and gave similar or slightly higher levels of seedling emergence, except on one sowing occasion when seeds were stored before sowing followed by sowing into a drying seedbed. 相似文献
20.
General methods of sequence comparison 总被引:9,自引:0,他引:9
Michael S. Waterman 《Bulletin of mathematical biology》1984,46(4):473-500
Mathematical methods for comparison of nucleic acid sequences are reviewed. There are two major methods of sequence comparison: dynamic programming and a method referred to here as the regions method. The problem types discussed are comparison of two sequences, location of long matching segments, efficient database searches and comparison of several sequences. This work was supported by a grant from the System Development Foundation. 相似文献