首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Sequence alignment by cross-correlation.   总被引:1,自引:0,他引:1  
Many recent advances in biology and medicine have resulted from DNA sequence alignment algorithms and technology. Traditional approaches for the matching of DNA sequences are based either on global alignment schemes or heuristic schemes that seek to approximate global alignment algorithms while providing higher computational efficiency. This report describes an approach using the mathematical operation of cross-correlation to compare sequences. It can be implemented using the fast fourier transform for computational efficiency. The algorithm is summarized and sample applications are given. These include gene sequence alignment in long stretches of genomic DNA, finding sequence similarity in distantly related organisms, demonstrating sequence similarity in the presence of massive (approximately 90%) random point mutations, comparing sequences related by internal rearrangements (tandem repeats) within a gene, and investigating fusion proteins. Application to RNA and protein sequence alignment is also discussed. The method is efficient, sensitive, and robust, being able to find sequence similarities where other alignment algorithms may perform poorly.  相似文献   

2.
Deciphering gene expression regulatory networks   总被引:11,自引:0,他引:11  
  相似文献   

3.
Effectiveness of computational methods in haplotype prediction   总被引:11,自引:0,他引:11  
Haplotype analysis has been used for narrowing down the location of disease-susceptibility genes and for investigating many population processes. Computational algorithms have been developed to estimate haplotype frequencies and to predict haplotype phases from genotype data for unrelated individuals. However, the accuracy of such computational methods needs to be evaluated before their applications can be advocated. We have experimentally determined the haplotypes at two loci, the N-acetyltransferase 2 gene ( NAT2, 850 bp, n=81) and a 140-kb region on chromosome X ( n=77), each consisting of five single nucleotide polymorphisms (SNPs). We empirically evaluated and compared the accuracy of the subtraction method, the expectation-maximization (EM) method, and the PHASE method in haplotype frequency estimation and in haplotype phase prediction. Where there was near complete linkage disequilibrium (LD) between SNPs (the NAT2 gene), all three methods provided effective and accurate estimates for haplotype frequencies and individual haplotype phases. For a genomic region in which marked LD was not maintained (the chromosome X locus), the computational methods were adequate in estimating overall haplotype frequencies. However, none of the methods was accurate in predicting individual haplotype phases. The EM and the PHASE methods provided better estimates for overall haplotype frequencies than the subtraction method for both genomic regions.  相似文献   

4.
An increasing number of genes have been experimentally confirmed in recent years as causative genes to various human diseases. The newly available knowledge can be exploited by machine learning methods to discover additional unknown genes that are likely to be associated with diseases. In particular, positive unlabeled learning (PU learning) methods, which require only a positive training set P (confirmed disease genes) and an unlabeled set U (the unknown candidate genes) instead of a negative training set N, have been shown to be effective in uncovering new disease genes in the current scenario. Using only a single source of data for prediction can be susceptible to bias due to incompleteness and noise in the genomic data and a single machine learning predictor prone to bias caused by inherent limitations of individual methods. In this paper, we propose an effective PU learning framework that integrates multiple biological data sources and an ensemble of powerful machine learning classifiers for disease gene identification. Our proposed method integrates data from multiple biological sources for training PU learning classifiers. A novel ensemble-based PU learning method EPU is then used to integrate multiple PU learning classifiers to achieve accurate and robust disease gene predictions. Our evaluation experiments across six disease groups showed that EPU achieved significantly better results compared with various state-of-the-art prediction methods as well as ensemble learning classifiers. Through integrating multiple biological data sources for training and the outputs of an ensemble of PU learning classifiers for prediction, we are able to minimize the potential bias and errors in individual data sources and machine learning algorithms to achieve more accurate and robust disease gene predictions. In the future, our EPU method provides an effective framework to integrate the additional biological and computational resources for better disease gene predictions.  相似文献   

5.
6.
The explosion in genomic sequence available in public databases has resulted in an unprecedented opportunity for computational whole genome analyses. A number of promising comparative-based approaches have been developed for gene finding, regulatory element discovery and other purposes, and it is clear that these tools will play a fundamental role in analysing the enormous amount of new data that is currently being generated. The synthesis of computationally intensive comparative computational approaches with the requirement for whole genome analysis represents both an unprecedented challenge and opportunity for computational scientists. We focus on a few of these challenges, using by way of example the problems of alignment, gene finding and regulatory element discovery, and discuss the issues that have arisen in attempts to solve these problems in the context of whole genome analysis pipelines.  相似文献   

7.
The problems associated with gene identification and the prediction of gene structure in DNA sequences have been the focus of increased attention over the past few years with the recent acquisition by large-scale sequencing projects of an immense amount of genome data. A variety of prediction programs have been developed in order to address these problems. This paper presents a review of the computational approaches and gene-finders used commonly for gene prediction in eukaryotic genomes. Two approaches, in general, have been adopted for this purpose: similarity-based and ab initio techniques. The information gleaned from these methods is then combined via a variety of algorithms, including Dynamic Programming (DP) or the Hidden Markov Model (HMM), and then used for gene prediction from the genomic sequences.  相似文献   

8.

Background

Early and accurate identification of adverse drug reactions (ADRs) is critically important for drug development and clinical safety. Computer-aided prediction of ADRs has attracted increasing attention in recent years, and many computational models have been proposed. However, because of the lack of systematic analysis and comparison of the different computational models, there remain limitations in designing more effective algorithms and selecting more useful features. There is therefore an urgent need to review and analyze previous computation models to obtain general conclusions that can provide useful guidance to construct more effective computational models to predict ADRs.

Principal Findings

In the current study, the main work is to compare and analyze the performance of existing computational methods to predict ADRs, by implementing and evaluating additional algorithms that have been earlier used for predicting drug targets. Our results indicated that topological and intrinsic features were complementary to an extent and the Jaccard coefficient had an important and general effect on the prediction of drug-ADR associations. By comparing the structure of each algorithm, final formulas of these algorithms were all converted to linear model in form, based on this finding we propose a new algorithm called the general weighted profile method and it yielded the best overall performance among the algorithms investigated in this paper.

Conclusion

Several meaningful conclusions and useful findings regarding the prediction of ADRs are provided for selecting optimal features and algorithms.  相似文献   

9.

Background  

gene identification in genomic DNA sequences by computational methods has become an important task in bioinformatics and computational gene prediction tools are now essential components of every genome sequencing project. Prediction of splice sites is a key step of all gene structural prediction algorithms.  相似文献   

10.
MOTIVATION: Many bioinformatic approaches exist for finding novel genes within genomic sequence data. Traditionally, homology search-based methods are often the first approach employed in determining whether a novel gene exists that is similar to a known gene. Unfortunately, distantly related genes or motifs often are difficult to find using single query-based homology search algorithms against large sequence datasets such as the human genome. Therefore, the motivation behind this work was to develop an approach to enhance the sensitivity of traditional single query-based homology algorithms against genomic data without losing search selectivity. RESULTS: We demonstrate that by searching against a genome fragmented into all possible reading frames, the sensitivity of homology-based searches is enhanced without degrading its selectivity. Using the ETS-domain, bromodomain and acetyl-CoA acetyltransferase gene as queries, we were able to demonstrate that direct protein-protein searches using BLAST2P or FASTA3 against a human genome segmented among all possible reading frames and translated was substantially more sensitive than traditional protein-DNA searches against a raw genomic sequence using an application such as TBLAST2N. Receiver operating characteristic analysis was employed to demonstrate that the algorithms remained selective, while comparisons of the algorithms showed that the protein-protein searches were more sensitive in identifying hits. Therefore, through the overprediction of reading frames by this method and the increased sensitivity of protein-protein based homology search algorithms, a genome can be deeply mined, potentially finding hits overlooked by protein-DNA searches against raw genomic data.  相似文献   

11.
12.
13.
Considerable attention has been focused on predicting the secondary structure for aligned RNA sequences since it is useful not only for improving the limiting accuracy of conventional secondary structure prediction but also for finding non-coding RNAs in genomic sequences. Although there exist many algorithms of predicting secondary structure for aligned RNA sequences, further improvement of the accuracy is still awaited. In this article, toward improving the accuracy, a theoretical classification of state-of-the-art algorithms of predicting secondary structure for aligned RNA sequences is presented. The classification is based on the viewpoint of maximum expected accuracy (MEA), which has been successfully applied in various problems in bioinformatics. The classification reveals several disadvantages of the current algorithms but we propose an improvement of a previously introduced algorithm (CentroidAlifold). Finally, computational experiments strongly support the theoretical classification and indicate that the improved CentroidAlifold substantially outperforms other algorithms.  相似文献   

14.
orthologs指起源于不同物种的最近的共同祖先的一些基因。orthologous的基因,具有相近甚至相同的功能,由相似的途径调控,在不同的物种中扮演相似甚至相同的角色,因此在基因组序列的注释中,是最可靠的选择。orthologs的生物信息预测方法主要有两类:系统发生方法和序列比对方法。这两类方法都是基于序列的相似性,但又各有特点。系统发生方法通过重建系统发生树来预测orthologs,因此在概念上比较精确,但难于自动化,运算量也很大。序列比对方法在概念上比较粗糙,但简单实用,运算量相对较小,因此得到了较广泛的应用。  相似文献   

15.
Comparative genomics is a powerful means to gain insight into the evolutionary processes that shape the genomes of related species. As the number of sequenced genomes increases, the development of software to perform accurate cross-species analyses becomes indispensable. However, many implementations that have the ability to compare multiple genomes exhibit unfavorable computational and memory requirements, limiting the number of genomes that can be analyzed in one run. Here, we present a software package to unveil genomic homology based on the identification of conservation of gene content and gene order (collinearity), i-ADHoRe 3.0, and its application to eukaryotic genomes. The use of efficient algorithms and support for parallel computing enable the analysis of large-scale data sets. Unlike other tools, i-ADHoRe can process the Ensembl data set, containing 49 species, in 1?h. Furthermore, the profile search is more sensitive to detect degenerate genomic homology than chaining pairwise collinearity information based on transitive homology. From ultra-conserved collinear regions between mammals and birds, by integrating coexpression information and protein-protein interactions, we identified more than 400 regions in the human genome showing significant functional coherence. The different algorithmical improvements ensure that i-ADHoRe 3.0 will remain a powerful tool to study genome evolution.  相似文献   

16.
This research analyzes some aspects of the relationship between gene expression, gene function, and gene annotation. Many recent studies are implicitly based on the assumption that gene products that are biologically and functionally related would maintain this similarity both in their expression profiles as well as in their gene ontology (GO) annotation. We analyze how accurate this assumption proves to be using real publicly available data. We also aim to validate a measure of semantic similarity for GO annotation. We use the Pearson correlation coefficient and its absolute value as a measure of similarity between expression profiles of gene products. We explore a number of semantic similarity measures (Resnik, Jiang, and Lin) and compute the similarity between gene products annotated using the GO. Finally, we compute correlation coefficients to compare gene expression similarity against GO semantic similarity. Our results suggest that the Resnik similarity measure outperforms the others and seems better suited for use in gene ontology. We also deduce that there seems to be correlation between semantic similarity in the GO annotation and gene expression for the three GO ontologies. We show that this correlation is negligible up to a certain semantic similarity value; then, for higher similarity values, the relationship trend becomes almost linear. These results can be used to augment the knowledge provided by clustering algorithms and in the development of bioinformatic tools for finding and characterizing gene products.  相似文献   

17.
MOTIVATION: Gene finding remains an open problem well after the sequencing of the human genome. The low gene sensitivity of current methods is a problem for divergent protein families, because fairly accurate exon assemblies are required before sensitive fold recognition algorithms can be applied. This paper presents a new genomic threading algorithm which integrates the gene finding and fold recognition steps into a single process. The method is applicable to evolutionarily divergent protein families that have retained some trace of their common ancestry, number and phase of introns, sizes of exons and placement of structural elements on specific exons. Such conserved structural signals may be visible despite dramatic evolution of protein sequence. RESULTS: The method is evaluated on the family of helical cytokines by cross-validation sensitivity analysis. The method has also been applied to all intergenic regions of the human genome, and an expression and cloning approach has been coupled with the predictions of the method. Two genes discovered by this method are discussed. SUPPLEMENTARY INFORMATION: All data used and the results obtained in the cross-validation analysis are available at http://www.soi.city.ac.uk/~conklin/papers/GT/  相似文献   

18.
MicroRNAs (miRNAs) are one class of tiny, endogenous RNAs that can regulate messenger RNA (mRNA) expression by targeting homologous sequences in mRNAs. Their aberrant expressions have been observed in many cancers and several miRNAs have been convincingly shown to play important roles in carcinogenesis. Since the discovery of this small regulator, computational methods have been indispensable tools in miRNA gene finding and functional studies. In this review we first briefly outline the biological findings of miRNA genes, such as genomic feature, biogenesis, gene structure, and functional mechanism. We then discuss in detail the three main aspects of miRNA computational studies: miRNA gene finding, miRNA target prediction, and regulation of miRNA genes. Finally, we provide perspectives on some emerging issues, including combinatorial regulation by miRNAs and functional binding sites beyond the 3′-untranslated region (3′UTR) of target mRNAs. Available online resources for miRNA computational studies are also provided.  相似文献   

19.
Yi G  Jung J 《Bioinformation》2011,7(5):251-256
Identifying genomic regions that descended from a common ancestor helps us study the gene function and genome evolution. In distantly related genomes, clusters of homologous gene pairs are evidently used in function prediction, operon detection, etc. Currently, there are many kinds of computational methods that have been proposed defining gene clusters to identify gene families and operons. However, most of those algorithms are only available on a data set of small size. We developed an efficient gene clustering algorithm that can be applied on hundreds of genomes at the same time. This approach allows for large-scale study of evolutionary relationships of gene clusters and study of operon formation and destruction. An analysis of proposed algorithms shows that more biological insight can be obtained by analyzing gene clusters across hundreds of genomes, which can help us understand operon occurrences, gene orientations and gene rearrangements.  相似文献   

20.
A challenging task in computational biology is the reconstruction of genomic sequences of extinct ancestors, given the phylogenetic tree and the sequences at the leafs. This task is best solved by calculating the most likely estimate of the ancestral sequences, along with the most likely edge lengths. We deal with this problem and also the variant in which the phylogenetic tree in addition to the ancestral sequences need to be estimated. The latter problem is known to be NP-hard, while the computational complexity of the former is unknown. Currently, all algorithms for solving these problems are heuristics without performance guarantees. The biological importance of these problems calls for developing better algorithms with guarantees of finding either optimal or approximate solutions.We develop approximation, fix parameter tractable (FPT), and fast heuristic algorithms for two variants of the problem; when the phylogenetic tree is known and when it is unknown. The approximation algorithm guarantees a solution with a log-likelihood ratio of 2 relative to the optimal solution. The FPT has a running time which is polynomial in the length of the sequences and exponential in the number of taxa. This makes it useful for calculating the optimal solution for small trees. Moreover, we combine the approximation algorithm and the FPT into an algorithm with arbitrary good approximation guarantee (PTAS). We tested our algorithms on both synthetic and biological data. In particular, we used the FPT for computing the most likely ancestral mitochondrial genomes of hominidae (the great apes), thereby answering an interesting biological question. Moreover, we show how the approximation algorithms find good solutions for reconstructing the ancestral genomes for a set of lentiviruses (relatives of HIV). Supplementary material of this work is available at www.nada.kth.se/~isaac/publications/aml/aml.html.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号