首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
MOTIVATION: Homology search for RNAs can use secondary structure information to increase power by modeling base pairs, as in covariance models, but the resulting computational costs are high. Typical acceleration strategies rely on at least one filtering stage using sequence-only search. RESULTS: Here we present the multi-segment CYK (MSCYK) filter, which implements a heuristic of ungapped structural alignment for RNA homology search. Compared to gapped alignment, this approximation has lower computation time requirements (O(N?) reduced to O(N3), and space requirements (O(N3) reduced to O(N2). A vector-parallel implementation of this method gives up to 100-fold speed-up; vector-parallel implementations of standard gapped alignment at two levels of precision give 3- and 6-fold speed-ups. These approaches are combined to create a filtering pipeline that scores RNA secondary structure at all stages, with results that are synergistic with existing methods.  相似文献   

3.
The Goulden-Jackson cluster method is a powerful method to calculate the probability of occurrences of a pattern or set of patterns in a sequence. If the patterns contain wildcard characters, however, the size of the connector matrix grows exponentially with the number of wildcards. Here we show that average correlation c(z) is a good predicator of hitting probability q (n), and the generalized correlation function ?(z) can be used to approximate c(z) efficiently.We apply the method to the problem of optimal multiple spaced seed selection for homology search. We reexamine the concept of optimal sensitivity of spaced seeds and show that it is better to select optimal seeds based on some average properties, such as c(1), which is the expectation of the first hitting length. Higher order approximations can also be constructed easily. Tests on arbitrary large genomic data with multiple seeds show that the optimal multiple seeds selected by the methods are indeed more sensitive. The methods provide a theoretical background on which various empirical observations can be unified and further heuristic search methods can be developed.  相似文献   

4.
Optimized design and assessment of whole genome tiling arrays   总被引:1,自引:0,他引:1  
MOTIVATION: Recent advances in microarray technologies have made it feasible to interrogate whole genomes with tiling arrays and this technique is rapidly becoming one of the most important high-throughput functional genomics assays. For large mammalian genomes, analyzing oligonucleotide tiling array data is complicated by the presence of non-unique sequences on the array, which increases the overall noise in the data and may lead to false positive results due to cross-hybridization. The ability to create custom microarrays using maskless array synthesis has led us to consider ways to optimize array design characteristics for improving data quality and analysis. We have identified a number of design parameters to be optimized including uniqueness of the probe sequences within the whole genome, melting temperature and self-hybridization potential. RESULTS: We introduce the uniqueness score, U, a novel quality measure for oligonucleotide probes and present a method to quickly compute it. We show that U is equivalent to the number of shortest unique substrings in the probe and describe an efficient greedy algorithm to design mammalian whole genome tiling arrays using probes that maximize U. Using the mouse genome, we demonstrate how several optimizations influence the tiling array design characteristics. With a sensible set of parameters, our designs cover 78% of the mouse genome including many regions previously considered 'untilable' due to the presence of repetitive sequence. Finally, we compare our whole genome tiling array designs with commercially available designs. AVAILABILITY: Source code is available under an open source license from http://www.ebi.ac.uk/~graef/arraydesign/.  相似文献   

5.
We present a fast mapping-based algorithm to compute the mappability of each region of a reference genome up to a specified number of mismatches. Knowing the mappability of a genome is crucial for the interpretation of massively parallel sequencing experiments. We investigate the properties of the mappability of eukaryotic DNA/RNA both as a whole and at the level of the gene family, providing for various organisms tracks which allow the mappability information to be visually explored. In addition, we show that mappability varies greatly between species and gene classes. Finally, we suggest several practical applications where mappability can be used to refine the analysis of high-throughput sequencing data (SNP calling, gene expression quantification and paired-end experiments). This work highlights mappability as an important concept which deserves to be taken into full account, in particular when massively parallel sequencing technologies are employed. The GEM mappability program belongs to the GEM (GEnome Multitool) suite of programs, which can be freely downloaded for any use from its website (http://gemlibrary.sourceforge.net).  相似文献   

6.
7.
8.
Savir Y  Tlusty T 《Molecular cell》2010,40(3):388-396
Homologous recombination facilitates the exchange of genetic material between homologous DNA molecules. This crucial process requires detecting a specific homologous DNA sequence within a huge variety of heterologous sequences. The detection is mediated by RecA in E. coli, or members of its superfamily in other organisms. Here, we examine how well the RecA-DNA interaction is adjusted to its task. By formulating the DNA recognition process as a signal detection problem, we find the optimal value of binding energy that maximizes the ability to detect homologous sequences. We show that the experimentally observed binding energy is nearly optimal. This implies that the RecA-induced deformation and the binding energetics are fine-tuned to ensure optimal sequence detection. Our analysis suggests a possible role for DNA extension by RecA, in which deformation enhances detection. The present signal detection approach provides a general recipe for testing the optimality of other molecular recognition systems.  相似文献   

9.
The increasing availability of high-quality reference genomic sequences has created a demand for ways to survey the sequence differences present in individual genomes. Here we describe a DNA sequencing method based on hybridization of a universal panel of tiling probes. Millions of shotgun fragments are amplified in situ and subjected to sequential hybridization with short fluorescent probes. Long fragments of 200 bp facilitate unique placement even in large genomes. The sequencing chemistry is simple, enzyme-free and consumes only dilute solutions of the probes, resulting in reduced sequencing cost and substantially increased speed. A prototype instrument based on commonly available equipment was used to resequence the Bacteriophage lambda and Escherichia coli genomes to better than 99.93% accuracy with a raw throughput of 320 Mbp/day, albeit with a significant number of small gaps attributed to losses in sample preparation.  相似文献   

10.
As much of the focus of genetics and molecular biology has shifted toward the systems level, it has become increasingly important to accurately extract biologically relevant signal from thousands of related measurements. The common property among these high-dimensional biological studies is that the measured features have a rich and largely unknown underlying structure. One example of much recent interest is identifying differentially expressed genes in comparative microarray experiments. We propose a new approach aimed at optimally performing many hypothesis tests in a high-dimensional study. This approach estimates the optimal discovery procedure (ODP), which has recently been introduced and theoretically shown to optimally perform multiple significance tests. Whereas existing procedures essentially use data from only one feature at a time, the ODP approach uses the relevant information from the entire data set when testing each feature. In particular, we propose a generally applicable estimate of the ODP for identifying differentially expressed genes in microarray experiments. This microarray method consistently shows favorable performance over five highly used existing methods. For example, in testing for differential expression between two breast cancer tumor types, the ODP provides increases from 72% to 185% in the number of genes called significant at a false discovery rate of 3%. Our proposed microarray method is freely available to academic users in the open-source, point-and-click EDGE software package.  相似文献   

11.
MOTIVATION: It is widely recognized that homology search and ortholog clustering are very useful for analyzing biological sequences. However, recent growth of sequence database size makes homolog detection difficult, and rapid and accurate methods are required. RESULTS: We present a novel method for fast and accurate homology detection, assuming that the Smith-Waterman (SW) scores between all similar sequence pairs in a target database are computed and stored. In this method, SW alignment is computed only if the upper bound, which is derived from our novel inequality, is higher than the given threshold. In contrast to other methods such as FASTA and BLAST, this method is guaranteed to find all sequences whose scores against the query are higher than the specified threshold. Results of computational experiments suggest that the method is dozens of times faster than SSEARCH if genome sequence data of closely related species are available.  相似文献   

12.
In this paper, we propose a hybrid clustering method that combines the strengths of bottom-up hierarchical clustering with that of top-down clustering. The first method is good at identifying small clusters but not large ones; the strengths are reversed for the second method. The hybrid method is built on the new idea of a mutual cluster: a group of points closer to each other than to any other points. Theoretical connections between mutual clusters and bottom-up clustering methods are established, aiding in their interpretation and providing an algorithm for identification of mutual clusters. We illustrate the technique on simulated and real microarray datasets.  相似文献   

13.
A new method for homology search of DNA sequences is suggested. This method may be used to find extensive and not strong homologies with point mutations and deletions. The running program time for comparing sequences is less then the dynamic program algorithms at least at two orders of magnitude. It makes possible to use the method for homology searching throughover the nucleotide bank by personal computers.  相似文献   

14.
Microarray experiments can generate enormous amounts of data, but large datasets are usually inherently complex, and the relevant information they contain can be difficult to extract. For the practicing biologist, we provide an overview of what we believe to be the most important issues that need to be addressed when dealing with microarray data. In a microarray experiment we are simply trying to identify which genes are the most "interesting" in terms of our experimental question, and these will usually be those that are either overexpressed or underexpressed (upregulated or downregulated) under the experimental conditions. Analysis of the data to find these genes involves first preprocessing of the raw data for quality control, including filtering of the data (e.g., detection of outlying values) followed by standardization of the data (i.e., making the data uniformly comparable throughout the dataset). This is followed by the formal quantitative analysis of the data, which will involve either statistical hypothesis testing or multivariate pattern recognition. Statistical hypothesis testing is the usual approach to "class comparison," where several experimental groups are being directly compared. The best approach to this problem is to use analysis of variance, although issues related to multiple hypothesis testing and probability estimation still need to be evaluated. Pattern recognition can involve "class prediction," for which a range of supervised multivariate techniques are available, or "class discovery," for which an even broader range of unsupervised multivariate techniques have been developed. Each technique has its own limitations, which need to be kept in mind when making a choice from among them. To put these ideas in context, we provide a detailed examination of two specific examples of the analysis of microarray data, both from parasitology, covering many of the most important points raised.  相似文献   

15.
16.
17.
18.
The use of multiple hypothesis testing procedures has been receiving a lot of attention recently by statisticians in DNA microarray analysis. The traditional FWER controlling procedures are not very useful in this situation since the experiments are exploratory by nature and researchers are more interested in controlling the rate of false positives rather than controlling the probability of making a single erroneous decision. This has led to increased use of FDR (False Discovery Rate) controlling procedures. Genovese and Wasserman proposed a single-step FDR procedure that is an asymptotic approximation to the original Benjamini and Hochberg stepwise procedure. In this paper, we modify the Genovese-Wasserman procedure to force the FDR control closer to the level alpha in the independence setting. Assuming that the data comes from a mixture of two normals, we also propose to make this procedure adaptive by first estimating the parameters using the EM algorithm and then using these estimated parameters into the above modification of the Genovese-Wasserman procedure. We compare this procedure with the original Benjamini-Hochberg and the SAM thresholding procedures. The FDR control and other properties of this adaptive procedure are verified numerically.  相似文献   

19.
MOTIVATION: Statistical tests for the detection of differentially expressed genes lead to a large collection of p-values one for each gene comparison. Without any further adjustment, these p-values may lead to a large number of false positives, simply because the number of genes to be tested is huge, which might mean wastage of laboratory resources. To account for multiple hypotheses, these p-values are typically adjusted using a single step method or a step-down method in order to achieve an overall control of the error rate (the so-called familywise error rate). In many applications, this may lead to an overly conservative strategy leading to too few genes being flagged. RESULTS: In this paper we introduce a novel empirical Bayes screening (EBS) technique to inspect a large number of p-values in an effort to detect additional positive cases. In effect, each case borrows strength from an overall picture of the alternative hypotheses computed from all the p-values, while the entire procedure is calibrated by a step-down method so that the familywise error rate at the complete null hypothesis is still controlled. It is shown that the EBS has substantially higher sensitivity than the standard step-down approach for multiple comparison at the cost of a modest increase in the false discovery rate (FDR). The EBS procedure also compares favorably when compared with existing FDR control procedures for multiple testing. The EBS procedure is particularly useful in situations where it is important to identify all possible potentially positive cases which can be subjected to further confirmatory testing in order to eliminate the false positives. We illustrated this screening procedure using a data set on human colorectal cancer where we show that the EBS method detected additional genes related to colon cancer that were missed by other methods.This novel empirical Bayes procedure is advantageous over our earlier proposed empirical Bayes adjustments due to the following reasons: (i) it offers an automatic screening of the p-values the user may obtain from a univariate (i.e., gene by gene) analysis package making it extremely easy to use for a non-statistician, (ii) since it applies to the p-values, the tests do not have to be t-tests; in particular they could be F-tests which might arise in certain ANOVA formulations with expression data or even nonparametric tests, (iii) the empirical Bayes adjustment uses nonparametric function estimation techniques to estimate the marginal density of the transformed p-values rather than using a parametric model for the prior distribution and is therefore robust against model mis-specification. AVAILABILITY: R code for EBS is available from the authors upon request. SUPPLEMENTARY INFORMATION: http://www.stat.uga.edu/~datta/EBS/supp.htm  相似文献   

20.
MOTIVATION: An important goal of microarray studies is to discover genes that are associated with clinical outcomes, such as disease status and patient survival. While a typical experiment surveys gene expressions on a global scale, there may be only a small number of genes that have significant influence on a clinical outcome. Moreover, expression data have cluster structures and the genes within a cluster have correlated expressions and coordinated functions, but the effects of individual genes in the same cluster may be different. Accordingly, we seek to build statistical models with the following properties. First, the model is sparse in the sense that only a subset of the parameter vector is non-zero. Second, the cluster structures of gene expressions are properly accounted for. RESULTS: For gene expression data without pathway information, we divide genes into clusters using commonly used methods, such as K-means or hierarchical approaches. The optimal number of clusters is determined using the Gap statistic. We propose a clustering threshold gradient descent regularization (CTGDR) method, for simultaneous cluster selection and within cluster gene selection. We apply this method to binary classification and censored survival analysis. Compared to the standard TGDR and other regularization methods, the CTGDR takes into account the cluster structure and carries out feature selection at both the cluster level and within-cluster gene level. We demonstrate the CTGDR on two studies of cancer classification and two studies correlating survival of lymphoma patients with microarray expressions. AVAILABILITY: R code is available upon request. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号