首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Motivation: The nucleotide sequencing process produces not onlythe sequence of nucleotides, but also associated quality values.Quality values provide valuable information, but are primarilyused only for trimming sequences and generally ignored in subsequentanalyses. Results: This article describes how the scoring schemes of standardalignment algorithms can be modified to take into account qualityvalues to produce improved alignments and statistically moreaccurate scores. A prototype implementation is also provided,and used to post-process a set of BLAST results. Quality-adjustedalignment is a natural extension of standard alignment methods,and can be implemented with only a small constant factor performancepenalty. The method can also be applied to related methods includingheuristic search algorithms like BLAST and FASTA. Availability: Software is available at http://malde.org/~ketil/qaa. Contact: ketil.malde{at}imr.no Supplementary information: Supplementary data are availableat Bioinformatics online. Associate Editor: Limsoon Wong  相似文献   

2.
Motivation: The success of genome sequencing has resulted inmany protein sequences without functional annotation. We presentConFunc, an automated Gene Ontology (GO)-based protein functionprediction approach, which uses conserved residues to generatesequence profiles to infer function. ConFunc split sets of sequencesidentified by PSI-BLAST into sub-alignments according to theirGO annotations. Conserved residues are identified for each GOterm sub-alignment for which a position specific scoring matrixis generated. This combination of steps produces a set of feature(GO annotation) derived profiles from which protein functionis predicted. Results: We assess the ability of ConFunc, BLAST and PSI-BLASTto predict protein function in the twilight zone of sequencesimilarity. ConFunc significantly outperforms BLAST & PSI-BLASTobtaining levels of recall and precision that are not obtainedby either method and maximum precision 24% greater than BLAST.Further for a large test set of sequences with homologues oflow sequence identity, at high levels of presicision, ConFuncobtains recall six times greater than BLAST. These results demonstratethe potential for ConFunc to form part of an automated genomicsannotation pipeline. Availability: http://www.sbg.bio.ic.ac.uk/confunc Contact: m.sternberg{at}imperial.ac.uk Supplementary information: Supplementary data are availableat Bioinformatics online. Associate Editor: Dmitrij Frishman  相似文献   

3.
Motivation: A plethora of alignment tools have been createdthat are designed to best fit different types of alignment conditions.While some of these are made for aligning Illumina SequenceAnalyzer reads, none of these are fully utilizing its probability(prb) output. In this article, we will introduce a new alignmentapproach (Slider) that reduces the alignment problem space byutilizing each read base's probabilities given in the prb files. Results: Compared with other aligners, Slider has higher alignmentaccuracy and efficiency. In addition, given that Slider matchesbases with probabilities other than the most probable, it significantlyreduces the percentage of base mismatches. The result is thatits SNP predictions are more accurate than other SNP predictionapproaches used today that start from the most probable sequence,including those using base quality. Contact: nmalhis{at}bcgsc.ca Supplementary information and availability: http://www.bcgsc.ca/platform/bioinfo/software/slider Associate Editor: Dmitrij Frishman  相似文献   

4.
Exon discovery by genomic sequence alignment   总被引:5,自引:0,他引:5  
MOTIVATION: During evolution, functional regions in genomic sequences tend to be more highly conserved than randomly mutating 'junk DNA' so local sequence similarity often indicates biological functionality. This fact can be used to identify functional elements in large eukaryotic DNA sequences by cross-species sequence comparison. In recent years, several gene-prediction methods have been proposed that work by comparing anonymous genomic sequences, for example from human and mouse. The main advantage of these methods is that they are based on simple and generally applicable measures of (local) sequence similarity; unlike standard gene-finding approaches they do not depend on species-specific training data or on the presence of cognate genes in data bases. As all comparative sequence-analysis methods, the new comparative gene-finding approaches critically rely on the quality of the underlying sequence alignments. RESULTS: Herein, we describe a new implementation of the sequence-alignment program DIALIGN that has been developed for alignment of large genomic sequences. We compare our method to the alignment programs PipMaker, WABA and BLAST and we show that local similarities identified by these programs are highly correlated to protein-coding regions. In our test runs, PipMaker was the most sensitive method while DIALIGN was most specific. AVAILABILITY: The program is downloadable from the DIALIGN home page at http://bibiserv.techfak.uni-bielefeld.de/dialign/.  相似文献   

5.
MOTIVATION: Multiple alignment of highly divergent sequences is a challenging problem for which available programs tend to show poor performance. Generally, this is due to a scoring function that does not describe biological reality accurately enough or a heuristic that cannot explore solution space efficiently enough. In this respect, we present a new program, Align-m, that uses a non-progressive local approach to guide a global alignment. RESULTS: Two large test sets were used that represent the entire SCOP classification and cover sequence similarities between 0 and 50% identity. Performance was compared with the publicly available algorithms ClustalW, T-Coffee and DiAlign. In general, Align-m has comparable or slightly higher accuracy in terms of correctly aligned residues, especially for distantly related sequences. Importantly, it aligns much fewer residues incorrectly, with average differences of over 15% compared with some of the other algorithms. AVAILABILITY: Align-m and the test sets are available at http://bioinformatics.vub.ac.be  相似文献   

6.
7.
DbClustal addresses the important problem of the automatic multiple alignment of the top scoring full-length sequences detected by a database homology search. By combining the advantages of both local and global alignment algorithms into a single system, DbClustal is able to provide accurate global alignments of highly divergent, complex sequence sets. Local alignment information is incorporated into a ClustalW global alignment in the form of a list of anchor points between pairs of sequences. The method is demonstrated using anchors supplied by the Blast post-processing program, Ballast. The rapidity and reliability of DbClustal have been demonstrated using the recently annotated Pyrococcus abyssi proteome where the number of alignments with totally misaligned sequences was reduced from 20% to <2%. A web site has been implemented proposing BlastP database searches with automatic alignment of the top hits by DbClustal.  相似文献   

8.

Background

Genomic sequence alignment is a powerful method for genome analysis and annotation, as alignments are routinely used to identify functional sites such as genes or regulatory elements. With a growing number of partially or completely sequenced genomes, multiple alignment is playing an increasingly important role in these studies. In recent years, various tools for pair-wise and multiple genomic alignment have been proposed. Some of them are extremely fast, but often efficiency is achieved at the expense of sensitivity. One way of combining speed and sensitivity is to use an anchored-alignment approach. In a first step, a fast search program identifies a chain of strong local sequence similarities. In a second step, regions between these anchor points are aligned using a slower but more accurate method.

Results

Herein, we present CHAOS, a novel algorithm for rapid identification of chains of local pair-wise sequence similarities. Local alignments calculated by CHAOS are used as anchor points to improve the running time of DIALIGN, a slow but sensitive multiple-alignment tool. We show that this way, the running time of DIALIGN can be reduced by more than 95% for BAC-sized and longer sequences, without affecting the quality of the resulting alignments. We apply our approach to a set of five genomic sequences around the stem-cell-leukemia (SCL) gene and demonstrate that exons and small regulatory elements can be identified by our multiple-alignment procedure.

Conclusion

We conclude that the novel CHAOS local alignment tool is an effective way to significantly speed up global alignment tools such as DIALIGN without reducing the alignment quality. We likewise demonstrate that the DIALIGN/CHAOS combination is able to accurately align short regulatory sequences in distant orthologues.
  相似文献   

9.
Motivation: High-throughput experimental and computational methodsare generating a wealth of protein–protein interactiondata for a variety of organisms. However, data produced by currentstate-of-the-art methods include many false positives, whichcan hinder the analyses needed to derive biological insights.One way to address this problem is to assign confidence scoresthat reflect the reliability and biological significance ofeach interaction. Most previously described scoring methodsuse a set of likely true positives to train a model to scoreall interactions in a dataset. A single positive training set,however, may be biased and not representative of true interactionspace. Results: We demonstrate a method to score protein interactionsby utilizing multiple independent sets of training positivesto reduce the potential bias inherent in using a single trainingset. We used a set of benchmark yeast protein interactions toshow that our approach outperforms other scoring methods. Ourapproach can also score interactions across data types, whichmakes it more widely applicable than many previously proposedmethods. We applied the method to protein interaction data fromboth Drosophila melanogaster and Homo sapiens. Independent evaluationsshow that the resulting confidence scores accurately reflectthe biological significance of the interactions. Contact: rfinley{at}wayne.edu Supplementary information: Supplementary data are availableat Bioinformatics Online. Associate Editor: Burkhard Rost  相似文献   

10.
Motivation: Understanding the complexity in gene–phenotyperelationship is vital for revealing the genetic basis of commondiseases. Recent studies on the basis of human interactome andphenome not only uncovers prevalent phenotypic overlap and geneticoverlap between diseases, but also reveals a modular organizationof the genetic landscape of human diseases, providing new opportunitiesto reduce the complexity in dissecting the gene–phenotypeassociation. Results: We provide systematic and quantitative evidence thatphenotypic overlap implies genetic overlap. With these results,we perform the first heterogeneous alignment of human interactomeand phenome via a network alignment technique and identify 39disease families with corresponding causative gene networks.Finally, we propose AlignPI, an alignment-based framework topredict disease genes, and identify plausible candidates for70 diseases. Our method scales well to the whole genome, asdemonstrated by prioritizing 6154 genes across 37 chromosomeregions for Crohn's disease (CD). Results are consistent witha recent meta-analysis of genome-wide association studies forCD. Availability: Bi-modules and disease gene predictions are freelyavailable at the URL http://bioinfo.au.tsinghua.edu.cn/alignpi/ Contact: ruijiang{at}tsinghua.edu.cn Supplementary information: Supplementary data are availableat Bioinformatics online. Associate Editor: Trey Ideker  相似文献   

11.
Motivation: Mass spectrometry data are subjected to considerablenoise. Good noise models are required for proper detection andquantification of peptides. We have characterized noise in bothquadrupole time-of-flight (Q-TOF) and ion trap data, and haveconstructed models for the noise. Results: We find that the noise in Q-TOF data from Applied BiosystemsQSTAR fits well to a combination of multinomial and Poissonmodel with detector dead-time correction. In comparison, iontrap noise from Agilent MSD-Trap-SL is larger than the Q-TOFnoise and is proportional to Poisson noise. We then demonstratethat the noise model can be used to improve deisotoping forpeptide detection, by estimating appropriate cutoffs of thegoodness of fit parameter at prescribed error rates. The noisemodels also have implications in noise reduction, retentiontime alignment and significance testing for biomarker discovery. Contact: pdu{at}us.ibm.com Supplementary information: Supplementary data are availableat Bioinfomatics Online. Associate Editor: Olga Troyanskaya  相似文献   

12.

Background  

DIALIGN-T is a reimplementation of the multiple-alignment program DIALIGN. Due to several algorithmic improvements, it produces significantly better alignments on locally and globally related sequence sets than previous versions of DIALIGN. However, like the original implementation of the program, DIALIGN-T uses a a straight-forward greedy approach to assemble multiple alignments from local pairwise sequence similarities. Such greedy approaches may be vulnerable to spurious random similarities and can therefore lead to suboptimal results. In this paper, we present DIALIGN-TX, a substantial improvement of DIALIGN-T that combines our previous greedy algorithm with a progressive alignment approach.  相似文献   

13.
14.
15.
AGenDA: homology-based gene prediction   总被引:2,自引:0,他引:2  
We present a www server for homology-based gene prediction. The user enters a pair of evolutionary related genomic sequences, for example from human and mouse. Our software system uses CHAOS and DIALIGN to calculate an alignment of the input sequences and then searches for conserved splicing signals and start/stop codons around regions of local sequence similarity. This way, candidate exons are identified that are used, in turn, to calculate optimal gene models. The server returns the constructed gene model by email, together with a graphical representation of the underlying genomic alignment.  相似文献   

16.
Motivation: In searching for differentially expressed (DE) genesin microarray data, we often observe a fraction of the genesto have unequal variability between groups. This is not an issuein large samples, where a valid test exists that uses individualvariances separately. The problem arises in the small-samplesetting, where the approximately valid Welch test lacks sensitivity,while the more sensitive moderated t-test assumes equal variance. Methods: We introduce a moderated Welch test (MWT) that allowsunequal variance between groups. It is based on (i) weightingof pooled and unpooled standard errors and (ii) improved estimationof the gene-level variance that exploits the information fromacross the genes. Results: When a non-trivial proportion of genes has unequalvariability, false discovery rate (FDR) estimates based on thestandard t and moderated t-tests are often too optimistic, whilethe standard Welch test has low sensitivity. The MWT is shownto (i) perform better than the standard t, the standard Welchand the moderated t-tests when the variances are unequal betweengroups and (ii) perform similarly to the moderated t, and betterthan the standard t and Welch tests when the group variancesare equal. These results mean that MWT is more reliable thanother existing tests over wider range of data conditions. Availability: R package to perform MWT is available at http://www.meb.ki.se/~yudpaw Contact: yudi.pawitan{at}ki.se Supplementary information: Supplementary data are availableat Bioinformatics online. Associate Editor: Martin Bishop  相似文献   

17.
Sequence comparison methods based on position-specific score matrices (PSSMs) have proven a useful tool for recognition of the divergent members of a protein family and for annotation of functional sites. Here we investigate one of the factors that affects overall performance of PSSMs in a PSI-BLAST search, the algorithm used to construct the seed alignment upon which the PSSM is based. We compare PSSMs based on alignments constructed by global sequence similarity (ClustalW and ClustalW-pairwise), local sequence similarity (BLAST), and local structure similarity (VAST). To assess performance with respect to identification of conserved functional or structural sites, we examine the accuracy of the three-dimensional molecular models predicted by PSSM-sequence alignments. Using the known structures of those sequences as the standard of truth, we find that model accuracy varies with the algorithm used for seed alignment construction in the pattern local-structure (VAST) > local-sequence (BLAST) > global-sequence (ClustalW). Using structural similarity of query and database proteins as the standard of truth, we find that PSSM recognition sensitivity depends primarily on the diversity of the sequences included in the alignment, with an optimum around 30-50% average pairwise identity. We discuss these observations, and suggest a strategy for constructing seed alignments that optimize PSSM-sequence alignment accuracy and recognition sensitivity.  相似文献   

18.
Motivation: The quest for high-throughput proteomics has revealeda number of challenges in recent years. Whilst substantial improvementsin automated protein separation with liquid chromatography andmass spectrometry (LC/MS), aka ‘shotgun’ proteomics,have been achieved, large-scale open initiatives such as theHuman Proteome Organization (HUPO) Brain Proteome Project haveshown that maximal proteome coverage is only possible when LC/MSis complemented by 2D gel electrophoresis (2-DE) studies. Moreover,both separation methods require automated alignment and differentialanalysis to relieve the bioinformatics bottleneck and so makehigh-throughput protein biomarker discovery a reality. The purposeof this article is to describe a fully automatic image alignmentframework for the integration of 2-DE into a high-throughputdifferential expression proteomics pipeline. Results: The proposed method is based on robust automated imagenormalization (RAIN) to circumvent the drawbacks of traditionalapproaches. These use symbolic representation at the very earlystages of the analysis, which introduces persistent errors dueto inaccuracies in modelling and alignment. In RAIN, a third-ordervolume-invariant B-spline model is incorporated into a multi-resolutionschema to correct for geometric and expression inhomogeneityat multiple scales. The normalized images can then be compareddirectly in the image domain for quantitative differential analysis.Through evaluation against an existing state-of-the-art methodon real and synthetically warped 2D gels, the proposed analysisframework demonstrates substantial improvements in matchingaccuracy and differential sensitivity. High-throughput analysisis established through an accelerated GPGPU (general purposecomputation on graphics cards) implementation. Availability: Supplementary material, software and images usedin the validation are available at http://www.proteomegrid.org/rain/ Contact: g.z.yang{at}imperial.ac.uk Supplementary information: Supplementary data are availableat Bioinformatics online. Associate Editor: David Rocke  相似文献   

19.
SUMMARY: Multiple sequence alignment is the NP-hard problem of aligning three or more DNA or amino acid sequences in an optimal way so as to match as many characters as possible from the set of sequences. The popular sequence alignment program ClustalW uses the classical method of approximating a sequence alignment, by first computing a distance matrix and then constructing a guide tree to show the evolutionary relationship of the sequences. We show that parallelizing the ClustalW algorithm can result in significant speedup. We used a cluster of workstations using C and message passing interface for our implementation. Experimental results show that speedup of over 5.5 on six processors is obtainable for most inputs. AVAILABILITY: The software is available upon request from the second author.  相似文献   

20.
A multivariate test of association   总被引:1,自引:0,他引:1  
Summary: Although genetic association studies often test multiple,related phenotypes, few formal multivariate tests of associationare available. We describe a test of association that can beefficiently applied to large population-based designs. Availability: A C++ implementation can be obtained from theauthors. Contact: manuel.ferreira{at}qimr.edu.au Supplementary information: Supplementary figures are availableat Bioinformatics online. Associate Editor: Alex Bateman  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号