首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 187 毫秒
1.
Multiple sequence alignment (MSA) is a crucial first step in the analysis of genomic and proteomic data. Commonly occurring sequence features, such as deletions and insertions, are known to affect the accuracy of MSA programs, but the extent to which alignment accuracy is affected by the positions of insertions and deletions has not been examined independently of other sources of sequence variation. We assessed the performance of 6 popular MSA programs (ClustalW, DIALIGN-T, MAFFT, MUSCLE, PROBCONS, and T-COFFEE) and one experimental program, PRANK, on amino acid sequences that differed only by short regions of deleted residues. The analysis showed that the absence of residues often led to an incorrect placement of gaps in the alignments, even though the sequences were otherwise identical. In data sets containing sequences with partially overlapping deletions, most MSA programs preferentially aligned the gaps vertically at the expense of incorrectly aligning residues in the flanking regions. Of the programs assessed, only DIALIGN-T was able to place overlapping gaps correctly relative to one another, but this was usually context dependent and was observed only in some of the data sets. In data sets containing sequences with non-overlapping deletions, both DIALIGN-T and MAFFT (G-INS-I) were able to align gaps with near-perfect accuracy, but only MAFFT produced the correct alignment consistently. The same was true for data sets that comprised isoforms of alternatively spliced gene products: both DIALIGN-T and MAFFT produced highly accurate alignments, with MAFFT being the more consistent of the 2 programs. Other programs, notably T-COFFEE and ClustalW, were less accurate. For all data sets, alignments produced by different MSA programs differed markedly, indicating that reliance on a single MSA program may give misleading results. It is therefore advisable to use more than one MSA program when dealing with sequences that may contain deletions or insertions, particularly for high-throughput and pipeline applications where manual refinement of each alignment is not practicable.  相似文献   

2.
The SEQALIGN programs1 described in this report aid in the assembly of up to 100 individual overlapping DNA sequences generated by M-13 subcloning and sequencing methods. The program produces a printout of the aligned sequences presented in register. Use of the program will be facilitated because 1) it is written with the Microsoft BASIC interpreter, 2) sequence data may be entered and edited using WORDSTAR or similar word processing programs, and 3) hardware requirements for execution of the program on CP/M or MS-DOS (IBM-PC compatible) systems are minimal.  相似文献   

3.
Kumaran D  Maguire EA 《Neuron》2006,49(4):617-629
Sequence disambiguation, the process by which overlapping sequences are kept separate, has been proposed to underlie a wide range of memory capacities supported by the hippocampus, including episodic memory and spatial navigation. We used functional magnetic resonance imaging (fMRI) to explore the dynamic pattern of hippocampal activation during the encoding of sequences of faces. Activation in right posterior hippocampus, only during the encoding of overlapping sequences but not nonoverlapping sequences, was found to correlate robustly with a subject-specific behavioral index of sequence learning. Moreover, our data indicate that hippocampal activation in response to elements common to both sequences in the overlapping sequence pair, may be particularly important for accurate sequence encoding and retrieval. Together, these findings support the conclusion that the human hippocampus is involved in the earliest stage of sequence disambiguation, when memory representations are in the process of being created, and provide empirical support for contemporary computational models of hippocampal function.  相似文献   

4.
Microcomputer programs for DNA sequence analysis.   总被引:21,自引:5,他引:16       下载免费PDF全文
Computer programs are described which allow (a) analysis of DNA sequences to be performed on a laboratory microcomputer or (b) transfer of DNA sequences between a laboratory microcomputer and another computer system, such as a DNA library. The sequence analysis programs are interactive, do not require prior experience with computers and in many other respects resemble programs which have been written for larger computer systems (1-7). The user enters sequence data into a text file, accesses this file with the programs, and is then able to (a) search for restriction enzyme sites or other specified sequences, (b) translate in one or more reading frames in one or both directions in order to find open reading frames, or (c) determine codon usage in the sequence in one or more given reading frames. The results are given in table format and a restriction map is generated. The modem program permits collection of large amounts of data from a sequence library into a permanent file on the microcomputer disc system, or transfer of laboratory data in the reverse direction to a remote computer system.  相似文献   

5.
6.
Protein sequences predicted from metagenomic datasets are annotated by identifying their homologs via sequence comparisons with reference or curated proteins. However, a majority of metagenomic protein sequences are partial-length, arising as a result of identifying genes on sequencing reads or on assembled nucleotide contigs, which themselves are often very fragmented. The fragmented nature of metagenomic protein predictions adversely impacts homology detection and, therefore, the quality of the overall annotation of the dataset. Here we present a novel algorithm called GRASP that accurately identifies the homologs of a given reference protein sequence from a database consisting of partial-length metagenomic proteins. Our homology detection strategy is guided by the reference sequence, and involves the simultaneous search and assembly of overlapping database sequences. GRASP was compared to three commonly used protein sequence search programs (BLASTP, PSI-BLAST and FASTM). Our evaluations using several simulated and real datasets show that GRASP has a significantly higher sensitivity than these programs while maintaining a very high specificity. GRASP can be a very useful program for detecting and quantifying taxonomic and protein family abundances in metagenomic datasets. GRASP is implemented in GNU C++, and is freely available at http://sourceforge.net/projects/grasp-release.  相似文献   

7.

Background  

High quality sequence alignments of RNA and DNA sequences are an important prerequisite for the comparative analysis of genomic sequence data. Nucleic acid sequences, however, exhibit a much larger sequence heterogeneity compared to their encoded protein sequences due to the redundancy of the genetic code. It is desirable, therefore, to make use of the amino acid sequence when aligning coding nucleic acid sequences. In many cases, however, only a part of the sequence of interest is translated. On the other hand, overlapping reading frames may encode multiple alternative proteins, possibly with intermittent non-coding parts. Examples are, in particular, RNA virus genomes.  相似文献   

8.
GenBank.   总被引:5,自引:2,他引:3       下载免费PDF全文
The GenBank sequence database continues to expand its data coverage, quality control, annotation content and retrieval services. GenBank is comprised of DNA sequences submitted directly by authors as well as sequences from the other major public databases. An integrated retrieval system, known as Entrez, contains data from GenBank and from the major protein sequence and structural databases, as well as related MEDLINE abstracts. Users may access GenBank over the Internet through the World Wide Web and through special client-server programs for text and sequence similarity searching. FTP, CD-ROM and e-mail servers are alternate means of access.  相似文献   

9.
GenBank.   总被引:8,自引:3,他引:5       下载免费PDF全文
The GenBank sequence database continues to expand its data coverage, quality control, annotation content and retrieval services for the scientific community. Besides handling direct submissions of sequence data from authors, GenBank also incorporates DNA sequences from all available public sources; an integrated retrieval system, known as Entrez, also makes available data from the major protein sequence and structural databases, and from U.S. and European patents. MIDLINE abstracts from published articles describing the sequences are also included as an additional source of biological annotation for sequence entries. GenBank supports distribution of the data via FTP, CD-ROM, and E-mail servers. Network server-client programs provide access to an integrated database for literature retrieval and sequence similarity searching.  相似文献   

10.
11.
We investigated protein sequence/structure correlation by constructing a space of protein sequences, based on methods developed previously for constructing a space of protein structures. The space is constructed by using a representation of the amino acids as vectors of 10 property factors that encode almost all of their physical properties. Each sequence is represented by a distribution of overlapping sequence fragments. A distance between any two sequences can be calculated. By attaching a weight to each factor, intersequence distances can be varied. We optimize the correlation between corresponding distances in the sequence and structure spaces. The optimal correlation between the sequence and structure spaces is significantly better than that which results from correlating randomly generated sequences, having the overall composition of the data base, with the structure space. However, sets of randomly generated sequences, each of which approximates the composition of the real sequence it replaces, produce correlations with the structure space that are as good as that observed for the actual protein sequences. A connection is proposed with previous studies of the protein folding code. It is shown that the most important property factors for the correlation of the sequence and structure spaces are related to helix/bend preference, side chain bulk, and beta-structure preference.  相似文献   

12.
GenBank.   总被引:2,自引:0,他引:2  
The GenBank (Registered Trademark symbol) sequence database incorporates DNA sequences from all available public sources, primarily through the direct submission of sequence data from individual laboratories and from large-scale sequencing projects. Most submitters use the BankIt (Web) or Sequin programs to format and send sequence data. Data exchange with the EMBL Data Library and the DNA Data Bank of Japan helps ensure comprehensive worldwide coverage. GenBank data is accessible through NCBI's integrated retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome and protein structure information. MEDLINE (Registered Trademark symbol) s from published articles describing the sequences are included as an additional source of biological annotation through the PubMed search system. Sequence similarity searching is offered through the BLAST series of database search programs. In addition to FTP, Email, and server/client versions of Entrez and BLAST, NCBI offers a wide range of World Wide Web retrieval and analysis services based on GenBank data. The GenBank database and related resources are freely accessible via the URL: http://www.ncbi.nlm.nih.gov  相似文献   

13.
Two data structures designated Fragment and Construct are described. The Fragment data structure defines a continuous nucleic acid sequence from a unique genetic origin. The Construct defines a continuous sequence composed of sequences from multiple genetic origins. These data structures are manipulated by a set of software tools to simulate the construction of mosaic recombinant DNA molecules. They are also used as an interface between sequence data banks and analytical programs.  相似文献   

14.
GenBank.   总被引:3,自引:1,他引:2       下载免费PDF全文
The GenBank(R) sequence database (http://www.ncbi.nlm.nih.gov/) incorporates DNA sequences from all available public sources, primarily through the direct submission of sequence data from individual laboratories and from large-scale sequencing projects. Most submitters use the BankIt (WWW) or Sequin programs to send their sequence data. Data exchange with the EMBL Data Library and the DNA Data Bank of Japan helps ensure comprehensive worldwide coverage. GenBank data is accessible through NCBI's integrated retrieval system, Entrez , which integrates data from the major DNA and protein sequence databases along with taxonomy, genome and protein structure information. MEDLINE(R) abstracts from published articles describing the sequences are also included as an additional source of biological annotation. Sequence similarity searching is offered through the BLAST series of database search programs. In addition to FTP, e-mail and server/client versions of Entrez and BLAST, NCBI offers a wide range of World Wide Web retrieval and analysis services of interest to biologists.  相似文献   

15.
Purifying and directional selection in overlapping prokaryotic genes   总被引:4,自引:0,他引:4  
In overlapping genes, the same DNA sequence codes for two proteins using different reading frames. Analysis of overlapping genes can help in understanding the mode of evolution of a coding region from noncoding DNA. We identified 71 pairs of convergent genes, with overlapping 3' ends longer than 15 nucleotides, that are conserved in at least two prokaryotic genomes. Among the overlap regions, we observed a statistically significant bias towards the 123:132 phase (i.e. the second codon base in one gene facing the degenerate third position in the second gene). This phase ensures the least mutual constraint on nonconservative amino acid replacements in both overlapping coding sequences. The excess of this phase is compatible with directional (positive) selection acting on the overlapping coding regions. This could be a general evolutionary mode for genes emerging from noncoding sequences, in which the protein sequence has not been subject to selection.  相似文献   

16.
Exon discovery by genomic sequence alignment   总被引:5,自引:0,他引:5  
MOTIVATION: During evolution, functional regions in genomic sequences tend to be more highly conserved than randomly mutating 'junk DNA' so local sequence similarity often indicates biological functionality. This fact can be used to identify functional elements in large eukaryotic DNA sequences by cross-species sequence comparison. In recent years, several gene-prediction methods have been proposed that work by comparing anonymous genomic sequences, for example from human and mouse. The main advantage of these methods is that they are based on simple and generally applicable measures of (local) sequence similarity; unlike standard gene-finding approaches they do not depend on species-specific training data or on the presence of cognate genes in data bases. As all comparative sequence-analysis methods, the new comparative gene-finding approaches critically rely on the quality of the underlying sequence alignments. RESULTS: Herein, we describe a new implementation of the sequence-alignment program DIALIGN that has been developed for alignment of large genomic sequences. We compare our method to the alignment programs PipMaker, WABA and BLAST and we show that local similarities identified by these programs are highly correlated to protein-coding regions. In our test runs, PipMaker was the most sensitive method while DIALIGN was most specific. AVAILABILITY: The program is downloadable from the DIALIGN home page at http://bibiserv.techfak.uni-bielefeld.de/dialign/.  相似文献   

17.
18.
A set of programs was developed for searching nucleic acid and protein sequence data bases for sequences similar to a given sequence. The programs, written in FORTRAN 77, were optimized for vector processing on a Hitachi S810-20 supercomputer. A search of a 500-residue protein sequence against the entire PIR data base Ver. 1.0 (1) (0.5 M residues) is carried out in a CPU time of 45 sec. About 4 min is required for an exhaustive search of a 1500-base nucleotide sequence against all mammalian sequences (1.2M bases) in Genbank Ver. 29.0. The CPU time is reduced to about a quarter with a faster version.  相似文献   

19.
The presence of heterozygous indels in a DNA sequence usually results in the sequence being discarded. If the sequence trace is of high enough quality, however, it will contain enough information to reconstruct the two constituent sequences with very little ambiguity. Solutions already exist using comparisons with a known reference sequence, but this is often unavailable for nonmodel organisms or novel DNA regions. I present a program which determines the sizes and positions of heterozygous indels in a DNA sequence and reconstructs the two constituent haploid sequences. No external data such as a reference sequence or other prior knowledge are required. Simulation suggests an accuracy of >99% from a single read, with errors being eliminable by the inclusion of a second sequencing read, such as one using a reverse primer. Diploid sequences can be fully reconstructed across any number of heterozygous indels, with two overlapping sequencing reads almost always sufficient to infer the entire DNA sequence. This eliminates the need for costly and laborious cloning, and allows data to be used which would otherwise be discarded. With no more laboratory work than is needed to produce two normal sequencing reads, two aligned haploid sequences can be produced quickly and accurately and with extensive phasing information.  相似文献   

20.
A computer program package for the storage, change, and comparison of restriction maps is described. The programs are intended to detect overlaps between relatively short (about 10-40 kb; abbreviations ref.2) maps and to merge the overlapping fragments into large restriction maps. They run on a 16-bit-microcomputer with limited memory and addressing capability. Due to the restricted reliability of restriction maps compared with DNA sequence data a particular storage method was used. The source code of the programs is freely available (+).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号