首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 218 毫秒
1.
目的:探寻一种简单、经济的方法,解决基因组序列拼接中的重复序列问题。方法:选取序列拼接中遇到重复序列问题的质粒NDM-BTR,在其与重复序列相关的contigs两端设计引物,进行实时定量PCR,通过观察临界循环数来判断contig之间的位置关系。结果:成功判断出质粒contig之间的位置关系,得到了质粒基因组完成图。结论:实时定量PCR法可用于解决基因组序列拼接中的重复序列问题,相比较传统建立大片段文库更加简单、快速、经济。  相似文献   

2.
利用MISA(MicroSatellite)软件对山地虎耳草转录组拼接序列进行微卫星位点信息分析,为后期SSR标记的开发和物种遗传多样性检测提供候选序列。结果发现,在拼接得到的63 763条Unigene序列中含有4 622个SSR,发生频率为7.25%,有110种重复基元,平均每10.00 kB出现一个SSR位点。山地虎耳草转录组序列的SSR主要集中在三核苷酸重复(55.50%),其次为二核苷酸重复(30.23%)。二核苷酸重复和三核苷酸重复中的优势重复基元分别为AG/TC和AAG/TTC。二核苷酸重复基元的重复次数类型最多,跨度最大,具有更高的多态性,三核苷酸次之,而四、五、六核苷酸重复类型很少。山地虎耳草转录组SSR以5~9次重复为主,且SSR数量随着重复次数的增加逐渐减少,基序长度主要集中于12~30 bp,多态性均在中等以上。  相似文献   

3.
Xiao P  Li RH 《遗传》2011,33(6):654-660
二代测序技术及全基因组多样性比较是现代生物学及信息科学研究的热点,对基因组中转座元件(Transposable element)的分析已成为基因组比较分析的重要组成部分。目前对于转座元件的种类、数量和组成的挖掘和分析一般是基于完全拼接后的全基因组序列,对在此之前的海量短片段序列后期处理及拼接仍是目前基因组研究的盲点,以转座元件为主的重复序列在拼接过程中也存在着不可避免的拼接误差或丢失,给转座元件系统的分析带来不确定。文章旨在建立一套分析流程,对铜绿微囊藻NIES 843全基因组构建的罗氏(Roche)公司454测序随机模拟原始数据集的转座元件(主要类型为插入序列:Insert sequence,IS)组成进行分析,结果表明,采用对核酸探针扫描后备选序列分成3组,并分设氨基酸检测阈值的方案分析得到的结果较为可靠,结果显示铜绿微囊藻NIES843的蓝藻转座元件占基因组比例的10.38%,归属于14个IS家族,66个IS亚家族。与之前基于完整拼接基因组数据的两套不同分析流程得到的结果相比,在丰度及家族/亚家族组成上无显著差异,在转座元件序列水平上也显示了高比例的相似性序列重叠,证实了本研究流程在基于高通量测序原始数据的转座元件分析方面具可靠性及实用性。  相似文献   

4.
线粒体基因组的研究已经普及,其正确的拼接和注释是所有后续研究的基础。本文以Staden Package软件为主介绍了拼接和注释的线粒体基因组的方法,同时介绍了其他常用的拼接软件ContigExpress、DNAMAN、DNASTAR、BioEdit和Sequencher,以及利用不同软件(包括DOGMA、MOSAS、MITOS、GOBASE、OGRe、MitoZoa、tRNAscan-SE、ARWEN、BLAST和MiTFi等)对线粒体基因组中的蛋白质编码基因、rRNA、tRNA和A+T富集区进行注释的方法,最后介绍了利用MEGA5软件分析线粒体基因组的组成、Sequin软件提交序列和线粒体基因组数据绘图工具(CG view、MTviz和OGDRAW)。  相似文献   

5.
生物序列拼接及其算法   总被引:1,自引:0,他引:1  
生物序列拼接是鸟枪法(shotgun)测序中的一个重要环节.主要介绍了生物序列拼接及其研究中所涉及的一些基本问题,概述了两类主要的生物序列拼接算法,分析了其各自的特点,并对其进行了比较.  相似文献   

6.
串联重复序列的物种差异及其生物功能   总被引:13,自引:0,他引:13  
高焕  孔杰 《动物学研究》2005,26(5):555-564
串联重复序列是指1-200个碱基左右的核心重复单位,以头尾相串联的方式重复多次所组成的重 复序列。它广泛存在于真核生物和一些原核生物的基因组中,并表现出种属、碱基组成等的特异性。在基因组 整体水平上,各种优势的重复序列类型不同。即使在同一重复序列类型内部,不同重复拷贝类别(如AT、AC 等)在基因组中的存在也表现出很大的差异。同时,这些重复序列类型和各重复拷贝类别在同一物种的不同染 色体间,以及基因的编码区和非编码区间也表现种属和碱基组成差异。这些差异显示了重复序列起源和进化的 复杂性,可能涉及到多种机制和因素,并与生物功能密切相关。另外,由于重复序列分析软件和统计标准还存 在算法、重复长度、完美性等问题,需要进一步探讨。此外,串联重复序列的自身进化关系、全基因组水平上 的进化地位、在基因组中的生物功能、重复序列数据库建立和应用研究等,将是今后研究的主要课题。  相似文献   

7.
四种常用高通量测序拼接软件的应用比较   总被引:1,自引:0,他引:1  
新一代测序平台的诞生推动了对全基因组鸟枪法测序数据的拼接算法和软件的研究,自2005年以来多种用于高通量测序的序列拼接软件已经被开发出来,并且在不断地进行改进以提高拼接效果.本文利用目前广泛使用的高通量测序拼接软件Velvet、AbySS、SOAPdenovo和CLC Genomic Workbench分别对本试验室分离的一株噬菌体IME08的高通量测序结果进行拼接,介绍这几种拼接软件的安装使用及参数优化,并对不同软件的拼接结果进行比较,针对不同的拼接软件得到优化的拼接参数,可为其他研究人员使用上述软件提供参考借鉴.  相似文献   

8.
为拓展分子标记在燕麦种质资源分析与鉴定中的应用,利用公共数据库中的25376条EST(expressed sequence tags)序列,开展了燕麦EST-SSR功能性标记的开发和利用研究。25376条EST序列经拼接去冗余后获得了11618条序列,从中筛选出含有不同重复基元的SSR且重复次数较多、长度较长的556条EST序列进行引物设计,开发了50对燕麦EST-SSR引物,通过筛选得到40对有效的EST-SSR引物。选取其中4对引物对5个燕麦种质资源进行了PCR扩增及产物测序,结果表明扩增条带多态性是由SSR差异造成的。利用40对ESTSSR引物对15个六倍体燕麦种质资源进行遗传多样性分析,共扩增出89个等位基因,平均每对引物产生2.23个等位基因;UPGMA聚类分析表明,15个六倍体燕麦种质资源在Dice系数为0.93处聚为3支,基本上是按照不同种进行聚类的,在相同种中又根据地理来源分别聚集成支。利用40对EST-SSR引物对31个遗传背景不清的燕麦种质资源进行基因组倍性鉴定,发现这些种质中可能存在有四倍体和二倍体的燕麦新资源。本研究开发的燕麦EST-SSR功能性标记将在燕麦遗传多样性分析、遗传图谱构建及燕麦属内种间基因组鉴定等方面发挥重要作用。  相似文献   

9.
拟南芥与水稻之间简单重复序列的比较分析   总被引:3,自引:0,他引:3  
利用Perl,C 语言编写了鉴定和分析简单重复序列的一系列程序,在全基因组水平上分析了拟南芥(ArabidopsisthalianaL.)简单重复序列的分布及简单重复序列和基因的关系。共发现5652个简单重复序列(≥20bp),大约每20.6kb有1个简单重复序列。拟南芥各染色体之间简单重复序列的密度基本一致。拟南芥的27480条编码序列中,只有677条编码序列含有725个简单重复序列,其中的3碱基简单重复序列多数对应的是小的亲水性的氨基酸。在拟南芥和水稻(OryzasativaL.)第4号染色体的高度保守的基因中,简单重复序列却并不保守。通过比较拟南芥和水稻之间简单重复序列的差异,推论出:水稻的全基因组和基因中简单重复序列的密度都比拟南芥大,这可能是水稻基因组序列比拟南芥大的原因之一,水稻基因组中0.21%来自简单重复序列,而拟南芥中只有0.13%;不但不同物种的基因组对简单重复序列的偏好性不同,而且不同物种的基因对简单重复序列的偏好性也不同。在水稻和拟南芥中都发现了一些嵌套性的卫星序列。  相似文献   

10.
蒙古沙冬青是第三纪冰川孑遗物种,具有极强的耐逆性,是研究植物耐逆分子机制的良好材料。利用高通量测序技术和生物信息学工具预测蒙古沙冬青的基因组大小及杂合度,进行SSR分子遗传标记的初步鉴定。蒙古沙冬青基因组大小约为812 Mb,杂合率为0.506%,基因组杂合度较高。对基因组进行拼接,进行SSR分子遗传标记分析,共鉴定183 102个SSR,不同类型核苷酸重复差异较大,其中,二核苷酸重复中的AT/TA含量最高,四核苷酸重复最少,共占总数的1.48%。  相似文献   

11.
The analysis of repeats in the DNA sequences is an important subject in bioinformatics. In this paper, we propose a novel projection-assemble algorithm to find unknown interspersed repeats in DNA sequences. The algorithm employs random projection algorithm to obtain a candidate fragment set, and exhaustive search algorithm to search each pair of fragments from the candidate fragment set to find potential linkage, and then assemble them together. The complexity of our projection-assemble algorithm is nearly linear to the length of the genome sequence, and its memory usage is limited by the hardware. We tested our algorithm with both simulated data and real biology data, and the results show that our projection-assemble algorithm is efficient. By means of this algorithm, we found an un-labeled repeat region that occurs five times in Escherichia coil genome, with its length more than 5,000 bp, and a mismatch probability less than 4%.  相似文献   

12.
We have shown, in a previous paper, that tandem repeating sequences, especially triplet repeats, play a very important role in gene evolution. This result led to the formulation of the following hypothesis: most of the genomic sequences evolved through everlasting acts of tandem repeat expansions with subsequent accumulation of changes. In order to estimate how much of the observed sequences have the repeat origin we describe the adaptation of a text segmentation algorithm, based on dynamic programming, to the mapping of the ancient expansion events. The algorithm maximizes the segmentation cost, calculated as the similarity of obtained fragments to the putative repeat sequence. In the first application of the algorithm to segmentations of genomic sequences, a significant difference between the natural sequences and the corresponding shuffled sequences is detected. The natural fragments are longer and more similar to the putative repeat sequences. As our analysis shows, the coding sequences allow for repeats only when the size of the repeated words is divisible by three. In contrast, in the non-coding sequences, all repeated word sizes are present. It was estimated, that in Escherichia coli K12 genome, about 35.5% of sequence can be detectably traced to original simple repeat ancestors. The results shed light on the genomic sequence organization, and strongly confirm the hypothesis about the crucial role of triplet expansions in gene origin and evolution.  相似文献   

13.
Structural genomic projects envision almost routine protein structure determinations, which are currently imaginable only for small proteins with molecular weights below 25,000 Da. For larger proteins, structural insight can be obtained by breaking them into small segments of amino acid sequences that can fold into native structures, even when isolated from the rest of the protein. Such segments are autonomously folding units (AFU) and have sizes suitable for fast structural analyses. Here, we propose to expand an intuitive procedure often employed for identifying biologically important domains to an automatic method for detecting putative folded protein fragments. The procedure is based on the recognition that large proteins can be regarded as a combination of independent domains conserved among diverse organisms. We thus have developed a program that reorganizes the output of BLAST searches and detects regions with a large number of similar sequences. To automate the detection process, it is reduced to a simple geometrical problem of recognizing rectangular shaped elevations in a graph that plots the number of similar sequences at each residue of a query sequence. We used our program to quantitatively corroborate the premise that segments with conserved sequences correspond to domains that fold into native structures. We applied our program to a test data set composed of 99 amino acid sequences containing 150 segments with structures listed in the Protein Data Bank, and thus known to fold into native structures. Overall, the fragments identified by our program have an almost 50% probability of forming a native structure, and comparable results are observed with sequences containing domain linkers classified in SCOP. Furthermore, we verified that our program identifies AFU in libraries from various organisms, and we found a significant number of AFU candidates for structural analysis, covering an estimated 5 to 20% of the genomic databases. Altogether, these results argue that methods based on sequence similarity can be useful for dissecting large proteins into small autonomously folding domains, and such methods may provide an efficient support to structural genomics projects.  相似文献   

14.
A census of protein repeats.   总被引:20,自引:0,他引:20  
In this study, we analyzed all known protein sequences for repeating amino acid segments. Although duplicated sequence segments occur in 14 % of all proteins, eukaryotic proteins are three times more likely to have internal repeats than prokaryotic proteins. After clustering the repetitive sequence segments into families, we find repeats from eukaryotic proteins have little similarity with prokaryotic repeats, suggesting most repeats arose after the prokaryotic and eukaryotic lineages diverged. Consequently, protein classes with the highest incidence of repetitive sequences perform functions unique to eukaryotes. The frequency distribution of the repeating units shows only weak length dependence, implicating recombination rather than duplex melting or DNA hairpin formation as the limiting mechanism underlying repeat formation. The mechanism favors additional repeats once an initial duplication has been incorporated. Finally, we show that repetitive sequences are favored that contain small and relatively water-soluble residues. We propose that error-prone repeat expansion allows repetitive proteins to evolve more quickly than non-repeat-containing proteins.  相似文献   

15.
Let A denote an alphabet consisting of n types of letters. Given a sequence S of length L with v(i) letters of type i on A, to describe the compositional properties and combinatorial structure of S, we propose a new complexity function of S, called the reciprocal complexity of S, as C(S) = (i=1) product operator (n) (L/nv(i))(vi) Based on this complexity measure, an efficient algorithm is developed for classifying and analyzing simple segments of protein and nucleotide sequence databases associated with scoring schemes. The running time of the algorithm is nearly proportional to the sequence length. The program DSR corresponding to the algorithm was written in C++, associated with two parameters (window length and cutoff value) and a scoring matrix. Some examples regarding protein sequences illustrate how the method can be used to find regions. The first application of DSR is the masking of simple sequences for searching databases. Queries masked by DSR returned a manageable set of hits below the E-value cutoff score, which contained all true positive homologues. The second application is to study simple regions detected by the DSR program corresponding to known structural features of proteins. An extensive computational analysis has been made of protein sequences with known, physicochemically defined nonglobular segments. For the SWISS-PROT amino acid sequence database (Release 40.2 of 02-Nov-2001), we determine that the best parameters and the best BLOSUM matrix are, respectively, for automatic segmentation of amino acid sequences into nonglobular and globular regions by the DSR program: Window length k = 35, cutoff value b = 0.46, and the BLOSUM 62.5 matrix. The average "agreement accuracy (sensitivity)" of DSR segmentation for the SWISS-PROT database is 97.3%.  相似文献   

16.
LTR retrotransposons constitute one of the most abundant classes of repetitive elements in eukaryotic genomes. In this paper, we present a new algorithm for detection of full-length LTR retrotransposons in genomic sequences. The algorithm identifies regions in a genomic sequence that show structural characteristics of LTR retrotransposons. Three key components distinguish our algorithm from that of current software--(i) a novel method that preprocesses the entire genomic sequence in linear time and produces high quality pairs of LTR candidates in run-time that is constant per pair, (ii) a thorough alignment-based evaluation of candidate pairs to ensure high quality prediction, and (iii) a robust parameter set encompassing both structural constraints and quality controls providing users with a high degree of flexibility. We implemented our algorithm into a software program called LTR_par, which can be run on both serial and parallel computers. Validation of our software against the yeast genome indicates superior results in both quality and performance when compared to existing software. Additional validations are presented on rice BACs and chimpanzee genome.  相似文献   

17.
We study the problem of approximate non-tandem repeat extraction. Given a long subject string S of length N over a finite alphabet Sigma and a threshold D, we would like to find all short substrings of S of length P that repeat with at most D differences, i.e., insertions, deletions, and mismatches. We give a careful theoretical characterization of the set of seeds (i.e., some maximal exact repeats) required by the algorithm, and prove a sublinear bound on their expected numbers. Using this result, we present a sub-quadratic algorithm for finding all short (i.e., of length O(log N)) approximate repeats. The running time of our algorithm is O(DN(3pow(epsilon)-1)log N), where epsilon = D/P and pow(epsilon) is an increasing, concave function that is 0 when epsilon = 0 and about 0.9 for DNA and protein sequences.  相似文献   

18.
A significant portion (20%) of the Physarum genome can be isolated as a HpaII-resistant, methylated fraction. Cloned DNA probes containing highly-repeated sequences derived from this fraction were used to define the pattern of structural organisation of homologous repeats in Physarum genomic DNA. It is shown that the probes detect an abundant, methylated family of sequences with an estimated genomic repetition frequency greater than 2100, derived from a large repeated element whose length exceeds 5.8kb. Sequences comprising the long repetitive element dominate the HpaII-resistant compartment and account for between 4-20% of the Physarum genome. Detailed restriction/hybridisation analysis of cloned DNA segments derived from this compartment shows that HpaII/MspI restriction sites within some copies of the long repeated sequence are probably deleted by mutation. Additionally, segments of the repeat are often found in different organisational patterns that represent scrambled versions of its basic structure, and which are presumed to have arisen as a result of recombinational rearrangement in situ in the Physarum genome. Preliminary experiments indicate that the sequences are transcribed and that the structural properties of the repeat bear some resemblance to those of transposable genetic elements defined in other eukaryotic species.  相似文献   

19.
In the study of genome rearrangement, the block-interchanges have been proposed recently as a new kind of global rearrangement events affecting a genome by swapping two nonintersecting segments of any length. The so-called block-interchange distance problem, which is equivalent to the sorting-by-block-interchange problem, is to find a minimum series of block-interchanges for transforming one chromosome into another. In this paper, we study this problem by considering the circular chromosomes and propose a Omicron(deltan) time algorithm for solving it by making use of permutation groups in algebra, where n is the length of the circular chromosome and delta is the minimum number of block-interchanges required for the transformation, which can be calculated in Omicron(n) time in advance. Moreover, we obtain analogous results by extending our algorithm to linear chromosomes. Finally, we have implemented our algorithm and applied it to the circular genomic sequences of three human vibrio pathogens for predicting their evolutionary relationships. Consequently, our experimental results coincide with the previous ones obtained by others using a different comparative genomics approach, which implies that the block-interchange events seem to play a significant role in the evolution of vibrio species.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号