首页 | 本学科首页   官方微博 | 高级检索  
   检索      

A Statistical Approach Designed for Finding Mathematically Defined Repeats in Shotgun Data and Determining the Length Distribution of Clone-Inserts
作者姓名:Zhong L  Zhang K  Huang X  Ni P  Han Y  Wang K  Wang J  Li S
作者单位:Lan Zhong,Kunlin Zhang,Xiangang Huang,Peixiang Ni,Yujun Han,Kai Wang,Jun Wang,and Songgang Li 1 College of Life Science,Peking University,Beijing 100871,China; 2 Beijing Genomics Institute/Center of Genomics and Bioinformatics,Chinese Academy of Sciences,Beijing 101300,China
摘    要:The large amount of repeats, especially high copy repeats, in the genomes of higher animals and plants makes whole genome assembly (WGA) quite difficult. In order to solve this problem, we tried to identify repeats and mask them prior to assembly even at the stage of genome survey. It is known that repeats of different copy number have different probabilities of appearance in shotgun data, so based on this principle, we constructed a statistical model and inferred criteria for mathematically defined repeats (MDRs) at different shotgun coverages. According to these criteria, we developed software MDRmasker to identify and mask MDRs in shotgun data. With repeats masked prior to assembly, the speed of assembly was increased with lower error probability. In addition, clone-insert size affects the accuracy of repeat assembly and scaffold construction. We also designed length distribution of clone-inserts using our model. In our simulated genomes of human and rice, the length distribution of repeats is differ

收稿时间:2003 Jan 13

A statistical approach designed for finding mathematically defined repeats in shotgun data and determining the length distribution of clone-inserts
Zhong L,Zhang K,Huang X,Ni P,Han Y,Wang K,Wang J,Li S.A Statistical Approach Designed for Finding Mathematically Defined Repeats in Shotgun Data and Determining the Length Distribution of Clone-Inserts[J].Genomics Proteomics & Bioinformatics,2003,1(1):43-51.
Authors:Zhong Lan  Zhang Kunlin  Huang Xiangang  Ni Peixiang  Han Yujun  Wang Kai  Wang Jun  Li Songgang
Institution:College of Life Sciences, Peking University, Beijing 100871, China.
Abstract:The large amount of repeats, especially high copy repeats, in the genomes of higher animals and plants makes whole genome assembly (WGA) quite difficult. In order to solve this problem, we tried to identify repeats and mask them prior to assembly even at the stage of genome survey. It is known that repeats of different copy number have different probabilities of appearance in shotgun data, so based on this principle, we constructed a statistical model and inferred criteria for mathematically defined repeats (MDRs) at different shotgun coverages. According to these criteria, we developed software MDRmasker to identify and mask MDRs in shotgun data. With repeats masked prior to assembly, the speed of assembly was increased with lower error probability. In addition, clone-insert size affects the accuracy of repeat assembly and scaffold construction. We also designed length distribution of clone-inserts using our model. In our simulated genomes of human and rice, the length distribution of repeats is different, so their optimal length distributions of clone-inserts were not the same. Thus with optimal length distribution of clone-inserts, a given genome could be assembled better at lower coverage.
Keywords:mathematically denned repeat (MDR)  clone-inserts  assembly
本文献已被 CNKI 万方数据 PubMed 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号