Despite considerable excitement over the potential functional significance of copy-number variants (CNVs), we still lack knowledge of the fine-scale architecture of the large majority of CNV regions in the human genome. In this study, we used a high-resolution array-based comparative genomic hybridization (aCGH) platform that targeted known CNV regions of the human genome at approximately 1 kb resolution to interrogate the genomic DNAs of 30 individuals from four HapMap populations. Our results revealed that 1020 of 1153 CNV loci (88%) were actually smaller in size than what is recorded in the Database of Genomic Variants based on previously published studies. A reduction in size of more than 50% was observed for 876 CNV regions (76%). We conclude that the total genomic content of currently known common human CNVs is likely smaller than previously thought. In addition, approximately 8% of the CNV regions observed in multiple individuals exhibited genomic architectural complexity in the form of smaller CNVs within larger ones and CNVs with interindividual variation in breakpoints. Future association studies that aim to capture the potential influences of CNVs on disease phenotypes will need to consider how to best ascertain this previously uncharacterized complexity.  相似文献   

Segmental duplications and copy-number variation in the human genome   总被引:33,自引:0,他引:33       下载免费PDF全文
The human genome contains numerous blocks of highly homologous duplicated sequence. This higher-order architecture provides a substrate for recombination and recurrent chromosomal rearrangement associated with genomic disease. However, an assessment of the role of segmental duplications in normal variation has not yet been made. On the basis of the duplication architecture of the human genome, we defined a set of 130 potential rearrangement hotspots and constructed a targeted bacterial artificial chromosome (BAC) microarray (with 2,194 BACs) to assess copy-number variation in these regions by array comparative genomic hybridization. Using our segmental duplication BAC microarray, we screened a panel of 47 normal individuals, who represented populations from four continents, and we identified 119 regions of copy-number polymorphism (CNP), 73 of which were previously unreported. We observed an equal frequency of duplications and deletions, as well as a 4-fold enrichment of CNPs within hotspot regions, compared with control BACs (P < .000001), which suggests that segmental duplications are a major catalyst of large-scale variation in the human genome. Importantly, segmental duplications themselves were also significantly enriched >4-fold within regions of CNP. Almost without exception, CNPs were not confined to a single population, suggesting that these either are recurrent events, having occurred independently in multiple founders, or were present in early human populations. Our study demonstrates that segmental duplications define hotspots of chromosomal rearrangement, likely acting as mediators of normal variation as well as genomic disease, and it suggests that the consideration of genomic architecture can significantly improve the ascertainment of large-scale rearrangements. Our specialized segmental duplication BAC microarray and associated database of structural polymorphisms will provide an important resource for the future characterization of human genomic disorders.  相似文献   

The human UGT2B17 gene varies in copy number from zero to two per individual and also differs in mean number between populations from Africa, Europe, and East Asia. We show that such a high degree of geographical variation is unusual and investigate its evolutionary history. This required first reinterpreting the reference sequence in this region of the genome, which is misassembled from the two different alleles separated by an artifactual gap. A corrected assembly identifies the polymorphism as a 117 kb deletion arising by nonallelic homologous recombination between ~4.9 kb segmental duplications and allows the deletion breakpoint to be identified. We resequenced ~12 kb of DNA spanning the breakpoint in 91 humans from three HapMap and one extended HapMap populations and one chimpanzee. Diversity was unusually high and the time to the most recent common ancestor was estimated at ~2.4 or ~3.0 million years by two different methods, with evidence of balancing selection in Europe. In contrast, diversity was low in East Asia where a single haplotype predominated, suggesting positive selection for the deletion in this part of the world.  相似文献   

ABSTRACT: BACKGROUND: Eimeria is a genus of parasites in the same phylum (Apicomplexa) as human parasites such as Toxoplasma, Cryptosporidium and the malaria parasite Plasmodium. As an apicomplexan whose life-cycle involves a single host, Eimeria is a convenient model for understanding this group of organisms. Although the genomes of the Apicomplexa are diverse, that of Eimeria is unique in being composed of large alternating blocks of sequence with very different characteristics - an arrangement seen in no other organism. This arrangement has impeded efforts to fully sequence the genome of Eimeria, which remains the last of the major apicomplexans to be fully analyzed. In order to increase the value of the genome sequence data and aid in the effort to gain a better understanding of the Eimeria tenella genome, we constructed a whole genome map for the parasite. RESULTS: A total of 1245 contigs representing 70.0% of the whole genome assembly sequences (Wellcome Trust Sanger Institute) were selected and subjected to marker selection. Subsequently, 2482 HAPPY markers were developed and typed. Of these, 795 were considered as usable markers, and utilized in the construction of a HAPPY map. Markers developed from chromosomally-assigned genes were then integrated into the HAPPY map and this aided the assignment of a number of linkage groups to their respective chromosomes. BAC-end sequences and contigs from whole genome sequencing were also integrated to improve and validate the HAPPY map. This resulted in an integrated HAPPY map consisting of 60 linkage groups that covers approximately half of the estimated 60 Mb genome. Further analysis suggests that the segmental organization first seen in Chromosome 1 is present throughout the genome, with repeat-poor (P) regions alternating with repeat-rich (R) regions. Evidence of copy-number variation between strains was also uncovered. CONCLUSIONS: This paper describes the application of a whole genome mapping method to improve the assembly of the genome of E. tenella from shotgun data, and to help reveal its overall structure. A preliminary assessment of copy-number variation (extra or missing copies of genomic segments) between strains of E. tenella was also carried out. The emerging picture is of a very unusual genome architecture displaying inter-strain copy-number variation. We suggest that these features may be related to the known ability of this parasite to rapidly develop drug resistance.  相似文献   



Tandem repeat variation in protein-coding regions will alter protein length and may introduce frameshifts. Tandem repeat variants are associated with variation in pathogenicity in bacteria and with human disease. We characterized tandem repeat polymorphism in human proteins, using the UniGene database, and tested whether these were associated with host defense roles.


Protein-coding tandem repeat copy-number polymorphisms were detected in 249 tandem repeats found in 218 UniGene clusters; observed length differences ranged from 2 to 144 nucleotides, with unit copy lengths ranging from 2 to 57. This corresponded to 1.59% (218/13,749) of proteins investigated carrying detectable polymorphisms in the copy-number of protein-coding tandem repeats. We found no evidence that tandem repeat copy-number polymorphism was significantly elevated in defense-response proteins (p = 0.882). An association with the Gene Ontology term 'protein-binding' remained significant after covariate adjustment and correction for multiple testing. Combining this analysis with previous experimental evaluations of tandem repeat polymorphism, we estimate the approximate mean frequency of tandem repeat polymorphisms in human proteins to be 6%. Because 13.9% of the polymorphisms were not a multiple of three nucleotides, up to 1% of proteins may contain frameshifting tandem repeat polymorphisms.


Around 1 in 20 human proteins are likely to contain tandem repeat copy-number polymorphisms within coding regions. Such polymorphisms are not more frequent among defense-response proteins; their prevalence among protein-binding proteins may reflect lower selective constraints on their structural modification. The impact of frameshifting and longer copy-number variants on protein function and disease merits further investigation.  相似文献   



Extensive copy-number variation of the human olfactory receptor gene family   总被引:3,自引:0,他引:3  
As much as a quarter of the human genome has been reported to vary in copy number between individuals, including regions containing about half of the members of the olfactory receptor (OR) gene family. We have undertaken a detailed study of copy-number variation of ORs to elucidate the selective and mechanistic forces acting on this gene family and the true impact of copy-number variation on human OR repertoires. We argue that the properties of copy-number variants (CNVs) and other sets of large genomic regions violate the assumptions of statistical methods that are commonly used in the assessment of gene enrichment. Using more appropriate methods, we provide evidence that OR enrichment in CNVs is not due to positive selection but is because of OR preponderance in segmentally duplicated regions, which are known to be frequently copy-number variable, and because purifying selection against CNVs is lower in OR-containing regions than in regions containing essential genes. We also combine multiplex ligation-dependent probe amplification (MLPA) and PCR to assay the copy numbers of 37 candidate CNV ORs in a panel of ~50 human individuals. We confirm copy-number variation of 18 ORs but find no variation in this human-diversity panel for 16 other ORs, highlighting the caveat that reported intervals often overrepresent true CNVs. The copy-number variation we describe is likely to underpin significant variation in olfactory abilities among human individuals. Finally, we show that both homology-based and homology-independent processes have played a recent role in remodeling the OR family.  相似文献   

Genomic DNA copy-number alterations (CNAs) are associated with complex diseases, including cancer: CNAs are indeed related to tumoral grade, metastasis, and patient survival. CNAs discovered from array-based comparative genomic hybridization (aCGH) data have been instrumental in identifying disease-related genes and potential therapeutic targets. To be immediately useful in both clinical and basic research scenarios, aCGH data analysis requires accurate methods that do not impose unrealistic biological assumptions and that provide direct answers to the key question, "What is the probability that this gene/region has CNAs?" Current approaches fail, however, to meet these requirements. Here, we introduce reversible jump aCGH (RJaCGH), a new method for identifying CNAs from aCGH; we use a nonhomogeneous hidden Markov model fitted via reversible jump Markov chain Monte Carlo; and we incorporate model uncertainty through Bayesian model averaging. RJaCGH provides an estimate of the probability that a gene/region has CNAs while incorporating interprobe distance and the capability to analyze data on a chromosome or genome-wide basis. RJaCGH outperforms alternative methods, and the performance difference is even larger with noisy data and highly variable interprobe distance, both commonly found features in aCGH data. Furthermore, our probabilistic method allows us to identify minimal common regions of CNAs among samples and can be extended to incorporate expression data. In summary, we provide a rigorous statistical framework for locating genes and chromosomal regions with CNAs with potential applications to cancer and other complex human diseases.  相似文献   

Olfactory receptors (ORs), which are involved in odorant recognition, form the largest mammalian protein superfamily. The genomic content of OR genes is considerably reduced in humans, as reflected by the relatively small repertoire size and the high fraction ( approximately 55%) of human pseudogenes. Since several recent low-resolution surveys suggested that OR genomic loci are frequently affected by copy-number variants (CNVs), we hypothesized that CNVs may play an important role in the evolution of the human olfactory repertoire. We used high-resolution oligonucleotide tiling microarrays to detect CNVs across 851 OR gene and pseudogene loci. Examining genomic DNA from 25 individuals with ancestry from three populations, we identified 93 OR gene loci and 151 pseudogene loci affected by CNVs, generating a mosaic of OR dosages across persons. Our data suggest that approximately 50% of the CNVs involve more than one OR, with the largest CNV spanning 11 loci. In contrast to earlier reports, we observe that CNVs are more frequent among OR pseudogenes than among intact genes, presumably due to both selective constraints and CNV formation biases. Furthermore, our results show an enrichment of CNVs among ORs with a close human paralog or lacking a one-to-one ortholog in chimpanzee. Interestingly, among the latter we observed an enrichment in CNV losses over gains, a finding potentially related to the known diminution of the human OR repertoire. Quantitative PCR experiments performed for 122 sampled ORs agreed well with the microarray results and uncovered 23 additional CNVs. Importantly, these experiments allowed us to uncover nine common deletion alleles that affect 15 OR genes and five pseudogenes. Comparison to the chimpanzee reference genome revealed that all of the deletion alleles are human derived, therefore indicating a profound effect of human-specific deletions on the individual OR gene content. Furthermore, these deletion alleles may be used in future genetic association studies of olfactory inter-individual differences.  相似文献   

To explore the genetic contribution to autistic spectrum disorders (ASDs), we have studied genomic copy-number variation in a large cohort of families with a single affected child and at least one unaffected sibling. We confirm a major contribution from de novo deletions and duplications but also find evidence of a role for inherited "ultrarare" duplications. Our results show that, relative to males, females have greater resistance to autism from genetic causes, which raises the question of the fate of female carriers. By analysis of the proportion and number of recurrent loci, we set a lower bound for distinct target loci at several hundred. We find many new candidate regions, adding substantially to the list of potential gene targets, and confirm several loci previously observed. The functions of the genes in the regions of de novo variation point to a great diversity of genetic causes but also suggest functional convergence.  相似文献   

Genomic sequences obtained through high-throughput sequencing are not uniformly distributed across the genome. For example, sequencing data of total genomic DNA show significant, yet unexpected enrichments on promoters and exons. This systematic bias is a particular problem for techniques such as chromatin immunoprecipitation, where the signal for a target factor is plotted across genomic features. We have focused on data obtained from Illumina's Genome Analyser platform, where at least three factors contribute to sequence bias: GC content, mappability of sequencing reads, and regional biases that might be generated by local structure. We show that relying on input control as a normalizer is not generally appropriate due to sample to sample variation in bias. To correct sequence bias, we present BEADS (bias elimination algorithm for deep sequencing), a simple three-step normalization scheme that successfully unmasks real binding patterns in ChIP-seq data. We suggest that this procedure be done routinely prior to data interpretation and downstream analyses.  相似文献   

Recent studies have extensively examined the large-scale genetic variants in the human genome known as copy-number variations (CNVs), and the universality of CNVs in normal individuals, along with their functional importance, has been increasingly recognized. However, the absence of a method to accurately infer alleles or haplotypes within a CNV region from high-throughput experimental data hampers the finer analyses of CNV properties and applications to disease-association studies. Here we developed an algorithm to infer complex haplotypes within a CNV region by using data obtained from high-throughput experimental platforms. We applied this algorithm to experimental data and estimated the population frequencies of haplotypes that can yield information on both sequences and numbers of DNA copies. These results suggested that the analysis of such complex haplotypes is essential for accurately detecting genetic differences within a CNV region between population groups.  相似文献   

Improving detection of foraminifera by cathodoluminescence   总被引:1,自引:0,他引:1  
Cathodoluminescence (CL) studies of Lower–Middle Oxfordian marls and limestones, as well as clasts from the uppermost Turonian–?Early Coniacian conglomerates of the Cracow Upland (southern Poland), reveal that the CL view of foraminifers from some lithologies differs from that in transmitted light. In particular, the CL technique revealed abundant tests of planktonic species Globuligerina oxfordiana in the Middle Oxfordian glauconitic marls, which under transmitted light are either poorly visible or remain completely undetected. Bright red–orange luminescence characterizes originally hyaline aragonitic tests of G. oxfordiana, but also several calcitic benthic species, in spite of their different taxonomic position and original test structure and mineralogy. In sponge microbial boundstones, foraminifers generally do not show the CL emission, or show a weak luminescence. Similarly, Late Cretaceous foraminifera represented mostly by planktonic taxa were detected or their view was clearly improved under CL only in some clasts from the uppermost Turonian–?Early Coniacian conglomerates filling karstic cavities. In other clasts, foraminifera are clearly visible only under normal transmitted light, therefore the luminescence signature is highly spatially variable. These results indicate a strong influence of lithology and diagenesis and rather minor effects of shell structure on luminescence of microfossils. The CL technique can be a useful tool in the detection and documentation of abundance patterns of foraminifers that are poorly preserved under transmitted light.  相似文献   

Ectomycorrhizal (ECM) fungi associated with plants constitute one of the most successful symbiotic interactions in forest ecosystems. ECM support trophic exchanges with host plants and are important factors for the survival and stress resilience of trees. However, ECM clades often harbour morpho-species and cryptic lineages, with weak morphological differentiation. How this relates to intraspecific genome variability and ecological functioning is poorly known. Here, we analysed 16 European isolates of the ascomycete Cenococcum geophilum, an extremely ubiquitous forest symbiotic fungus with no known sexual or asexual spore-forming structures but with a massively enlarged genome. We carried out whole-genome sequencing to identify single-nucleotide polymorphisms. We found no geographic structure at the European scale but divergent lineages within sampling sites. Evidence for recombination was restricted to specific cryptic lineages. Lineage differentiation was supported by extensive copy-number variation. Finally, we confirmed heterothallism with a single MAT1 idiomorph per genome. Synteny analyses of the MAT1 locus revealed substantial rearrangements and a pseudogene of the opposite MAT1 idiomorph. Our study provides the first evidence for substantial genome-wide structural variation, lineage-specific recombination and low continent-wide genetic differentiation in C. geophilum. Our study provides a foundation for targeted analyses of intra-specific functional variation in this major symbiosis.  相似文献   

Structural genetic variation, including copy-number variation (CNV), constitutes a substantial fraction of total genetic variability and the importance of structural genetic variants in modulating human disease is increasingly being recognized. Early successes in identifying disease-associated CNVs via a candidate gene approach mandate that future disease association studies need to include structural genetic variation. Such analyses should not rely on previously developed methodologies that were designed to evaluate single nucleotide polymorphisms (SNPs). Instead, development of novel technical, statistical, and epidemiologic methods will be necessary to optimally capture this newly-appreciated form of genetic variation in a meaningful manner.  相似文献   

The genomic sequence of the type strain of the opportunist human pathogen Candida glabrata (CBS138, ATCC 2001) is available since 2004. This allows the analysis of genomic structure of other strains by comparative genomic hybridization. We present here the molecular analysis of a collection of 183 C. glabrata strains isolated from patients hospitalized in France and around the world. We show that the mechanisms of microevolution within this asexual species include rare reciprocal chromosomal translocations and recombination within tandem arrays of repeated genes, and that these account for the frequent size heterogeneity between chromosomes across strains. Gene tandems often encode cell wall proteins suggesting a possible role in adaptation to the environment.  相似文献   

Scientists estimate seed abundances to calculate seasonal carrying capacities and assess wetland management actions for waterfowl and other wildlife using soil core samples. We evaluated recovery of known quantities of moist-soil seeds from whole and subsampled experimental core samples containing 12 seed taxa representing small, medium, and large size classes. We recovered 86.3% (SE = 1.8) of all seeds added to experimental cores; 8.3% (SE = 1.2) of seeds were destroyed during the sieving process and 5.4% (SE = 1.2) were not recovered by observers. Recovery rates varied by seed size, but not seed quantity or disproportionate ratios of seed-size classes. Overall seed recovery rates were similar between subsampled ( = 81.2%, SE = 3.6) and whole–processed core samples ( = 86.3%, SE = 1.8). We used recovery rates to generate size-specific, taxon-specific, and constant correction factors and applied each to actual core sample data. Size-specific correction factors increased seed mass estimates in the Mississippi Alluvial Valley ( = 10.1%, SE = 0.32), upper Midwest ( = 21.2%, SE = 0.61), and both regions combined ( = 15.7%, SE = 0.51) differently, as seed composition in core samples varied regionally. We suggest scientists consider using size-specific correction factors to account for seed recovery bias in core samples because these factors may be applied to a variety of taxa and produced similar mass estimates as taxon-specific correction factors. However, if data from core samples are unavailable at the resolution of seed size classes, we suggest increasing seed mass estimates by 16% to account for seed recovery bias. © 2011 The Wildlife Society.  相似文献   

