期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

51.

Zinc finger gene clusters and tandem gene duplication.

Mengxiang Tang Michael Waterman Shibu Yooseph 《Journal of computational biology》2002,9(2):429-446

Zinc finger genes in mammalian genomes are frequently found to occur in clusters with cluster members appearing in a tandem array on the chromosome. It has been suggested that in situ gene duplication events are primarily responsible for the evolution of such clusters. The problem of inferring the series of duplication events responsible for producing clustered families is different from the standard phylogeny problem. In this paper, we study this inference problem using a graph called duplication model that captures the series of duplication events while taking into account the observed order of the genes on the chromosome. We provide algorithms to reconstruct a duplication model for a given data set. We use our method to hypothesize the series of duplication events that may have produced the ZNF45 family that appears on human chromosome 19. 相似文献

52.

Microbiome Analysis of Stool Samples from African Americans with Colon Polyps

Hassan Brim Shibu Yooseph Erwin G. Zoetendal Edward Lee Manolito Torralbo Adeyinka O. Laiyemo Babak Shokrani Karen Nelson Hassan Ashktorab 《PloS one》2013,8(12)

Background

Colonic polyps are common tumors occurring in ~50% of Western populations with ~10% risk of malignant progression. Dietary agents have been considered the primary environmental exposure to promote colorectal cancer (CRC) development. However, the colonic mucosa is permanently in contact with the microbiota and its metabolic products including toxins that also have the potential to trigger oncogenic transformation.

Aim

To analyze fecal DNA for microbiota composition and functional potential in African Americans with pre-neoplastic lesions.

Materials & Methods

We analyzed the bacterial composition of stool samples from 6 healthy individuals and 6 patients with colon polyps using 16S ribosomal RNA-based phylogenetic microarray; the Human intestinal Tract Chip (HITChip) and 16S rRNA gene barcoded 454 pyrosequencing. The functional potential was determined by sequence-based metagenomics using 454 pyrosequencing.

Results

Fecal microbiota profiling of samples from the healthy and polyp patients using both a phylogenetic microarraying (HITChip) and barcoded 454 pyrosequencing generated similar results. A distinction between both sets of samples was only obtained when the analysis was performed at the sub-genus level. Most of the species leading to the dissociation were from the Bacteroides group. The metagenomic analysis did not reveal major differences in bacterial gene prevalence/abundances between the two groups even when the analysis and comparisons were restricted to available Bacteroides genomes.

Conclusion

This study reveals that at the pre-neoplastic stages, there is a trend showing microbiota changes between healthy and colon polyp patients at the sub-genus level. These differences were not reflected at the genome/functions levels. Bacteria and associated functions within the Bacteroides group need to be further analyzed and dissected to pinpoint potential actors in the early colon oncogenic transformation in a large sample size. 相似文献

53.

The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families

Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph 《PLoS biology》2007,5(3):e16

Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature. 相似文献

54.

The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families 总被引：8，自引：8，他引：0

Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph Shibu Yooseph 《PLoS biology》2007,5(3):e16

Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature. 相似文献

55.

GRASP: Guided Reference-based Assembly of Short Peptides

Cuncong Zhong Youngik Yang Shibu Yooseph 《Nucleic acids research》2015,43(3):e18

Protein sequences predicted from metagenomic datasets are annotated by identifying their homologs via sequence comparisons with reference or curated proteins. However, a majority of metagenomic protein sequences are partial-length, arising as a result of identifying genes on sequencing reads or on assembled nucleotide contigs, which themselves are often very fragmented. The fragmented nature of metagenomic protein predictions adversely impacts homology detection and, therefore, the quality of the overall annotation of the dataset. Here we present a novel algorithm called GRASP that accurately identifies the homologs of a given reference protein sequence from a database consisting of partial-length metagenomic proteins. Our homology detection strategy is guided by the reference sequence, and involves the simultaneous search and assembly of overlapping database sequences. GRASP was compared to three commonly used protein sequence search programs (BLASTP, PSI-BLAST and FASTM). Our evaluations using several simulated and real datasets show that GRASP has a significantly higher sensitivity than these programs while maintaining a very high specificity. GRASP can be a very useful program for detecting and quantifying taxonomic and protein family abundances in metagenomic datasets. GRASP is implemented in GNU C++, and is freely available at http://sourceforge.net/projects/grasp-release. 相似文献

56.

Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in marker gene phylogenetic trees

Wu D Wu M Halpern A Rusch DB Yooseph S Frazier M Venter JC Eisen JA 《PloS one》2011,6(3):e18011

Background

Most of our knowledge about the ancient evolutionary history of organisms has been derived from data associated with specific known organisms (i.e., organisms that we can study directly such as plants, metazoans, and culturable microbes). Recently, however, a new source of data for such studies has arrived: DNA sequence data generated directly from environmental samples. Such metagenomic data has enormous potential in a variety of areas including, as we argue here, in studies of very early events in the evolution of gene families and of species.

Methodology/Principal Findings

We designed and implemented new methods for analyzing metagenomic data and used them to search the Global Ocean Sampling (GOS) Expedition data set for novel lineages in three gene families commonly used in phylogenetic studies of known and unknown organisms: small subunit rRNA and the recA and rpoB superfamilies. Though the methods available could not accurately identify very deeply branched ss-rRNAs (largely due to difficulties in making robust sequence alignments for novel rRNA fragments), our analysis revealed the existence of multiple novel branches in the recA and rpoB gene families. Analysis of available sequence data likely from the same genomes as these novel recA and rpoB homologs was then used to further characterize the possible organismal source of the novel sequences.

Conclusions/Significance

Of the novel recA and rpoB homologs identified in the metagenomic data, some likely come from uncharacterized viruses while others may represent ancient paralogs not yet seen in any cultured organism. A third possibility is that some come from novel cellular lineages that are only distantly related to any organisms for which sequence data is currently available. If there exist any major, but so-far-undiscovered, deeply branching lineages in the tree of life, we suggest that methods such as those described herein currently offer the best way to search for them. 相似文献

57.

Going deeper: metagenome of a hadopelagic microbial community 总被引：1，自引：0，他引：1

Eloe EA Fadrosh DW Novotny M Zeigler Allen L Kim M Lombardo MJ Yee-Greenbaum J Yooseph S Allen EE Lasken R Williamson SJ Bartlett DH 《PloS one》2011,6(5):e20388

相似文献

58.

Genomic insights to SAR86, an abundant and uncultivated marine bacterial lineage

Chris L Dupont Douglas B Rusch Shibu Yooseph Mary-Jane Lombardo R Alexander Richter Ruben Valas Mark Novotny Joyclyn Yee-Greenbaum Jeremy D Selengut Dan H Haft Aaron L Halpern Roger S Lasken Kenneth Nealson Robert Friedman J Craig Venter 《The ISME journal》2012,6(6):1186-1199

Bacteria in the 16S rRNA clade SAR86 are among the most abundant uncultivated constituents of microbial assemblages in the surface ocean for which little genomic information is currently available. Bioinformatic techniques were used to assemble two nearly complete genomes from marine metagenomes and single-cell sequencing provided two more partial genomes. Recruitment of metagenomic data shows that these SAR86 genomes substantially increase our knowledge of non-photosynthetic bacteria in the surface ocean. Phylogenomic analyses establish SAR86 as a basal and divergent lineage of γ-proteobacteria, and the individual genomes display a temperature-dependent distribution. Modestly sized at 1.25–1.7 Mbp, the SAR86 genomes lack several pathways for amino-acid and vitamin synthesis as well as sulfate reduction, trends commonly observed in other abundant marine microbes. SAR86 appears to be an aerobic chemoheterotroph with the potential for proteorhodopsin-based ATP generation, though the apparent lack of a retinal biosynthesis pathway may require it to scavenge exogenously-derived pigments to utilize proteorhodopsin. The genomes contain an expanded capacity for the degradation of lipids and carbohydrates acquired using a wealth of tonB-dependent outer membrane receptors. Like the abundant planktonic marine bacterial clade SAR11, SAR86 exhibits metabolic streamlining, but also a distinct carbon compound specialization, possibly avoiding competition. 相似文献

59.

Comparative genome analysis of 19 Ureaplasma urealyticum and Ureaplasma parvum strains

Paralanov V Lu J Duffy LB Crabb DM Shrivastava S Methé BA Inman J Yooseph S Xiao L Cassell GH Waites KB Glass JI 《BMC microbiology》2012,12(1):88

ABSTRACT: BACKGROUND: Ureaplasma urealyticum (UUR) and Ureaplasma parvum (UPA) are sexually transmitted bacteria among humans implicated in a variety of disease states including but not limited to: nongonococcal urethritis, infertility, adverse pregnancy outcomes, chorioamnionitis, and bronchopulmonary dysplasia in neonates. There are 10 distinct serotypes of UUR and 4 of UPA. Efforts to determine whether difference in pathogenic potential exists at the ureaplasma serovar level have been hampered by limitations of antibody-based typing methods, multiple cross-reactions and poor discriminating capacity in clinical samples containing two or more serovars. RESULTS: We determined the genome sequences of the American Type Culture Collection (ATCC) type strains of all UUR and UPA serovars as well as four clinical isolates of UUR for which we were not able to determine serovar designation. UPA serovars had 0.750.78 Mbp genomes and UUR serovars were 0.840.95 Mbp. The original classification of ureaplasma isolates into distinct serovars was largely based on differences in the major ureaplasma surface antigen called the multiple banded antigen (MBA) and reactions of human and animal sera to the organisms. Whole genome analysis of the 14 serovars and the 4 clinical isolates showed the mba gene was part of a large superfamily, which is a phase variable gene system, and that some serovars have identical sets of mba genes. Most of the differences among serovars are hypothetical genes, and in general the two species and 14 serovars are extremely similar at the genome level. CONCLUSIONS: Comparative genome analysis suggests UUR is more capable of acquiring genes horizontally, which may contribute to its greater virulence for some conditions. The 4 overwhelming evidence of extensive horizontal gene transfer among these organisms from our previous studies combined with our comparative analysis indicates that 6 ureaplasmas exist as quasispecies rather than as stable serovars in their native environment. Therefore, differential pathogenicity and clinical outcome of a ureaplasmal infection is most likely not on the serovar level, but rather may be due to the presence or absence of potential pathogenicity factors in an individual ureaplasma clinical isolate and/or patient to patient differences in terms of autoimmunity and microbiome. 相似文献

60.

A note on efficient computation of haplotypes via perfect phylogeny.

Vineet Bafna Dan Gusfield Sridhar Hannenhalli Shibu Yooseph 《Journal of computational biology》2004,11(5):858-866

The problem of inferring haplotype phase from a population of genotypes has received a lot of attention recently. This is partly due to the observation that there are many regions on human genomic DNA where genetic recombination is rare (Helmuth, 2001; Daly et al., 2001; Stephens et al., 2001; Friss et al., 2001). A Haplotype Map project has been announced by NIH to identify and characterize populations in terms of these haplotypes. Recently, Gusfield introduced the perfect phylogeny haplotyping problem, as an algorithmic implication of the no-recombination in long blocks observation, together with the standard population-genetic assumption of infinite sites. Gusfield's solution based on matroid theory was followed by direct theta(nm2) solutions that use simpler techniques (Bafna et al., 2003; Eskin et al., 2003), and also bound the number of solutions to the PPH problem. In this short note, we address two questions that were left open. First, can the algorithms of Bafna et al. (2003) and Eskin et al. (2003) be sped-up to O(nm + m2) time, which would imply an O(nm) time-bound for the PPH problem? Second, if there are multiple solutions, can we find one that is most parsimonious in terms of the number of distinct haplotypes. We give reductions that suggests that the answer to both questions is "no." For the first problem, we show that computing the output of the first step (in either method) is equivalent to Boolean matrix multiplication. Therefore, the best bound we can presently achieve is O(nm(omega-1)), where omega < or = 2.52 is the exponent of matrix multiplication. Thus, any linear time solution to the PPH problem likely requires a different approach. For the second problem of computing a PPH solution that minimizes the number of distinct haplotypes, we show that the problem is NP-hard using a reduction from Vertex Cover (Garey and Johnson, 1979). 相似文献