首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 13 毫秒
1.

Background

Comparative genomics, or the study of the relationships of genome structure and function across different species, offers a powerful tool for studying evolution, annotating genomes, and understanding the causes of various genetic disorders. However, aligning multiple sequences of DNA, an essential intermediate step for most types of analyses, is a difficult computational task. In parallel, citizen science, an approach that takes advantage of the fact that the human brain is exquisitely tuned to solving specific types of problems, is becoming increasingly popular. There, instances of hard computational problems are dispatched to a crowd of non-expert human game players and solutions are sent back to a central server.

Methodology/Principal Findings

We introduce Phylo, a human-based computing framework applying “crowd sourcing” techniques to solve the Multiple Sequence Alignment (MSA) problem. The key idea of Phylo is to convert the MSA problem into a casual game that can be played by ordinary web users with a minimal prior knowledge of the biological context. We applied this strategy to improve the alignment of the promoters of disease-related genes from up to 44 vertebrate species. Since the launch in November 2010, we received more than 350,000 solutions submitted from more than 12,000 registered users. Our results show that solutions submitted contributed to improving the accuracy of up to 70% of the alignment blocks considered.

Conclusions/Significance

We demonstrate that, combined with classical algorithms, crowd computing techniques can be successfully used to help improving the accuracy of MSA. More importantly, we show that an NP-hard computational problem can be embedded in casual game that can be easily played by people without significant scientific training. This suggests that citizen science approaches can be used to exploit the billions of “human-brain peta-flops” of computation that are spent every day playing games. Phylo is available at: http://phylo.cs.mcgill.ca.  相似文献   

2.
Intron phylogeny: a new hypothesis   总被引:35,自引:0,他引:35  
The three major classes of intron are clearly of unequal antiquity. Structured (often self-splicing and sometimes mobile) introns are the most ancient, probably dating (at least for group I) from the ancestral (eubacterial) cell 3500 million years ago, and were originally restricted to tRNA. Protein-spliced introns (usually in tRNA) probably evolved from them by a radical change in splicing mechanism in the common ancestor of eukaryotes and archaebacteria, perhaps only about 1700 million years ago. Spliceosomal introns probably evolved from group-II-like self-splicing introns after the origin of the nucleus between 1700 and 1000 million years ago, and were probably mostly inserted into previously unsplit protein-coding genes after the origin of mitochondria 1000 million years ago.  相似文献   

3.

Background  

Recently introduced pathway-based approach is promising and advantageous to improve the efficiency of analyzing genome-wide association scan (GWAS) data to identify disease variants by jointly considering variants of the genes that belong to the same biological pathway. However, the current available pathway-based approaches for analyzing GWAS have limited power and efficiency.  相似文献   

4.

Background  

Selenocysteine and pyrrolysine are the 21st and 22nd amino acids, which are genetically encoded by stop codons. Since a number of microbial genomes have been completely sequenced to date, it is tempting to ask whether the 23rd amino acid is left undiscovered in these genomes. Recently, a computational study addressed this question and reported that no tRNA gene for unknown amino acid was found in genome sequences available. However, performance of the tRNA prediction program on an unknown tRNA family, which may have atypical sequence and structure, is unclear, thereby rendering their result inconclusive. A protein-level study will provide independent insight into the novel amino acid.  相似文献   

5.
We present a new method using nucleic acid secondary structure to assess phylogenetic relationships among species. In this method, which we term "molecular morphometrics," the measurable structural parameters of the molecules (geometrical features, bond energies, base composition, etc.) are used as specific characters to construct a phylogenetic tree. This method relies both on traditional morphological comparison and on molecular sequence comparison. Applied to the phylogenetic analysis of Cirripedia, molecular morphometrics supports the most recent morphological analyses arguing for the monophyly of Cirripedia sensu stricto (Thoracica + Rhizocephala + Acrothoracica). As a proof, a classical multiple alignment was also performed, either using or not using the structural information to realign the sequence segments considered in the molecular morphometrics analysis. These methods yielded the same tree topology as the direct use of structural characters as a phylogenetic signal. By taking into account the secondary structure of nucleic acids, the new method allows investigators to use the regions in which multiple alignments are barely reliable because of a large number of insertions and deletions. It thus appears to be complementary to classical primary sequence analysis in phylogenetic studies.  相似文献   

6.
Haplotyping as perfect phylogeny: a direct approach.   总被引:4,自引:0,他引:4  
A full haplotype map of the human genome will prove extremely valuable as it will be used in large-scale screens of populations to associate specific haplotypes with specific complex genetic-influenced diseases. A haplotype map project has been announced by NIH. The biological key to that project is the surprising fact that some human genomic DNA can be partitioned into long blocks where genetic recombination has been rare, leading to strikingly fewer distinct haplotypes in the population than previously expected (Helmuth, 2001; Daly et al., 2001; Stephens et al., 2001; Friss et al., 2001). In this paper we explore the algorithmic implications of the no-recombination in long blocks observation, for the problem of inferring haplotypes in populations. This assumption, together with the standard population-genetic assumption of infinite sites, motivates a model of haplotype evolution where the haplotypes in a population are assumed to evolve along a coalescent, which as a rooted tree is a perfect phylogeny. We consider the following algorithmic problem, called the perfect phylogeny haplotyping problem (PPH), which was introduced by Gusfield (2002) - given n genotypes of length m each, does there exist a set of at most 2n haplotypes such that each genotype is generated by a pair of haplotypes from this set, and such that this set can be derived on a perfect phylogeny? The approach taken by Gusfield (2002) to solve this problem reduces it to established, deep results and algorithms from matroid and graph theory. Although that reduction is quite simple and the resulting algorithm nearly optimal in speed, taken as a whole that approach is quite involved, and in particular, challenging to program. Moreover, anyone wishing to fully establish, by reading existing literature, the correctness of the entire algorithm would need to read several deep and difficult papers in graph and matroid theory. However, as stated by Gusfield (2002), many simplifications are possible and the list of "future work" in Gusfield (2002) began with the task of developing a simpler, more direct, yet still efficient algorithm. This paper accomplishes that goal, for both the rooted and unrooted PPH problems. It establishes a simple, easy-to-program, O(nm(2))-time algorithm that determines whether there is a PPH solution for input genotypes and produces a linear-space data structure to represent all of the solutions. The approach allows complete, self-contained proofs. In addition to algorithmic simplicity, the approach here makes the representation of all solutions more intuitive than in Gusfield (2002), and solves another goal from that paper, namely, to prove a nontrivial upper bound on the number of PPH solutions, showing that that number is vastly smaller than the number of haplotype solutions (each solution being a set of n pairs of haplotypes that can generate the genotypes) when the perfect phylogeny requirement is not imposed.  相似文献   

7.
基因鉴定集成法:全基因组基因表达研究的新策略   总被引:2,自引:0,他引:2  
人类基因组包含的核苷酸数目庞大,基因鉴定(识别)的技术策略是基因克隆研究至为重要的基础。在全基因组基因表达分析策略方面,已相继建立了mRNA差异显示、代表性差异分析、抑制性消减杂交、基因表达系列分析和cDNA微阵列等技术。基因鉴定集成法是新近在综合上述技术的优缺点的基础上建立的全基因组分析新策略,具有充分利用生物基因信息数据库进行基因鉴定(识别),并能提高稀有拷贝基因鉴定效率的优点。本文简要介绍其  相似文献   

8.
MOTIVATION: Fast and reliable phylogeny estimation is rapidly gaining importance as more and more genomic sequence information is becoming available, and the study of the evolution of genes and genomes accelerates our understanding in biology and medicine alike. Branch attraction phenomena due to unequal amounts of evolutionary change in different parts of the phylogeny are one major problem for current methods, placing the species that evolved fast in one part of the phylogenetic tree, and the species that evolved slowly in the other. RESULTS: We describe a way to avoid the artifactual attraction of species that evolved slowly, by detecting shared old character states using a calibrated comparison with an outgroup. The corresponding focus on shared novel character states yields a fast and transparent phylogeny estimation algorithm, by application of the divide-and-conquer principle, and heuristic search: shared novelties give evidence of the exclusive common heritage (monophyly) of a subset of the species. They indicate conflict in a split of all species considered, if the split tears them apart. Only the split at the root of the phylogenetic tree cannot have such conflict. Therefore, we can work top-down, from the root to the leaves, by heuristically searching for a minimum-conflict split, and tackling the resulting two subsets in the same way. The algorithm, called "minimum conflict phylogeny estimation" (MCOPE), has been validated successfully using both natural and artificial data. In particular, we reanalyze published trees, yielding more plausible phylogenies, and we analyze small "undisputed" trees on the basis of alignments considering structural homology. AVAILABILITY: MCOPEis available via http://bibiserv.techfak.uni-bielefeld.de/mcope/. CONTACT: fuellen@alum.mit.edu  相似文献   

9.
Gentianales consist of Apocynaceae, Gelsemiaceae, Gentianaceae, Loganiaceae, and Rubiaceae, of which the majority are woody plants in tropical and subtropical areas. Despite extensive efforts in reconstructing the phylogeny of Gentianales based on molecular data, some interfamily and intrafamily relationships remain uncertain. We reconstructed the genus-level phylogeny of Gentianales based on the supermatrix of eight plastid markers (rbcL, matK, atpB, ndhF, rpl16, rps16, thetrnL-trnF region, and atpB-rbcL spacer) and one mitochondrial gene (matR) using maximum likelihood. The major clades and their relationships retrieved in the present study concur with those of previous studies. All of the five families of Gentianales are monophyletic with strong support. We resolved Rubiaceae as sister to the remaining families in Gentianales and showed support for the sister relationship between Loganiaceae and Apocynaceae. Our results provide new insights into relationships among intrafamilial clades. For example, within Rubiaceae we found that Craterispermeae were sister to Morindeae + (Palicoureeae + Psychotrieae) and that Theligoneae were sister to Putorieae. Within Gentianaceae, our phylogeny revealed that Gentianeae were sister to Helieae and Potalieae, and subtribe Lisianthiinae were sister to Potaliinae and Faroinae. Within Loganiaceae, we found Neuburgia as sister to Spigelieae. Within Apocynaceae, our results supported Amsonieae as sister to Melodineae, and Hunterieae as sister to a clade comprising Plumerieae + (Carisseae + APSA). We also confirmed the monophyly of Perplocoideae and the relationships among Baisseeae + (Secamonoideae + Asclepiadoideae).  相似文献   

10.
11.
A molecular phylogeny of the Scleractinia is reconstructed from approximately 700 nucleotides of the 5'end of the 28S rDNA obtained from 40 species. A comparison of molecular phylogenic trees with biomineralization patterns of coral septa suggests that at least five clades are corroborated by both types of data. Agaricidae and Dendrophylliidae are found to be monophyletic, that is supported by microstructural data. Conversely, Faviidae and Caryophylliidae are found to be paraphyletic: Cladocora should be excluded from the faviids, whereas Eusmilia should be excluded from the caryophylliids. The conclusion is also supported by the positions, sizes and shapes of centres of calcification. The traditional Guyniidae are diphyletic, corroborating Stolarski's hypothesis 'A'. Some results from our most parsimonious trees are not strongly statistically supported but corroborated by other molecular studies and microstructural observations. For example, in the scleractinian phylogenetic tree, there are several lines of evidence (including those from our data) to distinguish a Faviidae–Mussidae lineage and a Dendrophylliidae–Agaricidae–Poritidae–Siderastreidae lineage. From a methodological standpoint, our results suggest that co-ordinated studies creating links between biomineralization patterns and molecular phylogeny may provide an efficient working approach for a re-examination of scleractinian classification. This goal is important because in the evolutionary scheme proposed by Wells that presently remains the basic framework in coral studies, patterns of septal microstructures are involved. Validating from molecular phylogenies a given microstructural character state as a potential synapomorphy for a clade is the only way to include fossils in the coral classification, an approach that should allow the unity of coral classification to be maintained up to the origin of the phylum in the Triassic times.  相似文献   

12.
Saruhashi S  Hamada K  Horiike T  Shinozawa T 《Gene》2007,392(1-2):157-163
The construction of accurate prokaryotic phylogeny is important not only in the field of evolutionary biology, but also in microbiology and pathology. However, in constructing a phylogenetic tree to trace prokaryotic evolution, the phylogenetic relationship is often changed by the choice of species. For the estimation of the accurate lineage of prokaryotes, a new method, named the "random extraction method", was developed. In this method, 16S rRNA sequence data were randomly extracted 1000 times from each closely-related taxa such as seven phyla of Eubacteria and one domain of Archaea and phylogenetic trees were constructed by the data to clarify the relationship of those groups. Next, the tree topology was counted and the most supported tree topology was found as the most plausible phylogenetic tree. To evaluate the reliability of each node, we developed the "Branching rate" (BR) and calculated for every tree. And also, computational simulation analysis was carried out to confirm these methods. On the assumption that the root of life is between Archaea and Eubacteria, the obtained phylogenetic relationships of phyla are the following. At first, Archaea (Euryarchaeota, Crenarchaeota and Korarchaeota) diverged, and Thermotogales, Cyanobacteria and Chlamydiales diverged in this order, then Firmicutes (Actinobacteria and Bacillus/Clostridium group cluster) and Proteobacteria (alpha and beta/gamma cluster) diverged. In addition, it was shown by the BR that the position of the node of Firmicutes Actinobacteria and Firmicutes Bacillus/Clostridium was changeable for each extraction. Therefore, it was suggested that the differences among the phylogenetic trees of prokaryotes were caused by the influence of these phyla.  相似文献   

13.
The genetic dissection of complex inherited diseases is a major challenge. Despite limited success in finding genes, substantial data based on genome-wide scan strategies is now available for a variety of diseases and related phenotypes. This can perhaps best be appreciated in the field of lipid and lipoprotein levels, where the amount of information generated is becoming overwhelming. We have created a database containing the results from whole-genome scans of lipid-related phenotypes undertaken to date. The usefulness of this database is demonstrated by performing a new autosomal genomic scan on apolipoprotein B (apoB), LDL-apoB, and apoA-I levels, measured in 679 subjects of 243 nuclear families. Linkage was tested using both allele-sharing and variance-component methods. Only two loci provided support for linkage with both methods: a LDL-apoB locus on 18q21.32 and an apoA-I locus on 3p25.2. Adding those findings to the database highlighted the fact that the former is reported as a lipid-related locus for the first time, whereas the latter has been observed before. However, concerns arise when displaying all data on the same map, because a large portion of the genome is now covered with loci supported by at least suggestive evidence of linkage.  相似文献   

14.
Coronavirus phylogeny based on a geometric approach   总被引:5,自引:0,他引:5  
  相似文献   

15.
Technological developments provide new insights into prokaryotic evolution and diversity and provoke a continuous need to update taxonomy and revise classification schemes. Our present species concept and definition are being challenged by the growing amount of whole genomic information, which should allow improvements in the natural species definition. The continuous quest for an objective and stable method for sorting strains into coherent homogeneous groups is inherent to prokaryotic systematics and nomenclature. Morphological, biochemical, physiological, phenotypic and chemotaxonomic criteria have been complemented by molecular data and pragmatic, purpose built, species definitions are being replaced by more natural ones based on evolutionary insights. It is imperative to give due consideration to both fundamental and applied aspects of future species concepts and definitions. The present paper discusses the present practice in prokaryotic taxonomy of how this system developed and how it may evolve in the future.  相似文献   

16.
Subtilisin-like serine proteases (subtilases) are a very diverse family of serine proteases with low sequence homology, often limited to regions surrounding the three catalytic residues. Starting with different Hidden Markov Models (HMM), based on sequence alignments around the catalytic residues of the S8 family (subtilisins) and S53 family (sedolisins), we iteratively searched all ORFs in the complete genomes of 313 eubacteria and archaea. In 164 genomes we identified a total of 567 ORFs with one or more of the conserved regions with a catalytic residue. The large majority of these contained all three regions around the "classical" catalytic residues of the S8 family (Asp-His-Ser), while 63 proteins were identified as S53 (sedolisin) family members (Glu-Asp-Ser). More than 30 proteins were found to belong to two novel subsets with other evolutionary variations in catalytic residues, and new HMMs were generated to search for them. In one subset the catalytic Asp is replaced by an equivalent Glu (i.e. Glu-His-Ser family). The other subset resembles sedolisins, but the conserved catalytic Asp is not located on the same helix as the nucleophile Glu, but rather on a beta-sheet strand in a topologically similar position, as suggested by homology modeling. The Prokaryotic Subtilase Database (www.cmbi.ru.nl/subtilases) provides access to all information on the identified subtilases, the conserved sequence regions, the proposed family subdivision, and the appropriate HMMs to search for them. Over 100 proteins were predicted to be subtilases for the first time by our improved searching methods, thereby improving genome annotation.  相似文献   

17.
18.
Multimarker Transmission/Disequilibrium Tests (TDTs) are very robust association tests to population admixture and structure which may be used to identify susceptibility loci in genome-wide association studies. Multimarker TDTs using several markers may increase power by capturing high-degree associations. However, there is also a risk of spurious associations and power reduction due to the increase in degrees of freedom. In this study we show that associations found by tests built on simple null hypotheses are highly reproducible in a second independent data set regardless the number of markers. As a test exhibiting this feature to its maximum, we introduce the multimarker 2-Groups TDT (mTDT(2G)), a test which under the hypothesis of no linkage, asymptotically follows a χ2 distribution with 1 degree of freedom regardless the number of markers. The statistic requires the division of parental haplotypes into two groups: disease susceptibility and disease protective haplotype groups. We assessed the test behavior by performing an extensive simulation study as well as a real-data study using several data sets of two complex diseases. We show that mTDT(2G) test is highly efficient and it achieves the highest power among all the tests used, even when the null hypothesis is tested in a second independent data set. Therefore, mTDT(2G) turns out to be a very promising multimarker TDT to perform genome-wide searches for disease susceptibility loci that may be used as a preprocessing step in the construction of more accurate genetic models to predict individual susceptibility to complex diseases.  相似文献   

19.

Background  

Incorrectly annotated sequence data are becoming more commonplace as databases increasingly rely on automated techniques for annotation. Hence, there is an urgent need for computational methods for checking consistency of such annotations against independent sources of evidence and detecting potential annotation errors. We show how a machine learning approach designed to automatically predict a protein's Gene Ontology (GO) functional class can be employed to identify potential gene annotation errors.  相似文献   

20.

Background  

A recent publication described a supervised classification method for microarray data: Between Group Analysis (BGA). This method which is based on performing multivariate ordination of groups proved to be very efficient for both classification of samples into pre-defined groups and disease class prediction of new unknown samples. Classification and prediction with BGA are classically performed using the whole set of genes and no variable selection is required. We hypothesize that an optimized selection of highly discriminating genes might improve the prediction power of BGA.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号