首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The major aim of tertiary structure prediction is to obtain protein models with the highest possible accuracy. Fold recognition, homology modeling, and de novo prediction methods typically use predicted secondary structures as input, and all of these methods may significantly benefit from more accurate secondary structure predictions. Although there are many different secondary structure prediction methods available in the literature, their cross-validated prediction accuracy is generally <80%. In order to increase the prediction accuracy, we developed a novel hybrid algorithm called Consensus Data Mining (CDM) that combines our two previous successful methods: (1) Fragment Database Mining (FDM), which exploits the Protein Data Bank structures, and (2) GOR V, which is based on information theory, Bayesian statistics, and multiple sequence alignments (MSA). In CDM, the target sequence is dissected into smaller fragments that are compared with fragments obtained from related sequences in the PDB. For fragments with a sequence identity above a certain sequence identity threshold, the FDM method is applied for the prediction. The remainder of the fragments are predicted by GOR V. The results of the CDM are provided as a function of the upper sequence identities of aligned fragments and the sequence identity threshold. We observe that the value 50% is the optimum sequence identity threshold, and that the accuracy of the CDM method measured by Q(3) ranges from 67.5% to 93.2%, depending on the availability of known structural fragments with sufficiently high sequence identity. As the Protein Data Bank grows, it is anticipated that this consensus method will improve because it will rely more upon the structural fragments.  相似文献   

2.

Background

HTLV-1 infection is endemic among people of Melanesian descent in Papua New Guinea, the Solomon Islands and Vanuatu. Molecular studies reveal that these Melanesian strains belong to the highly divergent HTLV-1c subtype. In Australia, HTLV-1 is also endemic among the Indigenous people of central Australia; however, the molecular epidemiology of HTLV-1 infection in this population remains poorly documented.

Findings

Studying a series of 23 HTLV-1 strains from Indigenous residents of central Australia, we analyzed coding (gag, pol, env, tax) and non-coding (LTR) genomic proviral regions. Four complete HTLV-1 proviral sequences were also characterized. Phylogenetic analyses implemented with both Neighbor-Joining and Maximum Likelihood methods revealed that all proviral strains belong to the HTLV-1c subtype with a high genetic diversity, which varied with the geographic origin of the infected individuals. Two distinct Australians clades were found, the first including strains derived from most patients whose origins are in the North, and the second comprising a majority of those from the South of central Australia. Time divergence estimation suggests that the speciation of these two Australian clades probably occurred 9,120 years ago (38,000–4,500).

Conclusions

The HTLV-1c subtype is endemic to central Australia where the Indigenous population is infected with diverse subtype c variants. At least two Australian clades exist, which cluster according to the geographic origin of the human hosts. These molecular variants are probably of very ancient origin. Further studies could provide new insights into the evolution and modes of dissemination of these retrovirus variants and the associated ancient migration events through which early human settlement of Australia and Melanesia was achieved.  相似文献   

3.
目的:利用基因芯片数据,探讨宫颈癌在分子水平上的发病机制,挖掘肿瘤相关基因EST片段,探索恶性肿瘤标志物,为肿瘤防治找到新的有效手段。方法:从基因芯片数据库GEO(gene expression omnibus)中获得GSM99077基因芯片数据,利用该数据筛选出宫颈癌相关基因的EST片段;然后通过NCBI中的在线BLAST软件找到与之相匹配的同源序列,对这些同源序列进行生物学功能分析,找到与肿瘤的相关性。结果:共发现宫颈癌组织与正常宫颈组织差异表达EST共127条,其中上调的106条,下调的11条,这些差异表达EST的同源序列的转录产物参与转录、翻译、细胞增殖分裂及细胞信号传导等过程。结论:基因芯片能有效、高通量地获取生物内在信息,通过对基因芯片数据再挖掘,可发现宫颈癌的发生涉及多个基因共同作用。  相似文献   

4.
5.
Fixed Character States and the Optimization of Molecular Sequence Data   总被引:5,自引:1,他引:5  
A method is proposed to optimize molecular sequence data that does not employ multiple sequence alignment. This method treats entire homologous contiguous stretches of sequence data as individual characters. This sequence is treated as the homologous unit employed in phylogeny reconstruction. The sets of specific sequences exhibited by the terminal taxa constitute the character states. The number of states is then less than or equal to the number of unique sequences (or homologous fragments) exhibited by the data. A matrix of transformation costs is created to relate the states to one another. The cells of this matrix are defined as the minimum transformation cost between each pair of states based on insertion–deletion and base substitution costs. The diagnosis of a topology then follows existing dynamic programming techniques, with the number of states greatly expanded. Since the possible sequences reconstructed at nodes are limited to those exhibited by the terminals, cladograms constructed in this way may be longer than those of other methods in that they require a greater number of weighted evolutionary events. Example data, the effects of missing data, restricted ancestors, and putative long-branch attraction are discussed.  相似文献   

6.
Cryptosporidium spp. represent a major public health problem worldwide and infect the gastrointestinal tract of both immunocompetent and immunocompromised persons. The prevalence of these parasites varies by geographic region, and no data are currently available in Lebanon. To promote an understanding of the epidemiology of cryptosporidiosisin this country, the main aim of this study was to determine the prevalence Cryptosporidium in symptomatic hospitalized patients, and to analyze the genetic diversity of the corresponding isolates. Fecal specimens were collected in four hospitals in North Lebanon from 163 patients (77 males and 86 females, ranging in age from 1 to 88 years, with a mean age of 22 years) presenting gastrointestinal disorders during the period July to December 2013. The overall prevalence of Cryptosporidium spp. infection obtained by modified Ziehl-Neelsen staining and/or nested PCR was 11%, and children <5 years old showed a higher rate of Cryptosporidium spp. The PCR products of the 15 positive samples were successfully sequenced. Among them, 10 isolates (66.7%) were identified as C. hominis, while the remaining 5 (33.3%) were identified as C. parvum. After analysis of the gp60 locus, C. hominis IdA19, a rare subtype, was found to be predominant. Two C. parvum subtypes were found: IIaA15G1R1 and IIaA15G2R1. The molecular characterization of Cryptosporidium isolates is an important step in improving our understanding of the epidemiology and transmission of the infection.  相似文献   

7.
The Detection of Linkage Disequilibrium in Molecular Sequence Data   总被引:15,自引:4,他引:11       下载免费PDF全文
R. C. Lewontin 《Genetics》1995,140(1):377-388
Studies of genetic variation in natural populations at the sequence level usually show that most polymorphic sites are very asymmetrical in allele frequencies, with the rarer allele at a site near fixation. When the rarer allele at a site is present only a few times in the sample, say below five representatives, it becomes very difficult to detect linkage disequilibrium between sites from tests of association. This is a consequence of the numerical properties of even the most powerful test of association, Fisher's exact test. Sites with fewer than five representatives in the sample should be excluded from association tests, but this generally leaves few site pairs eligible for testing. A test for overall linkage disequilibrium, based on the sign of the observed linkage disequilibria, is derived which can use all the data. It is shown that more power can be achieved by increasing the length of sequence determined than by increasing the number of genomes sampled for the same total work.  相似文献   

8.
Background: In the field of bioinformatics interchangeable data formats based on XML are widely used. XML-type data is also at the core of most web services. With the increasing amount of data stored in XML comes the need for storing and accessing the data. In this paper we analyse the suitability of different database systems for storing and querying large datasets in general and Medline in particular.Results: All reviewed database systems perform well when tested with small to medium sized datasets, however when the full Medline dataset is queried a large variation in query times is observed. Conclusions: There is not one system that is vastly superior to the others in this comparison and, depending on the database size and the query requirements, different systems are most suitable. The best all-round solution is the Oracle 11~g database system using the new binary storage option. Alias-i's Lingpipe is a more lightweight, customizable and sufficiently fast solution. It does however require more initial configuration steps. For data with a changing XML structure Sedna and BaseX as native XML database systems or MySQL with an XML-type column are suitable.  相似文献   

9.
10.
The GenBank database contains essentially all of the nucleotide sequence data generated for published molecular systematic studies, but for the majority of taxa these data remain sparse. GenBank has value for phylogenetic methods that leverage data–mining and rapidly improving computational methods, but the limits imposed by the sparse structure of the data are not well understood. Here we present a tree representing 13,093 land plant genera—an estimated 80% of extant plant diversity—to illustrate the potential of public sequence data for broad phylogenetic inference in plants, and we explore the limits to inference imposed by the structure of these data using theoretical foundations from phylogenetic data decisiveness. We find that despite very high levels of missing data (over 96%), the present data retain the potential to inform over 86.3% of all possible phylogenetic relationships. Most of these relationships, however, are informed by small amounts of data—approximately half are informed by fewer than four loci, and more than 99% are informed by fewer than fifteen. We also apply an information theoretic measure of branch support to assess the strength of phylogenetic signal in the data, revealing many poorly supported branches concentrated near the tips of the tree, where data are sparse and the limiting effects of this sparseness are stronger. We argue that limits to phylogenetic inference and signal imposed by low data coverage may pose significant challenges for comprehensive phylogenetic inference at the species level. Computational requirements provide additional limits for large reconstructions, but these may be overcome by methodological advances, whereas insufficient data coverage can only be remedied by additional sampling effort. We conclude that public databases have exceptional value for modern systematics and evolutionary biology, and that a continued emphasis on expanding taxonomic and genomic coverage will play a critical role in developing these resources to their full potential.  相似文献   

11.
Abstract

Analysis, storage, and transfer of molecular dynamic trajectories are becoming the bottleneck of computer simulations. In this paper we discuss different approaches for data mining and data processing of huge trajectory files generated from molecular dynamic simulations of nucleic acids.  相似文献   

12.
13.
14.
Inherited haemoglobinopathies are the most common monogenic diseases, with millions of carriers and patients worldwide. At present, we know several hundred disease-causing mutations on the globin gene clusters, in addition to numerous clinically important trans-acting disease modifiers encoded elsewhere and a multitude of polymorphisms with relevance for advanced diagnostic approaches. Moreover, new disease-linked variations are discovered every year that are not included in traditional and often functionally limited locus-specific databases. This paper presents IthaGenes, a new interactive database of haemoglobin variations, which stores information about genes and variations affecting haemoglobin disorders. In addition, IthaGenes organises phenotype, relevant publications and external links, while embedding the NCBI Sequence Viewer for graphical representation of each variation. Finally, IthaGenes is integrated with the companion tool IthaMaps for the display of corresponding epidemiological data on distribution maps. IthaGenes is incorporated in the ITHANET community portal and is free and publicly available at http://www.ithanet.eu/db/ithagenes.  相似文献   

15.
16.
Walnut (Juglans regia), an economically important woody plant, is widely cultivated in temperate regions for its timber and nutritional fruits. Despite abundant studies in germplasm, systemic molecular evaluations of walnut are sparsely reported mainly due to the limited molecular markers available. Expressed sequence tags (EST) provide a valuable resource for developing simple sequence repeat (SSR) markers. In this study, a total of 5,025 walnut ESTs (covering 16.41 Mb) were retrieved from the National Center for Biotechnology Information database. The SSR motifs were then analyzed by the SSRHunter software. In total, 398 SSRs were obtained with an average frequency of 1/4.08 kb. Dinucleotide (di-) repeat motifs accounted for 69.85% of all SSRs, followed by trinucleotide (tri-) with a frequency of 27.64%, while low frequency (2.51%) of tetranucleotide (tetra-) to hexanucleotide (hexa-) was observed. Meanwhile, GCA and TC motifs were prevalent among di- and tri- loci, respectively. Subsequently, a total of 123 primer pairs were designed from the non-redundant SSR-containing unigenes with the selection threshold of SSR length set to 10 bp or more. To examine the efficiency of candidate markers, seven DNA pools were collected from geographically different accessions. Results demonstrated that 41 SSR primer sets could generate high polymorphic amplification products (33.3%), and these polymorphic loci were mainly located in the 3′-untranslated region. Annotation analysis revealed that only two of these 41 loci were located inside open reading frames of characterized proteins (E ≤ 1E−30).  相似文献   

17.
中国柽柳属和水柏枝属的分子系统学研究   总被引:5,自引:3,他引:2  
对中国柽柳科 3属 2 1种植物的核糖体DNA中的内转录间隔区 (ITS)序列及 5 8SrRNA基因的 3′端序列进行测定。结果表明 ,ITS - 1片段的长度范围在 2 5 4bp~ 2 6 9bp之间 ,ITS - 2片段的长度范围在 2 2 5bp~ 2 5 3bp之间。以Reaumuriasongarica作为功能性外类群 ,运用PAUP软件分析仅得到一个最简约树。简约树步长为 4 6 6步 ,一致性指数CI =0 85 84 ,保持性指数RI=0 86 2 2。系统发育分析表明 :秀丽水柏枝不应从水柏枝属中分出。另外 ,研究分析为目前分类上存有争议的白花柽柳、短毛柽柳及甘蒙柽柳的划分提供了分子生物学证据  相似文献   

18.
Although ecosystem-based management can lead to sustainable resource use, its successful implementation depends on stakeholders’ acceptance. A framework to integrate scientific knowledge about the ecosystems with stakeholders’ preferences is therefore needed. We propose here a ‘Public Sentiment Index,’ or PSI, as an integration framework that combines an ecosystem model (Ecopath with Ecosim; EwE) with a public choice model (the damage schedule). Using Chesapeake Bay as a case study, we demonstrate the development of the PSI, based on judgments of Bay stakeholders, including ‘watermen’ (commercial fishers), seafood wholesalers and retailers, recreational fishers, representatives from non-governmental organizations, scientists and managers on a range of Bay ecosystems. The high PSI for Chesapeake Bay suggests a consensus amongst Bay stakeholders who, understanding the need for restoring the Bay ecosystem, may accept difficult policy choices and support their implementation.  相似文献   

19.
20.
Person who inject illicit substances have an important role in HIV-1 blood and sexual transmission and together with person who uses heavy non-injecting drugs may have less than optimal adherence to anti-retroviral treatment and eventually could transmit resistant HIV variants. Unfortunately, molecular biology data on such key population remain fragmentary in most low and middle-income countries. The aim of the present study was to assess HIV infection rates, evaluate HIV-1 genetic diversity, drug resistance, and to identify HIV transmission clusters in heavy drug users (DUs). For this purpose, DUs were recruited in the context of a Respondent-Driven Sampling (RDS) study in different Brazilian cities during 2009. Overall, 2,812 individuals were tested for HIV, and 168 (6%) of them were positive, of which 19 (11.3%) were classified as recent seroconverters, corresponding to an estimated incidence rate of 1.58%/year (95% CI 0.92–2.43%). Neighbor joining phylogenetic trees from env and pol regions and bootscan analyses were employed to subtype the virus from132 HIV-1-infected individuals. HIV-1 subtype B was prevalent in most of the cities under analysis, followed by BF recombinants (9%-35%). HIV-1 subtype C was the most prevalent in Curitiba (46%) and Itajaí (86%) and was also detected in Brasília (9%) and Campo Grande (20%). Pure HIV-1F infections were detected in Rio de Janeiro (9%), Recife (6%), Salvador (6%) and Brasília (9%). Clusters of HIV transmission were assessed by Maximum likelihood analyses and were cross-compared with the RDS network structure. Drug resistance mutations were verified in 12.2% of DUs. Our findings reinforce the importance of the permanent HIV-1 surveillance in distinct Brazilian cities due to viral resistance and increasing subtype heterogeneity all over Brazil, with relevant implications in terms of treatment monitoring, prophylaxis and vaccine development.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号