首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Oligomers of length k, or k-mers, are convenient and widely used features for modeling the properties and functions of DNA and protein sequences. However, k-mers suffer from the inherent limitation that if the parameter k is increased to resolve longer features, the probability of observing any specific k-mer becomes very small, and k-mer counts approach a binary variable, with most k-mers absent and a few present once. Thus, any statistical learning approach using k-mers as features becomes susceptible to noisy training set k-mer frequencies once k becomes large. To address this problem, we introduce alternative feature sets using gapped k-mers, a new classifier, gkm-SVM, and a general method for robust estimation of k-mer frequencies. To make the method applicable to large-scale genome wide applications, we develop an efficient tree data structure for computing the kernel matrix. We show that compared to our original kmer-SVM and alternative approaches, our gkm-SVM predicts functional genomic regulatory elements and tissue specific enhancers with significantly improved accuracy, increasing the precision by up to a factor of two. We then show that gkm-SVM consistently outperforms kmer-SVM on human ENCODE ChIP-seq datasets, and further demonstrate the general utility of our method using a Naïve-Bayes classifier. Although developed for regulatory sequence analysis, these methods can be applied to any sequence classification problem.  相似文献   

2.
Based on the well-known k-mer model, we propose a k-mer natural vector model for representing a genetic sequence based on the numbers and distributions of k-mers in the sequence. We show that there exists a one-to-one correspondence between a genetic sequence and its associated k-mer natural vector. The k-mer natural vector method can be easily and quickly used to perform phylogenetic analysis of genetic sequences without requiring evolutionary models or human intervention. Whole or partial genomes can be handled more effective with our proposed method. It is applied to the phylogenetic analysis of genetic sequences, and the obtaining results fully demonstrate that the k-mer natural vector method is a very powerful tool for analysing and annotating genetic sequences and determining evolutionary relationships both in terms of accuracy and efficiency.  相似文献   

3.
RNA oligomers of length 40–60 nt can self-assemble into covalent versions of the Azoarcus group I intron ribozyme. This process requires a series of recombination reactions in which the internal guide sequence of a nascent catalytic complex makes specific interactions with a complement triplet, CAU, in the oligomers. However, if the CAU were mutated, promiscuous self-assembly may be possible, lessening the dependence on a particular set of oligomer sequences. Here, we assayed whether oligomers containing mutations in the CAU triplet could still self-construct Azoarcus ribozymes. The mutations CAC, CAG, CUU and GAU all inhibited self-assembly to some degree, but did not block it completely in 100 mM MgCl2. Oligomers containing the CAC mutation retained the most self-assembly activity, while those containing GAU retained the least, indicating that mutations more 5′ in this triplet are the most deleterious. Self-assembly systems containing additional mutant locations were progressively less functional. Analyses of properly self-assembled ribozymes revealed that, of two recombination mechanisms possible for self-assembly, termed ‘tF2’ and ‘R2F2’, the simpler one-step ‘tF2’ mechanism is utilized when mutations exist. These data suggest that self-assembling systems are more facile than previously believed, and have relevance to the origin of complex ribozymes during the RNA World.  相似文献   

4.

Objectives

To evaluate sources of error in the Magnetic Resonance Imaging (MRI) measurement of percent fibroglandular tissue (%FGT) using two-point Dixon sequences for fat-water separation.

Methods

Ten female volunteers (median age: 31 yrs, range: 23–50 yrs) gave informed consent following Research Ethics Committee approval. Each volunteer was scanned twice following repositioning to enable an estimation of measurement repeatability from high-resolution gradient-echo (GRE) proton-density (PD)-weighted Dixon sequences. Differences in measures of %FGT attributable to resolution, T1 weighting and sequence type were assessed by comparison of this Dixon sequence with low-resolution GRE PD-weighted Dixon data, and against gradient-echo (GRE) or spin-echo (SE) based T1-weighted Dixon datasets, respectively.

Results

%FGT measurement from high-resolution PD-weighted Dixon sequences had a coefficient of repeatability of ±4.3%. There was no significant difference in %FGT between high-resolution and low-resolution PD-weighted data. Values of %FGT from GRE and SE T1-weighted data were strongly correlated with that derived from PD-weighted data (r = 0.995 and 0.96, respectively). However, both sequences exhibited higher mean %FGT by 2.9% (p < 0.0001) and 12.6% (p < 0.0001), respectively, in comparison with PD-weighted data; the increase in %FGT from the SE T1-weighted sequence was significantly larger at lower breast densities.

Conclusion

Although measurement of %FGT at low resolution is feasible, T1 weighting and sequence type impact on the accuracy of Dixon-based %FGT measurements; Dixon MRI protocols for %FGT measurement should be carefully considered, particularly for longitudinal or multi-centre studies.  相似文献   

5.
《Genomics》2019,111(6):1298-1305
Based on the k-mer model for protein sequence, a novel k-mer natural vector method is proposed to characterize the features of k-mers in a protein sequence, in which the numbers and distributions of k-mers are considered. It is proved that the relationship between a protein sequence and its k-mer natural vector is one-to-one. Phylogenetic analysis of protein sequences therefore can be easily performed without requiring evolutionary models or human intervention. In addition, there exists no a criterion to choose a suitable k, and k has a great influence on obtaining results as well as computational complexity. In this paper, a compound k-mer natural vector is utilized to quantify each protein sequence. The results gotten from phylogenetic analysis on three protein datasets demonstrate that our new method can precisely describe the evolutionary relationships of proteins, and greatly heighten the computing efficiency.  相似文献   

6.
Endonuclease assays of the H-N-H proteins encoded by two group I introns in the Chlamydomonas moewusii chloroplast psbA gene revealed that the CmpsbA·1 intron specifies a site-specific DNA endonuclease, designated I-CmoeI. Like most previously reported intron-encoded endonucleases, I-CmoeI generates a double-strand break near the insertion site of its encoding intron, leaving 3′ extensions of 4 nt. This enzyme was purified from Escherichia coli as a fusion protein with a His tag at its N-terminus. The recombinant protein (rI-CmoeI) requires a divalent alkaline earth cation for DNA cleavage (Mg2+ > Ca2+ > Sr2+ > Ba2+). It also requires a metal cofactor for DNA binding, a property shared with H-N-H colicins but not with the homing endonucleases characterized to date. rI-CmoeI binds its recognition sequence as a monomer, as revealed by gel retardation assays. Km and kcat values of 100 ± 40 pM and 0.26 ± 0.04 min–1, respectively, were determined. Replacement of the first histidine of the H-N-H motif by an alanine residue abolishes both rI-CmoeI activity and binding to its substrate. We propose that this conserved histidine residue plays a role in binding the metal cofactor and that such binding induces a structural modification of the enzyme which is required for DNA recognition.  相似文献   

7.
Phylogenetic tree reconstruction requires construction of a multiple sequence alignment (MSA) from sequences. Computationally, it is difficult to achieve an optimal MSA for many sequences. Moreover, even if an optimal MSA is obtained, it may not be the true MSA that reflects the evolutionary history of the underlying sequences. Therefore, errors can be introduced during MSA construction which in turn affects the subsequent phylogenetic tree construction. In order to circumvent this issue, we extend the application of the k-tuple distance to phylogenetic tree reconstruction. The k-tuple distance between two sequences is the sum of the differences in frequency, over all possible tuples of length k, between the sequences and can be estimated without MSAs. It has been traditionally used to build a fast ‘guide tree’ to assist the construction of MSAs. Using the 1470 simulated sets of sequences generated under different evolutionary scenarios, the neighbor-joining trees and BioNJ trees, we compared the performance of the k-tuple distance with four commonly used distance estimators including Jukes–Cantor, Kimura, F84 and Tamura–Nei. These four distance estimators fall into the category of model-based distance estimators, as each of them takes account of a specific substitution model in order to compute the distance between a pair of already aligned sequences. Results show that trees constructed from the k-tuple distance are more accurate than those from other distances most time; when the divergence between underlying sequences is high, the tree accuracy could be twice or higher using the k-tuple distance than other estimators. Furthermore, as the k-tuple distance voids the need for constructing an MSA, it can save tremendous amount of time for phylogenetic tree reconstructions when the data include a large number of sequences.  相似文献   

8.
Chloroplast genome sequences have been used to understand evolutionary events and to infer efficiently phylogenetic relationships. Callitropsis funebris (Cupressaceae) is an endemic species in China. Its phylogenetic position is controversial due to morphological characters similar to those of Cupressus, Callitropsis, and Chamaecyparis. This study used next‐generation sequencing technology to sequence the complete chloroplast genome of Ca. funebris and then constructed the phylogenetic relationship between Ca. funebris and its related species based on a variety of data sets and methods. Simple sequence repeats (SSRs) and adaptive evolution analysis were also conducted. Our results showed that the monophyletic branch consisting of Ca. funebris and Cupressus tonkinensis is a sister to Cupressus, while Callitropsis is not monophyletic; Ca. nootkatensis and Ca. vietnamensis are nested in turn at the base of the monophyletic group Hesperocyparis. The statistical results of SSRs supported the closest relationship between Ca. funebris and Cupressus. By performing adaptive evolution analysis under the phylogenetic background of Cupressales, the Branch model detected three genes and the Site model detected 10 genes under positive selection; and the Branch‐Site model uncovered that rpoA has experienced positive selection in the Ca. funebries branch. Molecular analysis from the chloroplast genome highly supported that Ca. funebris is at the base of Cupressus. Of note, SSR features were found to be able to shed some light on phylogenetic relationships. In short, this chloroplast genomic study has provided new insights into the phylogeny of Ca. funebris and revealed multiple chloroplast genes possibly undergoing adaptive evolution.  相似文献   

9.
Hammerhead ribozymes are self-cleaving RNA molecules capable of regulating gene expression in living cells. Their cleavage performance is strongly influenced by intra-molecular loop–loop interactions, a feature not readily accessible through modern prediction algorithms. Ribozyme engineering and efficient implementation of ribozyme-based genetic switches requires detailed knowledge of individual self-cleavage performances. By rational design, we devised fluorescent aptamer-ribozyme RNA architectures that allow for the real-time measurement of ribozyme self-cleavage activity in vitro. The engineered nucleic acid molecules implement a split Spinach aptamer sequence that is made accessible for strand displacement upon ribozyme self-cleavage, thereby complementing the fluorescent Spinach aptamer. This fully RNA-based ribozyme performance assay correlates ribozyme cleavage activity with Spinach fluorescence to provide a rapid and straightforward technology for the validation of loop–loop interactions in hammerhead ribozymes.  相似文献   

10.
Located on Chromosome 6p21, classical human leukocyte antigen genes are highly polymorphic. HLA alleles associate with a variety of phenotypes, such as narcolepsy, autoimmunity, as well as immunologic response to infectious disease. Moreover, high resolution genotyping of these loci is critical to achieving long-term survival of allogeneic transplants. Development of methods to obtain high resolution analysis of HLA genotypes will lead to improved understanding of how select alleles contribute to human health and disease risk. Genomic DNAs were obtained from a cohort of n = 383 subjects recruited as part of an Ulcerative Colitis study and analyzed for HLA-DRB1. HLA genotypes were determined using sequence specific oligonucleotide probes and by next-generation sequencing using the Roche/454 GSFLX instrument. The Clustering and Alignment of Polymorphic Sequences (CAPSeq) software application was developed to analyze next-generation sequencing data. The application generates HLA sequence specific 6-digit genotype information from next-generation sequencing data using MUMmer to align sequences and the R package diffusionMap to classify sequences into their respective allelic groups. The incorporation of Bootstrap Aggregating, Bagging to aid in sorting of sequences into allele classes resulted in improved genotyping accuracy. Using Bagging iterations equal to 60, the genotyping results obtained using CAPSeq when compared with sequence specific oligonucleotide probe characterized 4-digit genotypes exhibited high rates of concordance, matching at 759 out of 766 (99.1%) alleles.  相似文献   

11.

Background

With the completion of genome sequencing projects for more than 30 plant species, large volumes of genome sequences have been produced and stored in online databases. Advancements in sequencing technologies have reduced the cost and time of whole genome sequencing enabling more and more plants to be subjected to genome sequencing. Despite this, genome sequence qualities of multiple plants have not been evaluated.

Methodology/Principal Finding

Integrity and accuracy were calculated to evaluate the genome sequence quality of 32 plants. The integrity of a genome sequence is presented by the ratio of chromosome size and genome size (or between scaffold size and genome size), which ranged from 55.31% to nearly 100%. The accuracy of genome sequence was presented by the ratio between matched EST and selected ESTs where 52.93% ∼ 98.28% and 89.02% ∼ 98.85% of the randomly selected clean ESTs could be mapped to chromosome and scaffold sequences, respectively. According to the integrity, accuracy and other analysis of each plant species, thirteen plant species were divided into four levels. Arabidopsis thaliana, Oryza sativa and Zea mays had the highest quality, followed by Brachypodium distachyon, Populus trichocarpa, Vitis vinifera and Glycine max, Sorghum bicolor, Solanum lycopersicum and Fragaria vesca, and Lotus japonicus, Medicago truncatula and Malus × domestica in that order. Assembling the scaffold sequences into chromosome sequences should be the primary task for the remaining nineteen species. Low GC content and repeat DNA influences genome sequence assembly.

Conclusion

The quality of plant genome sequences was found to be lower than envisaged and thus the rapid development of genome sequencing projects as well as research on bioinformatics tools and the algorithms of genome sequence assembly should provide increased processing and correction of genome sequences that have already been published.  相似文献   

12.
13.
An imidazole-containing polyamide trimer, f-ImImIm, where f is a formamido group, was recently found using NMR methods to recognize T·G mismatched base pairs. In order to characterize in detail the T·G recognition affinity and specificity of imidazole-containing polyamides, f-ImIm, f-ImImIm and f-PyImIm were synthesized. The kinetics and thermodynamics for the polyamides binding to Watson–Crick and mismatched (containing one or two T·G, A·G or G·G mismatched base pairs) hairpin oligonucleotides were determined by surface plasmon resonance and circular dichroism (CD) methods. f-ImImIm binds significantly more strongly to the T·G mismatch-containing oligonucleotides than to the sequences with other mismatched or with Watson–Crick base pairs. Compared with the Watson–Crick CCGG sequence, f-ImImIm associates more slowly with DNAs containing T·G mismatches in place of one or two C·G base pairs and, more importantly, the dissociation rate from the T·G oligonucleotides is very slow (small kd). These results clearly demonstrate the binding selectivity and enhanced affinity of side-by-side imidazole/imidazole pairings for T·G mismatches and show that the affinity and specificity increase arise from much lower kd values with the T·G mismatched duplexes. CD titration studies of f-ImImIm complexes with T·G mismatched sequences produce strong induced bands at ~330 nm with clear isodichroic points, in support of a single minor groove complex. CD DNA bands suggest that the complexes remain in the B conformation.  相似文献   

14.
We have investigated the relative merits of two commonly used methods for target site selection for ribozymes: secondary structure prediction (MFold program) and in vitro accessibility assays. A total of eight methylated ribozymes with DNA arms were synthesized and analyzed in a transient co-transfection assay in HeLa cells. Residual expression levels ranging from 23 to 72% were obtained with anti-PSKH1 ribozymes compared to cells transfected with an irrelevant control ribozyme. Ribozyme efficacy depended on both ribozyme concentration and the steady state expression levels of the target mRNA. Allylated ribozymes against a subset of the target sites generally displayed poorer efficacy than their methylated counterparts. This effect appeared to be influenced by in vivo accessibility of the target site. Ribozymes designed on the basis of either selection method displayed a wide range of efficacies with no significant differences in the average activities of the two groups of ribozymes. While in vitro accessibility assays had limited predictive power, there was a significant correlation between certain features of the predicted secondary structure of the target sequence and the efficacy of the corresponding ribozyme. Specifically, ribozyme efficacy appeared to be positively correlated with the presence of short stem regions and helices of low stability within their target sequences. There were no correlations with predicted free energy or loop length.  相似文献   

15.
Group I intron ribozymes can repair mutated mRNAs by replacing the 3′-terminal portion of the mRNA with their own 3′-exon. This trans-splicing reaction has the potential to treat genetic disorders and to selectively kill cancer cells or virus-infected cells. However, these ribozymes have not yet been used in therapy, partially due to a low in vivo trans-splicing efficiency. Previous strategies to improve the trans-splicing efficiencies focused on designing and testing individual ribozyme constructs. Here we describe a method that selects the most efficient ribozymes from millions of ribozyme variants. This method uses an in vivo rescue assay where the mRNA of an inactivated antibiotic resistance gene is repaired by trans-splicing group I intron ribozymes. Bacterial cells that express efficient trans-splicing ribozymes are able to grow on medium containing the antibiotic chloramphenicol. We randomized a 5′-terminal sequence of the Tetrahymena thermophila group I intron and screened a library with 9 × 106 ribozyme variants for the best trans-splicing activity. The resulting ribozymes showed increased trans-splicing efficiency and help the design of efficient trans-splicing ribozymes for different sequence contexts. This in vivo selection method can now be used to optimize any sequence in trans-splicing ribozymes.  相似文献   

16.
In recent years, unprecedented DNA sequencing capacity provided by next generation sequencing (NGS) has revolutionized genomic research. Combining the Illumina sequencing platform and a scFv library designed to confine diversity to both CDR3, >1.9 × 107 sequences have been generated. This approach allowed for in depth analysis of the library’s diversity, provided sequence information on virtually all scFv during selection for binding to two targets and a global view of these enrichment processes. Using the most frequent heavy chain CDR3 sequences, primers were designed to rescue scFv from the third selection round. Identification, based on sequence frequency, retrieved the most potent scFv and valuable candidates that were missed using classical in vitro screening. Thus, by combining NGS with display technologies, laborious and time consuming upfront screening can be by-passed or complemented and valuable insights into the selection process can be obtained to improve library design and understanding of antibody repertoires.  相似文献   

17.
Nanopore sequencing and phylodynamic modeling have been used to reconstruct the transmission dynamics of viral epidemics, but their application to bacterial pathogens has remained challenging. Cost-effective bacterial genome sequencing and variant calling on nanopore platforms would greatly enhance surveillance and outbreak response in communities without access to sequencing infrastructure. Here, we adapt random forest models for single nucleotide polymorphism (SNP) polishing developed by Sanderson and colleagues (2020. High precision Neisseria gonorrhoeae variant and antimicrobial resistance calling from metagenomic nanopore sequencing. Genome Res. 30(9):1354–1363) to estimate divergence and effective reproduction numbers (Re) of two methicillin-resistant Staphylococcus aureus (MRSA) outbreaks from remote communities in Far North Queensland and Papua New Guinea (PNG; n = 159). Successive barcoded panels of S. aureus isolates (2 × 12 per MinION) sequenced at low coverage (>5× to 10×) provided sufficient data to accurately infer genotypes with high recall when compared with Illumina references. Random forest models achieved high resolution on ST93 outbreak sequence types (>90% accuracy and precision) and enabled phylodynamic inference of epidemiological parameters using birth–death skyline models. Our method reproduced phylogenetic topology, origin of the outbreaks, and indications of epidemic growth (Re > 1). Nextflow pipelines implement SNP polisher training, evaluation, and outbreak alignments, enabling reconstruction of within-lineage transmission dynamics for infection control of bacterial disease outbreaks on portable nanopore platforms. Our study shows that nanopore technology can be used for bacterial outbreak reconstruction at competitive costs, providing opportunities for infection control in hospitals and communities without access to sequencing infrastructure, such as in remote northern Australia and PNG.  相似文献   

18.
19.

Background

DNA sequence comparison is a well-studied problem, in which two DNA sequences are compared using a weighted edit distance. Recent DNA sequencing technologies however observe an encoded form of the sequence, rather than each DNA base individually. The encoded DNA sequence may contain technical errors, and therefore encoded sequencing errors must be incorporated when comparing an encoded DNA sequence to a reference DNA sequence.

Results

Although two-base encoding is currently used in practice, many other encoding schemes are possible, whereby two ore more bases are encoded at a time. A generalized k-base encoding scheme is presented, whereby feasible higher order encodings are better able to differentiate errors in the encoded sequence from true DNA sequence variants. A generalized version of the previous two-base encoding DNA sequence comparison algorithm is used to compare a k-base encoded sequence to a DNA reference sequence. Finally, simulations are performed to evaluate the power, the false positive and false negative SNP discovery rates, and the performance time of k-base encoding compared to previous methods as well as to the standard DNA sequence comparison algorithm.

Conclusions

The novel generalized k-base encoding scheme and resulting local alignment algorithm permits the development of higher fidelity ligation-based next generation sequencing technology. This bioinformatic solution affords greater robustness to errors, as well as lower false SNP discovery rates, only at the cost of computational time.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号