首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 140 毫秒
1.
基因表达过程主要包括转录、剪接和翻译,多种调控元件参与其中,是个高度调控的过程。建模识别分析这些调控元件,对理解基因表达具有重要意义。本研究提出了一个基于移动序列模式的短序列建模模型,并对转录启动子和剪接调控元件进行了建模分析。启动子是基因转录的核心调控元件,剪接调控元件参与调控剪接位点的识别。分类实验结果表明,该模型可有效识别转录启动子序列和剪接调控元件序列。并进一步利用该模型,建模分析已为生物实验验证的、会导致剪接影响的基因组变异,实验结果表明,该模型可有效预测基因组变异的剪接影响,进一步验证了该模型的有效性。  相似文献   

2.
可变剪接源于多外显子基因生成多个转录本的调控过程。随着高通量测序,尤其是RNA-seq的研究进展,剪接序列和剪接位点可以通过挖掘海量的测序数据进行预测。可变剪接现象拓宽了人们对基因结构和蛋白质亚型的知识。然而现有的短序列比对软件受到随机性比对的影响,产生很多假阳性剪接位点,干扰下游数据分析。本研究发现,可变剪接位点周边序列的结构特征可被深度学习模型提取,并利用深度卷积神经网络识别剪接位点。本研究的模型具有识别率高、计算速度快,模型泛化能力强、鲁棒性高等优势。  相似文献   

3.
为提高非翻译区剪接位点识别的精度,提出一种统计概率与支持向量机相结合的识别方法 .该方法主要分为两个阶段,第一阶段应用统计学方法对非翻译区(UTR)序列进行描述,将序列中各碱基之间的相关性、位置特异性、保守性等特征用概率形式描述,以概率参数作为第二阶段支持向量机的输入向量,第二阶段应用带有多项式核函数的支持向量机(SVM)对剪接位点进行识别.通过对人类5′UTR剪接位点数据集进行测试,结果表明:该方法对非翻译区剪接位点的识别取得了很好的效果.  相似文献   

4.
采用基于贝叶斯网络的建模方法,预测真核生物DNA序列中的剪接位点.分别建立了供体位点和受体位点模型,并根据两种位点的生物学特性,对模型的拓扑结构和上下游节点的选择进行了优化.通过贝叶斯网络的最大似然学习算法求出模型参数后,利用10分组交互验证方法对测试数据进行剪接位点预测。结果显示,受体位点的平均预测准确率为92.5%,伪受体位点的平均预测准确率为94.0%,供体位点的平均预测准确率为92.3%,伪供体位点的平均预测准确率为93.5%,整体效果要好于基于使用独立和条件概率矩阵、以及隐Markov模型的预测方法.表明利用贝叶斯网络对剪接位点建模是预测剪接位点的一种有效手段.  相似文献   

5.
基于支持向量机(SVM)的剪接位点识别   总被引:14,自引:1,他引:13  
剪接位点的识别作为基因识别中的一个重要环节, 一直受到研究人员的关注。考虑到剪接位点附近存在的序列保守性,已有一些基于统计特性的方法被用于剪接位点的识别中,但效果仍有待进一步改进。支持向量机(Support Vector Machines) 作为一种新的基于统计学习理论的学习机,近几年有了很大的发展,已被应用在模式识别的许多问题中。文中将其用于剪接位点的识别中,并针对满足GT- AG 规则的序列样本中虚假剪接位点的样本数远大于真实位点这一特性, 提出了一种基于SVM 的平衡取小法以获得更好的识别效果。实验结果表明,应用支持向量机进行剪接位点的识别能更好地提取位点附近保守序列的统计特征,对测试集具有更好的推广能力,并且使用上更加简单。这一结果为剪接位点的识别提供了一种新的方法,同时也为生物大分子研究中结构和位点的识别问题的解决提供了新的线索。  相似文献   

6.
低维输入空间的支持向量机识别人类剪接位点   总被引:1,自引:0,他引:1  
真核生物剪接位点的识别作为基因阵构成的向量来表示序列,用支持向量机在六维向量空间中寻找最优超平面,从而将真实的剪接位点和虚假的剪接位点进行分类.计算结果表明,利用这样的算法预测人类的剪接位点,有较好的预测效果.与其他的一些算法相比,表现出参数少,精度高等优点.  相似文献   

7.
本文提出了一种基于卷积神经网络和循环神经网络的深度学习模型,通过分析基因组序列数据,识别人基因组中环形RNA剪接位点.首先,根据预处理后的核苷酸序列,设计了2种网络深度、8种卷积核大小和3种长短期记忆(long short term memory,LSTM)参数,共8组16个模型;其次,进一步针对池化层进行均值池化和最大池化的测试,并加入GC含量提高模型的预测能力;最后,对已经实验验证过的人类精浆中环形RNA进行了预测.结果表明,卷积核尺寸为32×4、深度为1、LSTM参数为32的模型识别率最高,在训练集上为0.9824,在测试数据集上准确率为0.95,并且在实验验证数据上的正确识别率为83%.该模型在人的环形RNA剪接位点识别方面具有较好的性能.  相似文献   

8.
目的:计算识别果蝇中新的非经典剪接位点,以探索未知的剪接机制。方法:基于黑腹果蝇表达序列标签(EST)与其基因组序列比对数据重构基因结构,从中发现非经典的剪接位点,并采用Weblogo软件分析非经典剪接位点上下游序列,以期发现剪接相关的特异性元件。结果:共得到265个非经典的剪接位点,这些剪接位点落在195个蛋白编码基因上。结论:应用生物信息学方法在果蝇中发现了上百个非经典剪接位点,为研究非经典剪接机制奠定了基础。  相似文献   

9.
人类基因组中可变和组成性剪接位点的预测   总被引:2,自引:0,他引:2  
根据剪接位点的核酸序列保守特征,以及邻近位点的碱基组成和关联特性,结合一对可变剪接位点之间的距离参数和受体端剪接位点前30位碱基的GC和TC含量,利用结合多样性指标的二次判别方法(IDQD),预测了人类基因组中可变和组成性内含子的供体端和受体端的剪接位点,对可变的供体端和受体端剪接位点,阈值ξ选择-2时,总的预测精度分别为87.9%和89.9%,对组成性的供体端和受体端剪接位点,阈值ξ选择-1,总的预测精度分别为92.8%和94.3%.  相似文献   

10.
基于机器学习的高精度剪接位点识别是真核生物基因组注释的关键.本文采用卡方测验确定序列窗口长度,构建卡方统计差表提取位置特征,并结合碱基二联体频次表征序列;针对剪接位点正负样本高度不均衡这一情形,构建10个正负样本均衡的支持向量机分类器,进行加权投票决策,有效解决了不平衡模式分类问题. HS~3D数据集上的独立测试结果显示,供体、受体位点预测准确率分别达到93.39%、90.46%,明显高于参比方法.基于卡方统计差表的位置特征能有效表征DNA序列,在分子序列信号位点识别中具有应用前景.  相似文献   

11.
In this paper, a Bayesian method for inference is developed for the zero‐modified Poisson (ZMP) regression model. This model is very flexible for analyzing count data without requiring any information about inflation or deflation of zeros in the sample. A general class of prior densities based on an information matrix is considered for the model parameters. A sensitivity study to detect influential cases that can change the results is performed based on the Kullback–Leibler divergence. Simulation studies are presented in order to illustrate the performance of the developed methodology. Two real datasets on leptospirosis notification in Bahia State (Brazil) are analyzed using the proposed methodology for the ZMP model.  相似文献   

12.
Many statistical methods have been developed to screen for differentially expressed genes associated with specific phenotypes in the microarray data. However, it remains a major challenge to synthesize the observed expression patterns with abundant biological knowledge for more complete understanding of the biological functions among genes. Various methods including clustering analysis on genes, neural network, Bayesian network and pathway analysis have been developed toward this goal. In most of these procedures, the activation and inhibition relationships among genes have hardly been utilized in the modeling steps. We propose two novel Bayesian models to integrate the microarray data with the putative pathway structures obtained from the KEGG database and the directional gene–gene interactions in the medical literature. We define the symmetric Kullback–Leibler divergence of a pathway, and use it to identify the pathway(s) most supported by the microarray data. Monte Carlo Markov Chain sampling algorithm is given for posterior computation in the hierarchical model. The proposed method is shown to select the most supported pathway in an illustrative example. Finally, we apply the methodology to a real microarray data set to understand the gene expression profile of osteoblast lineage at defined stages of differentiation. We observe that our method correctly identifies the pathways that are reported to play essential roles in modulating bone mass.  相似文献   

13.
Often in biomedical studies, the routine use of linear mixed‐effects models (based on Gaussian assumptions) can be questionable when the longitudinal responses are skewed in nature. Skew‐normal/elliptical models are widely used in those situations. Often, those skewed responses might also be subjected to some upper and lower quantification limits (QLs; viz., longitudinal viral‐load measures in HIV studies), beyond which they are not measurable. In this paper, we develop a Bayesian analysis of censored linear mixed models replacing the Gaussian assumptions with skew‐normal/independent (SNI) distributions. The SNI is an attractive class of asymmetric heavy‐tailed distributions that includes the skew‐normal, skew‐t, skew‐slash, and skew‐contaminated normal distributions as special cases. The proposed model provides flexibility in capturing the effects of skewness and heavy tail for responses that are either left‐ or right‐censored. For our analysis, we adopt a Bayesian framework and develop a Markov chain Monte Carlo algorithm to carry out the posterior analyses. The marginal likelihood is tractable, and utilized to compute not only some Bayesian model selection measures but also case‐deletion influence diagnostics based on the Kullback–Leibler divergence. The newly developed procedures are illustrated with a simulation study as well as an HIV case study involving analysis of longitudinal viral loads.  相似文献   

14.
Beta diversity is among the most employed theoretical concepts in ecology and biodiversity conservation. Up to date, a self‐contained definition of it, with no reference to alpha and gamma diversity, has never been proposed. Using Kullback‐Leibler divergence, we present the explicit formula of Shannon's β entropy, a bias correction for its estimator and a confidence interval. We also provide the mathematical framework to decompose Shannon diversity into several hierarchical nested levels. From botanical inventories of tropical forest plots in French Guiana, we estimate Shannon diversity at the plot, forest and regional level. We believe this is a complete and usefulness toolbox for ecologists interested in partitioning biodiversity.  相似文献   

15.
Science can be seen as a sequential process where each new study augments evidence to the existing knowledge. To have the best prospects to make an impact in this process, a new study should be designed optimally taking into account the previous studies and other prior information. We propose a formal approach for the covariate prioritization, that is the decision about the covariates to be measured in a new study. The decision criteria can be based on conditional power, change of the p‐value, change in lower confidence limit, Kullback–Leibler divergence, Bayes factors, Bayesian false discovery rate or difference between prior and posterior expectation. The criteria can be also used for decisions on the sample size. As an illustration, we consider covariate prioritization based on genome‐wide association studies for C‐reactive protein levels and make suggestions on the genes to be studied further.  相似文献   

16.
Region-of-interest (ROI) and interior reconstructions for computed tomography (CT) have drawn much attention and can be of practical value for potential applications in reducing radiation dose and hardware cost. The conventional wisdom is that the exact reconstruction of an interior ROI is very difficult to be obtained by only using data associated with lines through the ROI. In this study, we propose and investigate optimization-based methods for ROI and interior reconstructions based on total variation (TV) and data derivative. Objective functions are built by the image TV term plus the data finite difference term. Different data terms in the forms of L1-norm, L2-norm, and Kullback–Leibler divergence are incorporated and investigated in the optimizations. Efficient algorithms are developed using the proximal alternating direction method of multipliers (ADMM) for each program. All sub-problems of ADMM are solved by using closed-form solutions with high efficiency. The customized optimizations and algorithms based on the TV and derivative-based data terms can serve as a powerful tool for interior reconstructions. Simulations and real-data experiments indicate that the proposed methods can be of practical value for CT imaging applications.  相似文献   

17.
In evolutionary biology, genetic sequences carry with them a trace of the underlying tree that describes their evolution from a common ancestral sequence. The question of how many sequence sites are required to recover this evolutionary relationship accurately depends on the model of sequence evolution, the substitution rate, divergence times and the method used to infer phylogenetic history. A particularly challenging problem for phylogenetic methods arises when a rapid divergence event occurred in the distant past. We analyse an idealised form of this problem in which the terminal edges of a symmetric four-taxon tree are some factor (λ) times the length of the interior edge. We determine an order λ2 lower bound on the growth rate for the sequence length required to resolve the tree (independent of any particular branch length). We also show that this rate of sequence length growth can be achieved by existing methods (including the simple ‘maximum parsimony’ method), and compare these order λ2 bounds with an order λ growth rate for a model that describes low-homoplasy evolution. In the final section, we provide a generic bound on the sequence length requirement for a more general class of Markov processes.  相似文献   

18.
19.
Phylogenetic inference: how much evolutionary history is knowable?   总被引:5,自引:2,他引:3  
In order to reconstruct phylogenetic trees from extremely dissimilar sequences it is necessary to estimate accurately the extent of sequence divergence. In this paper a new method of sequence analysis, Markov triple analysis, is developed for determining the relative frequencies of nucleotide substitutions within the three branches of a three-taxon dendrogram. Assuming that nucleotide sites are independently and identically distributed and assuming a Markov model for nucleotide (or protein) evolution, it is shown that the unique Markov matrices can be reconstructed given only the joint probability distribution relating three taxa. (In the much simpler case involving only two taxa and two character states, Markov matrices can also be reconstructed, provided symmetry assumptions are placed on the elements of the matrices.) The method is illustrated using sequence data from the combined first and second codon positions derived from complete human, mouse, and cow mitochondrial sequences.   相似文献   

20.
In common with other multigene families, sequence diversity in the hemoglobin genes of cladoceran crustaceans has been heavily impacted by gene conversion events. Because of their structural complexity (six exons, five introns), these genes provide a good opportunity to study the influence of intron length and position on the conversion process. This study surveys the patterns of divergence in variants of one hemoglobin gene (H1) from two closely related species of Daphnia using a PCR-based approach. Although its effects were most pronounced at their 5' ends, intron and exon regions of these genes showed similar exposure to gene conversion, excepting intron 2. This intron, which was the only one with a marked length difference among variants, showed substantial sequence divergence, suggesting that gene conversion was disrupted. These results, together with those on hemoglobin gene families in other organisms, indicate that sequence tracts showing gene conversion are often distributed in a mosaic fashion. The reactivation of gene conversion downstream of a block protected from its effects suggests that there are multiple initiation points, and the distribution of conversion tracts suggests that exon/intron splice sites are important in this regard.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号