首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
3.
4.
In linguistic studies, the academic level of the vocabulary in a text can be described in terms of statistical physics by using a “temperature” concept related to the text''s word-frequency distribution. We propose a “comparative thermo-linguistic” technique to analyze the vocabulary of a text to determine its academic level and its target readership in any given language. We apply this technique to a large number of books by several authors and examine how the vocabulary of a text changes when it is translated from one language to another. Unlike the uniform results produced using the Zipf law, using our “word energy” distribution technique we find variations in the power-law behavior. We also examine some common features that span across languages and identify some intriguing questions concerning how to determine when a text is suitable for its intended readership.  相似文献   

5.

Background

Zipf''s law states that the relationship between the frequency of a word in a text and its rank (the most frequent word has rank , the 2nd most frequent word has rank ,…) is approximately linear when plotted on a double logarithmic scale. It has been argued that the law is not a relevant or useful property of language because simple random texts - constructed by concatenating random characters including blanks behaving as word delimiters - exhibit a Zipf''s law-like word rank distribution.

Methodology/Principal Findings

In this article, we examine the flaws of such putative good fits of random texts. We demonstrate - by means of three different statistical tests - that ranks derived from random texts and ranks derived from real texts are statistically inconsistent with the parameters employed to argue for such a good fit, even when the parameters are inferred from the target real text. Our findings are valid for both the simplest random texts composed of equally likely characters as well as more elaborate and realistic versions where character probabilities are borrowed from a real text.

Conclusions/Significance

The good fit of random texts to real Zipf''s law-like rank distributions has not yet been established. Therefore, we suggest that Zipf''s law might in fact be a fundamental law in natural languages.  相似文献   

6.
Understanding the genetic regulatory network comprising genes, RNA, proteins and the network connections and dynamical control rules among them, is a major task of contemporary systems biology. I focus here on the use of the ensemble approach to find one or more well-defined ensembles of model networks whose statistical features match those of real cells and organisms. Such ensembles should help explain and predict features of real cells and organisms. More precisely, an ensemble of model networks is defined by constraints on the "wiring diagram" of regulatory interactions, and the "rules" governing the dynamical behavior of regulated components of the network. The ensemble consists of all networks consistent with those constraints. Here I discuss ensembles of random Boolean networks, scale free Boolean networks, "medusa" Boolean networks, continuous variable networks, and others. For each ensemble, M statistical features, such as the size distribution of avalanches in gene activity changes unleashed by transiently altering the activity of a single gene, the distribution in distances between gene activities on different cell types, and others, are measured. This creates an M-dimensional space, where each ensemble corresponds to a cluster of points or distributions. Using current and future experimental techniques, such as gene arrays, these M properties are to be measured for real cells and organisms, again yielding a cluster of points or distributions in the M-dimensional space. The procedure then finds ensembles close to those of real cells and organisms, and hill climbs to attempt to match the observed M features. Thus obtains one or more ensembles that should predict and explain many features of the regulatory networks in cells and organisms.  相似文献   

7.
Carlson SM  Najmi A  Cohen HJ 《Proteomics》2007,7(7):1037-1046
Correlated variables have been shown to confound statistical analyses in microarray experiments. The same effect applies to an even greater degree in proteomics, especially with the use of MS for parallel measurements. Biological effects such as PTM, fragmentation, and multimer formation can produce strongly correlated variables. The problem is compounded in some types of MS by technical effects such as incomplete chromatographic separation, binding to multiple surfaces, or multiple ionizations. Existing methods for dimension reduction, notably principal components analysis and related techniques, are not always satisfactory because they produce data that often lack clear biological interpretation. We propose a preprocessing algorithm that clusters highly correlated features, using the Bayes information criterion to select an optimal number of clusters. Statistical analysis of clusters, instead of individual features, benefits from lower noise, and reduces the difficulties associated with strongly correlated data. This preprocessing increases the statistical power of analyses using false discovery rate on simulated data. Strong correlations are often present in real data, and we find that clustering improves biomarker discovery in clinical SELDI-TOF-MS datasets of plasma from patients with Kawasaki disease, and bone-marrow cell extracts from patients with acute myeloid or acute lymphoblastic leukemia.  相似文献   

8.
9.
When investigators undertake searches of DNA databases, they normally discard large numbers of alignments that demonstrate very weak resemblances to each other, retaining only those that show statistically significant levels of resemblance. We show here that a great deal of information can be extracted from these weak alignments by examining them en masse. This is done by building three-dimensional similarity landscapes from the alignments, landscapes that reveal whether an unusual number of individually nonsignificant alignments tend to match up to a particular region of the query sequence being searched. The power of the search is increased by the use of libraries consisting entirely of introns or of exons. We show that (1) similarity landscapes with a variety of features can be generated from both intron and exon libraries, using introns or exons as query sequences; (2) the landscape features are real and not a statistical artifact; (3) well-known protein motifs used as query sequences can generate various landscape features; and (4) there is some evidence for resemblances between short regions of sequence carried by introns and exons. One possible interpretation of these results is that both introns and exons may have been built up during their evolution from short regions of sequence that as a result are now widely distributed throughout eukaryotic genomes. Such an interpretation would imply that these short regions have common ancestry. Alternatively, the wide sharing of short pieces of DNA may reflect regions with particular structural properties that have arisen through convergent evolution. The similarity-landscape approach can be used to detect such widespread structural motifs and sequence motifs in the genome that might be missed by less-global searches. It can also be used in conjunction with algorithms developed for detecting significant multiple alignments by isolating promising subsets of the databases that can be examined in more detail.Correspondence to: C. Wills  相似文献   

10.
The typical output of many computational methods to identify binding sites is a long list of motifs containing some real motifs (those most likely to correspond to the actual binding sites) along with a large number of random variations of these. We present a statistical method to separate real motifs from their artifacts. This produces a short list of high quality motifs that is sufficient to explain the over-representation of all motifs in the given sequences. Using synthetic data sets, we show that the output of our method is very accurate. On various sets of upstream sequences in S. cerevisiae, our program identifies several known binding sites, as well as a number of significant novel motifs.  相似文献   

11.
12.
Mathematical models of the generation of genetic texts appeared simultaneously with the first sequencing DNA. They are used to establish functional and evolutionary relations between genetic texts, to predict the number and distribution of specific sites in a sequence and to identify "meaningful" words. The present paper deals with two problems: 1) The significance of deviations from the mean statistical characteristics in a genetic text. Anyone who has addressed himself to the statistical analysis of sequenced DNA is familiar with the question: what deviations from the expected frequencies of occurrence of particular words testify to the "biological" significance of those words? We propose a formula for the variance of the number of word's occurrences in the text, with allowance for word overlaps, making it possible to assess the significance of the deviations from the expected statistical characteristics. 2) A new method for predicting the frequencies of occurrence of particular words in a genetic text using the statistical characteristics of "spaced" L-grams. The method can be used for predicting the number of restriction sites in human DNA and in planning experiments on the physical mapping and sequencing of the human genome.  相似文献   

13.
目的:分析《中国应用生理学杂志》的现状及载文信息,为作者、读者及办刊人员提供参考。方法:提取CNKI中《中国应用生理学杂志》2009年-2013年刊载论文,运用文献计量学方法对其类型、基金资助情况、机构类型及学科分类情况进行定量分析。结果:2009年。2013年刊载论文742篇,其中英文论文27篇;论文基金资助比率较高,平均82.6%。结论:《中国应用生理学杂志》拥有稳定的稿源,学科分布广泛,布局合理,对我国应用生理学的发展起着重要的作用。  相似文献   

14.
A widespread problem in biological research is assessing whether a model adequately describes some real-world data. But even if a model captures the large-scale statistical properties of the data, should we be satisfied with it? We developed a method, inspired by Alan Turing, to assess the effectiveness of model fitting. We first built a self-propelled particle model whose properties (order and cohesion) statistically matched those of real fish schools. We then asked members of the public to play an online game (a modified Turing test) in which they attempted to distinguish between the movements of real fish schools or those generated by the model. Even though the statistical properties of the real data and the model were consistent with each other, the public could still distinguish between the two, highlighting the need for model refinement. Our results demonstrate that we can use ‘citizen science’ to cross-validate and improve model fitting not only in the field of collective behaviour, but also across a broad range of biological systems.  相似文献   

15.
16.
17.
《IRBM》2014,35(1):3-10
In this paper we propose a brief survey on geometric variational approaches and more precisely on statistical region-based active contours for medical image segmentation. In these approaches, image features are considered as random variables whose distribution may be either parametric, and belongs to the exponential family, or non-parametric estimated with a kernel density method. Statistical region-based terms are listed and reviewed showing that these terms can depict a wide spectrum of segmentation problems. A shape prior can also be incorporated to the previous statistical terms. A discussion of some optimization schemes available to solve the variational problem is also provided. Examples on real medical images are given to illustrate some of the given criteria.  相似文献   

18.
Characterizing the microenvironment surrounding protein sites.   总被引:4,自引:0,他引:4       下载免费PDF全文
Sites are microenvironments within a biomolecular structure, distinguished by their structural or functional role. A site can be defined by a three-dimensional location and a local neighborhood around this location in which the structure or function exists. We have developed a computer system to facilitate structural analysis (both qualitative and quantitative) of biomolecular sites. Our system automatically examines the spatial distributions of biophysical and biochemical properties, and reports those regions within a site where the distribution of these properties differs significantly from control nonsites. The properties range from simple atom-based characteristics such as charge to polypeptide-based characteristics such as type of secondary structure. Our analysis of sites uses non-sites as controls, providing a baseline for the quantitative assessment of the significance of the features that are uncovered. In this paper, we use radial distributions of properties to study three well-known sites (the binding sites for calcium, the milieu of disulfide bridges, and the serine protease active site). We demonstrate that the system automatically finds many of the previously described features of these sites and augments these features with some new details. In some cases, we cannot confirm the statistical significance of previously reported features. Our results demonstrate that analysis of protein structure is sensitive to assumptions about background distributions, and that these distributions should be considered explicitly during structural analyses.  相似文献   

19.

Background

Several tools are available to identify miRNAs from deep-sequencing data, however, only a few of them, like miRDeep, can identify novel miRNAs and are also available as a standalone application. Given the difference between plant and animal miRNAs, particularly in terms of distribution of hairpin length and the nature of complementarity with its duplex partner (or miRNA star), the underlying (statistical) features of miRDeep and other tools, using similar features, are likely to get affected.

Results

The potential effects on features, such as minimum free energy, stability of secondary structures, excision length, etc., were examined, and the parameters of those displaying sizable changes were estimated for plant specific miRNAs. We found most of these features acquired a new set of values or distributions for plant specific miRNAs. While the length of conserved positions (nucleus) in mature miRNAs were relatively longer in plants, the difference in distribution of minimum free energy, between real and background hairpins, was marginal. However, the choice of source (species) of background sequences was found to affect both the minimum free energy and miRNA hairpin stability. The new parameters were tested on an Illumina dataset from maize seedlings, and the results were compared with those obtained using default parameters. The newly parameterized model was found to have much improved specificity and sensitivity over its default counterpart.

Conclusions

In summary, the present study reports behavior of few general and tool-specific statistical features for improving the prediction accuracy of plant miRNAs from deep-sequencing data.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号