首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Obtaining satisfactory results with neural networks depends on the availability of large data samples. The use of small training sets generally reduces performance. Most classical Quantitative Structure-Activity Relationship (QSAR) studies for a specific enzyme system have been performed on small data sets. We focus on the neuro-fuzzy prediction of biological activities of HIV-1 protease inhibitory compounds when inferring from small training sets. We propose two computational intelligence prediction techniques which are suitable for small training sets, at the expense of some computational overhead. Both techniques are based on the FAMR model. The FAMR is a Fuzzy ARTMAP (FAM) incremental learning system used for classification and probability estimation. During the learning phase, each sample pair is assigned a relevance factor proportional to the importance of that pair. The two proposed algorithms in this paper are: 1) The GA-FAMR algorithm, which is new, consists of two stages: a) During the first stage, we use a genetic algorithm (GA) to optimize the relevances assigned to the training data. This improves the generalization capability of the FAMR. b) In the second stage, we use the optimized relevances to train the FAMR. 2) The Ordered FAMR is derived from a known algorithm. Instead of optimizing relevances, it optimizes the order of data presentation using the algorithm of Dagher et al. In our experiments, we compare these two algorithms with an algorithm not based on the FAM, the FS-GA-FNN introduced in [4], [5]. We conclude that when inferring from small training sets, both techniques are efficient, in terms of generalization capability and execution time. The computational overhead introduced is compensated by better accuracy. Finally, the proposed techniques are used to predict the biological activities of newly designed potential HIV-1 protease inhibitors.  相似文献   

2.
Aims: To develop appropriate statistical approaches to plan and evaluate proficiency tests for the enumeration of Escherichia coli, addressing, in particular, a possible but frequently unavoidable lack of test sample homogeneity. Methods and Results: Each of 50 laboratories analysed two samples of a stabilized suspension of E. coli in duplicate, using various media, inoculation methods, and incubation times and conditions. In parallel, the E. coli suspension was tested by the organiser for homogeneity and stability. Escherichia coli counts followed a log‐normal distribution. After eliminating, by Youden analysis, two data sets that were considered outliers and eight data sets for underperformance of the laboratories (substantial lack of repeatability), the standard deviation of the mean was about 0·06 log10 units. There was no evidence of bimodality of the data. Lack of homogeneity of distribution of bacteria had a strong effect on measurement uncertainty, in addition to laboratory bias and method repeatability. The homogeneity decreases during storage of the individual test vials; this effect could be modelled by the known kinetics of inactivation of micro‐organisms. The results were confirmed by Monte Carlo simulations. Conclusions: By a tailored analysis of proficiency testing data, it is possible to distinguish the effect of lack of homogeneity, laboratory bias and method repeatability, on the measurement uncertainty. Significance and Impact of the Study: A statistic tool is provided to solve problems related to lack of stability of microbiological test material and to separate the effects of sample inhomogeneity from the performance of the individual laboratory.  相似文献   

3.
《Dendrochronologia》2014,32(4):343-356
A number of processing options associated with the use of a “regional curve” to standardise tree-ring measurements and generate a chronology representing changing tree growth over time are discussed. It is shown that failing to use pith offset estimates can generate a small but systematic chronology error. Where chronologies contain long-timescale signal variance, tree indices created by division of the raw measurements by RCS curve values produce chronologies with a skewed distribution. A simple empirical method of converting tree-indices to have a normal distribution is proposed. The Expressed Population Signal, which is widely used to estimate the statistical confidence of chronologies created using curve-fitting methods of standardisation, is not suitable for use with RCS generated chronologies. An alternative implementation, which takes account of the uncertainty associated with long-timescale as well as short-timescale chronology variance, is proposed. The need to assess the homogeneity of differently-sourced sets of measurement data and their suitability for amalgamation into a single data set for RCS standardisation is discussed. The possible use of multiple growth-rate based RCS curves is considered where a potential gain in chronology confidence must be balanced against the potential loss of long-timescale variance. An approach to the use of the “signal-free” method for generating artificial measurement series with the ‘noise’ characteristics of real data series but with a known chronology signal applied for testing standardisation performance is also described.  相似文献   

4.
We analyse optimal and heuristic place prioritization algorithms for biodiversity conservation area network design which can use probabilistic data on the distribution of surrogates for biodiversity. We show how an Expected Surrogate Set Covering Problem (ESSCP) and a Maximal Expected Surrogate Covering Problem (MESCP) can be linearized for computationally efficient solution. For the ESSCP, we study the performance of two optimization software packages (XPRESS and CPLEX) and five heuristic algorithms based on traditional measures of complementarity and rarity as well as the Shannon and Simpson indices of α‐diversity which are being used in this context for the first time. On small artificial data sets the optimal place prioritization algorithms often produced more economical solutions than the heuristic algorithms, though not always ones guaranteed to be optimal. However, with large data sets, the optimal algorithms often required long computation times and produced no better results than heuristic ones. Thus there is generally little reason to prefer optimal to heuristic algorithms with probabilistic data sets.  相似文献   

5.
Inference of haplotypes is important in genetic epidemiology studies. However, all large genotype data sets have errors due to the use of inexpensive genotyping machines that are fallible and shortcomings in genotyping scoring softwares, which can have an enormous impact on haplotype inference. In this article, we propose two novel strategies to reduce the impact induced by genotyping errors in haplotype inference. The first method makes use of double sampling. For each individual, the “GenoSpectrum” that consists of all possible genotypes and their corresponding likelihoods are computed. The second method is a genotype clustering algorithm based on multi‐genotyping data, which also assigns a “GenoSpectrum” for each individual. We then describe two hybrid EM algorithms (called DS‐EM and MG‐EM) that perform haplotype inference based on “GenoSpectrum” of each individual obtained by double sampling and multi‐genotyping data. Both simulated data sets and a quasi real‐data set demonstrate that our proposed methods perform well in different situations and outperform the conventional EM algorithm and the HMM algorithm proposed by Sun, Greenwood, and Neal (2007, Genetic Epidemiology 31 , 937–948) when the genotype data sets have errors.  相似文献   

6.
Spatial correlation modeling comprises both spatial autocorrelation and spatial cross-correlation processes. The spatial autocorrelation theory has been well-developed. It is necessary to advance the method of spatial cross-correlation analysis to supplement the autocorrelation analysis. This paper presents a set of models and analytical procedures for spatial cross-correlation analysis. By analogy with Moran’s index newly expressed in a spatial quadratic form, a theoretical framework is derived for geographical cross-correlation modeling. First, two sets of spatial cross-correlation coefficients are defined, including a global spatial cross-correlation coefficient and local spatial cross-correlation coefficients. Second, a pair of scatterplots of spatial cross-correlation is proposed, and the plots can be used to visually reveal the causality behind spatial systems. Based on the global cross-correlation coefficient, Pearson’s correlation coefficient can be decomposed into two parts: direct correlation (partial correlation) and indirect correlation (spatial cross-correlation). As an example, the methodology is applied to the relationships between China’s urbanization and economic development to illustrate how to model spatial cross-correlation phenomena. This study is an introduction to developing the theory of spatial cross-correlation, and future geographical spatial analysis might benefit from these models and indexes.  相似文献   

7.
Image registration has been used to support pixel-level data analysis on pedobarographic image data sets. Some registration methods have focused on robustness and sacrificed speed, but a recent approach based on external contours offered both high computational processing speed and high accuracy. However, since contours can be influenced by local perturbations, we sought more global methods. Thus, we propose two new registration methods based on the Fourier transform, cross-correlation and phase correlation which offer high computational speed. We found out that both proposed methods revealed high accuracy for the similarity measures considered, using control geometric transformations. Additionally, both methods revealed high computational processing speed which, combined with their accuracy and robustness, allows their implementation in near-real-time applications. Furthermore, we found that the current methods were robust to moderate levels of noise, and consequently, do not require noise removal procedure like the contours method does.  相似文献   

8.
This study makes use of three sources of data, morphology and two chloroplast DNA sequences,ndhF andrbcL, to resolve relationships in Gesneriaceae. Cladograms from each of the three data sets separately are not topologically congruent. Statistical indices suggest that each data set is congruent with thendhF data althoughrbcL and morphology are themselves incongruent. Consensus methods provide no resolution of taxonomic relationships when trees from the different data sets are combined. Combining data sets generally results in cladograms that are more fully resolved than each of the data sets analyzed separately and support for the clades increases based on higher decay index and bootstrap values. These results indicate that there is a phylogenetic signal common to each of the data sets, however, the noise (errors due to homoplasy, mis-scoring, etc.) unique to each data source masks this signal. In combining the data, the evidence for the common evolutionary history in each data set overcomes the noise and is apparent in the resulting trees.  相似文献   

9.
A negative staining technique is presented based on the use of 40-60 nm quartz membrane supported by a silicon grid. The quartz membrane is fabricated by thermal growth of silicon dioxide on a silicon substrate followed by an anisotropic silicon etching step giving rectangular holes in the silicon substrate. The hydrophilic membrane is shown to be ideally suited for negative staining due to its spreading characteristics, homogeneity, heat resistance and mechanical stability. Micrographs of phage lambda are presented showing the detailed structure of the tail. A simple method of calculating the number of adsorbed particles based on diffusion limited association is also presented.  相似文献   

10.
Generic relationships within Episcieae were assessed using ITS and ndhF sequences. Previous analyses of this tribe have focussed only on ndhF data and have excluded two genera, Rhoogeton and Oerstedina, which are included in this analysis. Data were analyzed using both parsimony and maximum-likelihood methods. Results from partition homogeneity tests imply that the two data sets are significantly incongruent, but when Rhoogeton is removed from the analysis, the data sets are not significantly different. The combined data sets reveal greater strength of relationships within the tribe with the exception of the position of Rhoogeton. Poorly or unresolved relationships based exclusively on ndhF data are more fully resolved with ITS data. These resolved clades include the monophyly of the genera Columnea and Paradrymonia and the sister-group relationship of Nematanthus and Codonanthe. A closer affinity between Neomortonia nummularia and N. rosea than has previously been seen is apparent from these data, although these two species are not monophyletic in any tree. Lastly, Capanea appears to be a member of Gloxinieae, although C. grandiflora remains within Episcieae. Evolution of fruit type, epiphytic habit, and presence of tubers is re-examined with the new data presented here.  相似文献   

11.
Traditional resampling-based tests for homogeneity in covariance matrices across multiple groups resample residuals, that is, data centered by group means. These residuals do not share the same second moments when the null hypothesis is false, which makes them difficult to use in the setting of multiple testing. An alternative approach is to resample standardized residuals, data centered by group sample means and standardized by group sample covariance matrices. This approach, however, has been observed to inflate type I error when sample size is small or data are generated from heavy-tailed distributions. We propose to improve this approach by using robust estimation for the first and second moments. We discuss two statistics: the Bartlett statistic and a statistic based on eigen-decomposition of sample covariance matrices. Both statistics can be expressed in terms of standardized errors under the null hypothesis. These methods are extended to test homogeneity in correlation matrices. Using simulation studies, we demonstrate that the robust resampling approach provides comparable or superior performance, relative to traditional approaches, for single testing and reasonable performance for multiple testing. The proposed methods are applied to data collected in an HIV vaccine trial to investigate possible determinants, including vaccine status, vaccine-induced immune response level and viral genotype, of unusual correlation pattern between HIV viral load and CD4 count in newly infected patients.  相似文献   

12.
One application of gene expression arrays is to derive molecular profiles, i.e., sets of genes, which discriminate well between two classes of samples, for example between tumour types. Users are confronted with a multitude of classification methods of varying complexity that can be applied to this task. To help decide which method to use in a given situation, we compare important characteristics of a range of classification methods, including simple univariate filtering, penalised likelihood methods and the random forest. Classification accuracy is an important characteristic, but the biological interpretability of molecular profiles is also important. This implies both parsimony and stability, in the sense that profiles should not vary much when there are slight changes in the training data. We perform a random resampling study to compare these characteristics between the methods and across a range of profile sizes. We measure stability by adopting the Jaccard index to assess the similarity of resampled molecular profiles. We carry out a case study on five well-established cancer microarray data sets, for two of which we have the benefit of being able to validate the results in an independent data set. The study shows that those methods which produce parsimonious profiles generally result in better prediction accuracy than methods which don't include variable selection. For very small profile sizes, the sparse penalised likelihood methods tend to result in more stable profiles than univariate filtering while maintaining similar predictive performance.  相似文献   

13.
The use of non-invasive genetic sampling to estimate population size in elusive or rare species is increasing. The data generated from this sampling differ from traditional mark-recapture data in that individuals may be captured multiple times within a session or there may only be a single sampling event. To accommodate this type of data, we develop a method, named capwire, based on a simple urn model containing individuals of two capture probabilities. The method is evaluated using simulations of an urn and of a more biologically realistic system where individuals occupy space, and display heterogeneous movement and DNA deposition patterns. We also analyse a small number of real data sets. The results indicate that when the data contain capture heterogeneity the method provides estimates with small bias and good coverage, along with high accuracy and precision. Performance is not as consistent when capture rates are homogeneous and when dealing with populations substantially larger than 100. For the few real data sets where N is approximately known, capwire's estimates are very good. We compare capwire's performance to commonly used rarefaction methods and to two heterogeneity estimators in program capture: Mh-Chao and Mh-jackknife. No method works best in all situations. While less precise, the Chao estimator is very robust. We also examine how large samples should be to achieve a given level of accuracy using capwire. We conclude that capwire provides an improved way to estimate N for some DNA-based data sets.  相似文献   

14.
Molecular diffusion and transport are fundamental processes in physical, chemical, biochemical, and biological systems. However, current approaches to measure molecular transport in cells and tissues based on perturbation methods such as fluorescence recovery after photobleaching are invasive, fluctuation correlation methods are local, and single-particle tracking requires the observation of isolated particles for relatively long periods of time. We propose to detect molecular transport by measuring the time cross-correlation of fluctuations at a pair of locations in the sample. When the points are farther apart than two times the size of the point spread function, the maximum of the correlation is proportional to the average time a molecule takes to move from a specific location to another. We demonstrate the method by simulations, using beads in solution, and by measuring the diffusion of molecules in cellular membranes. The spatial pair cross-correlation method detects barriers to diffusion and heterogeneity of diffusion because the time of the correlation maximum is delayed in the presence of diffusion barriers. This noninvasive, sensitive technique follows the same molecule over a large area, thereby producing a map of molecular flow. It does not require isolated molecules, and thus many molecules can be labeled at the same time and within the point spread function.  相似文献   

15.
Phylogenetic estimation has largely come to rely on explicitly model-based methods. This approach requires that a model be chosen and that that choice be justified. To date, justification has largely been accomplished through use of likelihood-ratio tests (LRTs) to assess the relative fit of a nested series of reversible models. While this approach certainly represents an important advance over arbitrary model selection, the best fit of a series of models may not always provide the most reliable phylogenetic estimates for finite real data sets, where all available models are surely incorrect. Here, we develop a novel approach to model selection, which is based on the Bayesian information criterion, but incorporates relative branch-length error as a performance measure in a decision theory (DT) framework. This DT method includes a penalty for overfitting, is applicable prior to running extensive analyses, and simultaneously compares all models being considered and thus does not rely on a series of pairwise comparisons of models to traverse model space. We evaluate this method by examining four real data sets and by using those data sets to define simulation conditions. In the real data sets, the DT method selects the same or simpler models than conventional LRTs. In order to lend generality to the simulations, codon-based models (with parameters estimated from the real data sets) were used to generate simulated data sets, which are therefore more complex than any of the models we evaluate. On average, the DT method selects models that are simpler than those chosen by conventional LRTs. Nevertheless, these simpler models provide estimates of branch lengths that are more accurate both in terms of relative error and absolute error than those derived using the more complex (yet still wrong) models chosen by conventional LRTs. This method is available in a program called DT-ModSel.  相似文献   

16.
A negative staining technique is presented baaed on the use of 40-60 nm quartz membrane supported by a silicon grid. The quartz membrane is fabricated by thermal growth of silicon dioxide on a silicon substrate followed by an anisotropic silicon etching step giving rectangular holes in the silicon substrate. The hydrophilic membrane is shown to be ideally suited for negative staining due to its spreading characteristics, homogeneity, heat resistance and mechanical stability. Micrographs of phage λ are presented showing the detailed structure of the tail. A simple method of calculating the number of adsorbed particles based on diffusion limited association is also presented.  相似文献   

17.
This paper analyzes the power divergence estimators when homogeneity/heterogeneity hypotheses among standardized mortality ratios (SMRs) are taken into account. A Monte Carlo study shows that when the standard mortality rate is not external, that is it is estimated from the sample data, these estimators have a good performance even for small sample sets and in particular the minimum chi‐square estimators have a better behavior compared to the classical maximum likelihood estimators. In order to make decisions under homogeneity/heterogeneity hypotheses of SMRs we propose some test‐statistics which consider the minimum power divergence estimators. Through a numerical example focused on SMRs of melanoma mortality ratios in different regions of the US, a homogeneity/heterogeneity study is illustrated.  相似文献   

18.
The resolution along the optical axis (z) is much less than the in-plane resolution in any current optical microscope, conventional or otherwise. We have used mutually tilted, through-focal section views of the same object to provide a solution to this problem. A tilting specimen stage was constructed for an optical microscope, which with the use of a coverslip-free water immersion lens, allowed the collection of data sets from intact Drosophila melanogaster embryos at viewing directions up to 90 degrees apart. We have devised an image processing scheme to determine the relative tilt, translation, and sampling parameters of the different data sets. This involves the use of a modified phase cross-correlation function, which produces a very sharp maximum. Finally the data sets are merged using figure-of-merit and local area scaling techniques borrowed from x-ray protein crystallography. We demonstrate the application of this technique to data sets from a metaphase plate in an embryo of Drosophila melanogaster. As expected, the merged reconstruction combined the highest resolution available in the individual data sets. As estimated from the Fourier transform, the final resolution is 0.25 microns in x and y and 0.4 microns in z. In the final reconstruction all ten chromosome arms can be easily delineated; this was not possible in the individual data sets. Within many of the arms the two individual chromatids can be seen. In some cases the chromatids are wrapped around each other helically, in others they lie alongside each other in a parallel arrangement.  相似文献   

19.
The inference of phylogenetic hypotheses from landmark data has been questioned during the last two decades. Besides theoretical concerns, one of the limitations pointed out for the use of landmark data in phylogenetics is its (supposed) lack of information relevant to the inference of phylogenetic relationships. However, empirical analyses are scarce; there exists no previous study that systematically evaluates the phylogenetic performance of landmark data in a series of data sets. In the present study, we analysed 41 published data sets in order to assess the correspondence between the phylogenetic trees derived from landmark data and those obtained with alternative and independent sources of evidence, and determined the main factors that might affect this inference. The data sets presented a variable number of terminals (5–200) and configurations (1–14), belonging to different taxonomic groups. The results showed that for most of the data sets analysed, the trees derived from landmark data presented a low correspondence with the reference phylogenies. The results were similar irrespective of the phylogenetic method considered. Complementary analyses strongly suggested that the limited amount of evidence included in each data set (one or a few landmark configurations) is the main cause for that low correspondence: the phylogenetic analysis of eight data sets that presented three or more configurations clearly showed that the inclusion of several landmark configurations improves the results. In addition, the analyses indicated that the inclusion of landmark data from different configurations is more important than the inclusion of more landmarks from the same configuration. Based on the results presented here, we consider that the poor results previously obtained in phylogenetic analyses based on landmark data were not caused by methodological limitations, but rather due to the limited amount of evidence included in the data sets.  相似文献   

20.
The alignment of single-particle images fails at low signal-to-noise ratios and small particle sizes, because noise produces false peaks in the cross-correlation function used for alignment. A maximum-likelihood approach to the two-dimensional alignment problem is described which allows the underlying structure to be estimated from large data sets of very noisy images. Instead of finding the optimum alignment for each image, the algorithm forms a weighted sum over all possible in-plane rotations and translations of the image. The weighting factors, which are the probabilities of the image transformations, are computed as the exponential of a cross-correlation function. Simulated data sets were constructed and processed by the algorithm. The results demonstrate a greatly reduced sensitivity to the choice of a starting reference, and the ability to recover structures from large data sets having very low signal-to-noise ratios.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号