首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
3.
MOTIVATION: Protein-protein interactions have proved to be a valuable starting point for understanding the inner workings of the cell. Computational methodologies have been built which both predict interactions and use interaction datasets in order to predict other protein features. Such methods require gold standard positive (GSP) and negative (GSN) interaction sets. Here we examine and demonstrate the usefulness of homologous interactions in predicting good quality positive and negative interaction datasets. RESULTS: We generate GSP interaction sets as subsets from experimental data using only interaction and sequence information. We can therefore produce sets for several species (many of which at present have no identified GSPs). Comprehensive error rate testing demonstrates the power of the method. We also show how the use of our datasets significantly improves the predictive power of algorithms for interaction prediction and function prediction. Furthermore, we generate GSN interaction sets for yeast and examine the use of homology along with other protein properties such as localization, expression and function. Using a novel method to assess the accuracy of a negative interaction set, we find that the best single selector for negative interactions is a lack of co-function. However, an integrated method using all the characteristics shows significant improvement over any current method for identifying GSN interactions. The nature of homologous interactions is also examined and we demonstrate that interologs are found more commonly within species than across species. CONCLUSION: GSP sets built using our homologous verification method are demonstrably better than standard sets in terms of predictive ability. We can build such GSP sets for several species. When generating GSNs we show a combination of protein features and lack of homologous interactions gives the highest quality interaction sets. AVAILABILITY: GSP and GSN datasets for all the studied species can be downloaded from http://www.stats.ox.ac.uk/~deane/HPIV.  相似文献   

4.
While modelling habitat suitability and species distribution, ecologists must deal with issues related to the spatial resolution of species occurrence and environmental data. Indeed, given that the spatial resolution of species and environmental datasets range from centimeters to hundreds of kilometers, it underlines the importance of choosing the optimal combination of resolutions to achieve the highest possible modelling prediction accuracy. We evaluated how the spatial resolution of land cover/waterbody datasets (meters to 1 km) affect waterbird habitat suitability models based on atlas data (grid cell of 12 × 11 km). We hypothesized that the area, perimeter and number of waterbodies computed from high resolution datasets would explain distributions of waterbirds better because coarse resolution datasets omit small waterbodies affecting species occurrence. Specifically, we investigated which spatial resolution of waterbodies better explain the distribution of seven waterbirds nesting on ponds/lakes with areas ranging from 0.1 ha to hundreds of hectares. Our results show that the area and perimeter of waterbodies derived from high resolution datasets (raster data with 30 m resolution, vector data corresponding with map scale 1:10 000) explain the distribution of the waterbirds better than those calculated using less accurate datasets despite the coarse grain of the species data. Taking into account the spatial extent (global vs regional) of the datasets, we found the Global Inland Waterbody Dataset to be the most suitable for modelling distribution of waterbirds. In general, we recommend using land cover data of a resolution sufficient to capture the smallest patches of the habitat suitable for a given species’ presence for both fine and coarse grain habitat suitability and distribution modelling.  相似文献   

5.
6.
The identification of genome-wide cis-regulatory modules (CRMs) and characterization of their associated epigenetic features are fundamental steps toward the understanding of gene regulatory networks. Although integrative analysis of available genome-wide information can provide new biological insights, the lack of novel methodologies has become a major bottleneck. Here, we present a comprehensive analysis tool called combinatorial CRM decoder (CCD), which utilizes the publicly available information to identify and characterize genome-wide CRMs in a species of interest. CCD first defines a set of the epigenetic features which is significantly associated with a set of known CRMs as a code called ‘trace code’, and subsequently uses the trace code to pinpoint putative CRMs throughout the genome. Using 61 genome-wide data sets obtained from 17 independent mouse studies, CCD successfully catalogued ∼12 600 CRMs (five distinct classes) including polycomb repressive complex 2 target sites as well as imprinting control regions. Interestingly, we discovered that ∼4% of the identified CRMs belong to at least two different classes named ‘multi-functional CRM’, suggesting their functional importance for regulating spatiotemporal gene expression. From these examples, we show that CCD can be applied to any potential genome-wide datasets and therefore will shed light on unveiling genome-wide CRMs in various species.  相似文献   

7.
8.
9.
10.
11.
Differences exist among analysis results of agriculture monitoring and crop production based on remote sensing observations, which are obtained at different spatial scales from multiple remote sensors in same time period, and processed by same algorithms, models or methods. These differences can be mainly quantitatively described from three aspects, i.e. multiple remote sensing observations, crop parameters estimation models, and spatial scale effects of surface parameters. Our research proposed a new method to analyse and correct the differences between multi-source and multi-scale spatial remote sensing surface reflectance datasets, aiming to provide references for further studies in agricultural application with multiple remotely sensed observations from different sources. The new method was constructed on the basis of physical and mathematical properties of multi-source and multi-scale reflectance datasets. Theories of statistics were involved to extract statistical characteristics of multiple surface reflectance datasets, and further quantitatively analyse spatial variations of these characteristics at multiple spatial scales. Then, taking the surface reflectance at small spatial scale as the baseline data, theories of Gaussian distribution were selected for multiple surface reflectance datasets correction based on the above obtained physical characteristics and mathematical distribution properties, and their spatial variations. This proposed method was verified by two sets of multiple satellite images, which were obtained in two experimental fields located in Inner Mongolia and Beijing, China with different degrees of homogeneity of underlying surfaces. Experimental results indicate that differences of surface reflectance datasets at multiple spatial scales could be effectively corrected over non-homogeneous underlying surfaces, which provide database for further multi-source and multi-scale crop growth monitoring and yield prediction, and their corresponding consistency analysis evaluation.  相似文献   

12.
MOTIVATION: Several kernel-based methods have been recently introduced for the classification of small molecules. Most available kernels on molecules are based on 2D representations obtained from chemical structures, but far less work has focused so far on the definition of effective kernels that can also exploit 3D information. RESULTS: We introduce new ideas for building kernels on small molecules that can effectively use and combine 2D and 3D information. We tested these kernels in conjunction with support vector machines for binary classification on the 60 NCI cancer screening datasets as well as on the NCI HIV data set. Our results show that 3D information leveraged by these kernels can consistently improve prediction accuracy in all datasets. AVAILABILITY: An implementation of the small molecule classifier is available from http://www.dsi.unifi.it/neural/src/3DDK.  相似文献   

13.
1. Eutrophication is a serious threat in many parts of the world, and identifying the environmental factors that determine the spatial distribution of eutrophicated waterbodies as well as the development of management tools is a challenge. 2. In this study, data from the Ile‐de‐France region were analysed to determine if catchment scale environmental variables could predict concentrations of chlorophyll a (used as a proxy for eutrophication status) of artificial lakes and reservoirs. 3. General additive models (GAM) and random forest models (RF) displayed greater predictive power than generalised linear models, indicating the importance of non‐monotonic relationships. Using RF modelling, very high predictive accuracy was achieved for both continuous and binomial (eutrophic or not) response variables (continuous: R2 = 0.715; binomial: kappa = 0.764, 89% of waterbodies were accurately predicted). The better predictive power and robustness of RF versus GAM was attributed to the formers ability to better handle complex interactions between predictors and to account for threshold effects. 4. Our results confirmed the close link between the water quality of lakes and reservoirs and the characteristics of their catchments. Moreover, we also showed that (i) simple (e.g. linear and/or monotonic) relationships between catchment land use and water quality were only found for sub‐regional datasets, and (ii) land use needs to be considered in association with complementary environmental variables (hydromorphological variables) to best assess its impact on water quality.  相似文献   

14.
Accurate modeling of geographic distributions of species is crucial to various applications in ecology and conservation. The best performing techniques often require some parameter tuning, which may be prohibitively time‐consuming to do separately for each species, or unreliable for small or biased datasets. Additionally, even with the abundance of good quality data, users interested in the application of species models need not have the statistical knowledge required for detailed tuning. In such cases, it is desirable to use “default settings”, tuned and validated on diverse datasets. Maxent is a recently introduced modeling technique, achieving high predictive accuracy and enjoying several additional attractive properties. The performance of Maxent is influenced by a moderate number of parameters. The first contribution of this paper is the empirical tuning of these parameters. Since many datasets lack information about species absence, we present a tuning method that uses presence‐only data. We evaluate our method on independently collected high‐quality presence‐absence data. In addition to tuning, we introduce several concepts that improve the predictive accuracy and running time of Maxent. We introduce “hinge features” that model more complex relationships in the training data; we describe a new logistic output format that gives an estimate of probability of presence; finally we explore “background sampling” strategies that cope with sample selection bias and decrease model‐building time. Our evaluation, based on a diverse dataset of 226 species from 6 regions, shows: 1) default settings tuned on presence‐only data achieve performance which is almost as good as if they had been tuned on the evaluation data itself; 2) hinge features substantially improve model performance; 3) logistic output improves model calibration, so that large differences in output values correspond better to large differences in suitability; 4) “target‐group” background sampling can give much better predictive performance than random background sampling; 5) random background sampling results in a dramatic decrease in running time, with no decrease in model performance.  相似文献   

15.
16.
MOTIVATION: Experimental limitations in high-throughput protein-protein interaction detection methods have resulted in low quality interaction datasets that contained sizable fractions of false positives and false negatives. Small-scale, focused experiments are then needed to complement the high-throughput methods to extract true protein interactions. However, the naturally vast interactomes would require much more scalable approaches. RESULTS: We describe a novel method called IRAP* as a computational complement for repurification of the highly erroneous experimentally derived protein interactomes. Our method involves an iterative process of removing interactions that are confidently identified as false positives and adding interactions detected as false negatives into the interactomes. Identification of both false positives and false negatives are performed in IRAP* using interaction confidence measures based on network topological metrics. Potential false positives are identified amongst the detected interactions as those with very low computed confidence values, while potential false negatives are discovered as the undetected interactions with high computed confidence values. Our results from applying IRAP* on large-scale interaction datasets generated by the popular yeast-two-hybrid assays for yeast, fruit fly and worm showed that the computationally repurified interaction datasets contained potentially lower fractions of false positive and false negative errors based on functional homogeneity. AVAILABILITY: The confidence indices for PPIs in yeast, fruit fly and worm as computed by our method can be found at our website http://www.comp.nus.edu.sg/~chenjin/fpfn.  相似文献   

17.
Ant-Miner is an ant-based algorithm for the discovery of classification rules. This paper proposes five extensions to Ant-Miner: (1) we utilize multiple types of pheromone, one for each permitted rule class, i.e. an ant first selects the rule class and then deposits the corresponding type of pheromone; (2) we use a quality contrast intensifier to magnify the reward of high-quality rules and to penalize low-quality rules in terms of pheromone update; (3) we allow the use of a logical negation operator in the antecedents of constructed rules; (4) we incorporate stubborn ants, an ACO variation in which an ant is allowed to take into consideration its own personal past history; (5) we use an ant colony behavior in which each ant is allowed to have its own values of the ?? and ?? parameters (in a sense, to have its own personality). Empirical results on 23 datasets show improvements in the algorithm??s performance in terms of predictive accuracy and simplicity of the generated rule set.  相似文献   

18.
19.
MOTIVATION: An important challenge in the use of large-scale gene expression data for biological classification occurs when the expression dataset being analyzed involves multiple classes. Key issues that need to be addressed under such circumstances are the efficient selection of good predictive gene groups from datasets that are inherently 'noisy', and the development of new methodologies that can enhance the successful classification of these complex datasets. METHODS: We have applied genetic algorithms (GAs) to the problem of multi-class prediction. A GA-based gene selection scheme is described that automatically determines the members of a predictive gene group, as well as the optimal group size, that maximizes classification success using a maximum likelihood (MLHD) classification method. RESULTS: The GA/MLHD-based approach achieves higher classification accuracies than other published predictive methods on the same multi-class test dataset. It also permits substantial feature reduction in classifier genesets without compromising predictive accuracy. We propose that GA-based algorithms may represent a powerful new tool in the analysis and exploration of complex multi-class gene expression data. AVAILABILITY: Supplementary information, data sets and source codes are available at http://www.omniarray.com/bioinformatics/GA.  相似文献   

20.
Landscape genetic analyses allow detection of fine‐scale spatial genetic structure (SGS) and quantification of effects of landscape features on gene flow and connectivity. Typically, analyses require generation of resistance surfaces. These surfaces characteristically take the form of a grid with cells that are coded to represent the degree to which landscape or environmental features promote or inhibit animal movement. How accurately resistance surfaces predict association between the landscape and movement is determined in large part by (a) the landscape features used, (b) the resistance values assigned to features, and (c) how accurately resistance surfaces represent landscape permeability. Our objective was to evaluate the performance of resistance surfaces generated using two publicly available land cover datasets that varied in how accurately they represent the actual landscape. We genotyped 365 individuals from a large black bear population (Ursus americanus) in the Northern Lower Peninsula (NLP) of Michigan, USA at 12 microsatellite loci, and evaluated the relationship between gene flow and landscape features using two different land cover datasets. We investigated the relative importance of land cover classification and accuracy on landscape resistance model performance. We detected local spatial genetic structure in Michigan''s NLP black bears and found roads and land cover were significantly correlated with genetic distance. We observed similarities in model performance when different land cover datasets were used despite 21% dissimilarity in classification between the two land cover datasets. However, we did find the performance of land cover models to predict genetic distance was dependent on the way the land cover was defined. Models in which land cover was finely defined (i.e., eight land cover classes) outperformed models where land cover was defined more coarsely (i.e., habitat/non‐habitat or forest/non‐forest). Our results show that landscape genetic researchers should carefully consider how land cover classification changes inference in landscape genetic studies.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号