首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
During routine screens of the NCBI databases using human repetitive elements we discovered an unlikely level of nucleotide identity across a broad range of phyla. To ascertain whether databases containing DNA sequences, genome assemblies and trace archive reads were contaminated with human sequences, we performed an in depth search for sequences of human origin in non-human species. Using a primate specific SINE, AluY, we screened 2,749 non-primate public databases from NCBI, Ensembl, JGI, and UCSC and have found 492 to be contaminated with human sequence. These represent species ranging from bacteria (B. cereus) to plants (Z. mays) to fish (D. rerio) with examples found from most phyla. The identification of such extensive contamination of human sequence across databases and sequence types warrants caution among the sequencing community in future sequencing efforts, such as human re-sequencing. We discuss issues this may raise as well as present data that gives insight as to how this may be occurring.  相似文献   

2.
Science is a social process with far-reaching impact on our modern society. In recent years, for the first time we are able to scientifically study the science itself. This is enabled by massive amounts of data on scientific publications that is increasingly becoming available. The data is contained in several databases such as Web of Science or PubMed, maintained by various public and private entities. Unfortunately, these databases are not always consistent, which considerably hinders this study. Relying on the powerful framework of complex networks, we conduct a systematic analysis of the consistency among six major scientific databases. We found that identifying a single "best" database is far from easy. Nevertheless, our results indicate appreciable differences in mutual consistency of different databases, which we interpret as recipes for future bibliometric studies.  相似文献   

3.
ABSTRACT: BACKGROUND: Biological databases contain large amounts of data concerning the functions and associationsof genes and proteins. Integration of data from several such databases into a single repositorycan aid the discovery of previously unknown connections spanning multiple types ofrelationships and databases. RESULTS: Biomine is a system that integrates cross-references from several biological databases into agraph model with multiple types of edges, such as protein interactions, gene-diseaseassociations and gene ontology annotations. Edges are weighted based on their type,reliability, and informativeness. We present Biomine and evaluate its performance in linkprediction, where the goal is to predict pairs of nodes that will be connected in the future,based on current data. In particular, we formulate protein interaction prediction and diseasegene prioritization tasks as instances of link prediction. The predictions are based on aproximity measure computed on the integrated graph. We consider and experiment withseveral such measures, and perform a parameter optimization procedure where different edgetypes are weighted to optimize link prediction accuracy. We also propose a novel method fordisease-gene prioritization, defined as finding a subset of candidate genes that cluster togetherin the graph. We experimentally evaluate Biomine by predicting future annotations in thesource databases and prioritizing lists of putative disease genes. CONCLUSIONS: The experimental results show that Biomine has strong potential for predicting links when aset of selected candidate links is available. The predictions obtained using the entire Biominedataset are shown to clearly outperform ones obtained using any single source of data alone,when different types of links are suitably weighted. In the gene prioritization task, anestablished reference set of disease-associated genes is useful, but the results show that underfavorable conditions, Biomine can also perform well when no such information is available.The Biomine system is a proof of concept. Its current version contains 1.1 million entities and8.1 million relations between them, with focus on human genetics. Some of its functionalitiesare available in a public query interface at http://biomine.cs.helsinki.fi, allowing searching forand visualizing connections between given biological entities.  相似文献   

4.
MOTIVATION: The living cell is a complex machine that depends on the proper functioning of its numerous parts, including proteins. Understanding protein functions and how they modify and regulate each other is the next great challenge for life-sciences researchers. The collective knowledge about protein functions and pathways is scattered throughout numerous publications in scientific journals. Bringing the relevant information together becomes a bottleneck in a research and discovery process. The volume of such information grows exponentially, which renders manual curation impractical. As a viable alternative, automated literature processing tools could be employed to extract and organize biological data into a knowledge base, making it amenable to computational analysis and data mining. RESULTS: We present MedScan, a completely automated natural language processing-based information extraction system. We have used MedScan to extract 2976 interactions between human proteins from MEDLINE abstracts dated after 1988. The precision of the extracted information was found to be 91%. Comparison with the existing protein interaction databases BIND and DIP revealed that 96% of extracted information is novel. The recall rate of MedScan was found to be 21%. Additional experiments with MedScan suggest that MEDLINE is a unique source of diverse protein function information, which can be extracted in a completely automated way with a reasonably high precision. Further directions of the MedScan technology improvement are discussed. AVAILABILITY: MedScan is available for commercial licensing from Ariadne Genomics, Inc.  相似文献   

5.
Many animal health, welfare and food safety databases include data on clinical and test-based disease diagnoses. However, the circumstances and constraints for establishing the diagnoses vary considerably among databases. Therefore results based on different databases are difficult to compare and compilation of data in order to perform meta-analysis is almost impossible. Nevertheless, diagnostic information collected either routinely or in research projects is valuable in cross comparisons between databases, but there is a need for improved transparency and documentation of the data and the performance characteristics of tests used to establish diagnoses. The objective of this paper is to outline the circumstances and constraints for recording of disease diagnoses in different types of databases, and to discuss these in the context of disease diagnoses when using them for additional purposes, including research. Finally some limitations and recommendations for use of data and for recording of diagnostic information in the future are given. It is concluded that many research questions have such a specific objective that investigators need to collect their own data. However, there are also examples, where a minimal amount of extra information or continued validation could make sufficient improvement of secondary data to be used for other purposes. Regardless, researchers should always carefully evaluate the opportunities and constraints when they decide to use secondary data. If the data in the existing databases are not sufficiently valid, researchers may have to collect their own data, but improved recording of diagnostic data may improve the usefulness of secondary diagnostic data in the future.  相似文献   

6.
Limited resources make it difficult to effectively document, monitor, and control invasive species across large areas, resulting in large gaps in our knowledge of current and future invasion patterns. We surveyed 128 citizen science program coordinators and interviewed 15 of them to evaluate their potential role in filling these gaps. Many programs collect data on invasive species and are willing to contribute these data to public databases. Although resources for education and monitoring are readily available, groups generally lack tools to manage and analyze data. Potential users of these data also retain concerns over data quality. We discuss how to address these concerns about citizen scientist data and programs while preserving the advantages they afford. A unified yet flexible national citizen science program aimed at tracking invasive species location, abundance, and control efforts could be designed using centralized data sharing and management tools. Such a system could meet the needs of multiple stakeholders while allowing efficiencies of scale, greater standardization of methods, and improved data quality testing and sharing. Finally, we present a prototype for such a system (see ).  相似文献   

7.
Biological invasions often transcend political boundaries, but the capacity of countries to prevent invasions varies. How this variation in biosecurity affects the invasion risks posed to the countries involved is unclear. We aimed to improve the understanding of how the biosecurity of a country influences that of its neighbours. We developed six scenarios that describe biological invasions in regions with contiguous countries. Using data from alien species databases, socio‐economic and biodiversity data and species distribution models, we determined where 86 of 100 of the world's worst invasive species are likely to invade and have a negative impact in the future. Information on the capacity of countries to prevent invasions was used to determine whether such invasions could be avoided. For the selected species, we predicted 2,523 discrete invasions, most of which would have significant negative impacts and are unlikely to be prevented. Of these invasions, approximately a third were predicted to spread from the country in which the species first establishes to neighbouring countries where they would cause significant negative impacts. Most of these invasions are unlikely to be prevented as the country of first establishment has a low capacity to prevent invasions or has little incentive to do so as there will be no impact in that country. Regional biosecurity is therefore essential to prevent future harmful biological invasions. In consequence, we propose that the need for increased regional co‐operation to combat biological invasions be incorporated in global biodiversity targets.  相似文献   

8.
This article considers how we should frame the ethical issues raised by current proposals for large-scale genebanks with on-going links to medical and lifestyle data, such as the Wellcome Trust and Medical Research Council's 'UK Biobank'. As recent scandals such as Alder Hey have emphasised, there are complex issues concerning the informed consent of donors that need to be carefully considered. However, we believe that a preoccupation with informed consent obscures important questions about the purposes to which such collections are put, not least that they may be only haphazardly used for research (especially that of commercial interest)--an end that would not fairly reflect the original altruistic motivation of donors, and the trust they must invest. We therefore argue that custodians of such databases take on a weighty pro-active duty, to encourage public debate about the ends of such collections and to sponsor research that reflects publicly agreed priorities and provides public benefits.  相似文献   

9.
This article considers how we should frame the ethical issues raised by current proposals for large‐scale genebanks with on‐going links to medical and lifestyle data, such as the Wellcome Trust and Medical Research Council's ‘UK Biobank’. As recent scandals such as Alder Hey have emphasised, there are complex issues concerning the informed consent of donors that need to be carefully considered. However, we believe that a preoccupation with informed consent obscures important questions about the purposes to which such collections are put, not least that they may be only haphazardly used for research (especially that of commercial interest)—an end that would not fairly reflect the original altruistic motivation of donors, and the trust they must invest. We therefore argue that custodians of such databases take on a weighty pro‐active duty, to encourage public debate about the ends of such collections and to sponsor research that reflects publicly agreed priorities and provides public benefits.  相似文献   

10.
Generative models have shown breakthroughs in a wide spectrum of domains due to recent advancements in machine learning algorithms and increased computational power. Despite these impressive achievements, the ability of generative models to create realistic synthetic data is still under-exploited in genetics and absent from population genetics. Yet a known limitation in the field is the reduced access to many genetic databases due to concerns about violations of individual privacy, although they would provide a rich resource for data mining and integration towards advancing genetic studies. In this study, we demonstrated that deep generative adversarial networks (GANs) and restricted Boltzmann machines (RBMs) can be trained to learn the complex distributions of real genomic datasets and generate novel high-quality artificial genomes (AGs) with none to little privacy loss. We show that our generated AGs replicate characteristics of the source dataset such as allele frequencies, linkage disequilibrium, pairwise haplotype distances and population structure. Moreover, they can also inherit complex features such as signals of selection. To illustrate the promising outcomes of our method, we showed that imputation quality for low frequency alleles can be improved by data augmentation to reference panels with AGs and that the RBM latent space provides a relevant encoding of the data, hence allowing further exploration of the reference dataset and features for solving supervised tasks. Generative models and AGs have the potential to become valuable assets in genetic studies by providing a rich yet compact representation of existing genomes and high-quality, easy-access and anonymous alternatives for private databases.  相似文献   

11.
ABSTRACT: Copyright and licensing of scientific data, internationally, are complex and present legal barriers to data sharing, integration and reuse, and therefore restrict the most efficient transfer and discovery of scientific knowledge. Much data are included within scientific journal articles, their published tables, additional files (supplementary material) and reference lists. However, these data are usually published under licenses which are not appropriate for data. Creative Commons CC0 is an appropriate and increasingly accepted method for dedicating data to the public domain, to enable data reuse with the minimum of restrictions. BioMed Central is committed to working towards implementation of open data-compliant licensing in its publications. Here we detail a protocol for implementing a combined Creative Commons Attribution license (for copyrightable material) and Creative Commons CC0 waiver (for data) agreement for content published in peer-reviewed open access journals. We explain the differences between legal requirements for attribution in copyright, and cultural requirements in scholarship for giving individuals credit for their work through citation. We argue that publishing data in scientific journals under CC0 will have numerous benefits for individuals and society, and yet will have minimal implications for authors and minimal impact on current publishing and research workflows. We provide practical examples and definitions of data types, such as XML and tabular data, and specific secondary use cases for published data, including text mining, reproducible research, and open bibliography. We believe this proposed change to the current copyright and licensing structure in science publishing will help clarify what users -- people and machines -- of the published literature can do, legally, with journal articles and make research using the published literature more efficient. We further believe this model could be adopted across multiple publishers, and invite comment on this article from all stakeholders in scientific research.  相似文献   

12.
Species occurrence records from a variety of sources are increasingly aggregated into heterogeneous databases and made available to ecologists for immediate analytical use. However, these data are typically biased, i.e. they are not a probability sample of the target population of interest, meaning that the information they provide may not be an accurate reflection of reality. It is therefore crucial that species occurrence data are properly scrutinised before they are used for research. In this article, we introduce occAssess, an R package that enables straightforward screening of species occurrence data for potential biases. The package contains a number of discrete functions, each of which returns a measure of the potential for bias in one or more of the taxonomic, temporal, spatial, and environmental dimensions. Users can opt to provide a set of time periods into which the data will be split; in this case separate outputs will be provided for each period, making the package particularly useful for assessing the suitability of a dataset for estimating temporal trends in species'' distributions. The outputs are provided visually (as ggplot2 objects) and do not include a formal recommendation as to whether data are of sufficient quality for any given inferential use. Instead, they should be used as ancillary information and viewed in the context of the question that is being asked, and the methods that are being used to answer it. We demonstrate the utility of occAssess by applying it to data on two key pollinator taxa in South America: leaf‐nosed bats (Phyllostomidae) and hoverflies (Syrphidae). In this worked example, we briefly assess the degree to which various aspects of data coverage appear to have changed over time. We then discuss additional applications of the package, highlight its limitations, and point to future development opportunities.  相似文献   

13.
Most methods for the interpretation of gene expression profiling experiments rely on the categorization of genes, as provided by the Gene Ontology (GO) and pathway databases. Due to the manual curation process, such databases are never up-to-date and tend to be limited in focus and coverage. Automated literature mining tools provide an attractive, alternative approach. We review how they can be employed for the interpretation of gene expression profiling experiments. We illustrate that their comprehensive scope aids the interpretation of data from domains poorly covered by GO or alternative databases, and allows for the linking of gene expression with diseases, drugs, tissues and other types of concepts. A framework for proper statistical evaluation of the associations between gene expression values and literature concepts was lacking and is now implemented in a weighted extension of global test. The weights are the literature association scores and reflect the importance of a gene for the concept of interest. In a direct comparison with classical GO-based gene sets, we show that use of literature-based associations results in the identification of much more specific GO categories. We demonstrate the possibilities for linking of gene expression data to patient survival in breast cancer and the action and metabolism of drugs. Coupling with online literature mining tools ensures transparency and allows further study of the identified associations. Literature mining tools are therefore powerful additions to the toolbox for the interpretation of high-throughput genomics data.  相似文献   

14.
Compiling disparate datasets into publicly available composite databases helps natural resource communities explore ecological trends and effectively manage across spatiotemporal scales. Though some studies have reported on the database construction phase, fewer have evaluated the data acquisition and distribution process. To facilitate future data sharing collaborations, Louisiana State University surveyed data providers and requestors to understand the characteristics of effective data requests and sharing. Data providers were largely U.S. natural resource agency personnel, and they reported that unclear data requests, privacy issues, and rigid timelines and formats were the greatest barriers toward providing data, but that they were motivated by improving science and collaboration. Data requestors identified challenges such as evolving needs, standardization issues, and insufficient resources (time and funding) as barriers to compiling data for these types of efforts. In a time of big data, open access, and collaboration, significant scientific advances can be made with effective requests and inclusion of data sets into larger and more powerful databases.  相似文献   

15.

Distribution data sharing in global databases (e.g. GBIF) allowed the knowledge synthesis in several biodiversity areas. However, their Wallacean shortfalls still reduce our capacity to understand distribution patterns. Including exclusive records from other databases, such as national ones (e.g. SpeciesLink), could mitigate these shortfall problems, but it remains not evaluated. Therefore, we assessed whether (i) the inventory completeness, (ii) taxonomic contribution and (iii) spatial biases could be improved when integrating both global and national biodiversity databases. Using Amazonian epiphytes as a model, we compared the available taxonomic information spatially between GBIF and SpeciesLink databases using a species contribution index. We obtained the inventory completeness from sources using species accumulation curves and assessed their spatial biases by constructing spatial autoregressive models. We found that both databases have a high amount of exclusive records (GBIF: 36.7%; SpeciesLink: 21.7%) and species (17.8%). Amazonia had a small epiphyte inventory completeness, but it was improved when we analyzed both databases together. Individually, both database records were biased to sites with higher altitude, population and herbarium density. Together, river density appeared as a new predictor, probably due to the higher species contribution of SpeciesLink along them. Our findings provide strong evidence that using both global and national databases increase the overall biodiversity knowledge and reduce inventory gaps, but spatial biases may persist. Therefore, we highlight the importance of aggregating more than one database to understand biodiversity patterns, to address conservation decisions and direct shortfalls more efficiently in future studies.

  相似文献   

16.
Increasing global energy demands have led to the ongoing intensification of hydrocarbon extraction from marine areas. Hydrocarbon extractive activities pose threats to native marine biodiversity, such as noise, light, and chemical pollution, physical changes to the sea floor, invasive species, and greenhouse gas emissions. Here, we assessed at a global scale the spatial overlap between offshore hydrocarbon activities and marine biodiversity (>25,000 species, nine major ecosystems, and marine protected areas), and quantify the changes over time. We discovered that two‐thirds of global offshore hydrocarbon activities occur in areas within the top 10% for species richness, range rarity, and proportional range rarity values globally. Thus, while hydrocarbon activities are undertaken in less than one percent of the ocean's area, they overlap with approximately 85% of all assessed species. Of conservation concern, 4% of species with the largest proportion of their range overlapping hydrocarbon activities are range restricted, potentially increasing their vulnerability to localized threats such as oil spills. While hydrocarbon activities have extended to greater depths since the mid‐1990s, we found that the largest overlap is with coastal ecosystems, particularly estuaries, saltmarshes and mangroves. Furthermore, in most countries where offshore hydrocarbon exploration licensing blocks have been delineated, they do not overlap with marine protected areas (MPAs). Although this is positive in principle, many countries have far more licensing block areas than protected areas, and in some instances, MPA coverage is minimal. These findings suggest the need for marine spatial prioritization to help limit future spatial overlap between marine conservation priorities and hydrocarbon activities. Such prioritization can be informed by the spatial and quantitative baseline information provided here. In increasingly shared seascapes, prioritizing management actions that set both conservation and development targets could help minimize further declines of biodiversity and environmental changes at a global scale.  相似文献   

17.
Human gene patents continue to stir social controversy, including the possibility that they might adversely affect public access to useful technologies. It has been suggested that a compulsory licensing policy might be used to alleviate the adverse effect of patents in this context. We suggest, however, that it is unclear whether existing international policies and licensing practices will permit compulsory licensing to be used in a way that would address common concerns. Indeed, given the minor role that genetic technologies have in most health care systems, it would be difficult to justify compulsory licensing. At a minimum, policy makers need to be more realistic about the potential effects of international trade agreements on the development of biotechnology policies.  相似文献   

18.
We consider how the landscape of biological databases may evolve in the future, and what research is needed to realize this evolution. We suggest today's dispersal of diverse resources will only increase as the number and size of those resources, driving the need for semantic interoperability even more strongly. Because the complexity of the questions biologists want answered automatically continues to rapidly escalate, we will need to draw upon high-performance computing resources such as the GRID to process complex queries. Finally, we still need data, and our ways of acquiring and curating data must improve by orders of magnitude.  相似文献   

19.
Problem: The increasing availability of large vegetation databases holds great potential in ecological research and biodiversity informatics, However, inconsistent application of plant names compromises the usefulness of these databases. This problem has been acknowledged in recent years, and solutions have been proposed, such as the concept of “potential taxa” or “taxon views”. Unfortunately, awareness of the problem remains low among vegetation scientists. Methods: We demonstrate how misleading interpretations caused by inconsistent use of plant names might occur through the course of vegetation analysis, from relevés upward through databases, and then to the final analyses. We discuss how these problems might be minimized. Results: We highlight the importance of taxonomic reference lists for standardizing plant names and outline standards they should fulfill to be useful for vegetation databases. Additionally, we present the R package vegdata, which is designed to solve name‐related problems that arise when analysing vegetation databases. Conclusions: We conclude that by giving more consideration to the appropriate application of plant names, vegetation scientists might enhance the reliability of analyses obtained from large vegetation databases.  相似文献   

20.
Over the past few years, large amounts of data linking gene-expression (GE) patterns and other genetic data with the development of the mouse kidney have been published, and the next task will be to integrate these data with the molecular networks responsible for the emergence of the kidney phenotype. This paper discusses how a start to this task can be made by using the kidney database and its associated search tools, and shows how the data generated by such an approach can be used as a guide to future experimentation. Many of the events taking place as the kidney develops do, of course, also take place in other tissues and organisms and it will soon be possible to incorporate relevant information from these systems into analyses of kidney data as well as the new information from microarray technology. The key to success here will be the ability to access over the internet data from the textual and graphical databases for the mouse and other organisms now being established. In order to do this, informatic tools will be needed that will allow a user working with one database to query another. This paper also considers both the types of tools that will be necessary and the databases on which they will operate.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号