Digital Approaches for the Synthesis of Poorly Accessible Biodiversity Information
The digitalization and integration of biodiversity information can generate substantial added value for existing data and yield novel scientific insights of relevance to bioeconomy, biotechnology, human health, and environmental protection. So far this potential has been exploited only rarely due the heterogeneity and fragmentation of data sources, and the little documentation, variable standards, and limited interoperability of data. For bacteria, research data are particularly diverse and broadly distributed; therefore these organisms will serve as the model group for the current project. The project DiASPora will establish an approach for synthesizing information for bacterial species by applying state-of-the-art data science methodology, genomics, and developing user-centric workflows. Extraction of phenotypic data from the microbiological literature will be achieved by large-scale text mining, applying artificial intelligence (AI) techniques that will be trained through the feedback of microbiologist curators. The data recovered will be hosted by the existing BacDive database and transformed into a machine readable and processable format using the Resource Description Framework (RDF). Subsequently, the transformed data will be used to establish a knowledge graph to generate innovative search options for the discovery of hidden data relationships. In parallel, phenotypic predictions will be derived from (meta)genomic data, through the application of metabolic models and comparison with the physiological and habitat data as obtained by data mining, and will be supported by an AI approach. The project is committed to an integral community engagement and an efficient dissemination of results. DiASPora builds upon the complementary expertise of three participating institutions, covering the fields of microbial databases and diversity research, bacterial genomics, text mining, artificial intelligence, and semantic technologies.
Goals of the project
The DiASPora project
May 2020 – April 2023
WP1 Mobilization and curation of phenotypic trait data
The aim of this WP is to generate a comprehensive set of phenotypic trait data for all described prokaryotic species. Data extraction from literature will therefore be streamlined by establishing automated text mining, employing an artificial intelligence system with machine learning approaches, random forest trees and trained deep neural networks. Data will be hosted by BacDive that has already gathered and published bacterial phenotypic metadata for 80,584 strains covering approx. 90% of the described species. The information from curation will then be used to retrain the artificial intelligence iteratively (similar to Prodi.gy), resulting in an improved text mining workflow. It is expected, that the majority of the data fields can eventually be extracted without or with only little human interference reducing the workload of curators.
WP2 Transformation into a machine readable and FAIR data repository
The extended BacDive content will be standardized and transformed into a machine- readable and -processable format following the FAIR (findable, accessible, interoperable, reusable) and Linked Data (LD) principles. Existing ontologies will be reused wherever possible to ensure that BacDive data can be linked with other, already semantically enriched data and thereby integrated into the existing landscape of other semantic services. Our combined approach will ultimately provide the scientific community with a facility for easy lookup, and a systematic and focused download of species-associated data in two different, complementary ways. After transformation into a fully machine-readable and FAIR-compliant research data repository, the search options will be further developed by establishing the BacDive Knowledge Graph.
WP3 Extending the database
Nearly 250,000 bacterial genome sequences have become available, including also those of phenotypically characterized species.The bacterial genome sequences allow predictions of phenotypic traits like the utilization of carbon and nitrogen sources, biosynthetic capabilities, motility, sporulation, or secretion. So far this information has not been systematically mobilized and integrated with phenotypic information for the same bacterial species. In a comprehensive approach potential phenotypic characteristics will be extracted from available genome data.The results obtained will be fed into a novel, genome-based knowledge base which will be established within BacDive. In order to evaluate the predictive power and plausibility of the genome-based approach, the derived phenotypic predictions will be tested against the actual phenotypic data retrieved by literature-mining. Genome-based data that reach or exceed a predefined confidence score will then be used to complement or correct phenotypic entries in the BacDive database. As an innovative approach, we will link the specific phenotypic properties and environmental preferences of so-far-uncharacterized bacteria (obtained from ecological data in WP1) with the occurrence patterns of particular non-annotated genes across their genomes. This will generate hypotheses on the specific functions of non-annotated genes.
The extended phenotypic dataset (consisting of text-mined, genome-derived, semantically transformed data) will be exploited to infer optimal growth conditions (e.g., pH, temperature, salinity, carbon sources, growth factors requirements, trace elements) and corresponding culture media composition for selected target species that have so far not been cultured. The media and growth conditions inferred in silico by machine learning approaches will be quality-controlled and improved in subsequent wet lab experiments
WP5 Community engagement and dissemination
We will identify additional requirements for the continued, agile development and operation of the phenotypic data services of BacDive in the future. Newly developed tools will be published under OSI-compliant Open Source licenses, adhering to the DFG recommendations about research software and to best practices in the bioinformatics community. All data resources, analysis results, and tools produced in the DiASPora project will be made accessible by extending and reprogramming the existing BacDive webpage.