Digital Approaches for the Synthesis of Poorly Accessible Biodiversity Information



The digitalization and integration of biodiversity information can generate substantial added value for existing data and yield novel scientific insights of relevance to bioeconomy, biotechnology, human health, and environmental protection. So far this potential has been exploited only rarely due the heterogeneity and fragmentation of data sources, and the little documentation, variable standards, and limited interoperability of data. For bacteria, research data are particularly diverse and broadly distributed; therefore these organisms will serve as the model group for the current project. The project DiASPora will establish an approach for synthesizing information for bacterial species by applying state-of-the-art data science methodology, genomics, and developing user-centric workflows. Extraction of phenotypic data from the microbiological literature will be achieved by large-scale text mining, applying artificial intelligence (AI) techniques that will be trained through the feedback of microbiologist curators. The data recovered will be hosted by the existing BacDive database and transformed into a machine readable and processable format using the Resource Description Framework (RDF). Subsequently, the transformed data will be used to establish a knowledge graph to generate innovative search options for the discovery of hidden data relationships. In parallel, phenotypic predictions will be derived from (meta)genomic data, through the application of metabolic models and comparison with the physiological and habitat data as obtained by data mining, and will be supported by an AI approach. The project is committed to an integral community engagement and an efficient dissemination of results. DiASPora builds upon the complementary expertise of three participating institutions, covering the fields of microbial databases and diversity research, bacterial genomics, text mining, artificial intelligence, and semantic technologies.


The Bacterial Diversity Metadatabase is a major ressource for this project.

Explore BacDive

Goals of the project

Improve the integration, accessibility and manageability of information on bacterial biodiversity.
Develop new bioinformatic tools that enable multidimensional analyses of these widely divergent molecular, phenotypical and ecologic data.
Make bacterial properties predictable using genome annotations.

The DiASPora project

Overview Overview Overview Overview Overview

Work program

May 2020 – April 2023

WP1 Mobilization and curation of phenotypic trait data

The aim of this WP is to generate a comprehensive set of phenotypic trait data for all described prokaryotic species. Data extraction from literature will therefore be streamlined by establishing automated text mining, employing an artificial intelligence system with machine learning approaches, random forest trees and trained deep neural networks. Data will be hosted by BacDive that has already gathered and published bacterial phenotypic metadata for 80,584 strains covering approx. 90% of the described species. The information from curation will then be used to retrain the artificial intelligence iteratively (similar to Prodi.gy), resulting in an improved text mining workflow. It is expected, that the majority of the data fields can eventually be extracted without or with only little human interference reducing the workload of curators.

WP2 Transformation into a machine readable and FAIR data repository

The extended BacDive content will be standardized and transformed into a machine- readable and -processable format following the FAIR (findable, accessible, interoperable, reusable) and Linked Data (LD) principles. Existing ontologies will be reused wherever possible to ensure that BacDive data can be linked with other, already semantically enriched data and thereby integrated into the existing landscape of other semantic services. Our combined approach will ultimately provide the scientific community with a facility for easy lookup, and a systematic and focused download of species-associated data in two different, complementary ways. After transformation into a fully machine-readable and FAIR-compliant research data repository, the search options will be further developed by establishing the BacDive Knowledge Graph.

WP3 Extending the database

Nearly 250,000 bacterial genome sequences have become available, including also those of phenotypically characterized species.The bacterial genome sequences allow predictions of phenotypic traits like the utilization of carbon and nitrogen sources, biosynthetic capabilities, motility, sporulation, or secretion. So far this information has not been systematically mobilized and integrated with phenotypic information for the same bacterial species. In a comprehensive approach potential phenotypic characteristics will be extracted from available genome data.The results obtained will be fed into a novel, genome-based knowledge base which will be established within BacDive. In order to evaluate the predictive power and plausibility of the genome-based approach, the derived phenotypic predictions will be tested against the actual phenotypic data retrieved by literature-mining. Genome-based data that reach or exceed a predefined confidence score will then be used to complement or correct phenotypic entries in the BacDive database. As an innovative approach, we will link the specific phenotypic properties and environmental preferences of so-far-uncharacterized bacteria (obtained from ecological data in WP1) with the occurrence patterns of particular non-annotated genes across their genomes. This will generate hypotheses on the specific functions of non-annotated genes.

WP4 Predicting suitable cultivation conditions for uncharacterized bacteria

The extended phenotypic dataset (consisting of text-mined, genome-derived, semantically transformed data) will be exploited to infer optimal growth conditions (e.g., pH, temperature, salinity, carbon sources, growth factors requirements, trace elements) and corresponding culture media composition for selected target species that have so far not been cultured. The media and growth conditions inferred in silico by machine learning approaches will be quality-controlled and improved in subsequent wet lab experiments

WP5 Community engagement and dissemination

We will identify additional requirements for the continued, agile development and operation of the phenotypic data services of BacDive in the future. Newly developed tools will be published under OSI-compliant Open Source licenses, adhering to the DFG recommendations about research software and to best practices in the bioinformatics community. All data resources, analysis results, and tools produced in the DiASPora project will be made accessible by extending and reprogramming the existing BacDive webpage.

Members of DiASPora

Prof. Dr. Jörg Overmann


is Director of the Leibniz-Institute DSMZ-German Collection of Microorganisms and Cell Cultures, full Professor of Microbiology at the TU Braunschweig, and leads the DSMZ Department of Microbial Ecology and Diversity research (DSMZ/MED). His research focuses on molecular microbial diversity, bacterial physiology, and bacterial interactions.

Prof. Dr. Sören Auer


is a professor for Data Science and Digital Libraries at Leibniz University of Hannover and director of the German National Library of Science and Technology (TIB). He has made substantial contributions to semantic web technologies, knowledge engineering, software engineering, usability, as well as databases and information systems

Prof. Dr. Dietrich Rebholz-Schuhmann


is professor for Biomedical Data Semantics and Analytics at the University of Cologne and the scientific director of ZB MED, Information Center for Life Sciences. He is a medical doctor and a computer scientist, holds a PhD in immunology and is Doctor of Science in computer science (Habilitation).

Dr. Angelina Kraft


is a data scientist and head of the Lab Research Data Services at TIB. The team develops quality standards and infrastructure services for the publication of research data and scientific software according to the FAIR principles.

Prof. Dr. Konrad Förstner


holds a joint professorship for Information Literacy at the TH Köln – University of Applied Sciences and at ZB MED, where he also leads the unit `Data Science and Services´ and is responsible for the discovery service LIVIVO. His research activities include high-throughput sequence data analysis and systems biology in microbiology, text mining and knowledge management of biological literature but also further applications of data science methods in the life sciences.

Dr. Lorenz Reimer


is leading the database development team at the DSMZ and is responsible for the BacDive database. His main research interests are research data hidden in culture collections and primary literature and to change the disequilibrium between readily available sequence data and hard to access phenotypic data in microbial research.

Dr. Boyke Bunk


is the head of the bioinformatics department at DSMZ.

Joaquim Sarda


is a software developer at the DSMZ and is the main web and database developer for the BacDive database. He has acquired deep skills developing web based applications across multiple industries.

Dr. Julia Koblitz


is a data scientist at DSMZ with expertise in genome annotation, metabolic modeling, enzymology, and data visualization. She developed tools for analyzing and visualizing metabolism-associated data, for instance MetaboMAPS and MMTB.

Dr. Arindam Halder


is a bioinformatician at ZB MED. His research activities have focussed on applying text mining and natural language processing methods for biological knowldge discovery along with expeience in molecular biology techniques using stem cells for cardiomyocyte repair. He has previously worked in the area of personalised medicine for oncology using systems biology and mathematical modeling.

Gautam Shahi


is a research assistant in the Lab Research Data Services at the German National Library of Science and Technology (TIB). He has expertise in the semantic web, ontology engineering, knowledge graph and software development.

The DiASPora project is funded by the SAW Programme of the Leibniz Association, Funding No. K280/2019

Get in touch

Contact us!