Data Integration Using A Graph Database

Posted on May 24th, 2013 by Stephen Young.

Data Integration is an increasingly-common problem for large enterprise and small business alike. How do we tie our accounting, ERP, CRM, HR and other systems together so that the important information that our staff and other stakeholders need is available in one place? And it's not just a corporate problem - integrating biological and other research databases, patient data, public records and shared Intelligence data for example are even bigger challenges.

Currently there are three main strategies for providing unified access to disparate data stores...

1) Application Integration
The most common solution is to build applications that answer specific pre-defined questions by hard-wiring those questions into middleware - software that sits beween data silos. To reduce the connection-complexity (six systems, for example, requires 15 different silo-to-silo connections) some kind of common "bus" is often used to route the queries and the data.

Application Integration is an expensive, time-consuming process - and increasingly many larger enterprises are finding that it's a strategy that just can't keep up with business change.

2) Data Warehouses
The complexities of Application Integration often mean that it's a strategy used only for transactions and other core modifications to data. Providing visibility requires another step - extract, transform and load (ETL) the data into a Data Warehouse. Powerful hardware and tools can then report many of the consolidated views, insights and other gems that business needs.

3) Virtual Databases
Data Warehouse data is historical, and for many businesses, access to anything less than current data is not good enough. A recent solution to this "real-time" problem has been to use a "virtual" or "federated" database. The Federated Database leaves silo data as it is, and where it is. Instead, it provides a unified single view by offering consensus or "mediated" data structures that can be referenced in place of those used by the various application silos. Query the virtual database and it will fetch the data it needs from those silos.

A Virtual Database strategy has its problems however...

  • Any but the most trivial of queries need to be hand-coded into the "wrappers" that sit between the virtual database and it's silos..
  • Mapping multiple data silos to the required "mediated schema" can be a difficult (even impossible) task, as anyone who has tried to map between only two schemas during a data migration will tell you.

There are any number of disciplines, stategies and tools tasked with reducing the impact of these and related problems - Schema matching, Semantic Integration, Master Data Management and Ontology Mapping to name a few. But there is a promising new approach made possible by using a Graph Database.

A unifying database built using a Graph DBMS has no need of a mediated schema or any other consensus structures. The anything-can-connect-to-anything nature of graph-based storage means that every table and column name from every schema in every data silo can be retained in a graph representation of enterprise data.

What's more, because a Graph Database maintains relationships as data (rather than SQL), there is no need to write wrapper queries.

Mapping RDBMS data into a graph requires some changes to the way we think about data. The W3C work on R2RML, an RDB to RDF Mapping Language, is a useful starting point. R2RML is a "language for expressing customized mappings from relational databases to RDF (i.e. Semantic Web) datasets. Such mappings provide the ability to view existing relational data in the RDF data model, expressed in a structure and target vocabulary of the mapping author's choice." As of late last year, R2RML has become an official W3C Recommendation.

Our RapidGrapher product provies a much simpler way to map and manage RDBMS data as a graph.

In a later blog entry, I'll work through an example.