Databases are good at inserting, updating, querying, and deleting data and representing the data’s current state. Developers rely on data consistency so APIs can perform the correct transactions and applications can retrieve accurate records. Other consumers of data include data scientists developing machine learning models and citizen data scientists creating data visualizations.

Query a SQL or NoSQL database for what the data looked like two days ago and you might have to rely on database snapshots or proprietary features to get this view. Snapshots and backups may be good enough for developers or data scientists to compare older data sets, but they are not adequate tools for tracking how the data changed.

There are many good reasons to know more about how people and systems modify data. It’s important to have the capabilities to answer questions such as:

  • Who or what business process changed the data?
  • What tool or technology made the change?
  • How was the data changed? Was it changed by an algorithm, a data flow, an API call, or someone entering data into a form?
  • What were the changes to records, documents, nodes, fields, or attributes?
  • When was the change made, and if done by a person, where were they geographically?
  • Why was the change made? What was the context?

Data lineage explained

Data lineage is comprised of methodologies and tools that expose data’s life cycle and help answer questions around who, when, where, why, and how data changes. It’s a discipline within metadata management and is often a featured capability of data catalogs that allow data consumers to understand the context of data they are utilizing for decision-making and other business purposes.

One way to explain data lineage is that it’s the GPS of data that provides “turn-by-turn directions and a visual overview of the completely mapped route.” Others view data lineage as a core datagovops practice, where data lineage, testing, and sandboxes are data governance’s technical practices and automation opportunities.

Capturing and understanding data lineage is important for several reasons:

Copyright © 2021 IDG Communications, Inc.