CompilerWorks' Lineage Solution

Data processing in enterprises is complex, and continues to grow in complexity.

Data engineers understand individual data processing pipelines, but can they (or the organization) understand and optimize data processing across multiple pipelines or enterprise wide? How do data scientists quickly find and build analyses around critical datasets? What is the most efficient way to add new data processing or analytics to existing processes?

If you need to manage data scientist or engineer productivity, timeliness of data delivery, data discovery, information correctness, or critical path reliability, CompilerWorks’ Lineage Solution will solve these fundamental data processing challenges.

CompilerWorks’ Lineage Solution analyzes code - it never touches data. It builds a lineage fabric (unified model) at the database column level; on multiple data repositories; across the entire enterprise. The fabric itself is precise, providing detailed insight into how every data element interacts with every other data element.

The lineage fabric reveals insights into data infrastructure that increase data processing, data engineering and data science productivity.

Request Lineage Paper
Demo Video
Lineage Solution - GUI
Timeliness of Delivery
Meeting and beating SLAs

Data engineering departments are inevitably judged on their ability to meet (or beat) SLAs. Data engineering is held accountable for SLAs but lacks control of the load users are putting on the data infrastructure. CompilerWorks’ Lineage Solution reveals issues that impact SLAs and highlights the control points to meet (and beat) SLAs.

To optimize SLAs there are several approaches:

  • discover and optimize the critical paths to high-value datasets;
  • ensure source datasets are always up to date;
  • remove redundant and unnecessary processing from the critical path;
  • discover duplication of effort and opportunities to share costs.

The lineage fabric directly addresses these issues. It offers performance optimizations beyond the capabilities of the database. The database does an excellent job of making individual queries execute efficiently; the lineage fabric can be used to optimize a series of queries (or even pipelines) to increase processing speed and reduce data weight.

Why automate?
Keep the lineage fabric and the insights it delivers up-to-date as the data infrastructure evolves.
Information Correctness
Measuring and building trust in datasets

An enterprise’s data warehouse is the foundation of its planning and decision-making. Data errors can be very costly. CompilerWorks’ Lineage Solution identifies all dataset and process dependencies. Imagine being able to immediately notify all users dependent on a particular dataset when there is an error landing a particular table/column.

CompilerWorks’ Lineage Solution exposes all upstream and downstream dependencies. Once a data issue is corrected users can automatically propagate the information to ALL downstream pipelines, datasets and business users dependent on that information.

Absolute Semantic Accuracy
Column level accuracy down to the individual opcode, expression, and join - a manually impossible task.
Critical Path Reliability
Monitoring the reliability of data delivery infrastructure

If a pipeline in a critical path fails, or worse, is executed ad hoc or unreliably, all datasets downstream of the failure or potential failure may be affected. CompilerWorks computes all effects at column granularity so that issues can be reported to customers in a timely manner. CompilerWorks’ Lineage Solution identifies:

  • whether all processes on the path to a critical dataset are reliable;
  • whether all pipelines are under proper organizational management;
  • whether a dataset may be untrustworthy because a preceding process failed.
Error Tolerance
CompilerWorks is tolerant of noisy, incomplete, and incorrect data, and will make intelligent deductions.
GDPR , PII, and Infosec Compliance

Mark an individual column as "secure data," and the lineage fabric will propagate that tag throughout the entire data infrastructure. It takes into account whether the data is an exact copy, a partial copy, or only a portion of the data 'leaks' (e.g. the MAX function will leak one data point), and whether the secure data influences a final data point. The propagation cross the boundary between multiple data repositories and als accounts for data exports.

The result is that a robust audit of GDPR requirements is instantly available. The impact of an individual instance of secure data can be cleansed from the organization as required to satisfy legal and operational criteria.

Enterprise Wide
Builds column level lineage across multiple data systems.
Data and Product Discovery
Deriving increased value from existing data inventory

Where does an analyst or data scientist start when constructing a new analysis or pipeline? If data engineers document their code and table metadata then a simple search will suffice. Unfortunately, maintaining correct documentation is time consuming and expensive; most data engineering organizations fail to meet this ideal.

CompilerWorks’ Lineage Solution tracks data from its source through all downstream processes (and across multiple data repositories.) The graphs and reports it produces direct data scientists to source data and pipeline processes that already exist. Data analysts and engineers can quickly find source data, identify existing data processing pipelines and leverage existing data processing and activities rather than rebuilding datasets organically.

CompilerWorks’ Lineage Solution enables analyst teams to create new business value, and provides an immediate, continuously-updated inventory of data assets and consumption.

Warehouse Cost Control
Coping with organic growth and eliminating resource waste

CompilerWorks’ Lineage Solution tracks the usage and consumption of all data sets and propagates this information to generate "unused" and "eventually-unused" annotations at column level, keeping the data warehouse free of useless data and processing.

The reduction in data weight either reduces flexible cloud data warehouse costs or the need for additional technology capex and opex. Reduction in data weight helps avoid development freezes when warehouse capacity is reached, and postpones the need for migration to more expensive resources.