CompilerWorks Lineage Solution

Data processing in enterprises is already complex, and it continues to grow in complexity. Data scientists can understand individual data processing pipelines, but does the organization understand data processing at the global level? What is the most efficient way to add new data processing to existing processes?

CompilerWorks’ Lineage Solution analyzes the dependencies and data flows between the elements of an enterprise’s processing pipelines and automatically constructs a unified model of data processing throughout the enterprise, across multiple data repositories, at the database column level.

CompilerWorks’ Lineage Solution provides the information to manage deliverables such as timeliness of delivery, information correctness, and critical path reliability. It does this by compiling the actual code executed in the data infrastructure into a Data Lineage Fabric. The lineage fabric building process does not need to touch the actual data. The fabric itself is precise, providing detailed insight into how every data column interacts with every other data column in the enterprise.

The lineage fabric reveals insights into data infrastructure that enable deterministic processes to identify, trace, and resolve problems. The result is that data discovery is easier, building new analyses is more efficient, and the cost of data processing can be proactively controlled.

Demo Video
Lineage Solution - GUI
Request Lineage Paper
Timeliness of Delivery
Meeting and beating SLAs

Data engineering departments are inevitably judged on their ability to meet (or beat) SLAs. Data engineering is held accountable for SLAs but lacks control of the load users are putting on the data infrastructure. CompilerWorks’ Lineage Solution reveals issues that impact SLAs and highlights the control points to meet (and beat) SLAs.

To optimize SLAs there are several approaches:

  • discover and optimize the critical paths to high-value datasets;
  • ensure source datasets are always up to date;
  • remove redundant and unnecessary processing from the critical path;
  • discover duplication of effort and opportunities to share costs.

The lineage fabric directly addresses these issues. It offers performance optimizations beyond the capabilities of the database. The database does an excellent job of making individual queries execute efficiently; the lineage fabric can be used to optimize a series of queries (or even pipelines) to increase processing speed and reduce data weight.

Why automate?
Keep the lineage fabric and the insights it delivers up-to-date as the data infrastructure evolves.
Information Correctness
Measuring and building trust in datasets

An enterprise’s data warehouse is the foundation of its planning and decision-making. Data errors can be very costly. CompilerWorks’ Lineage Solution identifies all dataset and process dependencies. Imagine being able to immediately notify all users dependent on a particular dataset when there is an error landing a particular table/column.

CompilerWorks’ Lineage Solution exposes all upstream and downstream dependencies. Once a data issue is corrected users can automatically propagate the information to ALL downstream pipelines, datasets and business users dependent on that information.

Absolute Semantic Accuracy
Column level accuracy down to the individual opcode, expression, and join - a manually impossible task.
Critical Path Reliability
Monitoring the reliability of data delivery infrastructure

If a pipeline in a critical path fails, or worse, is executed ad hoc or unreliably, all datasets downstream of the failure or potential failure may be affected. CompilerWorks computes all effects at column granularity so that issues can be reported to customers in a timely manner. CompilerWorks’ Lineage Solution identifies:

  • whether all processes on the path to a critical dataset are reliable;
  • whether all pipelines are under proper organizational management;
  • whether a dataset may be untrustworthy because a preceding process failed.
Error Tolerance
CompilerWorks is tolerant of noisy, incomplete, and incorrect data, and will make intelligent deductions.
GDPR , PII, and Infosec Compliance

Mark an individual column as "secure data," and the lineage fabric will propagate that tag throughout the entire data infrastructure. It takes into account whether the data is an exact copy, a partial copy, or only a portion of the data 'leaks' (e.g. the MAX function will leak one data point), and whether the secure data influences a final data point. The propagation cross the boundary between multiple data repositories and als accounts for data exports.

The result is that a robust audit of GDPR requirements is instantly available. The impact of an individual instance of secure data can be cleansed from the organization as required to satisfy legal and operational criteria.

Enterprise Wide
Builds column level lineage across multiple data systems.
Data and Product Discovery
Deriving increased value from existing data inventory

Where does an analyst or data scientist start when constructing a new analysis or pipeline? If data engineers document their code and table metadata then a simple search will suffice. Unfortunately, maintaining correct documentation is time consuming and expensive; most data engineering organizations fail to meet this ideal.

CompilerWorks’ Lineage Solution tracks data from its source through all downstream processes (and across multiple data repositories.) The graphs and reports it produces direct data scientists to source data and pipeline processes that already exist. Data analysts and engineers can quickly find source data, identify existing data processing pipelines and leverage existing data processing and activities rather than rebuilding datasets organically.

CompilerWorks’ Lineage Solution enables analyst teams to create new business value, and provides an immediate, continuously-updated inventory of data assets and consumption.

Warehouse Cost Control
Coping with organic growth and eliminating resource waste

CompilerWorks’ Lineage Solution tracks the usage and consumption of all data sets and propagates this information to generate "unused" and "eventually-unused" annotations at column level, keeping the data warehouse free of useless data and processing.

The reduction in data weight either reduces flexible cloud data warehouse costs or the need for additional technology capex and opex. Reduction in data weight helps avoid development freezes when warehouse capacity is reached, and postpones the need for migration to more expensive resources.