CompilerWorks' Revolutionary Technology
It sounds simple!

CompilerWorks core technology performs two, easy to state steps:

  1. Compile SQL code (and associated procedural language) into a computer algebra representation (mathematical model). CompilerWorks does not touch actual data.
  2. Emit the algebraic representation into either:
    • A lineage fabric: graph of relationships between all elements in the model.
    • SQL code (and a procedural language when required.)

That these steps are simple to state masks the complexity of what is achieved by them.

Request Technology Paper

Core Technology

Revolutionary application of compilers.

CompilerWorks’ core technology comprises of a suite of bespoke compilers, a common algebraic representation, and emitters for both a lineage fabric and specific SQL dialects.

SQL code is the fundamental input in the process. The compilers convert the SQL code to an algebraic representation (AR) which preserves a great deal of metadata; for each individual instruction, the AR captures origin, location, pipeline, responsible users and additional user‑specified metadata. This metadata is available to the emitters for both the lineage fabric and in the emitted SQL code.

There are two classes of emitters, one produces the lineage fabric - the foundation of the lineage solution, and the other is a set of emitters for each SQL dialect supported.

technology/coretech.svg
CompilerWorks Builds Real Compilers
Parsing isn't enough. Semantics are required.

There are many technology implementations that can parse and emit simple SQL. Only CompilerWorks can:

The bottom-line: CompilerWorks is aware of, and cares about semantics, not just syntax.


Consider the query select 1 / 2;, that is, divide 1 by 2.

The result to this simple query is dependent on the SQL dialect:

This case demonstrates the hidden complexity when changing SQL dialects. It is often true that identically named functions do not have identical behavior. For example, does regex_match match the needle-regex against the entire haystack, or just find it within the haystack? That is to say, is it FIND or MATCHES semantics. Does array[index] return null or throw an exception if index is out of bounds? Do negative indices index from the end, return null, or throw exceptions?


But I've heard so much about parsing!

Parsing is easy to talk about - it's largely a solved problem. Most importantly, it's a tiny part of the job, and not sufficient to solve the problem of translating code. Parsing and compiling with complete accuracy is not easy - it is a tough engineering problem which CompilerWorks has solved.

Each CompilerWorks compiler and emitter is hand-written, and supports the full definition of the language, including syntax, keywords, operators, functions, types, explicit, implicit and assignment casts, and all the other special cases which the vendor has grown into their dialect and product over time. CompilerWorks even captures illegal or erroneous behavior we have discovered in our customers' code. CompilerWorks has learnt NOT to trust what is written in the vendors' manuals.

The result is CompilerWorks parses, typechecks, performs operator and function selection, and compiles all languages into the core computer algebra representation.

Note: ANSI SQL is largely a useless standard. It gives false confidence that SQL copied from one platform to another and runs will behave as expected. Running on the new platform is NOT validation of correct processing/analysis. The risk of data dependent errors is very high. Semantic correctness is an absolute requirement if users want correct behavior.

Accuracy and Correctness: Lineage Version
More accurate than any alternative.

What does accurate column level analysis mean? Consider:

  
    create table t1 (c int primary key);
    create table t2 (c int primary key);
    insert into t1 values (1), (2);
    insert into t2 values (2);
    create table out as select t1.c as t1_c, t2.c as t2_c from t1 left outer join t2 using (c);
    select * from out;
    t1_c | t2_c
    ------+------
      1 |
      2 |    2
  

So did t2.c affect out.t1_c?

The join was an outer join, so it cannot delete a row from t1 and it was a join on a unique key, so it cannot increase the cardinality of a row of t1. So the answer is no, it did not.

It is even more complex than that: consider create table out as select substr(text, start, length) from src;. If src.text contains personally identifiable information (PII), out.substr contains PII. But if src.start contains PII, out.substr (under most rules) does not contain PII, because src.start is an index, not a data value.

Naturally, this is configurable for particular customer defined categories, environment and specific analyses: an ultra-secure environment might consider that a filter-effect from a WHERE clause leaks some information (consider joining on generate_sequence), despite no data being actually copied from the source column; whereas a simpler or less restrictively defined environment might only consider EXACT DATA-TO or PARTIAL DATA-TO effects to transfer protected information. Whatever the enterprise's requirements, it is very easy to configure them in CompilerWorks, and the underlying algebra and graph engines will compute the result correctly, every time.

Request Technology Paper
Accuracy and Correctness: Transpiler Version
How accurate can CompilerWorks be? How accurate is required?

Here's a simple example that demonstrates what accuracy means in a transpilation:

The obvious implication is that integer addition varies across platforms. The question arises: Do we emulate a dialect with true 8-bit integer addition v0 + v1, that is to say, (add_i8 v0 v1) as (bitand_i32 255 (add_i32 (cast_i8i32 v0) (cast_i8i32 v1))), which will emit as something like ((v0 + v1) & 255), or do we want to ignore the 8-bit-ness of the addition and emit v0 + v1.

CompilerWorks enables both options: customers inevitably say, “We want it to be accurate enough that our decision support system gives the right answer.” The default mode in CompilerWorks is itself built from a rather interesting set of choices: under what circumstances should CompilerWorks preserve ultimate accuracy, and under what circumstances should CompilerWorks simplify? The computer-algebra engine has the ability to compute this, along with many other decisions designed to give customers the code they need: clean, maintainable, understandable code which gives the right answer for the intended use case.

Computer Algebra with Serious Power
What Questions Need to be Answered?

The computer algebra engine at the core of the CompilerWorks product suite is incredibly powerful and drives both the transpiler and the analytics products.

The computer algebra engine is central to the ability to answer complex questions about the enterprise's data environment, perform analyses, or emulate features unsupported in a given target dialect. It is powerful enough to apply business logic rules, structural transformations, and even write and install custom plugins for special purpose tasks such as generating sampling queries.

This fine grained power enables CompilerWorks to answer almost limitless questions about the enterprise’s data infrastructure. Users can identify why two departments' sales figures differ, by comparing the entire path from source to result as composed functions, or to find opportunities to reduce workload by sharing identical computations. A query can be posed to the engine, and the core technology will expose the results (either through an API or in a rich user interface.)

Request Technology Paper
Language Cross-Compatibility
Migrate to a platform which doesn't support PL.

CompilerWorks' transpiler has no pair-wise dependencies; support for a particular language enables transpilation from that language to any other supported language.

CompilerWorks' computer algebra represents the superset, not the common subset, of all possible semantics. When CompilerWorks emit to a target dialect, the completeness of the result is constrained only by the expressiveness of the target language. If there is a way to express the required semantics within the target dialect then our algebra engine will do it.

A CompilerWorks target environment is not a single language, it is a stack of languages. For instance, a Teradata source environment may be shell scripts, which invoke BTEQ , which invokes Teradata SQL and SPL. A suitable target environment might be a Python control script which executes BigQuery , with JavaScript for custom UDFs. The computer algebra engine makes no account of which language in the source-language stack generated a particular algebraic term; expressing the desired program in the target language stack is a separate, standalone challenge. A dialect which does not support WITH RECURSIVE may use a Python or JavaScript loop outside the SQL to express the required semantics; while the transpiler may push BTEQ control structures down into JavaScript .

The computer algebra engine automatically configures itself with plugins based on the language-stacks given, and will perform these transformations with no user intervention. All users have to state is the objective: “Move this code from X+Y+Z to A+B+C,” and the transpiler will move it, or explain what is has achieved and what (and why) some things can't be done.

CompilerWorks transpiles to unavailable functionality in the target SQL dialect:

How does CompilerWorks handle EXEC IMMEDIATE from PL/SQL? Contact Us if you really want to know.

Usability, Performance, and Deliverability
CompilerWorks products are fast. What does that mean?

CompilerWorks' core technology reduces the time from decision to result. Customers pose a challenge, a question or make a decision: CompilerWorks will deliver the analysis, the answers, the cost reduction, the proof of compliance, or the new code fully up and running on the target platform FAST.

CompilerWorks' core technology changes the game: CompilerWorks understands the entire codebase, and will guide users through the process of analysis or migration, requiring a minimum of human or institutional knowledge and effort.

CompilerWorks will run on the enterprise's code and log files with no minimum requirement for completeness or correctness. The generated output report will contain line, column, chapter and verse on what the compiler found and how best to act upon it. The product will guide users, ensuring that there are complete inputs, validating data, translations, and testing analyses so the process of migrating an entire codebase between platforms can be handled (almost) mechanically and requires no specialist skills or understanding of the codebase.

All CompilerWorks products are error tolerant, and will ingest and make sense of incomplete, outdated, erroneous and broken code; if a human can make sense of the code then our compilers will make sense of it - and it will infer the consequences, ommissions, deductions, intuitions and corrections to be included in the automatically generated transpile report. The majority of manual labor is completed by our compilers. A human is directed where to intervene only when necessary.

CompilerWorks' technical performance is excellent: it starts with a custom LR(k) parser front-end, a low-allocation middle-end and ends with a high-performance, custom, computer algebra core. CompilerWorks transpilers will fit on a laptop or container with only 4Gb of RAM, but scale efficiently to saturate all CPUs on a 24-core server for ultimate scalability.

Why hunt for five or ten complex statements for a Proof-of-Concept when CompilerWorks will translate an entire codebase? CompilerWorks will automatically identify the most complex or incompatible statements, and translate them as accurately as possible, all in under a minute! CompilerWorks changes the migration process: it dramatically reduces the risk and time (and increases the predictability) of migration projects.

CompilerWorks' Supporting Infrastructure
It takes more than simply core technology .

The complexity of data processing in the enterprise means that it an enterprise ready solution require more than revolutionary technology.

Supporting infrastructure required for a robust enterprise solution built around CompilerWorks' core technology includes:

  • Handling SQL from multiple sources, including SQL generated by other systems.
  • Standardized capture of data repository metadata.
  • Flexible configuration of the core technology .
  • All style of user interface to the core technology .
Request Technology Paper

Infrastructure Schematic

Turning technology into solutions.

SQL code does not live in isolation in the enterprise. It is inevitably wrapped in another language, whether that be simply scripting, business intelligence tools or ETL/data integration tools.

CompilerWorks' infrastructure is incredibly flexible and is configurable to support:

  • ingesting SQL from whatever ecosystem it is stored within;
    • [optional]
    • applying transformations within the algebraic engine;
    • producing wrapped SQL for use in the source system;
  • exposing a full complement of user interfaces to both the lineage and transpiler capabilities.
technology/infrastructure.svg
Ecosystem Integration

Ingest is the process of extracting SQL from the encapsulating code or file format, be it python, Java, shell, XML, JSON, YAML, query log files, or vendor-specific tool formats. In some cases this requires customer-specific integration work as there is no robust standard for wrapping SQL for execution in large enterprises. CompilerWorks have a set of standard engines, parsers, and heuristics which handle most ingests without issue.

e.g. For custom SQL embedded in business intelligence tools (BI tools). CompilerWorks' will ingest custom SQL from a wide range of BI tool file formats, transpile it, and re-insert the converted code into the source BI tool's file format.

If the SQL is embedded in Python scripts that use macros to build the executable SQL then in most cases the transpiler can be configured to extract, transpile, and replace the SQL in place, in the Python code.

User Interface
How can a user extract the value in the core technology?

CompilerWorks exposes the power and flexibility of its core technology in every conceivable manner. The underlying philosophy is to enable data engineering and data analysts to create value in whatever way suits them.

GUI

Used by analysts and for “quick hit” data engineer tasks, e.g.:

  • Data discovery and exploration activities.
  • Identification of end-users.
  • Activity audits (and source code) for specific tables.
  • Transpile of a limited number of SQL statements.
Command Line Interface (CLI)

CompilerWorks' has worked with many data engineering departments where the preferred mode of interaction in through a CLI.

Beyond this preference it is clear that some tasks, like bulk conversion of an entire SQL code base is amenable to a CLI.

Application Programming Interface (API)

Example use cases:

  • Integrating the data fabric with in‑house data warehouse management processes.
  • Automatically re-writing queries to sample 1% of data (increase performance and reduce operating cost.)
  • Restructuring a data warehouse schema for domain driven design.

Compilerworks' customers continue to create new use cases for the API.