Revolutionary application of compilers.
CompilerWorks’ core technology comprises of a suite of bespoke
compilers, a common algebraic representation, and emitters for
both a lineage fabric and specific SQL dialects.
SQL code is the fundamental input in the process. The compilers
convert the SQL code to an algebraic representation (AR) which
preserves a great deal of metadata; for each individual instruction,
the AR captures origin, location, pipeline, responsible users
and additional user‑specified metadata. This metadata is available
to the emitters for both the lineage fabric and in the emitted SQL code.
There are two classes of emitters, one produces the lineage fabric -
the foundation of the lineage solution, and the other is a set of
emitters for each SQL dialect supported.
CompilerWorks Builds Real Compilers
Parsing isn't enough. Semantics are required.
There are many technology implementations that can parse and emit simple SQL.
Only CompilerWorks can:
- Typecheck it, identify whether a particular
means integer division, floating point division, or division of an
interval by a numeric.
- Compile this information into a full computer algebra, and solve
for the most efficient (and human-friendly) way to
express the same instruction in a target dialect that has different
operators, functions and semantics.
The bottom-line: CompilerWorks is aware of, and cares about semantics, not just syntax.
Consider the query
1 / 2;, that is, divide
The result to this simple query is dependent on the SQL dialect:
(and others that use integer arithmetic) the answer is 0.
, which is a derivative of
, returns 0.5 of type double.
has a special integer-division operator
carries additional information with the type,
including implicit formatting and presentation information.
This case demonstrates the hidden complexity when changing SQL dialects.
It is often true that identically named functions do not have identical behavior.
For example, does
regex_match match the needle-regex against
the entire haystack, or just find it within the haystack? That is to
say, is it
MATCHES semantics. Does
return null or throw an exception if index is out of bounds? Do
negative indices index from the end, return null, or throw exceptions?
“But I've heard so much about parsing!”
Parsing is easy to talk about - it's largely a solved problem.
Most importantly, it's a tiny part of the job, and not
sufficient to solve the problem of translating code. Parsing and
compiling with complete accuracy is not easy - it is a tough
engineering problem which CompilerWorks has solved.
Each CompilerWorks compiler and emitter is hand-written,
and supports the full definition of the language, including syntax,
keywords, operators, functions, types, explicit, implicit and
assignment casts, and all the other special cases which the vendor
has grown into their dialect and product over time. CompilerWorks even captures
illegal or erroneous behavior we have discovered in our customers' code.
CompilerWorks has learnt NOT to trust what is written in the vendors' manuals.
The result is CompilerWorks parses, typechecks, performs operator and
function selection, and compiles all languages into the core computer
is largely a useless standard. It gives
false confidence that SQL copied from one platform to another and runs will
behave as expected. Running on the new platform is NOT validation
of correct processing/analysis. The risk of data dependent errors is very high.
Semantic correctness is an absolute requirement if users want correct behavior.
Accuracy and Correctness: Lineage Version
More accurate than any alternative.
What does accurate column level analysis mean? Consider:
create table t1 (c int primary key);
create table t2 (c int primary key);
insert into t1 values (1), (2);
insert into t2 values (2);
create table out as select t1.c as t1_c, t2.c as t2_c from t1 left outer join t2 using (c);
select * from out;
t1_c | t2_c
2 | 2
The join was an outer join, so it cannot delete a row from
t1 and it was a join on a unique key, so it cannot
increase the cardinality of a row of
So the answer is no, it did not.
It is even more complex than that: consider
out as select substr(text, start, length) from src;.
src.text contains personally identifiable information (PII),
out.substr contains PII. But if
src.start contains PII,
out.substr (under most rules) does not contain PII,
src.start is an index, not a data value.
Naturally, this is configurable for particular customer defined categories,
environment and specific analyses: an ultra-secure environment might
consider that a filter-effect from a
Request Technology Paper
WHERE clause leaks
some information (consider joining on
despite no data being actually copied from the source column; whereas a
simpler or less restrictively defined environment might only consider
EXACT DATA-TO or
PARTIAL DATA-TO effects
to transfer protected information. Whatever the enterprise's requirements, it is
very easy to configure them in CompilerWorks, and the underlying algebra
and graph engines will compute the result correctly, every time.
Accuracy and Correctness: Transpiler Version
How accurate can CompilerWorks be? How accurate is required?
Here's a simple example that demonstrates what accuracy means in a transpilation:
12345 + 23456 Since
23456 are small, they are constants of type 16-bit-integer,
but addition of 16-bit integers produces a 32-bit result, so this
(add_i32 (cast_i16i32 (const_i16 1)) (cast_i8i32
(const_i8 1))). Addition of 16-bit integers promotes to 32 bits
before, rather than after the addition,
12345 is a 32-bit integer
constant, so let's demote it to 16 bits, and consider
12345::smallint+23456::smallint. Addition of 16-bit
remains in 16-bit mode, and in this case causes
overflow, so the result is
ERROR: smallint out of range.
12345 is a 64-bit integer
does not check for overflow, so addition will
not throw an exception but will perform 2's-complement wrap around
The obvious implication is that integer addition varies across platforms. The
question arises: Do we emulate a dialect with true 8-bit
v0 + v1, that is to say,
(add_i8 v0 v1) as
(bitand_i32 255 (add_i32 (cast_i8i32 v0) (cast_i8i32 v1))), which will
emit as something like
((v0 + v1) & 255), or do we want to ignore
the 8-bit-ness of the addition and emit
v0 + v1.
CompilerWorks enables both options: customers inevitably say,
“We want it to be accurate enough that our decision support
system gives the right answer.” The default mode in CompilerWorks is
itself built from a rather interesting set of choices: under what
circumstances should CompilerWorks preserve ultimate accuracy, and under what
circumstances should CompilerWorks simplify? The computer-algebra engine has the
ability to compute this, along with many other decisions designed to
give customers the code they need: clean, maintainable, understandable code
which gives the right answer for the intended use case.
Computer Algebra with Serious Power
What Questions Need to be Answered?
The computer algebra engine at the core of the CompilerWorks product
suite is incredibly powerful and drives both the transpiler and the
The computer algebra engine is central to the ability to answer
complex questions about the enterprise's data environment, perform analyses,
or emulate features unsupported in a given target dialect. It is
powerful enough to apply business logic rules,
structural transformations, and even write and install custom
plugins for special purpose tasks such as generating sampling queries.
This fine grained power enables CompilerWorks to answer almost limitless
questions about the enterprise’s data infrastructure. Users can identify
why two departments' sales figures differ, by comparing the entire path
from source to result as composed functions, or to find opportunities to
reduce workload by sharing identical computations. A query can be posed
to the engine, and the core technology will expose the results (either
through an API or in a rich user interface.)
Request Technology Paper
Migrate to a platform which doesn't support PL.
CompilerWorks' transpiler has no pair-wise dependencies; support for a
particular language enables transpilation from that language to any
other supported language.
CompilerWorks' computer algebra represents the superset, not the
common subset, of all possible semantics. When CompilerWorks
emit to a target dialect, the completeness of the result is
constrained only by the expressiveness of the target language.
If there is a way to express the required semantics within the
target dialect then our algebra engine will do it.
A CompilerWorks target environment is not a single language, it is a
stack of languages. For instance, a
source environment may be shell
scripts, which invoke
, which invokes
SQL and SPL. A suitable
target environment might be a
control script which executes
for custom UDFs. The computer algebra
engine makes no account of which language in the source-language stack
generated a particular algebraic term; expressing the desired program
in the target language stack is a separate, standalone challenge. A
dialect which does not support
WITH RECURSIVE may use
loop outside the SQL to express the required
semantics; while the transpiler may push
control structures down into
The computer algebra engine automatically configures itself with
plugins based on the language-stacks given, and will perform these
transformations with no user intervention. All users have to state is
the objective: “Move this code from X+Y+Z to A+B+C,”
and the transpiler will move it, or explain what is has achieved
and what (and why) some things can't be done.
CompilerWorks transpiles to unavailable functionality in the target SQL dialect:
How does CompilerWorks handle
EXEC IMMEDIATE from PL/SQL?
Contact Us if you really want to know.
Usability, Performance, and Deliverability
CompilerWorks products are fast. What does that mean?
CompilerWorks' core technology reduces the time from decision to result.
Customers pose a challenge, a question or make a decision: CompilerWorks
will deliver the analysis, the answers, the cost reduction, the proof of
compliance, or the new code fully up and running on the target platform
CompilerWorks' core technology changes the game:
CompilerWorks understands the entire code base, and will guide users
through the process of analysis or migration, requiring a minimum of
human or institutional knowledge and effort.
CompilerWorks will run on the enterprise's code and log files with no minimum
requirement for completeness or correctness. The generated output
report will contain line, column, chapter and verse on what the compiler found
and how best to act upon it. The product will guide users, ensuring that
there are complete inputs, validating data, translations, and testing analyses
so the process of migrating an entire code base between
platforms can be handled (almost) mechanically and requires no specialist
skills or understanding of the code base.
All CompilerWorks products are error tolerant, and will ingest and
make sense of incomplete, outdated, erroneous and broken code; if
a human can make sense of the code then our compilers will make sense of it
- and it will infer the consequences, omissions, deductions, intuitions and corrections
to be included in the automatically generated transpile report. The majority of manual labor is
completed by our compilers. A human is directed where to intervene only
CompilerWorks' technical performance is excellent: it starts with a
custom LR(k) parser front-end, a low-allocation middle-end and ends with
a high-performance, custom, computer algebra core. CompilerWorks transpilers
will fit on a laptop or container with only 4Gb of RAM, but scale
efficiently to saturate all CPUs on a 24-core server for ultimate scalability.
- A laptop will do a multi-thousand script code migration in under a minute.
- A server-grade system can analyze and process code at over 250,000 statements a second,
delivering the ability to maintain an updated lineage model for an
entire internet-scale organization.
Why hunt for five or ten complex statements for a Proof-of-Concept when
CompilerWorks will translate an entire code base? CompilerWorks will
automatically identify the most complex or incompatible statements, and
translate them as accurately as possible, all in under a minute! CompilerWorks
changes the migration process: it dramatically reduces the risk and time (and increases the
predictability) of migration projects.
Turning technology into solutions.
SQL code does not live in isolation in the enterprise. It is inevitably
wrapped in another language, whether that be simply scripting,
business intelligence tools or ETL/data integration tools.
CompilerWorks' infrastructure is incredibly flexible and is
configurable to support:
- ingesting SQL from whatever ecosystem it is stored within;
applying transformations within the algebraic engine;
- producing wrapped SQL for use with the target system;
- exposing a full complement of user interfaces to both the lineage and transpiler capabilities.
Ingest is the process of extracting SQL from the encapsulating code or file
format, be it python, Java, shell, XML, JSON, YAML, query log files, or
vendor-specific tool formats. In some cases this requires customer-specific
integration work as there is no robust standard for wrapping SQL for
execution in large enterprises. CompilerWorks have a set of standard engines, parsers, and heuristics which handle most ingests without issue.
e.g. For custom SQL embedded in business intelligence tools (BI tools).
CompilerWorks' will ingest custom SQL from
a wide range of BI tool file formats, transpile it, and
re-insert the converted code into the source BI tool's file format.
If the SQL is embedded in
scripts that use macros
to build the executable SQL then in most cases the transpiler can
be configured to extract, transpile, and replace the SQL
in place, in the
How can a user extract the value in the core technology?
CompilerWorks exposes the power and flexibility of its core technology in
every conceivable manner. The underlying philosophy is to enable data
engineering and data analysts to create value in whatever way suits them.
Used by analysts and for “quick hit” data engineer tasks, e.g.:
- Data discovery and exploration activities.
- Identification of end-users.
- Activity audits (and source code) for specific tables.
- Transpile of a limited number of SQL statements.
Command Line Interface (CLI)
CompilerWorks' has worked with many data engineering departments
where the preferred mode of interaction in through a CLI.
Beyond this preference it is clear that some tasks, like
bulk conversion of an entire SQL code base is amenable to a CLI.
Application Programming Interface (API)
Example use cases:
- Integrating the data fabric with in‑house data warehouse management processes.
- Automatically re-writing queries to sample 1% of data (increase performance and reduce operating cost.)
- Restructuring a data warehouse schema for domain driven design.
Compilerworks' customers continue to create new use cases for the API.