Applying Transformations and Optimizations to Your Code
Automatically rewrite your code
CompilerWorks core technology applies transformations and optimizations to your code. The core technology is extensible: access to the transformation engine is documented so engineers can create their own transformations.
A few examples of standard transformations/optimizations embedded in the core:
- Business Process Rules
- Add, rename, replace or change names or types of columns, adds conditions, introduces partitioning or segmentation.
- Normalize/de-normalize
- Relational normalization or de normalization.
- GROUP BY simplifier
- Commutes joins and group-by clauses, aiming for a reduction in the size of the group-by and a reduction in resource usage.
- Cast Coalesce Swap Transformer
- Transposes cast and coalesce operators to avoid casting constants, which interacts badly with some memory managers.
Transformation Example
Automatic query rewriting for sampled data
This page introduces the concept of how automatic code rewriting works. It presents an actual customer example – automatically rewriting queries to work with sampled data rather than an entire, internet scale dataset.
Situation
The internet scale customer has hundreds of analysts working on terabyte scale datasets. The cost for electricity alone when processing the data is millions of dollars a year. To support an agile, high speed development process for new analyses management and the data infrastructure team posed the challenge:
Can Compilerworks automatically rewrite queries to work with sampled data to increase analyst productivity and reduce cost?
The answer is yes. A CompilerWorks service was configured for analysts to submit “full scale” queries and the service automatically rewrites them before execution to work with sampled data.
This page provides tangible and simple examples of what this service does the the queries to work with the sampled data, rather than the complete dataset:
- DDL rewriting (data description language)
- DML rewriting (data manipulation language)
- Customer Query (anonymized)
In all examples the underlying data was sampled (1% of the data) and written into a new set of sampled database tables for the rewritten queries to execute on. A simple renaming convention is used in the examples presented where <TABLE_NAME> is replaced with <TABLE_NAME>_with_sampling. Both the source and target languages are Hive in this example.
DDL Examples
Trivial in this case
Rewriting DDL is mostly trivial. The structure of tables is unchanged – only the names are changed. Simple functions like COUNT need to be scaled. (Keep reading – it does get more complicated!)
Explanation | Source SQL | Transformed SQL (for sampled data) |
---|---|---|
Simple DDL can be transpiled with a simple text substitution. |
CREATE TABLE IF NOT EXISTS
c
BIGINT
,
|
CREATE TABLE IF NOT EXISTS
by_city_with_sampling
|
The transpiler adheres to the configured coding conventions: in this case fully qualified object names are generated. |
INSERT OVERWRITE TABLE
|
INSERT OVERWRITE TABLE
by_city_with_sampling
|
DML Examples
Taking into account the sampling
Dealing with sampling gets more complicated when dealing with DML. What do you do with an OUTER or an INNER query? How do you handle aliasing? Simple AST and text based approaches will fail. CompilerWorks gets it right.
Explanation | Source SQL | Transformed SQL (for sampled data) |
---|---|---|
The Transpiler knows to scale the OUTER query, but not the INNER one. |
WITH
|
WITH
|
The Transpiler knows to scale the INNER query, but not the OUTER one this time. |
SELECT
|
SELECT
|
What if a sampled query is aliased? The transpiler knows what to do. |
SELECT
|
SELECT
|
The the transpiler handles joins correctly. Scaling depends on the type of join. The tables in the join need to have the same sampling scheme and sampling function but need not have the same sampling rate. |
WITH
|
WITH
|
Customer Query
Closer to the real world
Below is an anonymized query that a customer used to automatically rewrite to work with sampled data. The query not only incorporates the ideas presented above but it also optimized a JOIN.
Source SQL | Transformed SQL (for sampled data) |
---|---|
INSERT OVERWRITE TABLE
dedup_summary
|
INSERT OVERWRITE TABLE
|