Applying Transformations and Optimizations to Your Code
Automatically rewrite your code

CompilerWorks core technology applies transformations and optimizations to your code. The core technology is extensible: access to the transformation engine is documented so engineers can create their own transformations.

A few examples of standard transformations/optimizations embedded in the core:

  • Business Process Rules
    • Add, rename, replace or change names or types of columns, adds conditions, introduces partitioning or segmentation.
  • Normalize/de-normalize
    • Relational normalization or de normalization.
  • GROUP BY simplifier
    • Commutes joins and group-by clauses, aiming for a reduction in the size of the group-by and a reduction in resource usage.
  • Cast Coalesce Swap Transformer
    • Transposes cast and coalesce operators to avoid casting constants, which interacts badly with some memory managers.
Request Technology Paper
Transformation Example
Automatic query rewriting for sampled data

This page introduces the concept of how automatic code rewriting works. It presents an actual customer example - automatically rewriting queries to work with sampled data rather than an entire, internet scale dataset.

Situation

The internet scale customer has hundreds of analysts working on terabyte scale datasets. The cost for electricity alone when processing the data is millions of dollars a year. To support an agile, high speed development process for new analyses management and the data infrastructure team posed the challenge:

Can Compilerworks automatically rewrite queries to work with sampled data to increase analyst productivity and reduce cost?

The answer is yes. A CompilerWorks service was configured for analysts to submit "full scale" queries and the service automatically rewrites them before execution to work with sampled data.

This page provides tangible and simple examples of what this service does the the queries to work with the sampled data, rather than the complete dataset:

In all examples the underlying data was sampled (1% of the data) and written into a new set of sampled database tables for the rewritten queries to execute on. A simple renaming convention is used in the examples presented where <TABLE_NAME>  is replaced with  <TABLE_NAME>_with_sampling. Both the source and target languages are Hive in this example.

DDL Examples
Trivial in this case

Rewriting DDL is mostly trivial. The structure of tables is unchanged - only the names are changed. Simple functions like COUNT need to be scaled. (Keep reading - it does get more complicated!)

Explanation Source SQL Transformed SQL (for sampled data)
Simple DDL can be transpiled with a simple text substitution.
  • CREATE TABLE IF NOT EXISTS
  • by_city
  • (
    • c BIGINT,
    • country STRING,
    • locale STRING
  • )
  • PARTITIONED BY (
    • dts STRING
  • )
  • CREATE TABLE IF NOT EXISTS
  • by_city_with_sampling
  • (
    • c BIGINT,
    • country STRING,
    • locale STRING
  • )
  • PARTITIONED BY(
    • dts STRING
  • )
The transpiler adheres to the configured coding conventions: in this case fully qualified object names are generated.
  • INSERT OVERWRITE TABLE
  • by_city
  • PARTITION (dts = ‘<DATEID>‘)
  • SELECT
    • count(*) c,
    • country,
    • locale
  • FROM all_users
  • WHERE dts = ‘<DATEID>‘
  • GROUP BY country, locale
  • INSERT OVERWRITE TABLE
  • by_city_with_sampling
  • PARTITION (dts = '<DATEID>')
  • SELECT
    • CAST(count(*) * 100.0000 as BIGINT) c,
    • all_users_with_sampling.country,
    • all_users_with_sampling.locale
  • FROM all_users_with_sampling
  • WHERE all_users_with_sampling.dts = '<DATEID>'
  • GROUP BY 2, 3
Request Technology Paper
DML Examples
Taking into account the sampling

Dealing with sampling gets more complicated when dealing with DML. What do you do with an OUTER or an INNER query? How do you handle aliasing? Simple AST and text based approaches will fail. CompilerWorks gets it right.

Explanation Source SQL Transformed SQL (for sampled data)
The Transpiler knows to scale the OUTER query, but not the INNER one.
  • WITH
    • q1 AS (
    • SELECT
      • userid,
      • country,
      • count(*) as kount
    • FROM all_users
    • WHERE dts = ‘<DATEID>‘
    • GROUP BY userid, country
    • )
  • SELECT
    • q1.country,
    • count(*)
  • FROM q1
  • GROUP BY q1.country
  • WITH
    • q1 AS (
    • SELECT
      • all_users_with_sampling.userid,
      • all_users_with_sampling.country,
      • count(*) kount
    • FROM all_users_with_sampling
    • WHERE all_users_with_sampling.dts = '<DATEID>'
    • GROUP BY 1, 2
    • )
  • SELECT
    • q1.country,
    • CAST(count(*) * 100.0000 as BIGINT)
  • FROM q1
  • GROUP BY 1
The Transpiler knows to scale the INNER query, but not the OUTER one this time.
  • SELECT
    • country,
    • SUM(c) as d
  • FROM (
    • SELECT
      • city,
      • country,
      • count(*) AS c
    • FROM all_users
    • WHERE dts = ‘<DATEID>‘
    • GROUP BY city, country
    • HAVING count(*) > 50
    • ) by_city
  • WHERE dts = ‘<DATEID>‘
  • GROUP BY country
  • SELECT
    • by_city.country,
    • sum(by_city.c) d
  • FROM (
    • SELECT
      • all_users_with_sampling.city,
      • all_users_with_sampling.country,
      • CAST(count(*) * 100.0000 as BIGINT) c
    • FROM all_users_with_sampling
    • WHERE all_users_with_sampling.dts = ‘<DATEID>‘
    • GROUP BY all_users_with_sampling.city, 2
    • HAVING CAST(count(*) * 100.0000 as BIGINT) > 50
    • ) by_city
  • WHERE (by_city.dts = ‘<DATEID>‘)
  • GROUP BY 1
What if a sampled query is aliased? The transpiler knows what to do.
  • SELECT
    • count(*),
    • q2.aliased_aliased_id
  • FROM (
    • SELECT
      • q1.country,
      • q1.aliased_id aliased_aliased_id
    • FROM (
      • SELECT
        • country,
        • userid aliased_id
      • FROM all_users
      • WHERE dts = ‘<DATEID>‘
    • ) q1
  • ) q2
  • GROUP BY q2.aliased_aliased_id
  • SELECT
    • count(*),
    • q2.aliased_aliased_id
  • FROM (
    • SELECT
      • q1.country,
      • q1.aliased_id aliased_aliased_id
    • FROM (
      • SELECT
        • all_users_with_sampling.country,
        • all_users_with_sampling.userid aliased_id
      • FROM all_users_with_sampling
      • WHERE all_users_with_sampling.dts = ‘<DATEID>‘
    • ) q1
  • ) q2
  • GROUP BY 2
The the transpiler handles joins correctly. Scaling depends on the type of join. The tables in the join need to have the same sampling scheme and sampling function but need not have the same sampling rate.
  • WITH
    • a AS (
    • SELECT
      • count(*) kount_a,
      • country
    • FROM all_users
    • WHERE dts = ‘<DATEID>‘
    • GROUP BY country
    • ),
    • b AS (
    • SELECT
      • userid,
      • country,
      • count(*) kount_b
    • FROM all_users_copy
    • WHERE dts = ‘<DATEID>‘
    • GROUP BY 1, 2
    • )
  • SELECT
    • sum(a.kount_a),
    • sum(b.kount_b),
    • country
  • FROM a
  • JOIN b ON (a.country = b.country)
  • GROUP BY country
  • WITH
    • a AS (
    • SELECT
      • CAST(count(*) * 100.0000 as BIGINT) kount_a,
      • all_users_with_sampling.country
    • FROM all_users_with_sampling
    • WHERE all_users_with_sampling.dts = '<DATEID>'
    • GROUP BY 2
    • ),
    • b AS (
    • SELECT
      • all_users_copy_with_sampling.userid,
      • all_users_copy_with_sampling.country,
      • count(*) kount_b
    • FROM all_users_copy_with_sampling
    • WHERE all_users_copy_with_sampling.dts = '<DATEID>'
    • GROUP BY 1, 2
    • )
  • SELECT
    • CAST(sum(a.kount_a) * 100.0000 as BIGINT),
    • CAST(sum(b.kount_b) * 100.0000 as BIGINT),
    • a.country
  • FROM a
  • JOIN b ON a.country = b.country
  • GROUP BY a.country
Request Technology Paper
Customer Query
Closer to the real world

Below is an anonymized query that a customer used to automatically rewrite to work with sampled data. The query not only incorporates the ideas presented above but it also optimized a JOIN.

Source SQL Transformed SQL (for sampled data)
  • INSERT OVERWRITE TABLE dedup_summary
  • PARTITION (ds='<DATEID>')
  • SELECT
    • a.cnt2/b.cnt1 as miss_rate,
    • a.sum2/b.sum1 as miss_usd_rate,
    • a.cnt2,
    • a.sum2 as miss_usd
  • FROM
    • (
    • SELECT
      • count(*) as cnt2,
      • sum(dd_cent) as sum2,
      • ds
    • FROM dedup
    • WHERE ds = '<DATEID>'
    • GROUP BY ds
    • ) a
  • INNER JOIN
    • (
    • SELECT
      • count(*) as cnt1,
      • sum(bgt_ant) as sum1,
      • ds
    • FROM
      • imps_ant
    • WHERE
      • ds = '<DATEID>'
      • AND illegals_bitmap =
        '00000000000000000000000000000000'
    • GROUP BY ds
    • ) b
  • ON a.ds = b.ds
  • INSERT OVERWRITE TABLE dedup_summary_with_sampling
  • PARTITION (ds = '<DATEID>')
  • SELECT
    • a.cnt2 / b.cnt1 miss_rate,
    • a.sum2 / b.sum1 miss_usd_rate,
    • a.cnt2,
    • a.sum2 miss_usd
  • FROM
    • (
    • SELECT
      • CAST(count(*) * 100.0000 as BIGINT) cnt2,
      • sum(dedup_with_sampling.dd_cent) * 100.0000 sum2,
      • dedup_with_sampling.ds
    • FROM dedup_with_sampling
    • WHERE dedup_with_sampling.ds = '<DATEID>'
    • GROUP BY 3
    • ) a
  • JOIN
    • (
    • SELECT
      • CAST(count(*) * 100.0000 as BIGINT) cnt1,
      • sum(imps_ant_with_sampling.bgt_ant) * 100.0000 sum1,
      • imps_ant_with_sampling.ds
    • FROM
    • imps_ant_with_sampling
    • WHERE
      • imps_ant_with_sampling.ds = '<DATEID>'
      • AND imps_ant_with_sampling.illegals_bitmap =
        '00000000000000000000000000000000'
    • GROUP BY 3
    • ) b
  • ON a.ds = b.ds