Applying Transformations and Optimizations to Your Code

Automatically rewrite your code

CompilerWorks core technology applies transformations and optimizations to your code. The core technology is extensible: access to the transformation engine is documented so engineers can create their own transformations.

A few examples of standard transformations/optimizations embedded in the core:

  • Business Process Rules
    • Add, rename, replace or change names or types of columns, adds conditions, introduces partitioning or segmentation.
  • Normalize/de-normalize
    • Relational normalization or de normalization.
  • GROUP BY simplifier
    • Commutes joins and group-by clauses, aiming for a reduction in the size of the group-by and a reduction in resource usage.
  • Cast Coalesce Swap Transformer
    • Transposes cast and coalesce operators to avoid casting constants, which interacts badly with some memory managers.

Transformation Example

Automatic query rewriting for sampled data

This page introduces the concept of how automatic code rewriting works. It presents an actual customer example – automatically rewriting queries to work with sampled data rather than an entire, internet scale dataset.

Situation

The internet scale customer has hundreds of analysts working on terabyte scale datasets. The cost for electricity alone when processing the data is millions of dollars a year. To support an agile, high speed development process for new analyses management and the data infrastructure team posed the challenge:

Can Compilerworks automatically rewrite queries to work with sampled data to increase analyst productivity and reduce cost?

The answer is yes. A CompilerWorks service was configured for analysts to submit “full scale” queries and the service automatically rewrites them before execution to work with sampled data.

This page provides tangible and simple examples of what this service does the the queries to work with the sampled data, rather than the complete dataset:

In all examples the underlying data was sampled (1% of the data) and written into a new set of sampled database tables for the rewritten queries to execute on. A simple renaming convention is used in the examples presented where <TABLE_NAME>  is replaced with  <TABLE_NAME>_with_sampling. Both the source and target languages are Hive in this example.


DDL Examples

Trivial in this case

Rewriting DDL is mostly trivial. The structure of tables is unchanged – only the names are changed. Simple functions like COUNT need to be scaled. (Keep reading – it does get more complicated!)

ExplanationSource SQLTransformed SQL
(for sampled data)
Simple DDL can be transpiled with a simple text substitution. CREATE TABLE IF NOT EXISTS
by_city
(

BIGINT ,
country  STRING ,
locale  STRING
)
PARTITIONED BY  (
dts  STRING
)
CREATE TABLE IF NOT EXISTS by_city_with_sampling
(
BIGINT ,
country  STRING ,
locale  STRING
)
PARTITIONED BY (
dts  STRING
)
The transpiler adheres to the configured coding conventions: in this case fully qualified object names are generated. INSERT OVERWRITE TABLE
by_city
PARTITION  (dts = '<DATEID>')
SELECT
count(*)
  c,
country,
locale
FROM   all_users
WHERE  dts = '<DATEID>'

GROUP BY country, locale
INSERT OVERWRITE TABLE by_city_with_sampling
PARTITION  (dts = '<DATEID>')
SELECT
CAST( count(*) * 100.0000  as  BIGINT )
  c,
all_users_with_sampling .country,
all_users_with_sampling .locale
FROM   all_users_with_sampling
WHERE   all_users_with_sampling .dts = '<DATEID>'
GROUP BY   2, 3

DML Examples

Taking into account the sampling

Dealing with sampling gets more complicated when dealing with DML. What do you do with an OUTER or an INNER query? How do you handle aliasing? Simple AST and text based approaches will fail. CompilerWorks gets it right.

ExplanationSource SQLTransformed SQL
(for sampled data)
The Transpiler knows to scale the  OUTER  query, but not the  INNER  one. WITH
q1 AS (
SELECT
userid,
country,
count(*)  as kount
FROM   all_users
WHERE dts =
'<DATEID>'
GROUP BY  userid, country
)
SELECT
q1.country,
count(*)
FROM  q1
GROUP BY   q1.country
WITH
q1 AS (
SELECT
all_users_with_sampling.userid,
all_users_with_sampling.country,

count(*)  kount
FROM   all_users_with_sampling
WHERE   all_users_with_sampling.dts  = '<DATEID>'
GROUP BY   1, 2
)
SELECT
q1.country,
CAST ( count(*) * 100.0000  as  BIGINT )
FROM  q1
GROUP BY   1
The Transpiler knows to scale the INNER query, but not the OUTER one this time. SELECT
country,
SUM(c) as d
FROM  (
SELECT
city,
country,
count(*)  AS c
FROM   all_users
WHERE  dts =
'<DATEID>'
GROUP BY
city, country
HAVING
count(*)  > 50
) by_city
WHERE  dts = '<DATEID>'
GROUP BY   country
SELECT
by_city.country,
sum(by_city.c) d
FROM  (
SELECT
all_users_with_sampling .city,
all_users_with_sampling .country,
CAST( count(*) * 100.0000  as BIGINT) c
FROM  all_users_with_sampling
WHERE   all_users_with_sampling .dts = '<DATEID>'
GROUP BY 
all_users_with_sampling .city,  2
HAVING  CAST( count(*) * 100.0000  as BIGINT) > 50
) by_city
WHERE  (by_city.dts = '<DATEID>')
GROUP BY   1
What if a sampled query is aliased? The transpiler knows what to do. SELECT
count(*),
q2.aliased_aliased_id
FROM 
( SELECT
q1.country,
q1.aliased_id 
   aliased_aliased_id
FROM  (
SELECT
country,
userid aliased_id
FROM   all_users
WHERE  dts =
'<DATEID>'
) q1
) q2
GROUP BY   q2.aliased_aliased_id
SELECT
count(*),
q2.aliased_aliased_id
FROM  (
SELECT
q1.country,
q1.aliased_id aliased_aliased_id
FROM  (
SELECT
all_users_with_sampling .country,
all_users_with_sampling .userid aliased_id
FROM   all_users_with_sampling
WHERE   all_users_with_sampling .dts = '<DATEID>'
) q1
) q2
GROUP BY   2
The the transpiler handles joins correctly. Scaling depends on the type of join. The tables in the join need to have the same sampling scheme and sampling function but need not have the same sampling rate. WITH
AS  (
SELECT
count(*)
   kount_a,
country
FROM   all_users
WHERE  dts =
   '<DATEID>'
GROUP BY   country
),
AS  (
SELECT
userid,
country,
count(*)  kount_b
FROM   all_users_copy
WHERE
 dts =
   '<DATEID>'
GROUP BY   1, 2
)
SELECT
sum(a.kount_a),
sum(b.kount_b),

country
FROM  a
JOIN  b  ON  (a.country = b.country)
GROUP BY   country
WITH
AS  (
SELECT
CAST( count(*) * 100.0000  as BIGINT)
kount_a,
all_users_with_sampling .country
FROM   all_users_with_sampling
WHERE   all_users_with_sampling .dts = '<DATEID>'
GROUP BY   2
),
AS  (
SELECT
all_users_copy_with_sampling .userid,
all_users_copy_with_sampling .country,
count(*)  kount_b
FROM   all_users_copy_with_sampling
WHERE 
all_users_copy_with_sampling .dts = '<DATEID>'
GROUP BY   1, 2
)
SELECT
CAST( sum(a.kount_a) * 100.0000  as BIGINT),
CAST( sum(b.kount_b) * 100.0000  as BIGINT),
a.country
FROM  a
JOIN  b  ON  a.country = b.country
GROUP BY   a.country

Customer Query

Closer to the real world

Below is an anonymized query that a customer used to automatically rewrite to work with sampled data. The query not only incorporates the ideas presented above but it also optimized a JOIN.

Source SQLTransformed SQL
(for sampled data)
INSERT OVERWRITE TABLE  dedup_summary
PARTITION  (ds='<DATEID>')
SELECT
a.cnt2/b.cnt1 as miss_rate,
a.sum2/b.sum1 as miss_usd_rate,
a.cnt2,
a.sum2 as miss_usd
FROM
(
SELECT
count(*)  as cnt2,
sum(dd_cent) as sum2,
ds
FROM   dedup
WHERE  ds = '<DATEID>'
GROUP BY   ds
) a
INNER JOIN
(
SELECT
count(*)  as cnt1,
sum(bgt_ant)  as sum1,
ds
FROM
imps_ant
WHERE
ds = '<DATEID>'
AND  illegals_bitmap =
'00000000000000000000000000000000'
GROUP BY   ds
) b
ON  a.ds = b.ds
INSERT OVERWRITE TABLE 
dedup_summary_with_sampling
PARTITION  (ds = '<DATEID>')
SELECT
a.cnt2 / b.cnt1 miss_rate,
a.sum2 / b.sum1 miss_usd_rate,
a.cnt2,
a.sum2 miss_usd
FROM
(
SELECT
CAST( count(*) * 100.0000  as BIGINT) cnt2,
sum(dedup_with_sampling.dd_cent) * 100.0000  sum2, dedup_with_sampling .ds
FROM   dedup_with_sampling
WHERE   dedup_with_sampling .ds = '<DATEID>'
GROUP BY   3
) a
JOIN
(
SELECT
CAST( count(*) * 100.0000  as BIGINT) cnt1,
sum(imps_ant_with_sampling.bgt_ant) * 100.0000  sum1,
imps_ant_with_sampling .ds
FROM
imps_ant_with_sampling
WHERE
imps_ant_with_sampling .ds = '<DATEID>'
AND  imps_ant_with_sampling .illegals_bitmap =
'00000000000000000000000000000000'
GROUP BY   3
) b
ON  a.ds = b.ds