CompilerWorks is now part of Google Cloud

CompilerWorks is joining Google Cloud

20 October 2021 | 9:00 PM 

CompilerWorks is joining Google Cloud

We are delighted to share today that CompilerWorks is joining Google Cloud!

Our mission has always been to tackle difficult engineering problems. Since 2015, we’ve been working on making migrating to the cloud easier and faster, by developing products that automatically analyze, convert, and optimize legacy code to run on today’s cloud platforms. Many of our customers were migrating to Google Cloud and naturally, we began to follow their developments. 

Google Cloud shares the same customer-centric mindset, with an even larger global growth agenda. We will be combining forces to make enterprise data cloud migrations faster, cost-efficient, and less risky, while increasing the ability to reach customers all over the world.

We are excited to help customers modernize with cloud and  are looking forward to bringing our expertise to Google Cloud’s suite of migration offerings and accelerating our customers’ journeys to the cloud, together.

We are excited to see how our technology will be leveraged by Google Cloud to speed customer migrations to the cloud.

Sincerely,

Gerald & Shevek — on behalf of the Team CompilerWorks

Data Engineering Podcast

Data Engineering Podcast with Tobias Macey

Overview: A major concern that comes up when selecting a vendor or technology for storing and managing your data is vendor lock-in. What happens if the vendor fails? What if the technology can’t do what I need it to? CompilerWorks set out to reduce the pain and complexity of migrating between platforms, and in the process added an advanced lineage tracking capability. In this episode Shevek, CTO of Compilerworks, takes us on an interesting journey through the many technical and social complexities that are involved in evolving your data platform and the system that they have built to make it a manageable task.

Transcript:

Tobias MaceyToday I’m interviewing Shevek about CompilerWorks and his work on writing compilers to automate data lineages tracking from your SQL code. Shevek can you start by introducing yourself.
ShevekHi, I’m Shevek. I’m the technical founder of CompilerWorks.
 
I started writing compilers by accident 20 years ago. You’ve given me the introduction challenge was just to figure out why?

I think it’s because they’re just one of the hard problems and an awful lot of things that are put out there as languages aren’t really compilers, they are, just a syntax stuck on top of an executor, and I think almost all toy languages whenever anybody says I’ve invented a language they haven’t actually invented a language. There’s no semantics to the language itself, they just stuck a syntax on top of an executor.
And as I’ve gone on with writing compilers, I’ve found that the real challenges are where there’s an actual semantic transformation between the language being expressed and the language and the target system, and that’s when it really starts to get interesting.

I’m not really interested in parsers. Parsers are when you speak to people who learn about compilers in university. They have been taught to write parsers, and I think one of the reasons for this is it’s really fun and entertaining to write a 12 week course about parsers, and you can teach it very slowly and you can do LL and LR and Earley’s algorithm and railroads, and all of these other algorithms that nobody in their right mind would use.

Because they’re theoretically interesting, but, please teach people about languages and compilers and semantics, because until you’re talking about mapping between semantic domains, you’re not really doing the job. 
Tobias MaceyGiven that your accidental introduction to compilers and subsequent fascination with them, how did you end up in the area of data management? 
ShevekIt’s where I’m going to be brutally honest, and that’s where the money is.

What you’ve got out here in the world of data management is a vast world of enterprise languages, each of which has a single vendor. And by the nature of a single vendor, they get to charge what they please and tell you what you can do with it.
And so there’s a traditional joke about exactly what filling you would like in your sandwich. You can have this marvelous enterprise capability, but you have to have all of these restrictions with it. And it started with taking those restrictions off the enterprise language and just saying what else could we do if we had a truly generic open compiler for each of these proprietary languages.

And actually, that philosophy started a little bit earlier because there were a lot of languages out there. Like there’s no point writing a commercial compiler for C these days unless you’re doing something particularly with secure with like FPGA’s, but fundamentally C is free, and that doesn’t necessarily make it easy.

So if you go back into my GitHub, you’ll see that one of my earliest projects was writing C pre-processor implementations natively in a whole set of languages, because back in the 90s, particularly anytime you wanted a pre-processor, this was before we had the whole modern world of assorted pre-processors and templating languages. People just used either CPP or M4. But if you were working in, say, Java and somebody defined something as using the C preprocessor, you didn’t have a ready implementation.

So that philosophy of writing things that were compatible with other things and just opening up the world has been an underlying philosophy of what I’ve been building for a lot longer than data processing languages. It just turns out that data processing languages is a commercially viable place to do it 
Tobias MaceyAnd, that is as good a reason as I need to be in the business and so that brings us to what you’re doing at CompilerWorks. I imagine you’ve sort of given us the prelude to the story behind it, but I’m wondering if you can just share a bit more about sort of the business that you’ve built there and some of the motivation and story behind how you ended up building this company to address this problem of lock in by data processing systems.
 
 
ShevekThe reason you build a new database is because we have a new capability, and then we get this marvelous phrase, ANSI SQL, which is a myth about as good as the bigfoot. Lots of people claim to have seen it, but nobody actually has.

And so now we come back to this question of translating between semantic domains.
I have two SQL vendors. I have code on one and I want to run it on the other.
The code is syntactically similar because ANSI wrote this great big expensive document that dropped some hints about how you might want your language look.
If you were to consider making it look like something that might look like this, but the languages don’t fundamentally do the same thing.

Simple case like what does division mean? What’s 1 / 10 now? If your database server happens to operate in integers, the answer is 0. If your database server happens to operate in numerics, the answer is 0.1 with one decimal place of precision.
If your database server happens to operate in Floating Points, then you ask for something very close to 0.1, but not exactly the same because that’s not an exact binary number, and so now even with that very trivial case, I’ve motivated
something beyond ANSI in terms of translating between database servers.

CompilerWorks has 2 products.
One is we translate code from one data processing platform to another and we do it correctly, and that’s correctly with a capital C.
And the other thing that we do is we compile the code and we do static analysis and we tell you what it is that you need to know about this code and a significant part of that is things like, if I change this, what will the effect be on my organization as a whole?

So you might imagine that you’re writing code that produces a particular table in a particular server,and you do or don’t make a particular error or you do or don’t have a particular change somewhere 10 levels removed from you, in a different sub organization of an organization, that employs 10s of thousands of people, somebody is affected by this change.

Who are they?
How are they affected?
Do they need to be told?
Did they make a critical business decision based on that?
And what do we need to do in order to keep this organization running?
And should you really be considering making this change before you’ve made it?
Or, if you already made the change, what do you now need to do to fix up all the impact?

So that’s the static analysis (the lineage product). 
Tobias MaceyDefinitely a lot of interesting things to dig into there, and lineage in particular is an area that has been gaining a lot of focus and interest lately from a number of different parties and attempts at addressing it in different ways, so I definitely like the sort of compiler driven static analysis motivation for it, and I’m wondering before we get too much further down that road, you’ve already given a bit of an overview about some of the differences between parsing and compiling but for the purposes of laying the groundwork for the rest of the conversation, can you give your definition of what is a compiler and how is that’s relevant to this overall space of language translation and data lineage? 
ShevekI usually describe a compiler as something that turns something from one language into another. In the case of lineage, what we’re doing is we’re turning something from the underlying source code into an algebraic model, and then that computer algebra model is the system of which we can ask questions regarding what happened, what the consequences are, where the lineage is.

It’s interesting to think about the compiler, particularly the linear side. Are we a compilers company or are we a computer algebra company?

I suspect really we have a computer algebra company because that’s where the hard stuff is. I think lineage analysis is getting popular, is an interesting question, because if someone writes some piece of code that says you will generate a piece of SQL that takes this table and it processes like that and it generates that table, well, I’ve got lineage from my source table to my target table.

But now we fall into a hole, which is this. If the language that I invented in order to generate this SQL is just as expressive as SQL, then that language is going to be flamingly complicated and it’s going to have all of these semi joins and stuff that,
typically the reason people write these languages is to make them not be as expressive as SQL because they want to make them more accessible to developers or they want to have click’y drop’y boxes or something like that.

Now, assuming that you followed that road, which is basically universal, now you have the problems that your language is insufficiently expressive to do the thing the analyst wants to do. And so now what happens is you have this text box or this field or something where you type in a fragment of underlying SQL, and now what you’ve got is not a language but a macro preprocessor which doesn’t actually know what’s going on in that fragment of underlying SQL that the developer typed in, and so all of these tools that start out saying, yes, you’re going to build your thing in our nice GUI workflow, we’re going to show you this nice GUI workflow, and that will give you the linear edge, but you don’t really have lineage because you don’t have enough expressiveness to really do the job. Therefore, the developers had to type some custom code into a box and you don’t understand that custom code. Therefore you don’t have lineage.

And if what’s happening is that you are subject to something like CCPA or GDPR, where you are going to jail if you get this wrong, then you don’t have a lineage tool.
You actually need to look at the code that the machine really executed and analyze that code accurately at a column level and then you have a lineage model. Then you’ve got a chance of not going to jail, but anything less we do not define as lineage. 
Tobias MaceyYeah, and then another motivation for trying to reconstruct lineage is that the so called best practice, quote, unquote for data processing these days is to spread your logic across these five different systems where you need to run this tool to get your data out of this source system into this other location, this other tool to process your data in that location, this other tool to build an analysis on top of that preprocessed data, and then this other tool to actually show it to somebody to make a decision based on it, that they then input other data back into this other system that we pull it back out again, and so trying to reconstruct that flow of operations using an additional set of language processing to, you know, post into an API that stores the data in a database or trying to analyze the query logs from your data warehouse, that has a very limited scope view of the entirety of the data lifecycle, and trying to sort of piece all this back together. 
ShevekWell, the query and audit logs typically give you a good start because they’re at least partly written by security people and the security people say you must tell us everything that’s going to go on, The fragmentation of the data infrastructure, which is the other thing that you alluded to, is very real, and I think this leads to a situation where in a typical installation we’re processing multiple languages and stringing them together, and sometimes that stringing together is standardized, and sometimes it’s bespoke.

But, the challenge in putting together a lineage is to be able to identify a column in the ODS up front in, say, the web tier where a user has interacted with something and then list all of the back end dashboards that that column affected, and then describe the effect of that user interaction on each of those dashboards in a human readable way.
That’s the challenge. 
Tobias MaceyAbsolutely, and so to that point, you’ve mentioned a little bit about some of these enterprise processing languages and the tool specific semantics about how they manage that processing, and you’ve discussed some of the wonderful joys of the SQL ecosystem and trying to translate across those, and I’m wondering if you can just give an overview of the specific language implementations and areas of focus that you’re building on top of for CompilerWorks, whether you’re focused primarily on the SQL layer and being able to generate, you know these transformations and this lineage for the databases, or, if you’re also venturing out into things like spark jobs or arbitrary Python or Java code and things like that. 
ShevekYes, so you run into a number of issues as you walk around the data infrastructure.

The SQL languages are for the most part, statically analyzable. There are a couple of holes in them and there are one or two that have type systems that lead one to lose hair. From there one can fairly easily go out to the BI dashboards, particularly the richer products.

So we’re talking about some of the flow based languages, and at this point as a compilers company, one ends up writing a new piece of technology because basically all of the SQL languages are tied into the relational algebra or some variant, or some set of extensions thereof.

It’s always amused me that so many of the papers on relational optimization start with a phrase something like “without loss of generality, we shall assume that the only boolean connective is AND,” which basically means that you’re not allowed to use OR, and you’re not allowed to use, NOT.

Well, guess what if you do make that assumption, the world becomes really easy and really simple, and they’re denying the existence of outer joins, and it’s really easy to write an academic research paper on optimization if you only deal with AND. But, I disagree with the “without loss of generality.” You’ve lost the whole of the real world there.

Anyway, I diverge, sorry, so yes, so there’s a bunch of data flow languages, and BI dashboarding systems that actually work effectively with data flow and data process management, so here we’re talking about the Informaticas, the Tableaus, the things in that range, the DataStages. So we grew up to work with that ilk and then we have a secondary core that speaks to the same computer algebra engine that deals with these dataflow style languages.

Spark is a sort of a mixture because you know on the front end Spark SQL looks like SQL. The question then is I’m going to speak generically, but not specifically about Spark SQL, but what’s the strength of the join optimizer before you compile down to a dataflow language? And, are you really a dataflow language? The seminal paper I think for people wanting to understand why dataflow languages have benefits is probably the Google Flume paper, particularly the statistics about reduction in MapReduce jobs by doing delayed evaluation. But once you get out of that and you get into the ETL languages, you also run into things like SAS.

And so now you end up with questions like how do you port, let’s say Informatica to Spark. And I picked those two because they are both dataflow languages. But, Informatica has this fundamental property that computation is sequential, which is to say that if you set the value of a read write port that value remains assigned and remains visible to the next data record, and so you can actually generate a datum by saying if the record number is 1, set the value to X if the record number is not one, just read X. And in an MPP system you would get X in one record and null in every other record. But in both SAS and Informatica you get the same value of X everywhere and this is the sort of hard semantic difference that makes it very, very difficult to map between languages.

This is where we break out of the traditional job of compiling, we actually have to up engineer into user intent. The user said this. If you’re compiling C or Java and you’re compiling it down to X86, the user said this. Therefore do this, and if you do anything else, it’s your fault. But if you’re compiling some of these languages, it’s like the user said this, we’ve had a look around, we think they really meant this part of what they said, and every other part of what they said was irrelevant or a consequence of the implementation, and therefore we’re going to generate high performance code for the target that preserves the thing they meant and discards the rest. And that’s hard! 
Tobias Macey[laughs] Yes, exactly, the technology is easy, it’s the people that are hard – as with everything that has to do with computers. 
ShevekI don’t envy you editing that part out cause I went very long winded. 
Tobias MaceyNo, there’s nothing to edit there.

Continuing on the point of the semantics being the hard part of translating these data, processing languages, and you mentioned earlier the fact that at the core you think you’re more of a computer algebra company than a compiler company.
I’m wondering if you can discuss a bit of the sort of abstract modeling and mathematical representations that you use as the intermediate layer for translating between and among these different languages and generating the linear analysis
that is one of the value adds of what you’re doing there. 
ShevekI won’t, but I will say some interesting corollaries. But I hope you will forgive me for not answering the question as you directly asked it, which is a very interesting question.

There’s an old party trick where you take a floating point value and you go around the loop a million times and you add one to this floating point value and the question now is what’s the value of that floating point value? And the answer is, it’s not a million, it’s about 65,000 ’cause. Eventually you reach the point where the exponent ticks out of it is, the exponent is now one and at the point where you’re not seeing the last integer digit anymore because your exponents ticked over, adding one to a floating point value has no effect. Processors are weird.

There’s another party trick where you write an array of memory, you fill it with random numbers and then you add all the numbers into an accumulator. But then you try doing that, iterating backwards, and then you try doing that, iterating in a random order and you see what the performance difference is. It turns out that processor hardware and memory pre-fetch and so on is an absolutely delicious thing, as long as you’re reading memory forwards. It sort of manages if you’re reading memory backwards. And it falls flat on its face, throws its hands up in the air and screams if you read memory in a random order to the effect of about a 200 to one performance penalty.

So now let’s think about a tree data structure. On paper, a tree data structure
has logarithmic complexity brilliant. And academically we ignore the constants, but what a tree data structure looks like to a processor with memory pre-fetch is it looks like random order access. And so what that means is that the constant is an order of magnitude larger than anybody thinks it is, which is why on paper heap and a tree have the same performance, but in practice, a heap is so much faster because you start to fit into cache lines.

Now, computer algebra systems look awfully like random order access to memory. And, I think one of the most interesting problems in any sort of computer algebra, and you’ll even find it in SAT solvers where people are optimizing C and they’re changing the order of the structs in the internals of the SAT solver such that they are all hotter up the top end of the word because that’s the bit of the word that will fit into cache.
We actually get a multiple order of magnitude by having a solver within the computer algebra engine, which itself works out what order to do things in so that we don’t appear to be accessing the algebra structure in random order.

The more you can do on a piece of memory while it’s in cache and then drop it out of
cache. And, then, of course, the other entertaining question is wait, you do all of this in Java. Isn’t Java  some kind of language where you’re 1,000,000 miles away from the processor? Actually, I happen to think Java and the JVM are a beautiful, beautiful,
set up because you get to be 1,000,000 miles away from the processor when you want to be. But, when you actually want to get down low, you’ve got control of everything down to memory barriers and at that point you’re pretty much able to write assembler and it’s the joy of a language. They say 90 something percent of your code doesn’t need optimizing, and they’re right. So the question is, can you ignore 90% of the job and do the 1% of the job? I think the JVM is one of the greatest feats of modern engineering for allowing that. 
Tobias MaceyJust for point of reference, for people who are listening and following along, I’ll clarify that. When you say tree data structure that you’re speaking of trees spelled TRIE not TREE. 
ShevekEither will do a B-tree tree with an I, a you know, an RB tree, anything where you’re effectively allocating nodes into main memory and then making those nodes point to each other, and particularly where your allocator is, you know, you’ve got some sort of slab allocator that’s mixing your tree nodes up with other things. You know, if you just allocate a tree, then maybe yes, your root node is allocated at the start of RAM and everything else is allocated sequentially. But the moment you start rotating and mutating a tree, then a tree walk looks like random order memory access again. 
Tobias MaceyDigging more into the technical architecture of what you’re building at CompilerWorks.
Can you give a bit of an overview about the workflow and the lifecycle of a piece of code?

I guess the data is irrelevant here since you’re working at the level of the code.
But, the life cycle of a piece of code as it enters the CompilerWorks system and your processing thereof, and then the representation that you generate on the other side for the end users of the system. 
ShevekYes, and I’m going to answer about 3/4 of that, of course, so let’s deal with the thing that we deal with at the start. Everybody knows about lexing and parsing. Lexing and parsing are not necessarily as immediate as everybody thinks they are. So for instance, we’re taught that Foo is an identifier and five is an integer, and that Foo5 is an identifier because it’s something that starts with the letter.

And then you ask the question, well, what is 5 Foo and the answer is it’s an illegal identifier because it’s an identifier that starts with a digit. But if you go into any SQL dialect and you type select 5 Foo, what you will get is a value 5 aliased to the name Foo because we implicitly, as humans assume that there needed to be a space between the five and the Foo, but, if you follow the textbook instruction of how to write a lexer and parser, you actually get the bug that I just described. If you do it the way they taught you in school, you get that bug.

So now it gets a little bit interesting because the first part of writing a compiler for enterprise language is working out what its structure is. What are we even being given there? So let’s take a language as exports itself as XML. You know there’s a number of them out there. So now you’ve got a load of XML and this XML has words in it such and such ID equals Foo. Well, what does that Foo refer to? It refers to some other Foo. XML can really only represent a tree structure, and all of these languages are dataflow structures therefore they must be representing graphs, therefore there must be linkages within this XML So the first stage is looking at a load of samples, working out what the semantics of the language are.

Now here we have an advantage because most of these languages were written as relatively thin skins on top of their executor’s, and so if you know the capabilities of the executor, and I kid you not, but actually reading historical papers about how memory allocators worked and things like that will give you a lot of insight into things like the extent of variables. You learn when values get reset or when they get de-allocated just by knowing what technology was available to the authors of the language at the time they wrote the language.

So having worked out what all of the linkages are, you now have a symbol table, and having done the parse, you do the compile, which is symbol table type check operation selection. Very classic compiler work and now what you have is you have a compiled binary in the semantics of the source language, which are not necessarily atomic semantics And so now what you need to do is to break those semantics down where some of those semantics may be quite large into effectively atoms, meaningful atoms. So now we will end up with something like 32 bit integer addition with exception on overflow and you might even get an annotation about what the exception is on overflow. And, now you’ve got the set of challenges where now if you’re doing linear analysis, you have a whole set of computer algebra rules that will tell you what you need to know about this thing. Am I doing this? Am I doing the same thing twice when you’re looking for, you know matching regions of algebra without going too much into details you could do something like Jaro Winkler distance on a computer out of a data structure or something like that.

It’s fundamentally hard, but things like that are available for saying why are my marketing department and sales department computing the sales figures but coming up with different numbers? Because now they’re disagreeing over who gets how much money, and that’s a problem that needs to be solved here.

Emitting code is actually a whole new set of challenges, so this is for the case of migration from platform to platform, because if one just takes the semantics and emits them to the target platform, you get code that has a number of issues.
First you get non optimality. You get the fact that it’s not using the correct idioms of the target platform. You also get the fact that it’s ugly. We’ve all seen computer generated code and nobody wants to maintain it, and a significant part of generating code for a target platform is working out what code to generate, which is idiomatic for the target platform, idiomatic for the particular development team that gave you the input code and human readable and human maintainable.

So to give you a trivial case, there was an old joke about I stole the artificial intelligence source code from the government laboratory for artificial intelligence. I’m going to prove it by dumping the last five lines and this joke came out when LISP and SCHEME with the popular languages and the punchline of the joke was five lines of close brackets. If you just do machine generated code which everybody has done at one point you either fail at 1 + 2 * 3 or you generate 5 lines of closed back.
Yes, and that’s the sort of problem that is non obvious to the emitter, and I’ve also alluded to the fact when I said idiomatic to developers who wrote the original source,
that means that there are things that you have to preserve about the original source, which are not necessarily semantics of the language, but which are in fact idioms of the development team in question. 
Tobias MaceyIt’s definitely an interesting aspect to the problem, because as you point out, there are certain implicit meanings or sort of meanings as side effect of the structure of the code
that has nothing to do with the semantics of the code or its sort of computational intent, but that does help in terms of the
cognitive and organizational complexity management for the team that is writing and maintaining the code and that they might want to maintain, and the output, because of things like splitting logic on team boundaries, for instance. 
ShevekYes, and generating code that a machine will accept is vastly easier than generating code that a customer will accept.

I mean the general market approach to doing machine translation is you write a parser. You jump up and down on the parse tree. And then you emit the past tree,
and you say this is the target language and you wave the ANSI SQL flag as hard as loudly as you can. And you replace some function names while you’re at it, but the moment you typecheck, you’ve now done things like inserting casts, and if you generate code that contains all of those casts, there’s two things here, both of which you can’t do. One is to generate code that contains all of those casts, because you’re human maintainer will say Nope.

And, the other is to assume that the target language does the same implicit type conversions, or even fundamentally has the same types as the source language.
And the answer to that is Nope. You cannot divide 1 by 10 in any financial institution unless you know exactly what you are doing. 
Tobias MaceyAbsolutely, and so to that point, it’s interesting to dig into some of the verification and validation process of the intermediate representation of the language and the onboarding approach to bringing new target platforms or new source platforms
under the umbrella of CompilerWorks and just the overall effort that’s involved in actually doing the research to understand the capabilities and semantics of those systems. 
ShevekYes, and then you start to speak to well, what kind of company are we?

Are we a computer algebra company or are we a set of research historians?
What do I know about unheard of platform X or who wrote it?
When did they write it?
Where did they write it and  … ?

An awful lot of that gets folded into the initial development of a language.
We are utterly test driven. You basically have to be. And so, starting out with a new language it really is just about passing test cases and building customer acceptability.
There are other parts of this question which I apologize I’m not going to answer,
I’m trying to fish things out that I can say because one of the things that we developed
over the years is the ability to implement a compiler for a language in a shockingly short space of time.

Once upon a time, we actually signed a contract to do a language in a space of time, which was, you know, bordering on professional irresponsibility. And of course we did it and we hit it. And the thing that we didn’t publish was we actually did it in less time than that because one of the things that we know how to do is to understand languages and put together an implementation of language in a very short space of time.

But a lot of this comes from having a core where, for instance, types only behave in certain ways. And if you can express all of the ways in which types behave and types interrelate, then you can describe a language in terms of for instance its type system.
And that to us is a tool that we have available.

It’s sort of interesting when you’re mapping between languages where types have inheritance and polymorphism to a language where types maybe have inheritance and polymorphism but have different relationships between themselves. So at that point, something which was a polymorphism conversion in one language is an explicit type conversion in another language. Understanding of types is very, very important. 
Tobias MaceyAbsolutely, especially when dealing with data. 
ShevekYes, the date times are the killers because even if you know that you’ve got a date time
there’s one hour in every year that doesn’t exist in one hour in every year that exists twice, and then people do things like OK, so adding one to a date or time is simple,
all you have to know is whether that particular language interprets as days or milliseconds, but then you get into all sorts of craziness like if I take a time when I convert its time zone, did I get the same instant? Or did I get what we call a time zone attach? So is 1:00 PM BST in Pacific like 9:00 PM or whatever it is 7:00 PM PST or is it 1:00 PM PST and different database servers do different things when given this operator and that again blows ANSI SQL out of the water and then what you actually end up doing is just figuring out how the database server does it internally and then modeling that.

And then you’re back into history. You’re back into reading source code. You’re back into the Postgres source code is one of the most marvelous resources on the planet because it tells you how a lot of the database servers out there work. And then you want to know when they forked or what they did and why they did it and who did it.
And so on. 
Tobias MaceyAnd I’ll agree that Postgres is definitely a marvelous resource, and it’s fascinating
the number of systems that have either been built directly on top of it, or you know inspired by it, even if not taking the code verbatim. 
ShevekYes, it’s also,
I mean, you’ve got this challenge when you say where Postgres compatible and you sort of adopt the Postgres mutator and people don’t typically want to do anything with the Postgres mutator and compared to almost every commercial dialect, the Postgres mutator has a fundamental weakness in its handling of time zones, which nobody has ever seen fit to correct. I suspect that core Postgres can’t, which is that you don’t have a timestamp with time zone and the timezone handling in Postgres is basically to be avoided if you want to get the right answer. And yet, Oracle, Teradata, Bigquery, everybody else does it right?

So yeah, Postgres is a wonderful resource, but I do wonder that people base things on it with that weakness. 
Tobias MaceyMost people who start basing their systems on top of Postgres  haven’t done enough of the homework to recognize that as a failing before they’re already halfway through implementation. 
ShevekI think the majority of people who have a good idea and they want to get a demo of their good idea out as fast as possible really don’t think about the consequences of their decisions on the first two or three days, and I think there is a phenomenal bias
among developers starting out to imagine that because something gives you a very fast day one that it will give you a very fast day three. And they think, OK, we’ll get 6 to 12 months down the line, and then we’ll rewrite it.

And I think that for an experienced developer, the crossover point with technologies is around day three, not month three, and this is a big big mistake. We made some very interesting technological decisions about things that we were and were not going to do with this company right at the start of the company, and they paid off.
And some of those decisions were that we were going to do a lot more hard work than was necessarily obvious. And we’ve watched people come up behind us and say we’re going to make different technological decisions, and suffer the consequences of those things and sort of run into a wall.

But the number of times that I’ve been told that, for instance, we want to develop the back end in node because that way we get to use the same models on the front end and the back end. It’s like whoopi-do – got no type checker OK type script I’m looking at you OK? Whoopi-do, let’s see what? And you’re experienced developers going to sit down with a decent web framework with a DI framework and everything else, and we’ll have you know you’re going to be overtaken by the end of day three at best. And there are companies out there that know this. 
Tobias MaceyYes, as you were talking about technologies that have been thrown together to get a fast solution the first thing that came to mind was JavaScript, so I appreciate that you called it out explicitly. 
ShevekI like JavaScript as a language, but I also have this rule about writing shell
scripts which is the moment you find yourself using anything like arrays, you’re in the wrong language.I have this sort of set of criteria that tell you that you’re in the wrong language.

There were things I very much like about the JavaScript ecosystem and the things that I would definitely go to it for. However, it does make me kind of sad to see it slowly reinventing or rediscovery, or hitting many of the problems that other languages have hit.

Another example was some years ago there was a great big fuss about the ability of an attacker to generate hash collisions. Putting perturbations into hash tables and somebody pointed out that if I generated the correct set of SYN packets in TCP and sent spoofed SYN packets to remote Linux kernel because it used a predictable
hash, and it was a list hash, we could convince the kernel to put all of those SYN packets into the same chain in the list hash and he’s denied service to the kernel because it was spending all of its time walking this linear chain rather than benefiting from the hash table. And I watched the same bug get discovered in Perl, which taught it to use perturbation of hashes. And then I waited something like three or five years for somebody to point out that actually the same bug existed and I forget
whether it was either Python or PHP. And then you get into this world where developers say hang on a minute my hash iteration order changed, you’re not allowed to do that. And then you say, yes, you are. It says so on the tin.

And so this whole pattern of watching developers discover, discover solutions to problems that other languages have already invented. It’s like once you discovered it in Perl, which I think might have been the first one, and then PHP might be the second, but I’d have to check go around all of the other languages and look, and make sure. Don’t wait five years and the same thing is true for the JavaScript ecosystem is like they’ve waited 15 years to reinvent certain things. 
Tobias MaceyYes, developers have remarkably short memories and attention spans, at least in certain respects. Correct, and so bringing us back to what you’re building at CompilerWorks and the sort of usage of it as a static analysis and lineages generation platform, I’m wondering if you can just talk through some of the overall process of integrating CompilerWorks into a customer’s infrastructure and workflow and some of the user interactions and processes and systems that people will use. CompilerWorks for and build on top of the CompilerWorks framework. 
ShevekSo what you’ll find is that most of the data processing platforms out there have some sort of log or some sort of standard presentation of their metadata, and we had CompilerWorks aim to make everything as easy as possible, by which I mean we take that stand presentation of the metadata.

If you’re working BigQuery, we take the BigQuery logs.
If you’re working Redshift you take the audit logs.
If you’re working Teradata, we take the various things that the Teradata throws to us.

And having basically giving the CompilerWorks dumper permission to access these logs and it makes a dump and it pulls them into the product, the rest of it is automated because the fundamental thing that we operate on is if it’s possible for the underlying platform to understand that code it’s possible for us to understand the code we have all of the temporal information.

We have all the metadata.
We have all semantic information.
From then on it’s all gravy.

We pull the logs, we put it up into the user interface, we make the data available as API’s and from then on you can just explore the lineage, much as you’ve seen in our video presentations. 
Tobias MaceyIn terms of the migration process, you’ve discussed a lot of this already, so we can probably skip through this question a little bit, but what is the overall process of actually doing the migration from platform A to platform B, and especially doing the validation that the answer that you get on the other side of the transformation, at least close enough, matches the answer that you were getting before you made the migration, and then maybe a little bit of some of the reasons that people actually perform those migrations in the first place 
ShevekSo lineage is totally easy. You can usually get up and running with the CompilerWorks
lineage in a few minutes – as long as it takes you to pull the logs. You pull the log, you run it.

Migration tends to be in practice a little bit hairier, because the customers presentation of their code is not standard. Significant percentage of customers preprocess their code or something. I mean, this is actually where some of the enterprise languages are nicer. The more capable enterprise languages, while the compilers we have to write for them are much tougher, the customers tend to present their code in a more standardized form because the language itself is more capable. When you get a relatively incapable language, the customer tends to mess with it procedurally, generate it, do all sorts of things. It’s almost like they’re treating the underlying language just as an executor. So the first question you have to ask is what’s your presentation of your code? How did you mess with it?

Once you’ve got a hold of the presentation of the code, what you do with CompilerWorks is you specify what the input language stack is, and this is actually quite nice because in CompilerWorks you can take a language that generates another language or contains another language or the preprocesses another language
and say this is a language stack you’re going to absorb this you’re going to transpile, and you’re going to emit to a target language stack that has some of the same preprocessing or management capabilities as your source language stack, and this is yet another hint to say that writing a purely academic Oracle to Postgres compiler
isn’t enough because the Oracle exists within the context of something else, and maybe incomplete and so on and so on, and again, if you don’t do that, you fail human acceptability, so the start of a migration process is get the code, work out how it’s specified, tell CompilerWorks how this customer currently specifies their code.

Tell CompilerWorks how the customer wants their code specified, and then run it for the migration. And that process I have walked into a meeting room and done it cold in an hour. This could be done, you know, given that the customer typically doesn’t know the answer, usually they don’t know the answer to the target platform. They’ve been sold something by a vendor. They think it’s a marvelous idea and you say, how do you want to use this target platform when they say we don’t know and then we make a recommendation and we work with their advisors to make that recommendation work and get that right. One of the things that you get out of this is that we have a lot of versatility with respect to doing the migration job, not just converting code.

Testing is, an interesting one. Customers vary in what they will accept. As I said, with the 1 / 10 example – we are very very precise in how we convert. We have customers who absolutely lean on us for that, and they say I want this accurate down to the last dollar. If you’re dealing with financials, sometimes they care down to the last dollar.
I’m avoiding slightly naming names here. If you’re dealing with some of the markets that we deal with, they’re happy with anything that’s within 5%. And now there’s another thing that gets slightly interesting, which is if you’re dealing with financials, you’ll always use decimal types for data. I have seen people in certain markets use floating point types for data and the consequence of that is that if you do a sum
of a float you could get any answer at all, it’s not like you will probably get an answer that’s within 5% of the result. People don’t understand floating point arithmetic. You could get anything. And the difficult cases, the ones where the customers done something like that, the target platform does something in a deterministic but different order to the source platform’s deterministic order.

Now you get customers who write code where the result of the code wasn’t well defined. But the source platform happened to execute it sufficiently deterministically that they think that’s the right answer, and now you have to sit down with the customer, you say dear customer, we love you. However you did not in the source language say what you think you said. Can we now please work with you and there’s a marvelous piece of education there, with a good customer, you can really help them to improve their infrastructure as a whole, and that’s also where we describe the static analysis side of the lineage product, as tell me the things I need to know. Am I in my infrastructure doing something that is odd?

One of the funniest cases I ever saw was somebody had taken code from Oracle
that said, “A ! = B” Now in Oracle this means not equals ’cause you’ve got a not and you’ve got an equals, that’s a not equal, because now we’re going back to
what does the lexer do?  In C exclamation mark equals that’s a token in Oracle exclamation mark and then equals are separate tokens and it’s the parser that puts them together into a not equal.

Postgres was written by C developers, therefore exclamation mark equals has to be a token. So what does “A ! = B” mean? It means A factorial equals B. It executes, it does not return an error. It doesn’t give you remotely the same answer so it is a legitimate static analysis to say, did we use the factorial operator? Because we almost definitely didn’t mean to. 
Tobias MaceyYes, that is a hilarious bug. 
ShevekWhat is equally puzzling is the number of these things that we discover and find in source code, and we say how long has this been in here?

And the answer is this has been in here for years. It’s generating a production data set.
It’s breaking the production data set and nobody noticed, and so you start to ask questions like, under what circumstances do you as a customer notice an error in the production data set?

The most common answer we get is because data is missing, but if data is present, the customer tends to assume it’s correct. I used to teach undergraduate Java and you get into a lab and you’d say to a student you’re going to simulate a cannonball.
You’re gonna fire it into the air at 30 meters a second. Gravity will assume is 9.81 and you’re going to model the position of this cannonball at one second intervals and tell me when it hits the ground. OK, well I can do basic calculus and so I can say OK it’s going to hit the ground in six and a bit seconds fine, so they’d write their code and they’d run their code, and they very proudly present me their answer. Cannonball hits the ground in 25 seconds and I would say to them. Are you sure? The tone of voice is critical here.

And it took them a couple of months to work out that I would ask, “are you sure” in exactly that same tone of voice regardless of whether or not they had the right answer because their duty to code was the same, it didn’t matter whether I knew they had the right answer. I was not going to be the oracle – they were going to make sure. 
Tobias MaceyIt’s definitely remarkable the amount of that sort of cavalier attitude that exists in the space of working with data and dealing with analysis and just assuming that because the computer says it that it’s correct and not being critical of the processes that led that gave you that answer in the first place. 
ShevekAnd you spoke briefly about testing, so the naive answer to testing is if the target platform gives the same answer as the source platform, great, you’re golden and that is in fact the easy case. There’s a lot of cases where the target platform gives a different answer to the source platform, and there’s an awful lot of reasons why that might arise, many of which are nothing to do with the translation, which was in fact accurate and preserved the semantics expressed by the source code. 
Tobias MaceyIt almost makes me think that people should just use CompilerWorks to trial migrate their code to a different system to see if it gives them a different answer and points them in the direction of finding that they had some horrible mistake for the past 10 years. 
ShevekWell, that’s exactly why we run lineage.

You run linear over your code and it will tell you whether you had a horrible mistake and you don’t need a target platform for that. 
Tobias MaceyAnd I imagine too, that because of the virtue of being able to take a source language, and then, you know, generate a different destination language that that will also help people with doing sort of trial evaluations of multiple different systems in the case where they’re trying to make a decision and see sort of how does it actually play out in,
you know, letting my engineers play with it, letting my financial people play with it and see what the answers look like, and I’m wondering, what the frequency of that type of engagement is in your experience? 
ShevekAlmost universal because one of the things you had to bear in mind when you’re doing a semantic mapping is that the required semantics might not exist on the target platform. So now you’ve got a group of developers on the source platform where you’ll find some master developer, and he will find you some heavy piece of code and he will say this is the heaviest thing on the source platform.

Can you convert it to the target platform and in the old world, somebody would sit down and they convert that piece of code and they say yes, but what he’s given you isn’t the heaviest piece of code for the target platform. He’s given you the heaviest piece of code for the source platform, so an engagement for us looks like here’s all the code for the source platform.

Can you qualify the entire codebase against the target platform?
And the answer is yes, if you hold on a minute or two, we can actually give you that answer and then we can say in this file over here is this operation which is really simple on the source platform because the source platform happens to have that operator, but the target platform doesn’t and has no way to emulate it. 
Tobias MaceyDefinitely an interesting aspect and side effect of the varying semantics of programming languages and processing systems. 
ShevekYes, and one of the fundamental assumptions of the compilers world is that the target platform can do the thing. This is a very interesting compilers world, because that’s not true. The target platform cannot necessarily do the thing. And in the world where your language is just a grammar, a syntax stuck on top of the target executor of course you can do the thing because you just glue the keyword to every instruction in the target.

So yes, this is that rare case where there isn’t a work around. It’s not just a case where the instruction set isn’t dense. I mean, even like compiling C to Intel, the Intel instruction set isn’t dense. You can’t do every basic arithmetic operation on every combination of widths of words, so sometimes you have to cast out, do your arithmetic operation, and then cast back down again. In databases there are conversions between platforms that have things that can’t be done, and so the ability to run CompilerWorks over a code base and say whether this could even be done based on some simple operation is is golden for a customer. 
Tobias MaceyIt’s the side effect of SQL not being Turing complete. 
ShevekAnd not fundamentally having assignment.

You can sort of use sub-selects to do a little bit of functional programming, but the lack of assignment, and then you end up in weird corner cases like if emulating a particular piece of semantics requires you to reference a value more than once, and you don’t have assignment or assignment like operator does the target platform, then re- valuate a sub-tree where that sub-tree might, for instance, contain a sub-select with an arbitrarily complex join. Expensive was the word I was looking for. Expensive, is such a marvelous word in industry. 
Tobias MaceyOr if the sub-select happens to happen at two different points in time where the query does not have snapshot isolation and somebody inserted a record in the midst of the query being executed. 
ShevekYes, so you’ve got repeatable read and then you’ve got things like stable functions. Like if the thing that you had to duplicate, for instance, read the clock and most database servers are smart about this, and when you call the clock function or they will actually publish multiple clock functions, one of those clock functions reads the time at the beginning of the query compiles that time into the query as a constant so that when you do something respect to now, you always are treating the same now, even if your query takes a minute to run, but they will often also have another clock function, which means the actual millisecond instant that the mutator hit that opcode, and now you get customers who confuse the two and sometimes it matters, and sometimes it doesn’t, and if you’re running on a parallel database server or whatever, you start to get different answers.

So yes, it’s not just about data. I think what I’m doing here is that I’m broadening one view of what isolation and sub-tree duplication, and so on, and so on really do to you. 
Tobias MaceySo I’m sure that we could probably continue this conversation at infinitum, but both of us do have things to do, so I’ll start to draw us to a close and so, to that end, I’m wondering if you can just share some of the most interesting or innovative or unexpected ways that you’ve seen the CompilerWorks platform used. 
ShevekI think the ones that we love most are the ones in the lineage product where because we show consequences at a distance, somebody looking at maintaining a column they say and we say it affects such and such a Business Report.

You probably should think before you do that and the user is no, it doesn’t.
It can’t possibly, and then they click the button on CompilerWorks. That says explain yourself and we say “This is how it does it.”

And they have that moment, and I think the best mail that I get, and we get them quite often, it’s not just where we gave the user of revelation, it’s where we gave the user a revelation that fundamentally disagreed with what they believed about their infrastructure and really opened their eyes to it.

Those are the ones that I most enjoy. The ones where people run it because it’s accurate and they say OK, if I do this, I’m not going to go to jail, that’s great, but the ones where it contradicts them and substantiates itself are the ones that I love. 
Tobias MaceyI could definitely see that being a gratifying experience and in terms of your experience of building the CompilerWorks system and working with the code and working with the customers. What are some of the most interesting or unexpected or challenging lessons that you’ve learned in the process? 
ShevekThere’s an interesting definition of technical debt.

Where we define technical debt as not a thing that’s done wrong, but a thing that’s done wrong that causes you to have to do other things wrong. It’s really only a debt that matters if you have to pay interest on it in that sense. So knowing when to incur technical debt and how much interest you’re paying on it. Compilers I’ve compared to playing snooker. I can go up to a snooker table and I can roll the ball into a pocket.
If I’m lucky, I can hit that ball with another ball and get it to go into a pocket.
I’m not skilled enough to have a stick and hit the first ball with a stick so it hits the second ball, so it goes into a pocket, and that’s a skill of three levels of indirection, which I don’t have.

With compilers you have to be forever thinking about everything you’re doing at three levels of indirection because you’re writing the compiler and a lot of customers don’t even send you the code, so the customer says my data was 42 and it should have been 44. I’m not even necessarily going to tell you what the code was and now you have to fix it in the compiler. So now you’re playing snooker, blindfolded. That’s tough.
And one of the things that comes out of this is because the majority of the code that ever runs through your compiler is never going to show up in a support issue, and you’re never going to see it. What this means is that you mustn’t cheat. Do it right and prove it right. 
Tobias MaceyAnd for people who are interested in performing some of these platform migrations or they want to be able to compute and analyze the lineage of their data infrastructure, what are the cases where CompilerWorks is the wrong choice? 
ShevekThere are cases where we come on where the target platform has so little resemblance to the source by which I mean the desired target code has so little resemblance to the source that it’s not really a migration. It’s a version two of your product.

We come across people who try this and for that CompilerWorks is the wrong choice and would tend to advise those people, you know, people use the platform migration as an opportunity to do a product version. Two, this isn’t always as stunning an idea as it seems. You might want to consider separating the platform migration and the version 2 product because I think any developer who’s been around the block a couple of times knows that the double set of unknowns is going to pit you, and so we speak to people and we say do the migration doing apples for apples, and then do the maintenance on the target platform because the other thing about not doing a relatively clean migration is that you now don’t have a test suite, you can’t compare target platform behavior with source platform behavior because you explicitly specified target platform behavior to be different.

So we tend to advise people to do one thing at a time, but if you wanted to do them both together, we would start to lose relevance. 
Tobias MaceyAs you continue to iterate on the platform and work with customers and build out capabilities for CompilerWorks, what are some of the things that you have planned for the near to medium term or any projects that are particularly excited to work on? 
ShevekLots more languages and shortening the time to the “ah ha” moment.

The latest versions of CompilerWorks that we’ve shipped has a completely redesigned user interface in the lineage, and we’ve done a lot of work there to put exactly the right things on screen so that you can look at the screen and the answer to your question is there.

That’s hard work. That’s visual design work. It’s got nothing whatsoever to do with compilers. And now you’ve got a compiler team saying now we have to do all of this user psychology and so on and so on to do the visual design. Then it goes back into the components team like the user says, in order to make this decision, a user has an error somewhere in their data infrastructure. They want to know how to fix it. Really what they want is the list of tasks they have to perform in order in order to fix that, and we can produce that, but we can also make it visible why and justify it.

But the moment you’ve gone out the front end and decided that that’s what your user story is, and you need to do that. Now you have to go back into the back end and make sure that the back end is generating all of the necessary metadata to feed into the static analysis so that that visualization can be generated and so it’s a very tight loop between user story, visualization, front end, and pretty hard core compiler engineering. 
Tobias MaceyAbsolutely, and running the risk that this is probably a subject for an entirely another podcast episode.What are some of the applications of compilers that you see the potential for in the data ecosystem specifically, that you might decide that you want to tackle someday. 
ShevekThere are applications of compilers that I particularly enjoy.

One of my favorites which is on GitHub was the QMU Java API. What it does is it takes the QMU source code, which itself has some sort of JSON ish preprocessor and writes another compiler that compiles that JSON ish code into a Java API which allows you to remote control QMU virtual machine. Now one could have sat down and tracked QMU and written this thing long hand and said I’m going to write a remote control interface to QMU such that I can add disks and remove disks and so on and so on on the fly. But it made far more sense to do it as a compiler problem because it now tracks the maintenance of QMU. They add a new capability. Well, guess what you rebuild your Java API by running this magic compiler over QMU and you’ve got a new QMU remote control interface.

And the reason that I particularly loved that piece of code as an application of compilers was that now I can write a J unit test case that runs in Gradle in J unit, in Jenkins, in all of my absolutely standard test infrastructure, which, using pure Java fires up three racks of computers, connects them together with the network topology,
installs a storage engine on them, writes a load of data to the storage engine, causes 3 hard drives to fail and proves that the storage engine continues operating in the presence of two failed hard drives – all in J unit.

Now normally when people start talking about doing that sort of infrastructure testing, they have to invent a whole world and the whole framework for doing this. And yet one 200 ish line compiler run over the QMU source code gave the capability to suddenly write a simple, readable test in the standard testing framework that allows you to do hardware based testing of situations that don’t even rise in the normal testing world.
That’s where I start to love compilers as a solution to things, and that’s why I think
I will always have a thing for compilers, whether it’s data processing or not. 
Tobias MaceyYes, it’s definitely amazing.

The number of ways that compilers are and can be used and the amount of time that people spend overlooking compilers as a solution to their problem to their detriment, and to the extreme cost of time and effort put over-engineering a solution that could have been solved with the compiler. 
ShevekPeople think about it as like you could do this by hand. I could sit down and write JNI bindings for Lib Open GL. But if I actually want to like how many method calls or how many function calls are there in open GL if I actually want to call open GL from Java, I
probably need to generate a Java binding against the C header file for open GL. Several thousand function calls, and that’s a job for a compiler – and happens to be a job for a C preprocessor as well. I think I know which one they used. 
Tobias MaceyAlright, well are there any other aspects of the work that you’re doing at CompilerWorks or the overall space of data infrastructure and data platform migrations that we didn’t discuss yet that you’d like to cover before we close out the show? 
ShevekI think we should have a long talk about compilers in the abstract sometime because we’ll get into a very rich, probably very opinionated and probably a very detailed territory.

One day I will say don’t be afraid to learn. One of the things that I think makes me a little bit odd in this world is that I actually didn’t study all of the standard reference works.

We study a lot of history. Most of the people who slapped these things together didn’t study the standard reference works. So by all means take the course, I had some excellent professors whom I loved who put us through the standard compilers course, but I will say that the standard compilers, course, and even some of the advanced compilers courses that I’ve watched because the universities have been publishing them online, they don’t really touch on this. They don’t really want to touch on type checking, they just about do basic things like flow control. Get out there and learn and be self taught and dig into it – and don’t be afraid to do that.

And one day I have a shelf of books where I went to one of the publishers I said give me every book you have on compilers. It’s one of my intentions one day to read them, but I haven’t yet. So my closing thought would be even if it’s not compilers or whatever it is, go for it. 
Tobias MaceyWell, for anybody who wants to follow along with you and get in touch, I’ll have you add your preferred contact information to the show notes and as the final question, I’d like to get your perspective on what you see is being the biggest gap in the tooling or technology that’s available for data Management Today. 
Shevek 
I think as we move away from some of the enterprise languages and we move into some of particularly the dataflow systems that we’ve got these days, we are moving into a world where the languages become harder to analyze and maintain. We’re accessing underlying platform semantics through API’s not through languages.
And I think that that is going to have a cost. I predict doom. 
Tobias MaceyDefinitely something to consider. 
ShevekIt might be strawberry flavored doom.
I don’t know what’s not to do. 
Tobias MaceyAlright, well it has truly been a joy speaking with you today, so thank you for taking the time and thank you for all of the time and effort you’re putting into the work that you’re doing at CompilerWorks. It’s definitely a very interesting business and an interesting approach to a problem that many people are interested in solving. So thank you for all of the time and effort on that, and I hope you enjoy the rest of your day. 

Source: https://www.dataengineeringpodcast.com

Lineage Demo

CompilerWorks Lineage Live Demo

Access to this demonstration has been restricted.

The demo environment you’re about to login to lets you explore lineage of a sample data warehouse with tables, columns, SQL statements, ETL pipelines, and users.

Check out the Quick Start Guide following the registration form below for an overview. And don’t forget to save your password. You can come back and log in any time to continue exploring.

Quick Start Guide

After signing in, you are presented with the instance panel showing the demo instance status.

Click Open to start the demo.

Launching the demo displays the Lineage entry screen with a search bar and three sample search options.

On your first visit to the demo, select from the pre-defined table searches to go to the Data Flow panel.

Diagraming Table Lineage

The Data Flow diagram shows all upstream and downstream tables, with the table you selected centered and highlighted in purple. In addition to tables, this display shows the names of pipelines that affect this and other tables.

Scroll to zoom in and out of any diagram. Click any element to refocus the diagram.

Clicking on the Columns icon reveals the columns in each table.

Tracing Column Lineage

Clicking a column name lets you trace its lineage.

Click on an edge (a line with an arrow) to get a semantic summary of the action connecting two elements

Displaying SQL Statements

Click on Code in the upper right corner of any diagram panel to display the SQL statements associated with elements in the diagram. Use the timeline scrub bar to trace previous executions of the SQL code.

Use the Back button or table name and pipeline name links to recenter the display on the Data Flow diagram.

Integration and the Lineage API

The Live Demo is a fully functioning copy of the Lineage user interface and the Lineage GraphQL API. The API lets you integrate Lineage with other data warehouse management tools and in-house apps. The demo includes the GraphiQL console: an IDE for writing and validating GraphQL queries run directly against the Lineage graph. The console support syntax highlighting and validation, typeahead suggestions, tab completion, and interactive documentation.

The Lineage graph is built automatically using SQL logs and ETL pipelines from your production environment and automatically updates as your production environment changes.

At no point does Lineage touch your data.

This Quick Start Guide gives you a taste of the capabilities of Lineage. Explore the demo at your own pace to discover more.

CompilerWorks Lineage gives you the visibility to make informed decisions about authenticity, governance, and data migration. No other data lineage tool offers the breadth, visualization, automation, and integration of CompilerWorks Lineage.


“Without CompilerWorks software, we would not have been able to migrate our critical risk models in the targeted timeframe. Lineage reduced our EDW TCO by $12 million per year.”

Marcel Kramer, Director of Data Engineering | ABN AMRO Bank N.V.



Interested in a demonstration of Lineage in your environment?

PayPal’s Moment of Truth

PayPal’s Moment of Truth

Big Tech redeemed itself during the COVID-19 pandemic. E-commerce companies stepped up and became a de facto part of our critical infrastructure, handling the huge increase in online business without missing a beat.

The new normal was a boon for companies like Amazon, Instacart, and Uber Eats, but at PayPal it was reason for panic.

PayPal data analytics and data science teams are responsible for compliance, risk processing, and fraud protection, among other things. As a financial services company, these workloads are business-critical.

RegulatoBreached SLAs and Delayed Decision Making

Record-breaking daily payment activity had impacted data warehouse ETL processing causing SLAs to be breached and delaying analytics-dependent business decisions.

PayPal’s on-prem data warehouse infrastructure could not keep up. Data engineers decided the most scalable path forward involved migrating the Teradata data warehouse to the cloud.

Several PayPal workloads had already moved to Google Cloud Platform. After a short evaluation, data engineers opted to move the warehouse to Google BigQuery.

The first step in the migration project was to scope the workload.

CompilerWorks and BigQuery

Using CompilerWorks Lineage solution, PayPal’s team processed Jupyter Notebooks, Tableau dashboards, and UC4 logs to create a lineage graph showing all tables, schemas, scheduled jobs, notebooks, and dashboards.

Data warehouse users validated the Lineage output to confirm active workloads. Redundant and duplicate processes were then deprecated. This significantly reduced the migration workload.

Eliminating Tedious and Error-Prone Manual Processing

Data engineers then used CompilerWorks Transpiler to recreate Teradata DDLs, DMLs, and SQL code in BigQuery.

Transpiler automation was critical to eliminating error-prone manual intervention and easing the PayPal data warehouse users’ transition to BigQuery. In all, Transpiler converted over ten thousand SQL queries in users’ jobs, Tableau dashboards, and Jupyter Notebooks.

During testing and validation, Transpiler continued to poll the on-prem Teradata infrastructure for changes and synchronize these with BigQuery.

The data engineering team now has 15 petabytes stored in Google BigQuery and an additional 80 petabytes in GCP. With CompilerWorks help, PayPal data warehouse users have transitioned well to BigQuery and are enjoying the improvements in query performance and load times.

To find out more about PayPal’s transition to BigQuery, read Romit Mehta’s superb write-up on Medium.

Syngenta’s Search for a Single Source of Truth

Syngenta’s Search for a Single Source of Truth

Syngenta is a global AgTech company dedicated to helping millions of farmers around the world safely and sustainably grow high-quality food, feed, fiber, and fuel. The company’s 26,000 employees in 100 countries use world-class science to transform how crops are grown and protected. In 2020, Syngenta had $14.3B in global sales and devoted 10% of revenue to R&D.

Syngenta achieved its position as a world leader in agtech through innovation and a succession of successful mergers and acquisitions. The downside of this growth path was a fractured IT environment.

For Syngenta’s data scientist and visualization engineers, the siloed IT environment presented real problems. Like an iceberg, 20% of the company’s data from trials, R&D, and field observations were visible. But ensuring the provenance of this visible tip required navigating the 80% of arcane, siloed data and processes that were obscured from view.

Syngenta’s senior data architect found a reproducible and scalable way to integrate the silos of data with the help of CompilerWorks. Creating a set of data marts on Amazon Redshift, Syngenta assimilated data from 60 different sources, integrating the entire technology, process, and product lifecycle of the Syngenta Group. CompilerWorks Lineage provided transparency enabling data scientists to see where their data originated. Data lineage was presented in a simple, understandable way, giving confidence in the new data source.

The landscape of assimilated data, processed through CompilerWorks, let Syngenta’s data scientists focus on the business meaning of the data in a trusted, agnostic way, providing data observability without ever having to dig into the source.

To read more about how CompilerWorks Lineage helped Syngenta automate data management and gave data scientists confidence in their data, read the expanded Syngenta case study on Information Week.

Cloud Data Warehouse Accelerating Data Engineering and Cloud Transformation at ABN AMRO Bank

ABN-AMRO-Netherlands

Cloud Data Warehouse Accelerating Data Engineering and Cloud Transformation at ABN AMRO Bank

ABN AMRO Bank N.V. is the third-largest bank in the Netherlands. Headquartered in Amsterdam, it provides financial services to more than a quarter of the Dutch population. The bank employs 19,000 people and has over $465.8 billion in assets. 

In 2019, ABM AMRO began an IT digital transformation and modernization project to grow the number of teams working in the cloud. The project included migrating a 90TB on-premise appliance-based Teradata enterprise data warehouse (EDW) to Microsoft Azure.

The data warehouse migration from Teradata EDW to Azure Data Factory’s (ADF) platform as a service (PaaS) architecture promised to optimize costs. It was also a hedge against a looming 2021 end-of-support deadline for the Teradata appliance. 

Regulatory Constraints Delay Migration Deadline

The bank identified 62 end-user groups using the Teradata EDW and asked them to re-engineer their workloads on Azure as part of the data platform modernization project. Unfortunately, many of the workloads were over ten years old and involved critical business logic with little, if any, documentation. 

The prospect of rewriting complex code from scratch was daunting. A significant number of the workloads were risk models used by the bank to satisfy data governance regulations such as Basel III/IV, ECB, DCB, AFM, and GDPR. 

Any changes to the highly regulated business logic would require regulators to re-validate the models: a time-consuming process that would delay the cloud transformation project. 

CompilerWorks Slashes Time to Re-Engineer Business Logic

ABM AMRO turned to CompilerWorks for help. Using a combination of CompilerWorks Lineage and Transpiler solutions, the bank’s DevOps teams extracted the business logic used by risk models and recreated the workloads on Azure’s cloud data warehouse. 

An on-screen comparison of the Teradata and Azure datasets demonstrated to regulators that both platforms were using the same business logic, eliminating the need to re-validate the models.

Using CompilerWorks Lineage and Transpiler solutions enabled the bank to condense four years’ development work into one year. Through an automated migration, critical workloads were successfully moved to Azure on schedule, reducing EDW’s total cost of ownership (TCO) by more than 10 million euros ($12 million) per year.

In addition to financial benefits, CompilerWorks brought clarity to ABM AMRO’s re-engineering efforts. IT teams and end-user groups acquired a much deeper understanding of the business logic used in their risk models. Retiring their Teradata EDW on schedule enabled the bank to adopt a modern cloud-based data architecture years earlier than would otherwise have been possible.

Decentralized Data Models and Improved Compliance

For the future, ABN AMRO anticipates providing end-user groups from across the enterprise access to data through a self-service data marketplace. The marketplace will include information such as data quality, ownership, and lineage over time letting data consumers prove internal and external compliance at any point.

The bank also has plans to use CompilerWorks Transpiler to support future cloud database migrations. Transpiler will let users compare translated code in different SQL dialects and perform risk assessments before migrating to a new platform.

ABM AMRO Bank key outcomes:

  • Migrate workflows from Informatica PowerCenter to Azure Data Factory pipeline
  • Slash migration time and enable user groups to meet target completion dates 
  • Enable nearly $12 million per year in IT CAPEX/OPEX savings 
  • Provide upstream and downstream automated lineage transparency 
  • Achieve one-to-one transpilation from Teradata SQL to Azure SQL 
  • Simplify data architecture by identifying unused data and resources 
  • Bring clarity to business logic, streamlining regulatory compliance through automation
  • Save time, reduce cost, and hedge risk for large-scale data pipeline migrations

Find out more about how ABN AMRO Bank uses CompilerWorks Lineage and Transpiler solutions to accelerate cloud transformation, reduce costs, and maintain regulatory compliance. Read the full customer success story here.

How Lyft’s Amundsen App is Scaling Data Discovery with CompilerWorks Lineage

How Lyft’s Amundsen App is Scaling Data Discovery with CompilerWorks Lineage

Applications that provide a search service for things we need are the way of the future. One of  the most popular application categories today are ride-sharing apps like Lyft and Uber.

Founded in 2012, Lyft has quickly become one of the largest transportation networks in the United States and Canada as the world shifts away from car ownership towards transportation-as-a-service.

Their mission? To improve people’s lives with the world’s best transportation. Lyft is making good on that mission with a transportation network that includes ridesharing, bikes, scooters, car rentals, and transit all available from a single phone app.  

Challenges in Scaling the Lyft App

With so much growth, making wise use of the data flowing into the application requires technology that can support it. Lyft relies on a cloud-based development infrastructure based on Amazon Web Services (AWS), including Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).

Initially, Lyft’s front-end service was dependent on Amazon’s Redshift data warehouse and Kinesis message bus as data stores but encountered issues in scaling the application to keep up with the volume of frequent users due to the tight coupling of compute and storage limitations. To resolve this, they elected to migrate from Redshift to Apache Hive on AWS cloud.

With the constant influx of new datasets from various sources, including SQL tables, Presto, Hive, Postgres, as well as dashboards in BI tools like Mode, Superset, and Tableau, Lyft had little insight into their data lineage and the impact of changes in their data flow and access.

To maintain their upward mobility they knew they needed fast, flexible access to data to power their application and services, visualize information flow, identify and monitor errors, and conduct impact analyses of changes to their data.

Lyft Creates Amundsen Tool to Improve Data Access

To provide faster access to the targeted data users need, Lyft developed a backend data discovery tool, along with co-creator Mark Grover, named Amundsen (after the Norwegian Explorer, Roald Amundsen).

Amundsen is an open-source data discovery and metadata engine that enables data science engineers and software engineers to gather necessary data from numerous pipelines into a central place and improve their productivity by up to 20%.

Amundsen data builder enables users to:

  • Search for data assets with a simple search-engine using a PageRank-inspired search algorithm that recommends results.
  • Use a metadata service to view curated data, including user information such as statistics and when a table was last updated.
  • Learn from others by seeing what data your co-workers use most, common queries, and table-based dashboards.

Data sources can include:

  • Data stores like Hive, Presto, MySQL.
  • BI/reporting tools like Tableau, Looker, and Apache Superset.
  • Events and schemas are stored in schema registries.
  • Streams like Apache Kafka, and AWS Kinesis.
  • Processing information from ETL jobs and machine learning workflows

Unfortunately, one drawback to Amundsen is that the data it pulls is represented in a static table format with little insight into where it came from and how it’s being used—like a glossary with no definitions.

Users must then try to fill in the gaps themselves through manual mapping of data lineage which can prove time-consuming and rife with error.

Improving the Data Model with CompilerWorks Lineage

To give users greater ability to trace the lineage of data from its various sources in Amundsen, Lyft employed CompilerWorks Lineage to better understand what data is being used, by whom, for what, and how it was processed.

Since it was deployed in 2018,  CompilerWorks Lineage has become an integral part of the success of Lyft’s data scientists, engineers, and business users.

CompilerWorks Lineage use cases include:

  • Data Exploration
  • Data Quality
  • Pipeline Migration
  • Cost Control
  • Usage Tracking and Reporting
  • Onboarding New Data Analysts, Data Engineers, and Scientists.

CompilerWorks Lineage and Lyft Amundsen combined enable users to: 

  • Deliver data lineage transparency and literacy
  • Enable cost-effective, confident data migrations
  • Reduce risk posed by corrupt or inaccurate data resources
  • Optimize compute resource utilization, savings millions
  • Improve workflow productivity at every level
  • Ensure data accuracy

To learn more about how Lyft is using CompilerWorks Lineage to increase data transparency, accuracy, cost efficiency, and productivity, read the full customer success story here.

MLB’s fan data team hits it out of the park with Teradata to BigQuery modernization

MLB’s fan data team hits it out of the park with Teradata to BigQuery modernization

Read a comprehensive “blow by blow” description of Major League Baseball’s platform modernization project to migrate their EDW from Teradata to Google Cloud’s BigQuery.

A major step in the process is migrating ETL scripts from Teradata SQL to BigQuery SQL, and quoting from Rob Goretsky, VP Data engineering MLB, “The SQL transpiler from CompilerWorks was helpful, as it dealt with the rote translation of SQL from one dialect to another.

Read the full article by Rob here.

Migrating Teradata to BigQuery – Out with the old, in with the new

Migrating Teradata to BigQuery – Out with the old, in with the new

You are ready to migrate your data from Teradata to BigQuery as quickly and efficiently as possible. 

You’re looking for a solution that doesn’t require you to manually migrate code, risking human error that can slow down your migration. 

In this guide, we discuss:

  • Why you shouldn’t migrate manually
  • How CompilerWorks offers simple solutions to your migration needs

Keep reading to learn more about data migration from Teradata to BigQuery and what your business can do to speed up the process.

Table of Contents

  • Teradata To BigQuery Migration – The Manual Way 
  • Potential Problems With Manual Migration
  • What is CompilerWorks?
  • CompilerWorks Platform Migration Benefits
  • CompilerWorks’ Best Migration Practices
  • Simplify Your Teradata TO BigQuery Platform Migration With CompilerWorks

Teradata To BigQuery Migration – The Manual Way 

83% of all data migrations fail to meet an organization’s expectations or fail completely. This is usually because when an organization or business starts a code migration, they are not aware of the fundamentals that make up a code migration. 

Although most migrations involve manual code migration as the most common code migration method — it is also the most time consuming and the most likely to have errors.

Converting Teradata code involves reading the code, understanding what it is doing, and manually converting it.  

Migrations can be complex and can become multi-year projects. 

If you are going to migrate from Teradata to BigQuery manually, there are a number of steps to take. Google refers to this as it’s migration framework, which involves:

  • Preparation
  • Planning
  • Migration
  • Verification and Validation

Preparation

Prepare for your migration — conduct an analysis, ask questions like: 

  • What are your cases for BigQuery? 
  • What databases are being migrated? — What can be migrated with little effort?
  • Which users and applications have access to these databases?
  • How is the data being used?

Planning

Start planning your migration by:

  • Assessing the current state
  • Create a backlog
  • Prioritizing cases 
  • Define your measures of success
  • Define done
  • Design a proof of concept (POC) 
  • Estimate time and costs of migration

Migration

It’s important to keep in mind that BigQuery and Teradata have different data types so conversions may be needed. 

Manually converting code is a tedious and difficult process that leaves a lot of room for human errors.

Then, you’ll perform an offload migration or a full migration

Verification and Validation

After converting the data, you have to test all the codes to make sure everything is working properly. Teradata migrations involve testing millions of lines of code in order to ensure that everything is running correctly.

Potential Problems With Manual Migration

Manual migration isn’t an easy task — a number of problems can arise. 

Teradata is one of the most complex systems on the market. Substantial amounts of code need to be added, read, and understood in order to work around the lack of syntax when doing a manual migration.

During this process, human error is inevitable. Code error can delay your migration project for weeks or even months.

Instead, Compilerworks eliminates human error by relying on smart technology to provide the same accurate results every time.

What is CompilerWorks?

CompilerWorks has developed a powerful solution that accelerates migration to the cloud. This solution covers: 

  • Structuring of the migration project
  • Automatic and accurate SQL code migration
  • Automated testing and verification

This technological solution involves two core applications: 

  1. The Transpiler Solution: This aids in the migration of SQL code between platforms.
  2. The Lineage Solution: This provides detailed insights concerning how data is used across an enterprise, including by who, for what, and at what cost.


CompilerWorks’ Core Technology

CompilerWorks’ core technology ingests source code and converts into Algebraic Representation, which will mathematically represent what the ingested code does.

Traditional compilers only work when given the complete code and full description of the execution environment. However, it’s impossible to meet these requirements in the realm of data processing code. 

In order to overcome this obstacle, Compilerworks’ software makes the same intelligent inferences that a human would and then reports these deductions to the user. 

Additionally, Compilerworks’ compilers can emit code in a high-level language (Transpiler solution) and in the lineage fabric (Lineage solution) which represents all actions of an entire code base.

CompilerWorks’ Supporting Infrastructure

In the real world, code rarely exists as simple .sql files. 

Database code is typically wrapped in scripts, B reports, and ETL tools. CompilerWorks provides the tools to extract the SQL code from various wrappers and then transpile and re-wraps it so that it is ready for execution and testing immediately. 

In the transpiler solution, there are hundreds of transformers embedded, including platform-specific optimization transpilers.

The lineage fabric takes advantage of the wealth of information captured by delivering global static analysis of data processing activities and providing GUI, CLI,  Graph QL, and API interfaces. The seamless integration of the CompilerWorks’ core technology and infrastructure combined represent the Transpiler Solution, which delivers fast, accurate, and predictable migration between data processing platforms.

CompilerWorks’ Platform Migration Benefits

Manual code migration is one giant mess waiting to happen. Human error is almost inescapable. 

With Compilerworks, software scopes out the entire project at the beginning of the migration process by automatically creating a comprehensive data lineage of source systems. This makes it possible for the system to automatically identify gaps in the source code to avoid project delays that can last up to months.

This automated process using the CompilerWorks Transpiler has three key benefits: 

  1. Accuracy
  2. Predictability
  3. Speed

Accuracy 

With manual migration, a series of rules are followed to rewrite a query. To ensure the query will run on the target platform, an execution test is performed. This traditional approach is prone to error. 

To be crystal clear: manually rewriting code can always lead to errors that go undetected by basic testing strategies. 

Instead of this approach, the Transpiler is designed to produce the same answer on both the source and target systems. 

Unlike human-driven conversions that can provide unpredictable results, the Transpiler provides accuracy by giving you the same correct answer, every time.

Predictability

With the CompilerWorks’ Transpiler, you can expect a predictable end-to-end solution for managing and executing platform migration projects.

Code migration projects must be: 

  • Located
  • Extracted
  • Converted (applying code transformations)
  • Tested and validated

Through processing the execution logs from the source system, the Transpiler systemically and immediately identifies: 

  • Code that is missing from the source provided for the migration project
  • Functionalities on the source system that need to be replicated on the target system
  • Any gaps in functionality in the target system that will need human intervention to migrate

The result? 

  • No more surprises in the migration project. 
  • No re-scoping because a new functionality/code is found.
  • No delays caused by missing functionality in the target system that was discovered half-way through the migration project. 

Beyond the predictability created by transpiling all of the code in the planning stage of the migration project, the lineage model provides a roadmap for structuring the migration project.

CompilerWorks offers the ability to strategically plan where you want to start your migration project and then provides guidance to order the migration in the most efficient and expeditious way possible. 

Speed

The Transpiler delivers performant and accurate code at lightning speeds. 

CompilerWorks can reduce the time spent on a migration project by 50% or more.

This is because the compiler has an understanding of all the nuances of the code being converted and the capabilities of the platform that it is generating code for. This information is used to generate performant code for the target platform.

As a testament to the Transpiler’s speed, CompilerWorks’ largest customer compiles 10TB of SQL on a single machine, on a daily basis. 


CompilerWorks’ Best Migration Practices

CompilerWorks’ Transpiler solution offers four key migration best practices: 

  • Structured migration
  • Iterative process
  • Integrated testing
  • Security review

Structured Migration

The CompilerWorks lineage fabric guides the entire migration project. 

Instead of manually reviewing the code to try to understand discrepancies between queries, relations, and attributes, CompilerWorks automates the process and provides a rich user interface to plan the migration project. 

If you are working on a “lift, improve, and shift” migration, the lineage model will immediately show you where you can wipe out unused processing and data, while also directing you to modifications in the data processing landscape that make the most logical sense.

 If you are working on a “redesign, re-architect, and consolidate” migration, the lineage model will provide the information (from across multiple source systems) to drive the entire migration project, which is made possible by the Transpiler itself.

An ideal approach to “lift and shift” migration involves these eight steps: 

  1. Select a key management report that you wish to migrate
  2. Discover all immediate upstream requirements by reviewing the lineage
  3. Transpile the upstream table DDL on the target system.
  4. Execute the translated DDL on the target system.
  5. Copy the required data.
  6. Execute the transpiled DML.
  7. Execute the provided verification queries.
  8. Use the lineage model to guide the next level of migration (loop back to step 1).

Iterative Process

To deliver a complete migration solution, CompilerWorks leverages the core capabilities of the transpiler.

This solution enables the testing of multiple migration strategies and selects the best approach for the particular migration project involved.

The iterative process works in five steps: 

  1. Assemble all inputs.
  2. Configure the transpiler as desired.
  3. Execute the transpiler.
  4. Inspect the outputs.
  • If missing inputs are discovered, loop back to step 1.
  • If the transpiler configuration needs tuning, loop back to step 2.
  1. Copy the required data.

This fast cycle in the iterative process enables experimentation so you can compare/test the code in order to best meet your requirements.

Integrated Testing

In integrative testing, the transpiler generates a comprehensive suite of test queries to validate DML and DDML migration. 

Integrated testing works in four steps: 

  1. Create the table on the target system.
  2. Compare SourceReadDQ to TargetReadDQ.
  3. Execute the pipeline on the source and target system.s.
  4. Compare SourceWriteDQ to TargetWriteDQ.

To facilitate automation of test query execution on both the source and target systems, the test queries are compiled in a machine-readable file. Correct migration is confirmed by the verified execution of the test query suite. 

Security Review

With Compilerworks, security reviews are a breeze. All of CompilerWorks’ software is designed with security in mind as a top priority:

  • CompilerWorks never touches data. It only processes code.
  • CompilerWorks is a standalone package that can run on an air-gapped machine.
  • CompilerWorks generates clean logs— values are obfuscated.
  • CompilerWorks has frequent updates.

CompilerWorks leaves zero footprint.

Simplify Your Teradata to BigQuery Platform Migration With CompilerWorks

The CompilerWorks Transpiler Solution is the logical choice for simplifying and ensuring the success of your platform migration.Turn your large, high risk, slow, manual migration from Teradata to BigQuery into a predictable, fast, accurate, and painless automated process with CompilerWorks.

CCPA vs GDPR: A Comparison Guide

You’ve been searching for ways to remain in compliance with the GDPR and the CCPA.

Understanding the difference between CCPA and GDPR can be complex. There is so much information to sift through and it’s overwhelming. 

How do you know if your business is compliant with one, both, or neither?

For businesses operating in California, it’s important to understand both and what they mean for your business. A simple “notice and choice” option for consumers is not enough to give consumers rights over their information.  

There are ways to check whether your company needs to comply with GDPR, CCPA, or both, and what this looks like. 

In this guide, we are going to explain how to tell which regulation applies to your business, how to learn if your business is compliant, and how we can help you achieve it.

Table of Contents

  • What are GDPR and CCPA?
  • GDPR
  • CCPA
  • Key Differences Between GDPR and CCPA
  • What CCPA and GDPR Compliance Guidelines Mean For Your Business
  • How CompilerWorks Can Help 
  • Enabling GDPR and CCPA Compliance With CompilerWorks


What are GDPR and CCPA?

The GDPR and CCPA are essential data privacy laws that affect businesses around the world. They both protect consumers’ privacy. 

This is great news for consumers. 

The bad news is, compliance with the General Data Protection Regulation (GDPR) does not guarantee compliance with the California Consumer Privacy Act (CCPA).

We’re going to briefly explain each and how this may affect you.

GDPR:

On May 25th, 2018, the European Union passed one of the toughest privacy and security laws in the world: The General Data Protection Regulation (GDPR). This law applies to anyone that targets or collects data related to people in the EU.

Let’s say that your enterprise tracks EU visitors to your website. You see that their IP address falls in EU territory.

You might want to know:

  • Their browsing activity
  • The kind of computer they’re using
  • Other accounts they’ve logged into
  • Information from tracking cookies etc. 

These companies are now under the legal scope of the GDPR. Many major U.S. based companies are affected by this.

Here’s how it’s supposed to work.

Rights transparency is central to the GDPR. 

The GDPR requires companies to inform consumers about types of data being collected about them, and why. Consumers had to agree to many updated terms of service by that deadline, May 25th. 

If they didn’t, they could no longer use that site. 

If a business doesn’t comply, the penalties can be steep: Up to 4% of your company’s global annual revenue or 20 million euros. Whichever is higher. 

There are some exceptions to the rule.

If you’re collecting email addresses and contact information to organize a birthday party, the GDPR will not apply to you. It applies to solely professional or commercial activity. 

There are some limits to these exceptions. 

CCPA:  

In 2016, former California Attorney General, Kamala Harris, released a report detailing a data breach that affected about 49 million California residents.

This shined a spotlight on the need for greater security on the web. And with an economy bigger than the UK, California needed their own solution.   

The 2020s have become the decade where the U.S. really gets serious about data security. The California Consumer Privacy Act (CCPA) came into effect on January 1st, 2020. Enforcement began July 1st, 2020.

The CCPA gives consumers more control over the personal information that businesses collect about them.

In preparation, you might have begun to get your house in order long ago. So who does it apply to?

It applies to any company that meets all of these requirements:

  • The company operates within California  
  • It makes at least $25 million in revenue 
  • Or whose primary business is the sale of personal information

Here’s a simple list of CCPA consumer rights. Consumers have the right to:

  • Information about how their personal data is processed
  • Opt-out of the sale of personal information 
  • To delete personal information
  • Non-discrimination for exercising these consumer rights
  • To direct private right of action for certain data breaches
  • For Minors: they have the right to opt-IN to the sale of their personal information

What happens if you violate the CCPA? 

California Attorney General Xavier Becerra told Reuters in 2019: “If they are not (operating properly) … I will descend on them and make an example of them, to show that if you don’t do it the right way, this is what is going to happen to you.”

The hammer will come down.

Key Differences Between GDPR and CCPA:

The GDPR and CCPA often use different definitions, scopes, and exceptions to their regulations. For example, the CCPA defines “personal data” more broadly and includes data about devices. The GDPR focuses on specific individuals and is less process-oriented than the CCPA. 

The CCPA requires a different scope of privacy disclosures than the GDPR. 

According to the GDPR, “personal data” is defined broadly to mean “any information relating to an identified or identifiable person”. This includes things like:

  • Cookies
  • IP addresses
  • Device IDs etc.

Under the CCPA, “personal data” is expanded to data associated with a household.

Adhering to the GDPR may not allow your company to be compliant with CCPA. Keep reading to look at some primary CCPA and GDPR differences.

Data Collection Practices

The CCPA and the GDPR are constantly changing and adapting to new technologies. As a result, the specific measures businesses need to take are unfortunately vague.

As you collect personal data, the GDPR requires consumer rights disclosure that covers such things as:

  • Transparency
  • Purpose limitation
  • Collecting on the minimal amount of data necessary 
  • Collecting accurate data
  • Encrypt or pseudonymize data where possible

The CCPA has more specific consumer rights regarding the collection and sale of their data:

  • Consumers must be informed of what categories of information are being collected. This can include IP addresses, internet activity, geolocation data, education information, and more. 
  • Consumers must be notified about why this information is being collected. How will this information be used?
  • The right to request the deletion of this information must be disclosed, as well as the limitations to these rights. 
  • Are there any additional categories of data that are being collected? Any additional purposes this data can be used for? The consumer must also be notified of this. 

Additional disclosures need to be made if the information is being sold or disclosed for business purposes.  

Enforcement and Nondiscrimination Practices 

In order to compare GDPR and CCPA, it’s important to look at how infractions are assessed.

The GDPR looks at global revenue. These fines reach up to 2% – 4% depending on the nature of the infraction. This can mean huge numbers especially for some well-known companies in Silicon Valley. 

The CCPA looks at how many consumers are affected. For civil penalties, the California Attorney General may require $2,500 per violation. Intent matters. This can be up to $7,500 if the violation is intentional. 

You can take a deep breath. There is a 30 day cure period for violations with given notice.   

The CCPA and GDPA provide consumers with a “right to non-discrimination”. Under both, a business must not use collected information to discriminate against a consumer. 

What CCPA and GDPR Compliance Guidelines Mean For Your Business

In order to protect your business, it’s beneficial to compare CCPA vs GDPR compliance and find out what consumer rights apply.  

What does this mean for businesses in California?  

Non-compliance comes with some steep costs. Happily, there are no specific encryption strengths or technologies you need to be compliant with. 

GDPR Compliance 

For data collection compliance, the GDPR has 6 criteria that must be met:

  1. The data must be collected lawfully, fairly, and in a transparent manner.
  2. It must be collected for a legitimate reason and with limited purposes.
  3. It must be adequate, limited to what is necessary and relevant.
  4. Data must be accurate and kept up to date where necessary.
  5. The data must be kept in an identifiable form no longer than necessary.
  6. Data must be processed securely. 

There are checklists upon checklists to remain compliant with every section of the GDPR. Combing through acquired data from consumers, manually looking for the purpose of each line of code, is a tedious process.

CCPA Compliance 

Here are some key suggestions in order to align with CCPA regulations:

  • You’ll need to know where your data is. You can use the cloud environment or data warehouse to manage this.
  • Encrypt or redact your data. 
  • If you’re selling personal information, be sure to track, and respond to, opt-in and opt-out requests.    
  • Offer two ways for a consumer to opt-out of the sale of their data.

Another key way to remain in compliance is to have a robust data inventory. You need to know why you have that data and who should have access to it. This requires data mapping— which is the process of creating data element mappings between two distinct data models.

Data mapping can help you with processes like:

  • Data migration— the process of moving data from one application to another.
  • Data integration— the process of combining data from different sources into a single, unified view.
  • Data transformation— the process of converting data from one data structure to another.
  • Data warehousing— the process of constructing and utilizing a data warehouse.

How CompilerWorks Can Help  

An ideal compliance solution must empower a data protection officer to: 

  • Have the ability to identify PII wherever it is in the organization’s data infrastructure
  • Highlight wherever PII is used for analysis
  • Have the ability to enable the destruction of PII for any selected individual across the entire organization

These requirements should not be restricted to individual departments or certain data processing repositories, but impact cross-functional areas in the entire organization. 

To solve challenges imposed by compliance, a DPO must be enabled to: 

  • Track processing and data movement across organizational and technological boundaries
  • Audit data processing and access
  • Complete comprehensive analyses of data flow

CompilerWorks offers the ideal solution to compliance challenges by enabling the platform to deliver  compliance utilizing the lineage fabric and CompilerWorks lineage solution built around it. 

Enabling GDPR and CCPA Compliance With CompilerWorks

The lineage fabric developed by CompilerWorks is generated with a standard process regardless of the application area. 

PII can be identified anywhere in an organization’s data storage and processing center to deliver compliance. This allows the DPO to directly identify PII and allows others across the organization to tag PII. 

Automatically, the lineage model tracks the preservation of the PII across the data infrastructure. The DPO can then track PII enterprise-wide consistently with particular enterprise policies.

By integrating the identification of PII with the lineage model, automated analyses can be enabled, such as: 

  • Tracking PII data movement at the column row level: 
    • Data copying
    • Agregation
    • PII “leakage”
  • Audit of data access— which allows for specific, time-stamped identification of which users/ systems view each piece of PII.
  • Destruction of PII from the data source throughout the entire data infrastructure.

This allows for the DPO to not only demonstrate compliance to management and the authorities but also control the data access and usage across the entire organization and processing infrastructure. 

With the CompilerWorks lineage model, compliance is simplified.