Data Engineering Podcast

Data Engineering Podcast with Tobias Macey

Overview: A major concern that comes up when selecting a vendor or technology for storing and managing your data is vendor lock-in. What happens if the vendor fails? What if the technology can’t do what I need it to? CompilerWorks set out to reduce the pain and complexity of migrating between platforms, and in the process added an advanced lineage tracking capability. In this episode Shevek, CTO of Compilerworks, takes us on an interesting journey through the many technical and social complexities that are involved in evolving your data platform and the system that they have built to make it a manageable task.

Transcript:

Tobias MaceyToday I’m interviewing Shevek about CompilerWorks and his work on writing compilers to automate data lineages tracking from your SQL code. Shevek can you start by introducing yourself.
ShevekHi, I’m Shevek. I’m the technical founder of CompilerWorks.
 
I started writing compilers by accident 20 years ago. You’ve given me the introduction challenge was just to figure out why?

I think it’s because they’re just one of the hard problems and an awful lot of things that are put out there as languages aren’t really compilers, they are, just a syntax stuck on top of an executor, and I think almost all toy languages whenever anybody says I’ve invented a language they haven’t actually invented a language. There’s no semantics to the language itself, they just stuck a syntax on top of an executor.
And as I’ve gone on with writing compilers, I’ve found that the real challenges are where there’s an actual semantic transformation between the language being expressed and the language and the target system, and that’s when it really starts to get interesting.

I’m not really interested in parsers. Parsers are when you speak to people who learn about compilers in university. They have been taught to write parsers, and I think one of the reasons for this is it’s really fun and entertaining to write a 12 week course about parsers, and you can teach it very slowly and you can do LL and LR and Earley’s algorithm and railroads, and all of these other algorithms that nobody in their right mind would use.

Because they’re theoretically interesting, but, please teach people about languages and compilers and semantics, because until you’re talking about mapping between semantic domains, you’re not really doing the job. 
Tobias MaceyGiven that your accidental introduction to compilers and subsequent fascination with them, how did you end up in the area of data management? 
ShevekIt’s where I’m going to be brutally honest, and that’s where the money is.

What you’ve got out here in the world of data management is a vast world of enterprise languages, each of which has a single vendor. And by the nature of a single vendor, they get to charge what they please and tell you what you can do with it.
And so there’s a traditional joke about exactly what filling you would like in your sandwich. You can have this marvelous enterprise capability, but you have to have all of these restrictions with it. And it started with taking those restrictions off the enterprise language and just saying what else could we do if we had a truly generic open compiler for each of these proprietary languages.

And actually, that philosophy started a little bit earlier because there were a lot of languages out there. Like there’s no point writing a commercial compiler for C these days unless you’re doing something particularly with secure with like FPGA’s, but fundamentally C is free, and that doesn’t necessarily make it easy.

So if you go back into my GitHub, you’ll see that one of my earliest projects was writing C pre-processor implementations natively in a whole set of languages, because back in the 90s, particularly anytime you wanted a pre-processor, this was before we had the whole modern world of assorted pre-processors and templating languages. People just used either CPP or M4. But if you were working in, say, Java and somebody defined something as using the C preprocessor, you didn’t have a ready implementation.

So that philosophy of writing things that were compatible with other things and just opening up the world has been an underlying philosophy of what I’ve been building for a lot longer than data processing languages. It just turns out that data processing languages is a commercially viable place to do it 
Tobias MaceyAnd, that is as good a reason as I need to be in the business and so that brings us to what you’re doing at CompilerWorks. I imagine you’ve sort of given us the prelude to the story behind it, but I’m wondering if you can just share a bit more about sort of the business that you’ve built there and some of the motivation and story behind how you ended up building this company to address this problem of lock in by data processing systems.
 
 
ShevekThe reason you build a new database is because we have a new capability, and then we get this marvelous phrase, ANSI SQL, which is a myth about as good as the bigfoot. Lots of people claim to have seen it, but nobody actually has.

And so now we come back to this question of translating between semantic domains.
I have two SQL vendors. I have code on one and I want to run it on the other.
The code is syntactically similar because ANSI wrote this great big expensive document that dropped some hints about how you might want your language look.
If you were to consider making it look like something that might look like this, but the languages don’t fundamentally do the same thing.

Simple case like what does division mean? What’s 1 / 10 now? If your database server happens to operate in integers, the answer is 0. If your database server happens to operate in numerics, the answer is 0.1 with one decimal place of precision.
If your database server happens to operate in Floating Points, then you ask for something very close to 0.1, but not exactly the same because that’s not an exact binary number, and so now even with that very trivial case, I’ve motivated
something beyond ANSI in terms of translating between database servers.

CompilerWorks has 2 products.
One is we translate code from one data processing platform to another and we do it correctly, and that’s correctly with a capital C.
And the other thing that we do is we compile the code and we do static analysis and we tell you what it is that you need to know about this code and a significant part of that is things like, if I change this, what will the effect be on my organization as a whole?

So you might imagine that you’re writing code that produces a particular table in a particular server,and you do or don’t make a particular error or you do or don’t have a particular change somewhere 10 levels removed from you, in a different sub organization of an organization, that employs 10s of thousands of people, somebody is affected by this change.

Who are they?
How are they affected?
Do they need to be told?
Did they make a critical business decision based on that?
And what do we need to do in order to keep this organization running?
And should you really be considering making this change before you’ve made it?
Or, if you already made the change, what do you now need to do to fix up all the impact?

So that’s the static analysis (the lineage product). 
Tobias MaceyDefinitely a lot of interesting things to dig into there, and lineage in particular is an area that has been gaining a lot of focus and interest lately from a number of different parties and attempts at addressing it in different ways, so I definitely like the sort of compiler driven static analysis motivation for it, and I’m wondering before we get too much further down that road, you’ve already given a bit of an overview about some of the differences between parsing and compiling but for the purposes of laying the groundwork for the rest of the conversation, can you give your definition of what is a compiler and how is that’s relevant to this overall space of language translation and data lineage? 
ShevekI usually describe a compiler as something that turns something from one language into another. In the case of lineage, what we’re doing is we’re turning something from the underlying source code into an algebraic model, and then that computer algebra model is the system of which we can ask questions regarding what happened, what the consequences are, where the lineage is.

It’s interesting to think about the compiler, particularly the linear side. Are we a compilers company or are we a computer algebra company?

I suspect really we have a computer algebra company because that’s where the hard stuff is. I think lineage analysis is getting popular, is an interesting question, because if someone writes some piece of code that says you will generate a piece of SQL that takes this table and it processes like that and it generates that table, well, I’ve got lineage from my source table to my target table.

But now we fall into a hole, which is this. If the language that I invented in order to generate this SQL is just as expressive as SQL, then that language is going to be flamingly complicated and it’s going to have all of these semi joins and stuff that,
typically the reason people write these languages is to make them not be as expressive as SQL because they want to make them more accessible to developers or they want to have click’y drop’y boxes or something like that.

Now, assuming that you followed that road, which is basically universal, now you have the problems that your language is insufficiently expressive to do the thing the analyst wants to do. And so now what happens is you have this text box or this field or something where you type in a fragment of underlying SQL, and now what you’ve got is not a language but a macro preprocessor which doesn’t actually know what’s going on in that fragment of underlying SQL that the developer typed in, and so all of these tools that start out saying, yes, you’re going to build your thing in our nice GUI workflow, we’re going to show you this nice GUI workflow, and that will give you the linear edge, but you don’t really have lineage because you don’t have enough expressiveness to really do the job. Therefore, the developers had to type some custom code into a box and you don’t understand that custom code. Therefore you don’t have lineage.

And if what’s happening is that you are subject to something like CCPA or GDPR, where you are going to jail if you get this wrong, then you don’t have a lineage tool.
You actually need to look at the code that the machine really executed and analyze that code accurately at a column level and then you have a lineage model. Then you’ve got a chance of not going to jail, but anything less we do not define as lineage. 
Tobias MaceyYeah, and then another motivation for trying to reconstruct lineage is that the so called best practice, quote, unquote for data processing these days is to spread your logic across these five different systems where you need to run this tool to get your data out of this source system into this other location, this other tool to process your data in that location, this other tool to build an analysis on top of that preprocessed data, and then this other tool to actually show it to somebody to make a decision based on it, that they then input other data back into this other system that we pull it back out again, and so trying to reconstruct that flow of operations using an additional set of language processing to, you know, post into an API that stores the data in a database or trying to analyze the query logs from your data warehouse, that has a very limited scope view of the entirety of the data lifecycle, and trying to sort of piece all this back together. 
ShevekWell, the query and audit logs typically give you a good start because they’re at least partly written by security people and the security people say you must tell us everything that’s going to go on, The fragmentation of the data infrastructure, which is the other thing that you alluded to, is very real, and I think this leads to a situation where in a typical installation we’re processing multiple languages and stringing them together, and sometimes that stringing together is standardized, and sometimes it’s bespoke.

But, the challenge in putting together a lineage is to be able to identify a column in the ODS up front in, say, the web tier where a user has interacted with something and then list all of the back end dashboards that that column affected, and then describe the effect of that user interaction on each of those dashboards in a human readable way.
That’s the challenge. 
Tobias MaceyAbsolutely, and so to that point, you’ve mentioned a little bit about some of these enterprise processing languages and the tool specific semantics about how they manage that processing, and you’ve discussed some of the wonderful joys of the SQL ecosystem and trying to translate across those, and I’m wondering if you can just give an overview of the specific language implementations and areas of focus that you’re building on top of for CompilerWorks, whether you’re focused primarily on the SQL layer and being able to generate, you know these transformations and this lineage for the databases, or, if you’re also venturing out into things like spark jobs or arbitrary Python or Java code and things like that. 
ShevekYes, so you run into a number of issues as you walk around the data infrastructure.

The SQL languages are for the most part, statically analyzable. There are a couple of holes in them and there are one or two that have type systems that lead one to lose hair. From there one can fairly easily go out to the BI dashboards, particularly the richer products.

So we’re talking about some of the flow based languages, and at this point as a compilers company, one ends up writing a new piece of technology because basically all of the SQL languages are tied into the relational algebra or some variant, or some set of extensions thereof.

It’s always amused me that so many of the papers on relational optimization start with a phrase something like “without loss of generality, we shall assume that the only boolean connective is AND,” which basically means that you’re not allowed to use OR, and you’re not allowed to use, NOT.

Well, guess what if you do make that assumption, the world becomes really easy and really simple, and they’re denying the existence of outer joins, and it’s really easy to write an academic research paper on optimization if you only deal with AND. But, I disagree with the “without loss of generality.” You’ve lost the whole of the real world there.

Anyway, I diverge, sorry, so yes, so there’s a bunch of data flow languages, and BI dashboarding systems that actually work effectively with data flow and data process management, so here we’re talking about the Informaticas, the Tableaus, the things in that range, the DataStages. So we grew up to work with that ilk and then we have a secondary core that speaks to the same computer algebra engine that deals with these dataflow style languages.

Spark is a sort of a mixture because you know on the front end Spark SQL looks like SQL. The question then is I’m going to speak generically, but not specifically about Spark SQL, but what’s the strength of the join optimizer before you compile down to a dataflow language? And, are you really a dataflow language? The seminal paper I think for people wanting to understand why dataflow languages have benefits is probably the Google Flume paper, particularly the statistics about reduction in MapReduce jobs by doing delayed evaluation. But once you get out of that and you get into the ETL languages, you also run into things like SAS.

And so now you end up with questions like how do you port, let’s say Informatica to Spark. And I picked those two because they are both dataflow languages. But, Informatica has this fundamental property that computation is sequential, which is to say that if you set the value of a read write port that value remains assigned and remains visible to the next data record, and so you can actually generate a datum by saying if the record number is 1, set the value to X if the record number is not one, just read X. And in an MPP system you would get X in one record and null in every other record. But in both SAS and Informatica you get the same value of X everywhere and this is the sort of hard semantic difference that makes it very, very difficult to map between languages.

This is where we break out of the traditional job of compiling, we actually have to up engineer into user intent. The user said this. If you’re compiling C or Java and you’re compiling it down to X86, the user said this. Therefore do this, and if you do anything else, it’s your fault. But if you’re compiling some of these languages, it’s like the user said this, we’ve had a look around, we think they really meant this part of what they said, and every other part of what they said was irrelevant or a consequence of the implementation, and therefore we’re going to generate high performance code for the target that preserves the thing they meant and discards the rest. And that’s hard! 
Tobias Macey[laughs] Yes, exactly, the technology is easy, it’s the people that are hard – as with everything that has to do with computers. 
ShevekI don’t envy you editing that part out cause I went very long winded. 
Tobias MaceyNo, there’s nothing to edit there.

Continuing on the point of the semantics being the hard part of translating these data, processing languages, and you mentioned earlier the fact that at the core you think you’re more of a computer algebra company than a compiler company.
I’m wondering if you can discuss a bit of the sort of abstract modeling and mathematical representations that you use as the intermediate layer for translating between and among these different languages and generating the linear analysis
that is one of the value adds of what you’re doing there. 
ShevekI won’t, but I will say some interesting corollaries. But I hope you will forgive me for not answering the question as you directly asked it, which is a very interesting question.

There’s an old party trick where you take a floating point value and you go around the loop a million times and you add one to this floating point value and the question now is what’s the value of that floating point value? And the answer is, it’s not a million, it’s about 65,000 ’cause. Eventually you reach the point where the exponent ticks out of it is, the exponent is now one and at the point where you’re not seeing the last integer digit anymore because your exponents ticked over, adding one to a floating point value has no effect. Processors are weird.

There’s another party trick where you write an array of memory, you fill it with random numbers and then you add all the numbers into an accumulator. But then you try doing that, iterating backwards, and then you try doing that, iterating in a random order and you see what the performance difference is. It turns out that processor hardware and memory pre-fetch and so on is an absolutely delicious thing, as long as you’re reading memory forwards. It sort of manages if you’re reading memory backwards. And it falls flat on its face, throws its hands up in the air and screams if you read memory in a random order to the effect of about a 200 to one performance penalty.

So now let’s think about a tree data structure. On paper, a tree data structure
has logarithmic complexity brilliant. And academically we ignore the constants, but what a tree data structure looks like to a processor with memory pre-fetch is it looks like random order access. And so what that means is that the constant is an order of magnitude larger than anybody thinks it is, which is why on paper heap and a tree have the same performance, but in practice, a heap is so much faster because you start to fit into cache lines.

Now, computer algebra systems look awfully like random order access to memory. And, I think one of the most interesting problems in any sort of computer algebra, and you’ll even find it in SAT solvers where people are optimizing C and they’re changing the order of the structs in the internals of the SAT solver such that they are all hotter up the top end of the word because that’s the bit of the word that will fit into cache.
We actually get a multiple order of magnitude by having a solver within the computer algebra engine, which itself works out what order to do things in so that we don’t appear to be accessing the algebra structure in random order.

The more you can do on a piece of memory while it’s in cache and then drop it out of
cache. And, then, of course, the other entertaining question is wait, you do all of this in Java. Isn’t Java  some kind of language where you’re 1,000,000 miles away from the processor? Actually, I happen to think Java and the JVM are a beautiful, beautiful,
set up because you get to be 1,000,000 miles away from the processor when you want to be. But, when you actually want to get down low, you’ve got control of everything down to memory barriers and at that point you’re pretty much able to write assembler and it’s the joy of a language. They say 90 something percent of your code doesn’t need optimizing, and they’re right. So the question is, can you ignore 90% of the job and do the 1% of the job? I think the JVM is one of the greatest feats of modern engineering for allowing that. 
Tobias MaceyJust for point of reference, for people who are listening and following along, I’ll clarify that. When you say tree data structure that you’re speaking of trees spelled TRIE not TREE. 
ShevekEither will do a B-tree tree with an I, a you know, an RB tree, anything where you’re effectively allocating nodes into main memory and then making those nodes point to each other, and particularly where your allocator is, you know, you’ve got some sort of slab allocator that’s mixing your tree nodes up with other things. You know, if you just allocate a tree, then maybe yes, your root node is allocated at the start of RAM and everything else is allocated sequentially. But the moment you start rotating and mutating a tree, then a tree walk looks like random order memory access again. 
Tobias MaceyDigging more into the technical architecture of what you’re building at CompilerWorks.
Can you give a bit of an overview about the workflow and the lifecycle of a piece of code?

I guess the data is irrelevant here since you’re working at the level of the code.
But, the life cycle of a piece of code as it enters the CompilerWorks system and your processing thereof, and then the representation that you generate on the other side for the end users of the system. 
ShevekYes, and I’m going to answer about 3/4 of that, of course, so let’s deal with the thing that we deal with at the start. Everybody knows about lexing and parsing. Lexing and parsing are not necessarily as immediate as everybody thinks they are. So for instance, we’re taught that Foo is an identifier and five is an integer, and that Foo5 is an identifier because it’s something that starts with the letter.

And then you ask the question, well, what is 5 Foo and the answer is it’s an illegal identifier because it’s an identifier that starts with a digit. But if you go into any SQL dialect and you type select 5 Foo, what you will get is a value 5 aliased to the name Foo because we implicitly, as humans assume that there needed to be a space between the five and the Foo, but, if you follow the textbook instruction of how to write a lexer and parser, you actually get the bug that I just described. If you do it the way they taught you in school, you get that bug.

So now it gets a little bit interesting because the first part of writing a compiler for enterprise language is working out what its structure is. What are we even being given there? So let’s take a language as exports itself as XML. You know there’s a number of them out there. So now you’ve got a load of XML and this XML has words in it such and such ID equals Foo. Well, what does that Foo refer to? It refers to some other Foo. XML can really only represent a tree structure, and all of these languages are dataflow structures therefore they must be representing graphs, therefore there must be linkages within this XML So the first stage is looking at a load of samples, working out what the semantics of the language are.

Now here we have an advantage because most of these languages were written as relatively thin skins on top of their executor’s, and so if you know the capabilities of the executor, and I kid you not, but actually reading historical papers about how memory allocators worked and things like that will give you a lot of insight into things like the extent of variables. You learn when values get reset or when they get de-allocated just by knowing what technology was available to the authors of the language at the time they wrote the language.

So having worked out what all of the linkages are, you now have a symbol table, and having done the parse, you do the compile, which is symbol table type check operation selection. Very classic compiler work and now what you have is you have a compiled binary in the semantics of the source language, which are not necessarily atomic semantics And so now what you need to do is to break those semantics down where some of those semantics may be quite large into effectively atoms, meaningful atoms. So now we will end up with something like 32 bit integer addition with exception on overflow and you might even get an annotation about what the exception is on overflow. And, now you’ve got the set of challenges where now if you’re doing linear analysis, you have a whole set of computer algebra rules that will tell you what you need to know about this thing. Am I doing this? Am I doing the same thing twice when you’re looking for, you know matching regions of algebra without going too much into details you could do something like Jaro Winkler distance on a computer out of a data structure or something like that.

It’s fundamentally hard, but things like that are available for saying why are my marketing department and sales department computing the sales figures but coming up with different numbers? Because now they’re disagreeing over who gets how much money, and that’s a problem that needs to be solved here.

Emitting code is actually a whole new set of challenges, so this is for the case of migration from platform to platform, because if one just takes the semantics and emits them to the target platform, you get code that has a number of issues.
First you get non optimality. You get the fact that it’s not using the correct idioms of the target platform. You also get the fact that it’s ugly. We’ve all seen computer generated code and nobody wants to maintain it, and a significant part of generating code for a target platform is working out what code to generate, which is idiomatic for the target platform, idiomatic for the particular development team that gave you the input code and human readable and human maintainable.

So to give you a trivial case, there was an old joke about I stole the artificial intelligence source code from the government laboratory for artificial intelligence. I’m going to prove it by dumping the last five lines and this joke came out when LISP and SCHEME with the popular languages and the punchline of the joke was five lines of close brackets. If you just do machine generated code which everybody has done at one point you either fail at 1 + 2 * 3 or you generate 5 lines of closed back.
Yes, and that’s the sort of problem that is non obvious to the emitter, and I’ve also alluded to the fact when I said idiomatic to developers who wrote the original source,
that means that there are things that you have to preserve about the original source, which are not necessarily semantics of the language, but which are in fact idioms of the development team in question. 
Tobias MaceyIt’s definitely an interesting aspect to the problem, because as you point out, there are certain implicit meanings or sort of meanings as side effect of the structure of the code
that has nothing to do with the semantics of the code or its sort of computational intent, but that does help in terms of the
cognitive and organizational complexity management for the team that is writing and maintaining the code and that they might want to maintain, and the output, because of things like splitting logic on team boundaries, for instance. 
ShevekYes, and generating code that a machine will accept is vastly easier than generating code that a customer will accept.

I mean the general market approach to doing machine translation is you write a parser. You jump up and down on the parse tree. And then you emit the past tree,
and you say this is the target language and you wave the ANSI SQL flag as hard as loudly as you can. And you replace some function names while you’re at it, but the moment you typecheck, you’ve now done things like inserting casts, and if you generate code that contains all of those casts, there’s two things here, both of which you can’t do. One is to generate code that contains all of those casts, because you’re human maintainer will say Nope.

And, the other is to assume that the target language does the same implicit type conversions, or even fundamentally has the same types as the source language.
And the answer to that is Nope. You cannot divide 1 by 10 in any financial institution unless you know exactly what you are doing. 
Tobias MaceyAbsolutely, and so to that point, it’s interesting to dig into some of the verification and validation process of the intermediate representation of the language and the onboarding approach to bringing new target platforms or new source platforms
under the umbrella of CompilerWorks and just the overall effort that’s involved in actually doing the research to understand the capabilities and semantics of those systems. 
ShevekYes, and then you start to speak to well, what kind of company are we?

Are we a computer algebra company or are we a set of research historians?
What do I know about unheard of platform X or who wrote it?
When did they write it?
Where did they write it and  … ?

An awful lot of that gets folded into the initial development of a language.
We are utterly test driven. You basically have to be. And so, starting out with a new language it really is just about passing test cases and building customer acceptability.
There are other parts of this question which I apologize I’m not going to answer,
I’m trying to fish things out that I can say because one of the things that we developed
over the years is the ability to implement a compiler for a language in a shockingly short space of time.

Once upon a time, we actually signed a contract to do a language in a space of time, which was, you know, bordering on professional irresponsibility. And of course we did it and we hit it. And the thing that we didn’t publish was we actually did it in less time than that because one of the things that we know how to do is to understand languages and put together an implementation of language in a very short space of time.

But a lot of this comes from having a core where, for instance, types only behave in certain ways. And if you can express all of the ways in which types behave and types interrelate, then you can describe a language in terms of for instance its type system.
And that to us is a tool that we have available.

It’s sort of interesting when you’re mapping between languages where types have inheritance and polymorphism to a language where types maybe have inheritance and polymorphism but have different relationships between themselves. So at that point, something which was a polymorphism conversion in one language is an explicit type conversion in another language. Understanding of types is very, very important. 
Tobias MaceyAbsolutely, especially when dealing with data. 
ShevekYes, the date times are the killers because even if you know that you’ve got a date time
there’s one hour in every year that doesn’t exist in one hour in every year that exists twice, and then people do things like OK, so adding one to a date or time is simple,
all you have to know is whether that particular language interprets as days or milliseconds, but then you get into all sorts of craziness like if I take a time when I convert its time zone, did I get the same instant? Or did I get what we call a time zone attach? So is 1:00 PM BST in Pacific like 9:00 PM or whatever it is 7:00 PM PST or is it 1:00 PM PST and different database servers do different things when given this operator and that again blows ANSI SQL out of the water and then what you actually end up doing is just figuring out how the database server does it internally and then modeling that.

And then you’re back into history. You’re back into reading source code. You’re back into the Postgres source code is one of the most marvelous resources on the planet because it tells you how a lot of the database servers out there work. And then you want to know when they forked or what they did and why they did it and who did it.
And so on. 
Tobias MaceyAnd I’ll agree that Postgres is definitely a marvelous resource, and it’s fascinating
the number of systems that have either been built directly on top of it, or you know inspired by it, even if not taking the code verbatim. 
ShevekYes, it’s also,
I mean, you’ve got this challenge when you say where Postgres compatible and you sort of adopt the Postgres mutator and people don’t typically want to do anything with the Postgres mutator and compared to almost every commercial dialect, the Postgres mutator has a fundamental weakness in its handling of time zones, which nobody has ever seen fit to correct. I suspect that core Postgres can’t, which is that you don’t have a timestamp with time zone and the timezone handling in Postgres is basically to be avoided if you want to get the right answer. And yet, Oracle, Teradata, Bigquery, everybody else does it right?

So yeah, Postgres is a wonderful resource, but I do wonder that people base things on it with that weakness. 
Tobias MaceyMost people who start basing their systems on top of Postgres  haven’t done enough of the homework to recognize that as a failing before they’re already halfway through implementation. 
ShevekI think the majority of people who have a good idea and they want to get a demo of their good idea out as fast as possible really don’t think about the consequences of their decisions on the first two or three days, and I think there is a phenomenal bias
among developers starting out to imagine that because something gives you a very fast day one that it will give you a very fast day three. And they think, OK, we’ll get 6 to 12 months down the line, and then we’ll rewrite it.

And I think that for an experienced developer, the crossover point with technologies is around day three, not month three, and this is a big big mistake. We made some very interesting technological decisions about things that we were and were not going to do with this company right at the start of the company, and they paid off.
And some of those decisions were that we were going to do a lot more hard work than was necessarily obvious. And we’ve watched people come up behind us and say we’re going to make different technological decisions, and suffer the consequences of those things and sort of run into a wall.

But the number of times that I’ve been told that, for instance, we want to develop the back end in node because that way we get to use the same models on the front end and the back end. It’s like whoopi-do – got no type checker OK type script I’m looking at you OK? Whoopi-do, let’s see what? And you’re experienced developers going to sit down with a decent web framework with a DI framework and everything else, and we’ll have you know you’re going to be overtaken by the end of day three at best. And there are companies out there that know this. 
Tobias MaceyYes, as you were talking about technologies that have been thrown together to get a fast solution the first thing that came to mind was JavaScript, so I appreciate that you called it out explicitly. 
ShevekI like JavaScript as a language, but I also have this rule about writing shell
scripts which is the moment you find yourself using anything like arrays, you’re in the wrong language.I have this sort of set of criteria that tell you that you’re in the wrong language.

There were things I very much like about the JavaScript ecosystem and the things that I would definitely go to it for. However, it does make me kind of sad to see it slowly reinventing or rediscovery, or hitting many of the problems that other languages have hit.

Another example was some years ago there was a great big fuss about the ability of an attacker to generate hash collisions. Putting perturbations into hash tables and somebody pointed out that if I generated the correct set of SYN packets in TCP and sent spoofed SYN packets to remote Linux kernel because it used a predictable
hash, and it was a list hash, we could convince the kernel to put all of those SYN packets into the same chain in the list hash and he’s denied service to the kernel because it was spending all of its time walking this linear chain rather than benefiting from the hash table. And I watched the same bug get discovered in Perl, which taught it to use perturbation of hashes. And then I waited something like three or five years for somebody to point out that actually the same bug existed and I forget
whether it was either Python or PHP. And then you get into this world where developers say hang on a minute my hash iteration order changed, you’re not allowed to do that. And then you say, yes, you are. It says so on the tin.

And so this whole pattern of watching developers discover, discover solutions to problems that other languages have already invented. It’s like once you discovered it in Perl, which I think might have been the first one, and then PHP might be the second, but I’d have to check go around all of the other languages and look, and make sure. Don’t wait five years and the same thing is true for the JavaScript ecosystem is like they’ve waited 15 years to reinvent certain things. 
Tobias MaceyYes, developers have remarkably short memories and attention spans, at least in certain respects. Correct, and so bringing us back to what you’re building at CompilerWorks and the sort of usage of it as a static analysis and lineages generation platform, I’m wondering if you can just talk through some of the overall process of integrating CompilerWorks into a customer’s infrastructure and workflow and some of the user interactions and processes and systems that people will use. CompilerWorks for and build on top of the CompilerWorks framework. 
ShevekSo what you’ll find is that most of the data processing platforms out there have some sort of log or some sort of standard presentation of their metadata, and we had CompilerWorks aim to make everything as easy as possible, by which I mean we take that stand presentation of the metadata.

If you’re working BigQuery, we take the BigQuery logs.
If you’re working Redshift you take the audit logs.
If you’re working Teradata, we take the various things that the Teradata throws to us.

And having basically giving the CompilerWorks dumper permission to access these logs and it makes a dump and it pulls them into the product, the rest of it is automated because the fundamental thing that we operate on is if it’s possible for the underlying platform to understand that code it’s possible for us to understand the code we have all of the temporal information.

We have all the metadata.
We have all semantic information.
From then on it’s all gravy.

We pull the logs, we put it up into the user interface, we make the data available as API’s and from then on you can just explore the lineage, much as you’ve seen in our video presentations. 
Tobias MaceyIn terms of the migration process, you’ve discussed a lot of this already, so we can probably skip through this question a little bit, but what is the overall process of actually doing the migration from platform A to platform B, and especially doing the validation that the answer that you get on the other side of the transformation, at least close enough, matches the answer that you were getting before you made the migration, and then maybe a little bit of some of the reasons that people actually perform those migrations in the first place 
ShevekSo lineage is totally easy. You can usually get up and running with the CompilerWorks
lineage in a few minutes – as long as it takes you to pull the logs. You pull the log, you run it.

Migration tends to be in practice a little bit hairier, because the customers presentation of their code is not standard. Significant percentage of customers preprocess their code or something. I mean, this is actually where some of the enterprise languages are nicer. The more capable enterprise languages, while the compilers we have to write for them are much tougher, the customers tend to present their code in a more standardized form because the language itself is more capable. When you get a relatively incapable language, the customer tends to mess with it procedurally, generate it, do all sorts of things. It’s almost like they’re treating the underlying language just as an executor. So the first question you have to ask is what’s your presentation of your code? How did you mess with it?

Once you’ve got a hold of the presentation of the code, what you do with CompilerWorks is you specify what the input language stack is, and this is actually quite nice because in CompilerWorks you can take a language that generates another language or contains another language or the preprocesses another language
and say this is a language stack you’re going to absorb this you’re going to transpile, and you’re going to emit to a target language stack that has some of the same preprocessing or management capabilities as your source language stack, and this is yet another hint to say that writing a purely academic Oracle to Postgres compiler
isn’t enough because the Oracle exists within the context of something else, and maybe incomplete and so on and so on, and again, if you don’t do that, you fail human acceptability, so the start of a migration process is get the code, work out how it’s specified, tell CompilerWorks how this customer currently specifies their code.

Tell CompilerWorks how the customer wants their code specified, and then run it for the migration. And that process I have walked into a meeting room and done it cold in an hour. This could be done, you know, given that the customer typically doesn’t know the answer, usually they don’t know the answer to the target platform. They’ve been sold something by a vendor. They think it’s a marvelous idea and you say, how do you want to use this target platform when they say we don’t know and then we make a recommendation and we work with their advisors to make that recommendation work and get that right. One of the things that you get out of this is that we have a lot of versatility with respect to doing the migration job, not just converting code.

Testing is, an interesting one. Customers vary in what they will accept. As I said, with the 1 / 10 example – we are very very precise in how we convert. We have customers who absolutely lean on us for that, and they say I want this accurate down to the last dollar. If you’re dealing with financials, sometimes they care down to the last dollar.
I’m avoiding slightly naming names here. If you’re dealing with some of the markets that we deal with, they’re happy with anything that’s within 5%. And now there’s another thing that gets slightly interesting, which is if you’re dealing with financials, you’ll always use decimal types for data. I have seen people in certain markets use floating point types for data and the consequence of that is that if you do a sum
of a float you could get any answer at all, it’s not like you will probably get an answer that’s within 5% of the result. People don’t understand floating point arithmetic. You could get anything. And the difficult cases, the ones where the customers done something like that, the target platform does something in a deterministic but different order to the source platform’s deterministic order.

Now you get customers who write code where the result of the code wasn’t well defined. But the source platform happened to execute it sufficiently deterministically that they think that’s the right answer, and now you have to sit down with the customer, you say dear customer, we love you. However you did not in the source language say what you think you said. Can we now please work with you and there’s a marvelous piece of education there, with a good customer, you can really help them to improve their infrastructure as a whole, and that’s also where we describe the static analysis side of the lineage product, as tell me the things I need to know. Am I in my infrastructure doing something that is odd?

One of the funniest cases I ever saw was somebody had taken code from Oracle
that said, “A ! = B” Now in Oracle this means not equals ’cause you’ve got a not and you’ve got an equals, that’s a not equal, because now we’re going back to
what does the lexer do?  In C exclamation mark equals that’s a token in Oracle exclamation mark and then equals are separate tokens and it’s the parser that puts them together into a not equal.

Postgres was written by C developers, therefore exclamation mark equals has to be a token. So what does “A ! = B” mean? It means A factorial equals B. It executes, it does not return an error. It doesn’t give you remotely the same answer so it is a legitimate static analysis to say, did we use the factorial operator? Because we almost definitely didn’t mean to. 
Tobias MaceyYes, that is a hilarious bug. 
ShevekWhat is equally puzzling is the number of these things that we discover and find in source code, and we say how long has this been in here?

And the answer is this has been in here for years. It’s generating a production data set.
It’s breaking the production data set and nobody noticed, and so you start to ask questions like, under what circumstances do you as a customer notice an error in the production data set?

The most common answer we get is because data is missing, but if data is present, the customer tends to assume it’s correct. I used to teach undergraduate Java and you get into a lab and you’d say to a student you’re going to simulate a cannonball.
You’re gonna fire it into the air at 30 meters a second. Gravity will assume is 9.81 and you’re going to model the position of this cannonball at one second intervals and tell me when it hits the ground. OK, well I can do basic calculus and so I can say OK it’s going to hit the ground in six and a bit seconds fine, so they’d write their code and they’d run their code, and they very proudly present me their answer. Cannonball hits the ground in 25 seconds and I would say to them. Are you sure? The tone of voice is critical here.

And it took them a couple of months to work out that I would ask, “are you sure” in exactly that same tone of voice regardless of whether or not they had the right answer because their duty to code was the same, it didn’t matter whether I knew they had the right answer. I was not going to be the oracle – they were going to make sure. 
Tobias MaceyIt’s definitely remarkable the amount of that sort of cavalier attitude that exists in the space of working with data and dealing with analysis and just assuming that because the computer says it that it’s correct and not being critical of the processes that led that gave you that answer in the first place. 
ShevekAnd you spoke briefly about testing, so the naive answer to testing is if the target platform gives the same answer as the source platform, great, you’re golden and that is in fact the easy case. There’s a lot of cases where the target platform gives a different answer to the source platform, and there’s an awful lot of reasons why that might arise, many of which are nothing to do with the translation, which was in fact accurate and preserved the semantics expressed by the source code. 
Tobias MaceyIt almost makes me think that people should just use CompilerWorks to trial migrate their code to a different system to see if it gives them a different answer and points them in the direction of finding that they had some horrible mistake for the past 10 years. 
ShevekWell, that’s exactly why we run lineage.

You run linear over your code and it will tell you whether you had a horrible mistake and you don’t need a target platform for that. 
Tobias MaceyAnd I imagine too, that because of the virtue of being able to take a source language, and then, you know, generate a different destination language that that will also help people with doing sort of trial evaluations of multiple different systems in the case where they’re trying to make a decision and see sort of how does it actually play out in,
you know, letting my engineers play with it, letting my financial people play with it and see what the answers look like, and I’m wondering, what the frequency of that type of engagement is in your experience? 
ShevekAlmost universal because one of the things you had to bear in mind when you’re doing a semantic mapping is that the required semantics might not exist on the target platform. So now you’ve got a group of developers on the source platform where you’ll find some master developer, and he will find you some heavy piece of code and he will say this is the heaviest thing on the source platform.

Can you convert it to the target platform and in the old world, somebody would sit down and they convert that piece of code and they say yes, but what he’s given you isn’t the heaviest piece of code for the target platform. He’s given you the heaviest piece of code for the source platform, so an engagement for us looks like here’s all the code for the source platform.

Can you qualify the entire codebase against the target platform?
And the answer is yes, if you hold on a minute or two, we can actually give you that answer and then we can say in this file over here is this operation which is really simple on the source platform because the source platform happens to have that operator, but the target platform doesn’t and has no way to emulate it. 
Tobias MaceyDefinitely an interesting aspect and side effect of the varying semantics of programming languages and processing systems. 
ShevekYes, and one of the fundamental assumptions of the compilers world is that the target platform can do the thing. This is a very interesting compilers world, because that’s not true. The target platform cannot necessarily do the thing. And in the world where your language is just a grammar, a syntax stuck on top of the target executor of course you can do the thing because you just glue the keyword to every instruction in the target.

So yes, this is that rare case where there isn’t a work around. It’s not just a case where the instruction set isn’t dense. I mean, even like compiling C to Intel, the Intel instruction set isn’t dense. You can’t do every basic arithmetic operation on every combination of widths of words, so sometimes you have to cast out, do your arithmetic operation, and then cast back down again. In databases there are conversions between platforms that have things that can’t be done, and so the ability to run CompilerWorks over a code base and say whether this could even be done based on some simple operation is is golden for a customer. 
Tobias MaceyIt’s the side effect of SQL not being Turing complete. 
ShevekAnd not fundamentally having assignment.

You can sort of use sub-selects to do a little bit of functional programming, but the lack of assignment, and then you end up in weird corner cases like if emulating a particular piece of semantics requires you to reference a value more than once, and you don’t have assignment or assignment like operator does the target platform, then re- valuate a sub-tree where that sub-tree might, for instance, contain a sub-select with an arbitrarily complex join. Expensive was the word I was looking for. Expensive, is such a marvelous word in industry. 
Tobias MaceyOr if the sub-select happens to happen at two different points in time where the query does not have snapshot isolation and somebody inserted a record in the midst of the query being executed. 
ShevekYes, so you’ve got repeatable read and then you’ve got things like stable functions. Like if the thing that you had to duplicate, for instance, read the clock and most database servers are smart about this, and when you call the clock function or they will actually publish multiple clock functions, one of those clock functions reads the time at the beginning of the query compiles that time into the query as a constant so that when you do something respect to now, you always are treating the same now, even if your query takes a minute to run, but they will often also have another clock function, which means the actual millisecond instant that the mutator hit that opcode, and now you get customers who confuse the two and sometimes it matters, and sometimes it doesn’t, and if you’re running on a parallel database server or whatever, you start to get different answers.

So yes, it’s not just about data. I think what I’m doing here is that I’m broadening one view of what isolation and sub-tree duplication, and so on, and so on really do to you. 
Tobias MaceySo I’m sure that we could probably continue this conversation at infinitum, but both of us do have things to do, so I’ll start to draw us to a close and so, to that end, I’m wondering if you can just share some of the most interesting or innovative or unexpected ways that you’ve seen the CompilerWorks platform used. 
ShevekI think the ones that we love most are the ones in the lineage product where because we show consequences at a distance, somebody looking at maintaining a column they say and we say it affects such and such a Business Report.

You probably should think before you do that and the user is no, it doesn’t.
It can’t possibly, and then they click the button on CompilerWorks. That says explain yourself and we say “This is how it does it.”

And they have that moment, and I think the best mail that I get, and we get them quite often, it’s not just where we gave the user of revelation, it’s where we gave the user a revelation that fundamentally disagreed with what they believed about their infrastructure and really opened their eyes to it.

Those are the ones that I most enjoy. The ones where people run it because it’s accurate and they say OK, if I do this, I’m not going to go to jail, that’s great, but the ones where it contradicts them and substantiates itself are the ones that I love. 
Tobias MaceyI could definitely see that being a gratifying experience and in terms of your experience of building the CompilerWorks system and working with the code and working with the customers. What are some of the most interesting or unexpected or challenging lessons that you’ve learned in the process? 
ShevekThere’s an interesting definition of technical debt.

Where we define technical debt as not a thing that’s done wrong, but a thing that’s done wrong that causes you to have to do other things wrong. It’s really only a debt that matters if you have to pay interest on it in that sense. So knowing when to incur technical debt and how much interest you’re paying on it. Compilers I’ve compared to playing snooker. I can go up to a snooker table and I can roll the ball into a pocket.
If I’m lucky, I can hit that ball with another ball and get it to go into a pocket.
I’m not skilled enough to have a stick and hit the first ball with a stick so it hits the second ball, so it goes into a pocket, and that’s a skill of three levels of indirection, which I don’t have.

With compilers you have to be forever thinking about everything you’re doing at three levels of indirection because you’re writing the compiler and a lot of customers don’t even send you the code, so the customer says my data was 42 and it should have been 44. I’m not even necessarily going to tell you what the code was and now you have to fix it in the compiler. So now you’re playing snooker, blindfolded. That’s tough.
And one of the things that comes out of this is because the majority of the code that ever runs through your compiler is never going to show up in a support issue, and you’re never going to see it. What this means is that you mustn’t cheat. Do it right and prove it right. 
Tobias MaceyAnd for people who are interested in performing some of these platform migrations or they want to be able to compute and analyze the lineage of their data infrastructure, what are the cases where CompilerWorks is the wrong choice? 
ShevekThere are cases where we come on where the target platform has so little resemblance to the source by which I mean the desired target code has so little resemblance to the source that it’s not really a migration. It’s a version two of your product.

We come across people who try this and for that CompilerWorks is the wrong choice and would tend to advise those people, you know, people use the platform migration as an opportunity to do a product version. Two, this isn’t always as stunning an idea as it seems. You might want to consider separating the platform migration and the version 2 product because I think any developer who’s been around the block a couple of times knows that the double set of unknowns is going to pit you, and so we speak to people and we say do the migration doing apples for apples, and then do the maintenance on the target platform because the other thing about not doing a relatively clean migration is that you now don’t have a test suite, you can’t compare target platform behavior with source platform behavior because you explicitly specified target platform behavior to be different.

So we tend to advise people to do one thing at a time, but if you wanted to do them both together, we would start to lose relevance. 
Tobias MaceyAs you continue to iterate on the platform and work with customers and build out capabilities for CompilerWorks, what are some of the things that you have planned for the near to medium term or any projects that are particularly excited to work on? 
ShevekLots more languages and shortening the time to the “ah ha” moment.

The latest versions of CompilerWorks that we’ve shipped has a completely redesigned user interface in the lineage, and we’ve done a lot of work there to put exactly the right things on screen so that you can look at the screen and the answer to your question is there.

That’s hard work. That’s visual design work. It’s got nothing whatsoever to do with compilers. And now you’ve got a compiler team saying now we have to do all of this user psychology and so on and so on to do the visual design. Then it goes back into the components team like the user says, in order to make this decision, a user has an error somewhere in their data infrastructure. They want to know how to fix it. Really what they want is the list of tasks they have to perform in order in order to fix that, and we can produce that, but we can also make it visible why and justify it.

But the moment you’ve gone out the front end and decided that that’s what your user story is, and you need to do that. Now you have to go back into the back end and make sure that the back end is generating all of the necessary metadata to feed into the static analysis so that that visualization can be generated and so it’s a very tight loop between user story, visualization, front end, and pretty hard core compiler engineering. 
Tobias MaceyAbsolutely, and running the risk that this is probably a subject for an entirely another podcast episode.What are some of the applications of compilers that you see the potential for in the data ecosystem specifically, that you might decide that you want to tackle someday. 
ShevekThere are applications of compilers that I particularly enjoy.

One of my favorites which is on GitHub was the QMU Java API. What it does is it takes the QMU source code, which itself has some sort of JSON ish preprocessor and writes another compiler that compiles that JSON ish code into a Java API which allows you to remote control QMU virtual machine. Now one could have sat down and tracked QMU and written this thing long hand and said I’m going to write a remote control interface to QMU such that I can add disks and remove disks and so on and so on on the fly. But it made far more sense to do it as a compiler problem because it now tracks the maintenance of QMU. They add a new capability. Well, guess what you rebuild your Java API by running this magic compiler over QMU and you’ve got a new QMU remote control interface.

And the reason that I particularly loved that piece of code as an application of compilers was that now I can write a J unit test case that runs in Gradle in J unit, in Jenkins, in all of my absolutely standard test infrastructure, which, using pure Java fires up three racks of computers, connects them together with the network topology,
installs a storage engine on them, writes a load of data to the storage engine, causes 3 hard drives to fail and proves that the storage engine continues operating in the presence of two failed hard drives – all in J unit.

Now normally when people start talking about doing that sort of infrastructure testing, they have to invent a whole world and the whole framework for doing this. And yet one 200 ish line compiler run over the QMU source code gave the capability to suddenly write a simple, readable test in the standard testing framework that allows you to do hardware based testing of situations that don’t even rise in the normal testing world.
That’s where I start to love compilers as a solution to things, and that’s why I think
I will always have a thing for compilers, whether it’s data processing or not. 
Tobias MaceyYes, it’s definitely amazing.

The number of ways that compilers are and can be used and the amount of time that people spend overlooking compilers as a solution to their problem to their detriment, and to the extreme cost of time and effort put over-engineering a solution that could have been solved with the compiler. 
ShevekPeople think about it as like you could do this by hand. I could sit down and write JNI bindings for Lib Open GL. But if I actually want to like how many method calls or how many function calls are there in open GL if I actually want to call open GL from Java, I
probably need to generate a Java binding against the C header file for open GL. Several thousand function calls, and that’s a job for a compiler – and happens to be a job for a C preprocessor as well. I think I know which one they used. 
Tobias MaceyAlright, well are there any other aspects of the work that you’re doing at CompilerWorks or the overall space of data infrastructure and data platform migrations that we didn’t discuss yet that you’d like to cover before we close out the show? 
ShevekI think we should have a long talk about compilers in the abstract sometime because we’ll get into a very rich, probably very opinionated and probably a very detailed territory.

One day I will say don’t be afraid to learn. One of the things that I think makes me a little bit odd in this world is that I actually didn’t study all of the standard reference works.

We study a lot of history. Most of the people who slapped these things together didn’t study the standard reference works. So by all means take the course, I had some excellent professors whom I loved who put us through the standard compilers course, but I will say that the standard compilers, course, and even some of the advanced compilers courses that I’ve watched because the universities have been publishing them online, they don’t really touch on this. They don’t really want to touch on type checking, they just about do basic things like flow control. Get out there and learn and be self taught and dig into it – and don’t be afraid to do that.

And one day I have a shelf of books where I went to one of the publishers I said give me every book you have on compilers. It’s one of my intentions one day to read them, but I haven’t yet. So my closing thought would be even if it’s not compilers or whatever it is, go for it. 
Tobias MaceyWell, for anybody who wants to follow along with you and get in touch, I’ll have you add your preferred contact information to the show notes and as the final question, I’d like to get your perspective on what you see is being the biggest gap in the tooling or technology that’s available for data Management Today. 
Shevek 
I think as we move away from some of the enterprise languages and we move into some of particularly the dataflow systems that we’ve got these days, we are moving into a world where the languages become harder to analyze and maintain. We’re accessing underlying platform semantics through API’s not through languages.
And I think that that is going to have a cost. I predict doom. 
Tobias MaceyDefinitely something to consider. 
ShevekIt might be strawberry flavored doom.
I don’t know what’s not to do. 
Tobias MaceyAlright, well it has truly been a joy speaking with you today, so thank you for taking the time and thank you for all of the time and effort you’re putting into the work that you’re doing at CompilerWorks. It’s definitely a very interesting business and an interesting approach to a problem that many people are interested in solving. So thank you for all of the time and effort on that, and I hope you enjoy the rest of your day. 

Source: https://www.dataengineeringpodcast.com