Apache Airflow Meetup: Elegant Data Pipelining with Apache Airflow

SPEAKER: Next, we have
Bolke talking to us today. Bolke is CTO ING Wholesale
Banking Advanced Analytics. Before joining
ING in 2008, Bolke worked at the 2004 Summer and
2006 Winter Olympic Games, managing the technology,
communication, and data requirements for
all news and media feeds. That’s two large
event locations. Bolke has run his own
startup commercializing multi-touch technology. In his spare time, Bolke
is a guest lecturer at the University of Amsterdam,
a fun father to Mattia and Timo, and can be found
surfing, obstacle running, or taking in a museum when
the opportunity arises. BOLKE DE BRUIN: Absolutely. SPEAKER: Welcome. BOLKE DE BRUIN:
Thank you very much. [APPLAUSE] Which one is mine? SPEAKER: Here. This one. BOLKE DE BRUIN: Yeah. That’s my name, indeed. Hello. So I’ve got a couple
of very boring slides. So I try to enliven you a
little bit with my presence here, see if that works out. Let’s see if that works. Today, I will talk about
elegant data pipelining with Apache Airflow. I’m going to be borrowing
some of Max’s slides that he did in the past,
and Ian, also Fokko, because partially, I’ve done
this talk at [INAUDIBLE],, Amsterdam, in early
spring, I think it was. But it’s continuing a little
bit of the thoughts behind that. That’s interesting. I think– oh, OK. I can work with that. No problem. Well, I am an Airflow committer
since a couple of years. Somehow, I got introduced to
Airflow by a friend of Fokko, actually. And he said, yo,
Airflow is great. Do something with it. I said, OK, let’s do
something with it. And I found out that a
lot of the security parts were not place. And as I work for a bank,
that had to be in place and to do the standard stuff. It’s not perfect yet,
as you’ve just found out from the previous
talk– isolation, DAGs, these kind of things. But it got to a
workable state for us, where we can manage
the things around that. In the meantime, the
department I work in actually grew to 55
people, now, and is going to be growing
to 320 people, now. So I’m a little bit more
distance from the real work, I guess. But I still have a love in
the relationship with Airflow. And so I do have a look at it. Today, I will talk
about elegance. And what does elegant mean? Well, in terms of
Airflow, that basically means that you will have to
have reproducible pipelines– so things that are
deterministic and idempotent. This is actually coming back
quite often in any of the talks that you see the
committers doing. And you see it also was one
of the main blocking items with people deploying
Airflow when they start out, because the deterministic
and the idempotent part doesn’t really enter the
first thoughts of people. They just start
building a pipeline. And then they find out,
going into the future or backfilling things,
it gets difficult when you don’t
have idempotencies. But that also introduces
the next point right away. And that’s about creating
future proof pipelines. And that’s for
backfilling, but actually, also, versioning, which gets
important for the main subject of this talk, which
is also– or let’s say your underlying thing
there is about, well, where I talk about lineage. It also means that you need
to be robust against changes. So you’ll want to have
easy changes, like adding or removing or changing tasks. And if the tasks are not
idempotent, for example, that makes it hard to
do these kind of things. And then, finally,
where in this case I’ll focus a little bit more on,
in this talk, is about clarity. It’s transparency where
data resides, what it means, and where it flows. For this, we borrow from
functional programming. Because maybe you’ve seen the
items I just put on the screen a little bit earlier, is
that functional programming is that you’ll have
your model operations as a mathematical function. And this is maybe
where we started, from this machine doing
the bits and bytes, and then went to
assembly, and then had procedural languages
like COBOL– still being used in my company. And then, finally, we
move to object oriented. And then we say we’ve
evolved, and went to functional programming. So this means– oh,
there we go again– that transformations should
happen as a pure function. So given a mathematical
equation or a function, if you put something in, you
always want to know the exact– what you’re getting out. So it’s defined
input and output. This is also easier
to reason about and to unit test
for obvious reasons, because it’s deterministic. It also means that there
are no side effects. So for the ones that are not
entirely familiar with this, it’s like, you see
the first example where you have a counter
started in a global space. And you’ll add to that
counter with a impure add one. So that has a side
effect that it’s outside of the function
that’s being changed. So if I would put a
variable in, which it doesn’t take right
now, that variable could change under way. While the next example,
with the good example– it actually is a
pure add one, it just adds one to the variable or the
parameter that you put there. It’s very simple. It gets more complex,
obviously, when you do more complex functions. But I think it shows the point. Going to data,
this means that you want to be idempotent
in your actions. So you want to set
a user to active with a deterministic setting,
like set user to active, in this case, to false. If you said it’s not
active, you might not know what the outcome
actually is– what you’re getting into your data. That is quite important. So imagine, in this case,
that you would pull in data from CoinDesk, in this case. And to do this properly, you’ll
have an execution date with it as a start and end
execution date. So this means that you
can go back in time– yesterday, for example,
with that execution date– and pull in the
data for that day. If you do that in
the bad example, you just get the current price. Because if you would
do that yesterday, you would get the
price for today. So that doesn’t really
create the idempotency. Maybe if you want to– imagine you need
to do a mortgage calculation in banking. You want to be able to reproduce
that mortgage calculation because your customer might
ask you for an explanation how you got to that answer. So if you do that on
your current modeling or your current data
sets, that might give you a different answer than
what you’ve got today. So this is important in
multiple circumstances. So if you get to the data
needs to be immutable, as well. So you never append to
partition, but you override it. This makes you fully recreate
that data for the time frame you created it in. It also makes things easier
to do them in parallel. So you can just schedule one
day through the second day, the third day, the fourth day,
or if you can do it by week. So the example that’s
there is obviously that you overwrite the crypto
of the cryptocurrency table with an execution
date to it, and you set to write thing to it
and make sure that you do it by date, in this case. Again, this is the
final partitions. You’ll see this
in your old data. You can say roll sales. You get to fact sales and
you get to aggregated sales. But they all retain the
same execution date with it. So you can always go back
into those partitions and find the data again
and reproduce it if needed. So if you have a task
to do a different kind of transformation
that you need, you can reproduce it for that part. Now we get to the somewhat more
exciting stuff, or, let’s say, early works that are to
a certain extent working in 1.10-plus. And this is where it gets
kind of bleeding edge, and we’re still working out. And I’m actually inviting you
guys to provide some inputs here to see what the
direction should be, because this is a
bit of my thinking and what we see, in
the company, happening, and the requirements,
especially in the European GDPR requirements that
we see, we want to– and also, discussions
on the mailing list about being able
to do data flows instead of just sets of tasks. But integrating with the task
base what Airflow basically is, is how to make it a nice
hybrid thing, maybe– another Frankenstein. It’s something we
still need to work on. So, for example, think about
data as inlets and outlets, in this case. So basically, you do a
transformation on the data sets, which is your inlet. You’ll create an outlet. In this case, there’s
another table, maybe. So then you would be
able to create templates for your operators, like
the one that’s actually on the screen that
says, I’m just picking this up from
the table name that’s being specified in the inlets. And then you create
that partition like just shown, actually, just before. And then you can do that
transformation just fully templating it. And we’ve had a bit of a
discussion with Max, earlier. So he was a bit
surprised that you do that on the instantiation
of the actual operator instead of that it’s hard-coded
into the DAG. Doing this at the
instantiation of it makes it a bit more flexible,
because then, through– and then for this,
we’re actually using x-com to communicate to
the next tasks downstream what the settings are,
what the inlets are, to the next operator. If we would adjust the
operators to take care of, to actually understand
what the inlets are, you can actually do where we
have the HTTP operator just calling out to one source
and to one destination. You could actually say, I
have a multitude of inlets. And I’ll get a multitude
of outlets because of that. And it can be done in
one operator, certainly. Makes things more scalable. And you can get better
modularization of the flows that you’re creating. So you can then change
to code over time. It says previously, but
previous DAG runs can then be recreated with new code. And data can be repaired
by rerunning the new code. I have a clearing task
for doing backfills. This gets down to the
idempotency again. And reproducibility is
critical in this case. This is import for
a legal standpoint, like I just explained,
for the mortgages. And for banking, at
least in our case, we need to be able to
explain to our customers how we get to the
certain decision. And I think it even gets more
important in other areas, as well, because you have fights
with either how things work and where you made a mistake–
because human beings make mistakes even in the
data science side. And it’s also to keep
sane, in this case. So what kind of things
can you do for that? Clarity by lineage. People often that lineage
is regulatory-driven only– so from the [INAUDIBLE] side. But it’s actually also
interesting from a data engineering side and
data science side, because it answers the following
questions for a developer– what is the latest version of
the data that I actually need? Maybe I want to have an
earlier version of that data. Does it exist? Can I go back in time? Maybe with versioning
of the data that you– or that you do,
because we might want to be able to version modes, but
we then, also, need to version the data. So as we’ve changed, maybe,
the DAG to re-run a new– or add a task to
it, we still might want to be able
to go back in time and get the old
data set from now. The question is also, obviously,
where did I get the data from? All the transformations that we
nowadays do in our companies, we might want to
backtrack on that and see, where did
it actually start? And what’s important
is, where do we store this kind of information? Because Airflow itself
is not, maybe, the right place to do these
kind of things, because these graphs are
getting really, really big. And Airflow might do
that on the DAG level and show you what
the data flow is. But going back in time
in that way in the UIs may be not the right
thing, forever, to do. And maybe you want to integrate
with the other systems there as well, and
not just Airflow. Because your data
scientists might do things outside of Airflow, obviously. So there’s rudimentary
support in– I’ve got five minutes? Like that? Oh, that’s quick. I have to be faster. So there’s rudimentary
support in Airflow for supplying this kind of
information to, in this case, Apache Atlas. I don’t know if you spot the
problem with this picture, because this is the data flow. I don’t know if anybody
recognized the issue with this picture. And Fokko, you can’t tell,
because I told you already. AUDIENCE: [INAUDIBLE] BOLKE DE BRUIN: What’s that? AUDIENCE: Is it race? BOLKE DE BRUIN: Race? AUDIENCE: [INAUDIBLE]. BOLKE DE BRUIN: No. No, it’s not a race. No, no. Anyone else? AUDIENCE: I don’t think you can
see it from behind the beam. It’s a bit unclear. BOLKE DE BRUIN: Do I– can
I zoom in a little bit. AUDIENCE: [INAUDIBLE]. BOLKE DE BRUIN: Oh my god. I’ll go down again. I’ll explain it, then. No worries. It’s not an idempotent DAG. Because it starts at the
source, and I’ve run this DAG, basically, twice, with a
different execution date. But the source and the
destination stay the same. Basically, I would
expect a straight line– or maybe two, in this case– because the initial source
needs to be showing up twice. And the destination needs
to show up twice, as well. And this immediately
shows, obviously, where you can maybe gain
maturity, eventually, in your own DAG pipelines
and your pipelines by having this kind
of visualization. And then you will
start understanding why it’s getting important. So let’s dive in a
little bit too quickly, then, because I have
four minutes left. Imagine that you’re
using a machine learning model that needs conversion
rates for your currencies, just– I’ve shown you before, quickly. This is used for advice
to your customers. And for your business,
it is important that you’re able to
explain to your customer how you got to a
certain decision. Basically, the reasoning I
gave to you just a minute ago. So our DAG consists
of three tasks– download currency
data from the web, run the machine learning
model in Apache Spark, and drop the data in
Apache Druid for OLAP. So simple definition of DAG– wouldn’t go there. I expect you to understand it,
or just grab some examples, if you please. Then, what’s new? You can say, my inlet is a file. From CoinDesk getting the
right data from there, it’s being templated
with the execution dates. And the outlet is,
I want to put it somewhere in S3 buckets
at another place with the execution date and
the version associated with it. And the amount of versions
that I want to have is five. So my operator basically exists. I use these inlets
and the outlets, and that’s maybe
difference from earlier, is that, instead of using
source and destination, I use the inlets
and the outlets. Then, finally– I get a second– is, I get an outlet that
creates the prediction or contains the prediction. And the Spark submit
operator receives that and determines the
inlets automatically, because it received
those through x-com, because the outlet
from the previous one specifies that
these are the inlets to the next downstream operator. And from the outlet
in this case, I have the Spark
submit operator. I tell the Spark submit
operator what it needs to do. You could imagine, if there’s a
bit more or better integration with Spark eventually,
for example, that that is even not required,
because Spark will deliver this information somewhere. It’s hard to get at the moment. I have one minute. Sheesh. I’ll be faster. I won’t talk faster. I’ll just take a little
bit more time, maybe. And then, finally, we drop
it in the Druid itself. I hope that’s kind of clear. The example is there. And then you get the
results that is this. Now it gets straight line. Because if we run
this one, again, I will get a new date
associated with it. Now, let’s imagine that
you can’t tag the data. So imagine your S3
buckets where you put that first set of
data in, is being tagged, this contains
personal information. Then we could trigger a DAG
and set the inlets, basically, on the trigger DAG
that is more generic. And there’s a bit of
an error in the slide. But that sets or understands
that it can obtain the tag from the metadata. Then you can have a generic
DAG for any kind of PII data, if it’s in a table,
that just specifies on this column-layer
level, I have PII data, and I can remove it. So [INAUDIBLE] because
it doesn’t work yet. But imagine that you can,
for every inlet that you get, you have a cleaned table set and
that through the clean columns that you still want to
have in that data set, you can run through that. And then you can
put this as a module inside any kind of
pipeline, and it will still work as long as the
metadata is there. Got very close. So I’m concluding this
one with, obviously, built data pipelines that are
idempotent, deterministic, have no side effects,
use immutable sources and destinations, and don’t
do update, upsert, append, or delete. Thanks very much. Hope you enjoyed it. A bit of bleeding edge. [APPLAUSE] Can I take one, two, three
questions– something like that? Anyone has a question? AUDIENCE: [INAUDIBLE]
versions, in your appraisal, is it max five years
it’s supposed to be? BOLKE DE BRUIN: Well, if it
would rerun at that moment, you would create a new version. So let’s say I’m getting that
CoinDesk data from today. If I rerun it again, I get a new
version of today, in this case. AUDIENCE: If you want
infinity, let’s say. BOLKE DE BRUIN: You
could do infinity, yeah. So I wouldn’t expect you
to do infinity– but that gets a bit difficult. But
yeah, you could do that. So depending your
business needs, you should be able to specify
how many of these things you need, because
as I explained, you can add a task
to the DAG that would generate a new version
for that execution date. Obviously, you’d want to
be able to go back in time in other reasons, in this case. Anyone? Yeah, Sid. AUDIENCE: [INAUDIBLE] BOLKE DE BRUIN: Do you want
me to repeat the question? For the recording. Also, OK. [LAUGHTER] I need to factor in
that one as well. So the question
was, why versioning? Right? Why was it max versions? OK. So this is, why is
it max versions? And the explanation was, this
is a business requirement, how far back in time on
the particular version you want to go. Anyone else? Yeah. AUDIENCE: In the example
you showed up there for the backfill, you still
have the execution date as the macro. So have you thought
about using start date– not start date of the
pipeline, but start execution date and end
execution date as new macros? Because what we found in our
use case is that, in some cases, either we’ll have run a
pipeline every day, for example, for ETLs. And then have a good
backdate so the computer– like three days–
just in case they’re missing their data points. But [INAUDIBLE] the
backfill, really, you don’t have to look
back for three days. So they’ll write a query,
look like, between DS and DS minus, like a look-back
date, like three days. So we feel like have a macro
to replace the execution date, replace that with a
start date and end date turns out to be more efficient. A user can write more
backfill-compatible pipelines. BOLKE DE BRUIN: To be
honest, that question got a bit complex. AUDIENCE: I wonder
if that make sense? Am I getting myself clear? BOLKE DE BRUIN: Kind of. And I don’t know if I’m going
to be able to fully answer it, because basically, this
is early thoughts which we’re trying out. So basically, open to
these kind of things. And get back on the
mailing list, for example, on these kind of ideas
while implementing it. And we can just do it. All the templating that is
currently working in Airflow is also applicable to
the inlets and outlets, so it’s not separated
in that way. So you should be able to
specify everything you want if it’s already available. AUDIENCE: Thank you. BOLKE DE BRUIN: Final question? No final questions? Well then, thank you very much. And then on to the next speaker. Thanks. [APPLAUSE]

Leave a Reply

Your email address will not be published. Required fields are marked *