Data Engineering: where are we and where are we going? w/ Joe Reis and Matt Housley of Ternary Data

Media Thumbnail
  • 0.5
  • 1
  • 1.25
  • 1.5
  • 1.75
  • 2
This is a podcast episode titled, Data Engineering: where are we and where are we going? w/ Joe Reis and Matt Housley of Ternary Data. The summary for this episode is: <p>What is the state of Data engineering today and where is it going (or should it be going)?&nbsp; Who better to talk about Data Engineering than the authors of the recent O’Reilly book “Fundamentals of Data Engineering”, Joe Reis and Matt Housley from Ternary Data.&nbsp;&nbsp;</p><p><br></p><p>Join Tim, Juan, Joe and Matt to discuss the state of data engineering. </p><p><br></p><p>Conversation highlights: </p><ul><li>[00:44] Introduction to Joe Reis and Matt Housley</li><li>[03:24] Warm up: what have you engineered or built that has gone terribly wrong?</li><li>[08:28] Data engineering tools and magpie syndrome</li><li>[13:15] How often and where do you see data teams stumbling to get the value out of the technology they have?</li><li>[17:08] Thoughts on diving deep into technologies where there is a skills gap</li><li>[19:39] The curse of familiarity in the context of tools</li><li>[23:55] Learning outside the context of an enterprise, and continously educating yourself on data</li><li>[27:23] Data modeling and where to start learning</li><li>[30:48] Data modeling = business concepts and definitions to relations among data</li><li>[32:05] The human aspect and technical side of data modeling</li><li>[36:19] Business literacy for data engineers</li><li>[38:39] The lifecycle of data</li><li>[43:02] Analytics engineering, the knowledge scientists</li><li>[46:09] Roles within data engineering and bifurcation</li><li>[48:26 The future of data engineering and tabular data</li><li>[52:12] Auto Machine Learning and Tabular Data</li><li>[55:00] Lightning round</li><li>[01:01:01] Tim and Juan's takeaways</li><li>[01:06:18] Three questions about data, life, resources and the show's next guest</li></ul>
Introduction to Joe Reis and Matt Housley
02:35 MIN
Warm up: what have you engineered or built that has gone terribly wrong?
04:58 MIN
Data engineering tools and magpie syndrome
04:44 MIN
How often and where do you see data teams stumbling to get the value out of the technology they have?
03:52 MIN
Thoughts on diving deep into technologies where there is a skills gap
02:29 MIN
The curse of familiarity in the context of tools
04:13 MIN
Learning outside the context of an enterprise, and continously educating yourself on data
03:23 MIN
Data modeling and where to start learning
03:21 MIN
Data modeling = business concepts and definitions to relations among data
01:15 MIN
The human aspect and technical side of data modeling
04:13 MIN
Business literacy
01:54 MIN
The lifecycle of data
04:18 MIN
Analytics engineering, the knowledge scientists
02:15 MIN
Roles within data engineering and bifurcation
02:16 MIN
The future of data engineering and tabular data
03:45 MIN
Auto Machine Learning and Tabular Data
02:43 MIN
Lightning round
05:53 MIN
Tim and Juan's takeaways
05:14 MIN
Three questions about data, life, resources and the show's next guest
04:43 MIN

Speaker 1: This is Catalog& Cocktails. Presented by data. world.

Tim Gasper: Hello everyone, and welcome to Catalog& Cocktails, presented by, the data catalog for leveraging agile data governance to give power to people and data. We're coming to you live from Austin, Texas. It's an honest, no BS, non- salesy conversation about enterprise data management with tasty beverage in hand. I'm Tim Gasper, longtime data nerd and product guy at data. world, joined by Juan.

Juan Sequeda: Hey, Tim. I'm Juan Sequeda, principle scientist at data. world, and it is Wednesday, middle of the week, towards the end of the day. It is that time to go take a break, and let's go chat about data. Today, it's such a cool episode because we have two people who are, I think, the brilliant brains right now in the industry on data engineering. This is Joe Reis and Matt Housley from Ternary Data, and also the authors of this new, recent book on Fundamentals of Data Engineering from O'Reilly, which this is a must- read book. Which I have not read yet, I will acknowledge, but I just got it, it arrived yesterday. I'm just so blown away, I was just telling you guys, about how you guys have structured this book. I'm so excited to dive into it, and today we're going to chat a lot about this whole topic of data engineering. How are you guys doing?

Joe Reis: Good, how are you?

Matt Housley: Doing great.

Juan Sequeda: Well, it is great, you guys, to have you. We were on your podcast recently, and you guys are just killing it on the conversations that everybody needs to be having.

Joe Reis: Awesome. Thank you.

Juan Sequeda: So tell and toast, what are you drinking and what are you toasting? What are we toasting?

Matt Housley: I'm drinking water so I'm doing okay. I'm actually not doing as well as I would like, some illness sort of flattened me last night. I'm here drinking water rather than something more interesting.

Joe Reis: That's actually vodka, don't lie.

Matt Housley: That's right, that's right.

Joe Reis: Oh man, I'm so sick. I got to drink this whole pint of vodka. And my fake drink is green tea, so I usually drink green tea in the afternoons. I sometimes I'll down a bunch of espressos though, but today it's caffeine, no alcohol because I get go drive my kid around after this so I don't want to be that dad. So.

Juan Sequeda: How about you Tim?

Tim Gasper: I am drinking some Neptunia Hendrick's. It's a very light, special blend of Hendrick's gin with a little bit of lemon lime soda and a little bit of simple syrup. So just a light, soda type drink. So not bad.

Juan Sequeda: Well that's a good idea. I'm at the office and we're packed and everybody's here so I'd have time to go mix myself a drink. I'm enjoying a delicious IPA. This deep elm one is actually-

Joe Reis: Mm, nice.

Tim Gasper: That's a good one.

Matt Housley: That's good.

Juan Sequeda: This is a really nice one. And anything you want to go toast for right now?

Tim Gasper: Toast to Matt's recovery. I hope that you recover soon and you're back to normal so it'd be good to hang out again. I also hope I didn't catch what you have. So toast of-

Joe Reis: My future recovery. So there's preempting that

Juan Sequeda: Let's toast to health. Cheers.

Joe Reis: To health. Yes, of course.

Matt Housley: Cheers.

Juan Sequeda: All right, we got our warm up question. So what have you engineered or built that has gone terribly wrong?

Matt Housley: I have an answer here. So I was working on this project and people know I'm an orchestration advocate. I worked with airflow and some with other projects as well. And so, on this particular project I resisted orchestration for a long time. No, no, no, let's keep it simple. And eventually it got complicated enough so we needed to do orchestration. So I set it up and then pretty soon the team wanted to do all this stuff to it that was just kind of abusive, like run jobs every five minutes and stuff, and so we started adding that. And then pretty soon I was leaving this company and going to a different job and so I tried to hand it off and it just didn't get handed off. No one picked up the slack and then years later it's still running and not being upgraded and various other things. And I'm like," How do you guys still have this running?" So I guess the point maybe behind this rambling is that, as we develop things and as we deploy technologies, there's always this organizational problem of ownership in addition to just building something out. Yeah.

Juan Sequeda: Who owns things is always a question.

Joe Reis: Yeah. Oh that's a really big question for sure.

Matt Housley: It's a huge question.

Joe Reis: Yeah.

Juan Sequeda: How about you Joe?

Joe Reis: I can't really think of anything that I've deployed where it broke per se or that I'm absolutely ashamed of. I don't know if that's like me conveniently forgetting everything that I've written that has broke. But yeah, it's a weird one. I was thinking about that question before and I was like,"I don't know." I think if I probably engineered something wrong, it was probably my calendar, which I'm trying to get better at. But I think as far as anything I've deployed any technology... Proud to say I haven't, which means it probably wasn't trying hard enough or something like that I guess. I don't know.

Juan Sequeda: You've been too much in your comfort zone then, probably. I don't know.

Joe Reis: Yeah, or I had a lot of people keep me in check too. So there's that you tend to... If you work on a good team then the chance of you... I would say that I had the chance to introduce a lot of things that probably would've been very disruptive, yes. But it's all counter factual. So, yeah.

Juan Sequeda: You, Tim?

Tim Gasper: This is an interesting question. I'll give a really quick work example and then I'll give a quick personal example. So work example is, I was working for a startup. I started a company 12 years ago called Cork Share and it was actually a lot like Pinterest, around the same time actually that Pinterest was getting started. And the day before our demo day for presenting it to investors, we decided to add a major new piece of real time functionality to the platform. And the demo went down in flames. That's an example of a thing that I wish I could have back. My early days of product management realizing wait," Maybe you shouldn't be developing and launching brand new features the day before the big demo." So, and then-

Joe Reis: It's gutsy.

Tim Gasper: Yeah, that was gutsy. It was a bad move,, and now I know. Personal thing, I was trying to put a big TV on the wall the other day and I bought a mount to put it on the wall and that was a disaster. I don't know. Me and TV mounts that's-

Joe Reis: Wait, do you still the TV?

Tim Gasper: I do. I haven't destroyed it.

Joe Reis: Okay, that helps. Okay.

Juan Sequeda: It didn't fall or anything.

Joe Reis: I put one up too. We have this giant TV in the next room over here and yeah, mounting that thing on the wall is, yeah, you're not sure if you're on a kamakazi mission or something because when you let go of the TV you're like," I hope this holds."

Tim Gasper: And you need more than two friends, you need like five.

Matt Housley: To position it exactly, right. You need more control. Now was this right after the cocktail hour or anything like that? Were there any extenuating surgery after-

Tim Gasper: It was after our show actually, so yeah.

Juan Sequeda: I don't know, whatever. I think I'm being bored. I do not have a good answer for this one. I mean yes, we probably build systems that should not exist anymore. I think that's some stuff. There's some code that I've written and I'm like,"I don't think that code should be running but it's still there, but nobody's ever complained."

Joe Reis: Yeah but that's just it. And Matt's example I think actually really illustrates the reality of a lot of things where I think as long as it's working is it worth redoing just for the sake of it? Matt and I always say, I can't remember who said it, but legacy is a condescending term for things that make money and I think as long as it's working and doing its job and contributing something then, I guess maybe it did its thing. Could it be clearer? Sure.

Matt Housley: My repost might be what we emphasize in the book, which is security above everything else. And so that's the one thing that's-

Joe Reis: Don't tell them that part. Yeah.

Matt Housley: So keep stuff updated so it's secure.

Juan Sequeda: Right.

Joe Reis: In a public S3 bucket and here it is by the way. So...

Juan Sequeda: Yeah . All right let's dive in on... All right, honest no BS, why are data engineers so fixated with tools today? Let me go start with this.

Matt Housley: Joe likes to refer to this as shiny object syndrome or I've heard it called magpie syndrome as well.

Joe Reis: Mag pie syndrome?

Matt Housley: Yeah, yeah. Meaning magpies chased shiny stuff. That's what they do.

Joe Reis: Oh they didn't know that.

Matt Housley: Bits of glass and silver and metal. They'll gather pieces of metal in their nests. Anything that's shiny, they'll collect, supposedly. I mean maybe this is some Disney channel mythology or something like this, but yeah.

Tim Gasper: That's interesting.

Matt Housley: So think engineers, why do we get into engineering in the first place, right? Because it's cool. It's like," Oh look at this cool technology and scalable clusters and everything." And so I think there's a real tendency to chase after cool stuff just because it's cool. And then the second thing is this idea of Resume- Driven development, that's another major problem.

Joe Reis: Yeah, I was thinking because technology and tooling is the easy part. I think I've counted six conversations I've had today with people where something I was thinking a lot about this weekend while I was writing actually an article on a data modeling, and the question behind the article is why is data modeling seeing a resurgence all of a sudden? Why is it a hot thing? So I had a LinkedIn post in this last week just asking," Okay, so what would you add to data modeling?" And it's a constructive conversation. But then I got thinking," Okay, so why are we having this conversation?" It's not like this is a new conversation, it's like this is a new idea. I mean COD and Chad in the relational world, I mean that's super old. It's older than all of us. And then even dimensional modeling, I mean that's nothing new either. But why is it we're rediscovering these ideas and these practices? Why are we having this conversation? Then it got me thinking. So along the way tooling's gotten really easy. So technology is a lot easier than it used to be. If you rewind 15 years ago and we're trying to do data stuff, I wagered a guess that it was technologically more constrained and very much more expensive, especially if you were head of data warehouse say, right? That was, you couldn't just go spin that up in the Cloud. That didn't really exist back then. That wasn't an option. You would have to do some arm twisting and selling a bit of your soul to make this happen. And so now it's different. With the credit card I can go spin up whatever tools I want. In a lot of ways the feel the technology is the easy part now. And what's happening is I feel like, because technology is easy but also because what I notice is there's a lack of, or I guess there's not as much emphasis on the people in the process of the People- Process- Technology continuum, what you're seeing as an over weighting on the technology. And so you're able to do a lot more stuff with the technology. It's easier to fixate on it. But you're still repeating that some same dumb processes over and over again, because you haven't built that foundation. You don't know how to see with data for example. You don't have data literacy perhaps, maybe you don't have good team dynamics, maybe you don't understand data modeling and all these other processes now. And so this leads you up to technology. So if you don't have that foundation, you can certainly do a lot of things with technology, but I think it's going to be a very substandard experience, which is what I think what we're all seeing. While in our talk you were hitting on knowledge is being something that's incredibly important. And to me that's a people and a processing, not a technological question. That's just learning how to see better as a person and how to view the world better. But that's not a tooling question, but it's easy to gravitate towards tools and every vendor in the world wants you to try their tool that will solve everything. So I think that's why I think that happens a lot in addition to what Matt said.

Tim Gasper: Yeah, that comment you made Matt, about Resume- Driven development, that resonated with me a little bit. There's a lot of, well if I have Dexter and dbt and Snowflake and a lot of these things on my resume that a lot of times... And this is where I wonder if the fault is a little bit on us, as for those who have to hire data engineers and are building data too-

Matt Housley: Yeah.

Tim Gasper: That we're also looking for these hot button tech too. So we're part of the problem here. There definitely does seem a lot of like," Well what's hot? What's new?" And maybe we fixate a little bit on the technology skills and experience, more than the fundamental stuff. How good are you at interacting and collaborating with other people? We may think that we value that a lot in our resume, but are we valuing it enough?

Joe Reis: That's a very good point.

Matt Housley: Yeah.

Joe Reis: So I mean we all work in data obviously, and I think we all work with a lot of companies and talk to a lot of people. So I mean riddle me this, in companies where the tech stack on paper looks very impressive and significant and could potentially do a lot of things, how often is it that you see these data teams still stumbling or maybe not getting the value that they should out of the technology that they have?

Juan Sequeda: So you're saying that the tech stack is established, you bring the people in, the data teams in, and then they're struggling with the current tech stack. How much is that the actual scenario? Is that the question?

Joe Reis: It's one of the scenarios, yeah. And I guess and" Adding value with data," that's the other component of it. Yeah.

Juan Sequeda: I go back to looking at the entire landscape. You look at all the data landscapes, all the different tools, and we say this all the time. Every single feature is now becoming a category. This is just ridiculous the amount of tools you have to go through. But if we look at the principles, they really haven't changed. And you have it here in your book too, right? There you have to ingest data, you have to store date. I mean at the end of the day, the principles are, you move data, you got to store the data, you got to compute the query the data, then you got to go use the data to answer some questions. And yes there're more details to that stuff, but essentially this stuff is the same. So if you have the tools to go move data, if you have the tools to store and compute data, if you have the tools to go analyze data, I think you have the basis there to be able to go start providing value. And then I think that's when we start getting into like," Oh, but there's that shiny object that makes this little thing more efficient." And I think a lot of these tools too are about," Oh we drive more efficiency." It's like, okay, that drives your efficiency to your data team, but are we generating more money, saving more money for the company? Are you making that argument? I mean," Oh we're more efficient so this, but we'll probably do more things," but is that really happening or not? Because we just get so bogged down into implementing all these technologies. So my answer I think I'm answering... Coming up with the answer to your question which is, if you have these bases of these principles, you should be able to go do a lot and you should be able to provide value. Now the question is can you provide more of that value, more of that value can be read faster? And I think that's when we say," Throw more technology, that's going to happen." And I'm like,"Ugh, not always."

Matt Housley: And I think we really saw this in the Hadoop era. Everyone wanted to jump on the Hadoop bandwagon because it's like," Oh we're going to have big data." In fact we have big data, it's lurking on our servers, we need to collect it and analyze it. And then in practice, some of these projects generated a lot of value. Many of them did not because there was a focus on the technology and not actually on the outcome that they were trying to achieve with this technology. And I think absolutely, absolutely the same thing can happen now with Cloud. Snowflake, BigQuery, even Redshift are amazing technologies, but if you don't have a strategy for using them, if you don't have a purpose, if you don't have some organization of your data, then you're not going to get very far in terms of delivering value.

Tim Gasper: We had a big sort of skill rush when the Hadoop movement happened and it was like you couldn't have enough people learning Map Produce and Pig and Hive and all these different things. And we were talking a lot about the skill gap and so a lot of people were focused on like,"Oh I got to skill up real fast on these new areas." And I assume that was helpful as that was a hype topic, but now that we're kind of moving on to like," Okay, this is just one more tool and a toolkit," I wonder where that leaves us and I wonder if data engineers felt like... Do a lot of data engineers feel like they valued diving deep into those technologies or did that end up being a little bit of a fools errand?

Joe Reis: I think there's a couple threads to this. If you take a step back, the skills gap as it's been discussed mostly relates to technology. So I have a skill gap in Cloud technologies, for example. Go get certified, now you're competent in using the Cloud. That kind of thing. That's a remedy. But it doesn't include or focus on, again the people or process skills. This is what I see fundamentally is missing was the knowing how to assess business problems, that's a people skill. Knowing how to talk to stakeholders and map out what's the path towards the goal that we're trying to achieve with this data project for example. That's something I think that's extremely underrated, but when I see most data projects fail, it's because of that specific thing. Asking the wrong questions. That has nothing to do with whether or not you're good at Hadoop or you're good at VELUX that the new query engine that Facebook just open- sourced the other day or whatever. I'm talking about reason to old stuff and everything in between. And it's like, why is it that... I've been in data for a long time now and it's like why are we still asking the same questions that we've been asking for, well as long as I've been in it, and know I talk to people like Bill Inmon, as long as he's been in it. We've fundamentally in this hamster wheel of asking the same questions. And I think-

Tim Gasper: New vendors, mutuals, new tech, same questions.

Joe Reis: Same questions. And I was thinking a lot about this over the weekend, because I just have a very exciting life. But the questions that kept coming up as I was writing and thinking as well, I'm asking the wrong questions about modeling actually, I'm not assessing it from the reason why we've been recycling the same stuff over and over. And again, it does come back to, I think fundamentally, we're not up- skilling people in the right areas. We're expecting to have tech skills, but that is a component and I would say maybe 16% of the way there in terms of you need the people skills, you need to understand how to assess problems, you need to process and put those in place. Technology is there to enable it after, and I know this is something out of management consultant would tell you, but success of management consultants I would say is pretty mixed too. So why are we here?

Juan Sequeda: Okay, so what are the things that we should be telling the data engineers today that you should focus on these types of skills? You said one, how to assess problems. I got mine and I rant them all the time. I don't know, I'm going to shut up now and let you guys speak.

Joe Reis: Matt?

Matt Housley: I mean we dedicate a whole section of the book to actually choosing technology because in most, with the flexibility of the Cloud in most data engineering jobs, you're going to run into in architecture jobs in the current market, there's going to be some degree of upgrading and changing technologies and such. And I don't think we spend enough time talking about actually assessing technologies based on business problems and then making choices in that kind of framework rather than just picking really cool stuff. All my friends are using Snowflake, therefore I should use Snowflake. That's not a good reason to use Snowflake. There are lots of great reasons to use it, but that's not a good one. And that's how we do things right now. I don't know, Joe, you go.

Joe Reis: You call it the curse of familiarity, right, Matt?

Matt Housley: Yes.

Joe Reis: I mean it's a classic hammer and nail situation. I mean just people are comfortable using what they have always used and so therefore they'll use it. I mean you see this all the time. Say that you get a new tech lead or a new CTO in a company, what's the first thing that they do? We're talking about this on the show with Milan last week and he brought up a good scenario. It's like that person blames the other person for all the problems and it's like," What they did was stupid. We're not going to keep that around." So I know this tech stack I used at my last job and as he correctly points out too, it's like," Well I'm probably leaving my job for a reason, maybe I wasn't that successful at it, so I'm getting into this new job now where I can become victorious and so forth." So I bring in the tech stack they're familiar with and this happens all the time. I would wager probably dozens of time each day or more. And so that gets you into this interesting vendor lock in before you even gotten locked in with a contract.

Tim Gasper: This reminds me of, we got football season coming up. It reminds me of you get a new coach and the coach brings their playbook and they're like," Hey, this is my play." We do west coast offense, right?

Joe Reis: Oh that's funny. Yeah, exactly right though, I mean that's just kind of how it works I suppose. But you see this in other industries too. You're going to bring in your favorite whatever tool into your company because that's what you're comfortable with. And then everyone has to do that and hopefully this company hasn't seen five versions of you in the last year because it's going to get very jarring very fast. But this happens all the time. So I'd say that more than a lot of reasons, this is why tech stacks exist in companies. It's not because it was the best thing to do at the time, it's because it was somewhat of a premeditated lock in position that just happened as you say, west coast offense, coach comes in, let's do it,

Tim Gasper: Right. I mean, how do data engineers handle that? So we're talking about skills data engineers need to be focused on, and it sounds like there's a pretty dynamic element of being able to work in same patterns, different technologies, same patterns, different Clouds. I don't know.

Matt Housley: Another one I would add, and Joe and I, there's a whole section on this in the book, we actually talk about it quite a bit in many sections is this notion of enterprise data engineering. And especially in the early two thousands and with Hadoop and Spark and everything, there was this notion, this goes along with Adam Alsop's comment who talks about data lakes being treated as this magical beast where no one wanted to do the old school stuff. It's like everything is new, it's big data, don't worry about modeling, don't worry about schema, don't worry about cataloging. And enterprisey data engineering is the stuff that goes back to the 80s that maybe no one wants to do, but it's actually super, super valuable. Thinking about your model, thinking about the purpose of the data, cataloging and tracking the data, data quality, increasingly we now call it data observability. This whole notion of monitoring the quality and quantity of your data in semi real time so you can see when things are going wrong. And of course you guys fit right into that data catalog cause right in the middle of enterprisey data engineering.

Juan Sequeda: So this is a really interesting notion you say, right? Because I think the people learn to become, or they get courses, they get certified every now to become data engineers and data scientists and take a master's course in the nine months or year, whatever. But you're not the enterprise and you can't expect to understand what the enterprise is going to look like in nine months or a year for the course that you're taking. But then they learn these things without understanding the context of what happens within an organization. A couple months ago, I was invited to go give a course on agile data governance to a masters in AI course. And all these folks from industry, they're learning all these great AI things to go learn, but they had no idea of," Oh, I'm going to go back to my office." And I'm like," Oh, give me the data and I'm going to go do this cool AI stuff." But that's not how it's going to go work. You're not just going to go get data, you need to go understand what governance is and where do you find that data. So there's this big disconnect on oh, what you want to go do and what the cool pretty pictures and stories are telling you, when you get into the reality is like," Oh, the world is very different." And then, at the enterprise level it's like," Oh, we're not cataloging tables that have names that look very nice and it's just a 10 or 15 tables. No, we're cataloging thousands of tables with tens of thousands of columns with words that make no sense to anybody." That's the enterprise world. And I think this is the balance that not everybody is aware of and people dive in thinking," We're going to go hit the ground running." I'm like," Wait, there's a reality that you have to go encounter there." And I think there's that that's the difference.

Tim Gasper: Yeah, there's not really a class in college that prepares you for that. It would be called 50 years of baggage and you're thrown into the middle of it.

Matt Housley: Yes. Yeah, it's funny too because I think we do a poor job of training data scientists on fairly standard tools. Even SQL, I find most data scientists are taught to do everything in Pandas and not to use much more powerful tools that are just available within their companies. And yes, there's absolutely a use case for Pandas, but most of them can barely stumble around in SQL. That's really unfortunate. And so there is this big disconnect between enterprise data that we've been doing since the 80s and people wanting to completely reinvent everything, I feel like.

Joe Reis: Well yeah, I remember there was a time when SQL was a four letter word because that was obviously Python and Scala and all these other languages were how are you supposed to do data and sequels. This old school, I remember when data warehouses were left for dead, it was thrown to a dumpster and lit on fire. Then somehow it crawls out and it's back bigger than ever. But there was definitely a time when data scientists, it was like the late 2000s, early 2010s when all the stuff we're talking about now are building tools around. I mean this stuff was, it wasn't cool.

Tim Gasper: Yeah, it actually had a problem, right? They call it New SQL because they were worried that they were like," No, we're not like old SQL. We're New SQL, right?

Joe Reis: Well yeah and it's like it, yeah... Data warehousing was like the business mullet for a while. It just wasn't cool.

Juan Sequeda: So one of the things that have already come up a lot and thinking about skills that data engineers should be focused on, let's talk about data modeling. Because I think as you've mentioned, you made this, you've been thinking, we've been all thinking about it. I've been talking about it a lot. So have you. You wrote this post on LinkedIn couple days ago, right? You got a hundred comments about it. Why is it re- surging? How much should all data engineers be learning data modeling or should a certain type who are interests and how do you actually start? How you seeing this?

Joe Reis: I would say that they at least seem to be aware of its existence. Let's start with that. I was actually talking with, he's a famous data modeler, I'll leave it at that, over the weekend about this. And he was like," Oh I thought everyone knew data modeling." I was like," You'd be shocked." I would reckon probably Matt and I, our estimation is in our conversations with engineers and data engineers and software engineers and so forth, the number of people in data scientists, the number of people who know data modeling I would guesstimate is maybe 20% that we talk to, who are acutely aware of practices around it, how to do it and so forth. People may have heard of it but it's... Because he was actually commenting on chapter eight of our book. He's like," Your chapter's very confusing. I don't understand your sequence of why you put stuff the way did." And I was like," I think it comes from our experience of talking to people and understanding that we need to meet them where they are." We're introducing a lot of the concepts the way we did because I think there's a general lack of recognition of data modeling. He's like," Why did you include Inmon in there?" And I was like," Well, for one, Inmon gets conflated with Kimball in a lot of cases in there. There are different ways of viewing the world." Inmon is very much a data warehousing paradigm. But unless third normal form for example, you wouldn't be able to do maybe an Inmon warehouse, for example. But then unless you knew how to do what a data mart is for example, you won't know that natural progression from that to a Kimball data warehouse, which is basically a data mart. And so that's why we included it in there. But just that I think the lack of recognition is the first place to start making people aware that, yes you can model data. Because for the longest time it was like wide tables or just reports were simply just made from ad hoc queries. So I would start there. Matt, what do you have to think, add about that?

Matt Housley: I think another major problem is that while there's a lot to learn from the data modeling classic approaches like Kimball and men in third normal form, for the most part you shouldn't be using those data modeling techniques with modern tools. So in other words, there is a technology element here. Third normal form doesn't play very nice with column or databases, which is where technology has gone. And that's not to say there's analytical technology.

Joe Reis: Analytical technology.

Matt Housley: Analytical, yeah, absolutely analytical technologies. But that's not to say that you just de- normalize everything. You still give some Google training for Google big query, which is column and they would just say,'Oh, well just do you de- normalize your data?" And I'm like," What does that even mean? I have no idea what it means to just de- normalize everything." And so I think part of what we need to do is synthesize the classic techniques and then update them to say," Yes, you're going to have joins in here, but don't split everything into joins. Figure out when you use other types, when you keep things in one table, when you split things out." We just need a more intelligent, modernized conversation about some of these ideas I feel like.

Joe Reis: But there's a separation though I think Matt, where it's, on one hand it is... If you look at the essence of what data modeling is trying to achieve, it's relating business concepts and definitions to relations among data, right?

Matt Housley: Exactly. Yes.

Joe Reis: It's essentially what it is. But when you look back at the classics, when Bill came up with a data warehouse, it was because people were recording OLTP systems and that was wreaking havoc on them, that's why he separated out data integration into a warehouse upon which you could do your analytics. And same with Kimball. It's taking data from a non O: TP source or different sources and integrating them into, in this case dimensional model. But what's happened over the last few decades is, I find very fascinating, because we're trying to shoehorn in these relational model or dimensional modeling techniques as well, into technologies where I guess the same problems don't really apply as much. And so the thing that Matt and I have been talking about is, well what's next in data modeling?

Matt Housley: Exactly.

Joe Reis: The essence of data modeling I don't think goes away where you're trying to again achieve definitional coherence amongst your various data assets. But how you achieve this, I think, is open to I think a renewed way of thinking.

Matt Housley: For sure.

Joe Reis: When you have streaming data sets or a graph databases or these or any number of new databases, it's like do we have to religiously adhere to the old ways of doing things?

Juan Sequeda: Well I think when it comes to data modeling, there's two aspects right? There is the people, the human aspect, and then there is the technical side. So data modeling is as much as of a science as it is as an art.

Matt Housley: Oh yeah.

Juan Sequeda: And I think you define a data model and you ask people," Does this look good? Do you like this?" Whatever. I mean, how do you even measure or come up with some sort of metrics around, is this a good data model line? You can come up with so many different types of ways and you can ask different people. You can have qualitative approaches to go define that. From a quantitative approach, you can say," Well the query runs better, faster, whatever."" But the writing that query is so complicated. So then the query that is written is much more complicated than the model is. But if I change the model, then the query may look my less lines of code, but it may execute differently because of it's a roast or a column store, whatever." But then you have to find this balance. But I want this data to be consumed by users and" Hey, it's okay if they need to go wait 20 seconds instead of two seconds, but they get more value for writing a shorter, cleaner query." I mean, this is the types of stuff that we need be thinking about, but how do we do that? We need to understand so much context around that. And for data engineers who just go look at this without talking to the rest of the human beings, what are they going to go focus on? On the technologies." Oh, it's a column store. It needs to run this fast and I need to optimize which one runs as fast as possible. So here it is, it's done. Please data consumer go and look at this data." And they're like," What the fuck does that mean? I have no idea where to go do it."" Well here's a bunch of queries that you can go use." And then you start depending and then you go back because it's this balance. And my rant basically is to say, it's not an easy problem and it's not something that we will find technology to solve. It really is, I think people need to really understand that it's an art and a science and this will... And we just throw it aside. And that's we can't...

Joe Reis: Yeah, I mean it's hard work. And I mean if you read back and Kimball's book, Building Our Data Warehouse Tool Kit, the step of dimensional... So the four step process of dimensional modeling. The first one is model the business process. And as he points out too, that the uncomfortable truth about this is you have to gasp, go talk to people. It's much easier to sit behind your keyboard and draw out diagrams all day about what a model should look like. But simply taking the time to go talk to people, that's something again, back to the people process of technology. These are the kinds of skills I wish we're taught more is. And in our book, we hit on this too. And at the end of each section, who are you working with section, where we talk about upstream and downstream, who are you working with? And this is there for a reason because time and time again, we've rarely seen data projects fail because of technical reasons. I don't think I've ever seen one actually fail because of a technical reason. That's solvable. Where we've seen data projects fail countless times is the data team isn't aligned with the people for whom they're supposed to serve. Simple as that.

Matt Housley: Yeah.

Tim Gasper: Yeah.

Matt Housley: Yeah. Completely agree. And to your point, Juan, about the thing about modern data modeling, what I often tell people is, instead of just trying to normalize everything, which is the old school approach, what you should do is look for data that's going to be recycled across many different queries and many different sub organizations within your company. That's the stuff that should be pulled out. So for example, if I have customer data, you wouldn't embed that inside another table because you're going to use it all over the company to understand your customers, so that should be a separate table. And that is data modeling reflecting human processes. And that's, yes, you need scientific principles, exactly like you said, but then the craft part, the art of understanding how your company works. That's one of the hardest things to actually achieve. It's not easy, especially for engineers.

Joe Reis: One of our friends, John, he said the first thing he does when he starts at a new company is he asked how do we make money?

Juan Sequeda: Oh god, you hit the point that I'm bringing up all over and over again, which is what I'm calling business literacy. Every data professional, IT tech person needs to understand how this company makes money, understand the business process, understand the people around it, and then you understand where the data systems go on and then you realize," Oh, that's why you want that data because I get all this contest and I should be talking to the person on this side or that right." Follow the money, understand how that goes. This is what we need, the business literacy, all data tech, IT people, need to understand business literacy, not just data literacy.

Joe Reis: But it's underrated, right? I mean again, if you go back to how... And we both, Matt and I teach and I think you teach as well. And the thing is, how often is this taught? Say you get MSIS program or computer science or anything like that. These are notoriously shoe gazey type degrees, right? Communication. I don't know that that's ever taught. Maybe you take a class in para- programming, you're in computer science, but MSIS I think it seems like it's very much what you do your Capstone project at the end where you might have to go get requirements, but that's... Businesses will tolerate you for that just to maybe get something out of you. But I've seen enough of these where... But I wish this is, what I see it falling short is the students really, they end up frustrating the Capstone sponsors because they don't know the questions to ask or they can't ask them in the right way. And it's a hard skill. I can't blame to students, but asking good questions is very difficult. I think that should be taught more.

Tim Gasper: Yeah, that's another key skill that we should add to our list here. So just a brief call out to our sponsor. So this episode is brought to you by data. world, the data catalog for your data mesh, a whole new paradigm for data empowerment to learn more go to data. world. And before we go to our lightning round and start to bring it home with takeaways and things like that, one other topic that's super interesting is around life cycle, the life cycle of data. And just curious from you Matt, from you Joe, how do you think about the life cycle of data? And how do you think about the role of data engineers both past and future around the life cycle of data?

Joe Reis: Matt?

Matt Housley: That is a good question. So I think early on, data engineers back in the early days of relational databases understood this fairly well. You bring in data, you transform it, you do something with it. And I think the technology obsession of the 2000s led us to lose sight of that. And so really that was part of our goal was to bring people back to those fundamentals of what data engineering is all about. And so that is the central theme of the book. And once you have that framework then you can build out the details of how you're actually going to execute that. But you have to start with what year you want to end up. You have to start thinking about what you're actually trying to achieve before you throw a bunch of technology at a problem. What are your thoughts Joe?

Joe Reis: Well, and I think it is interesting, when we wrote the book, we thought about," Okay, so earlier books like Designing Data and Intensive Applications I think was a great data engineering book and still continues to be and hopefully the next edition is fantastic. And Martin incidentally was, Martin Kleppmann, the author of that book, was one of our tech reviewers. And so we felt like... And I think we're hitting on some good topics with our book, but yeah, why I bring his book up is interesting because at that time I think 2016, 2017, data engineering was in a much different place. It was still very technical. You still had to understand the underlying guts of the systems you were using. Whereas fast forward to today, I mean do you really need to understand, if you're running Snowflake, do you need to understand how clocks work? You know, you might, but I would wager that... Or know how PAX works or some distributed algorithm. I think it's good for you to know this stuff, but it's not necessary for you to do your job and run Snowflake, which is what a lot of data engineers run these days or equivalent platforms, data bricks and so forth. And so what's happened is technology has become increasingly abstracted. It's SAS based. If you're an engineer that happens to work at one of these companies, I would hope that you understand the stuff in designing data intensive applications. But if you're not, I would say that it comes down to life cycle management now. And what Matt earlier alluded to with enterprisey data engineering, now that you have the Maslow's Hierarchy of Needs, you're not focusing as much on system survival so to speak, you can move up towards more self- actualization of being a data engineer. And what this means is, I think this is in part why you're seeing a lot of interest in things like enterprisey techniques such as governance and cataloging and quality stuff that even a few years ago, people would probably throw you out the room for talking about it. It's like,"What are you talking about? This isn't data engineering, this is stuff that enterprises do. What are you talking about?" But now you have the opportunity to do this and what's happening is it's forcing, I think, a recognition of things that have worked in the past and things that will hopefully work in the present and the future. That's why you're hearing about data modeling now. I think it's to cover up for a lot of maybe the past mistakes that we've let languish as well. A good data model goes a long way towards, I think, eliminating the need for quite a few tools around the modern data stack ecosystem that now are of monkey patching, the lack of rigor that... And so I think life cycle management is a really big thing and that's increasing where we see data engineers focusing on. Where that leads to in the future I think is open to debate. I have my own supposition in the last chapter that data engineering might go away as a title.

Tim Gasper: Yeah, that's interesting.

Juan Sequeda: Let's go up on that one. Data engineering, you have data scientists or mean, now we're talking about analytics engineering, we're talking about data product managers. I like to call this also the knowledge engineers or the knowledge scientists. Where is this evolving?

Joe Reis: There's a couple ways it's evolving, I would say, because of abstraction. So you saw this with data science. Every title you see in data science was actually a title that existed before, but it's lumped into this thing called data. Same thing happened with data engineering, frankly. It's like this used to be ETL developer, BI engineer, software engineer, systems engineer, distributed systems engineer and so forth. This is now a data, it's kind of a catchall term. But I think we have a chance to move back towards more specialized titles. At the same time, what's happening is you're seeing convergence of software and application development with data. I was talking to, the last person I talked to right before a podcast, he was a software engineer and he works with data now and he's like," Yeah, I mean a lot of the practices that I'm seeing are the two are melding together to the point where they may be indistinguishable from each other in several years from now." And then you also look at machine learning. That itself is becoming more interwoven with both software and machine learning. So I think what you're seeing, what's been being the practices that a lot of the big tech companies I think are trickling down to the masses. So those practices and the capabilities will be democratized to a large extent. And so it may morph to something that we call the life data stack, which is simply the real time meets data engineering meets ML into a fast feedback looper. You don't have these artificial distinctions anymore, it's just a living, breathing flow of data. The life cycle's still there, surprisingly enough. That's interesting about it. Things get simpler.

Tim Gasper: A lot of these things actually consolidate together. The thing that remains at the end is still the governance, the modeling, the working with the business and understanding the business questions. Those things get left over and don't really disappear.

Juan Sequeda: There's a heads up for next week's episode, we're going to have another author of an O'Reilly book, which is a ole, Forget his last name, Ole Oleckson? Ole Olesen, whose the author of the coming catalog book.

Matt Housley: Oh, right. Okay.

Juan Sequeda: I've been chatting with him a lot. And one of the things he's brought up on life cycles is inaudible. And this comes from the information library sciences. It's like plan, obtain-

Joe Reis: Interesting.

Juan Sequeda: And share, maintain, apply and dispose.

Joe Reis: Nice.

Juan Sequeda: A phenomenal episode next week because he is coming in from the library and information sciences point of view.

Joe Reis: That's cool.

Juan Sequeda: We've been talking about, when I talked to him, he's like," Wow, yeah, that's true." And he's like," Well this is just our bread and butter in life sciences and information library sciences." And then you see the entire data engineering community industry going on. They've completely ignored and ignored a sense because they were not aware about this stuff. But it's really hard to understand everything that's out there and that's why we need to have these conversations that get people involved. And also why diversity is so important. Bring in people from different areas. Diversity-

Joe Reis: I think of it like MMA, right? So you have a library science person and it's like," I think that's cool." Matt knows, I think in terms of fighting or South Park or something kind weird that way it's working out. But I think MMA is a really good example where historically it's been sort of bifurcated into like," Oh, you're like a karate guy or you, you're a wrestler or whatever." Then the UFC comes along and well, you find out it's none of the above. You got to be good at everything. But things evolve and I think that's super cool. I think data mesh is the one thing we didn't talk about, but I think that because you're a data mesh company, but I actually think that that holds a lot of promise. And it's interesting, I was reading, I've been reading a lot of books on actually the old historical ones on lean, like the original texts, lean thinking and so forth. What I find interesting is that if you harken back to DevOps, microservices, Agile, a lot of them find inspiration in lean really. And what that means is continuous flow, right? Batch is actually a bad practice in lean, that's an anti pattern. So you don't do batch and it's very low tech. At the end of the day, lean is about learning to see things. It's not technology. In fact, it's probably the lowest tech practice there is. But it yields a lot of value obviously. And what I see is that whole mentality is trickling down into software with DevOps, and now data. And data mesh I think is a continuum of that where it's... In lean, you want things to be centered around customers at the end of the day, not departments. So to eliminate silos just like you would in data mesh. But I think that is something like the live data stack actually facilitates that because it's continuous flow of information among different domains perhaps. That's what you want. Batch and Q is a cardinal sin, which is the centralization of such, which is why data mesh came about. So it's an exciting world and who knows what will happen. I pretty stoked. Matt, what do you think about the future of data engineering?

Matt Housley: Yeah, yeah. I mean, so what I'll say about traditional data engineering, like ETL developers and DBAs and everything, is that we've really had relational data until least the 16th going back at least to the 16th century. Think about double entry bookkeeping, maybe further back than that. People are keeping track of transactions. We're always going to have that kind of structured data that we have to deal with. And we're always going to have batch use cases where it's like," Yeah, I need to know about happened with my business in the last 30 days." Venetian merchants needed to know how much merchandise came on in a ship and how much they got paid for it. But the evolution is along the lines of what Joe was saying, we're real time. And then just more and more unstructured data. I mean, look at what's happening with TikTok, YouTube, Spotify, podcasts, stuff like we're doing right here. We're just generating a flood of unstructured data and we're still actually not very good at processing it. We're very excited about these advances in machine learning, but there's a lot more to come in the next decade. And so that's what I'm excited about that. I mean, okay, I'll say, I wrote a newsletter kind of rant about this last week. I both excited and kind of terrified. I'm like," Okay, but as our machine learning gets better, what are social media companies going to do to us next? Or what are we going to do to ourselves?" That's what I'm wondering.

Juan Sequeda: But this is an interesting point.

Matt Housley: Yeah.

Juan Sequeda: I'm not sure I agree with you because, yes, we're generating a bunch of more data, more to more unstructured data, mean pictures and sound and all that stuff. But what we're seeing more from the machine learning side is that we're not focusing on all the structured tabular data. And I think if we look at all the machine learning data sets and people who do things is, here's a bunch of pictures, a bunch of... But here's a bunch of databases, structured relational databases. Go do something with, not just a table, right. Here is a table. Here's a database that has thousands of tables and go do something with that. Oh, I turn it into a de- normalized table. Well, for me to go turn that into a de- normalized table, I got to understand what that means. And then we go back into the issues that we're talking about today. So I've worked so much in my career on schema matching, and this is a problem that we've been trying to go automate academically. We don't even talk about this, we don't talk about this today. It's embedded somehow in transformations. But with transformations take this entire blob of knowledge and you put it into some sequel query now basically, which is a rules. But what you really want to go say is that I should be able to go think," Hey, this thing X is actually mapped to Y and or some combination. What is the most granular thing?" But we don't think about it that way because it's a hard problem. And second, we want to go automate that, and it's so hard to automate because we don't have clean labeled data to go do that. And I think that's the breakthrough that we're missing. And we won't be able to get there because we lack yet, because we lack a lot of metadata. We have a lot of data. So we can do record linkage and stuff, but we don't have a lot of metadata. But having said all this, I think we're starting to now catalog and generate more metadata and keeping track of that, which we didn't do before. So that's why I'm very happily getting more excited that," Oh, we can actually have some breakthroughs on this," but this is a big problem that we're not looking into, and I think that's where we need to have more breakthroughs.

Matt Housley: Yeah, I tend to agree with that. I guess what I would say is that I don't think the structured data side is going to get any smaller. I think what we'll see instead is just an expansion of data across the board. And to your point, all the problems around tabular data are still really, really hard. I mean, this is something you worked on at a previous company, right Joe?

Joe Reis: Oh several.

Matt Housley: And it was really, really hard.

Joe Reis: Companies. Trying to do AutoML on tabular data. So AutoML on images and video I think is relatively simpler than tabular data. Because at the end of the day, the big difference is tabular data. Oh, okay. All data's human generated. But at the end of the day, tabular is, I would say tricky because it's human generated, it's got some sort of a purpose behind it that's different from free flowing text, images, audio, et cetera. And that's where I think that the major difference is, and algorithms are notoriously fickle at parsing this stuff. I mean, XGBoost is still, I think the best rated algorithm that came out a long time ago. But deep learning I think is... It's just hard to pick out the subtleties in tabular data. It's insanely hard for machine learning perspective. If you doubt me, go try it, especially on unseen data sets, it's very hard to get to make rhyme or reason out of it. It's full of mistakes. Tabular data sucks. I would say if you wanted to have a really fun time play with tabular data. And so I guess the question I have is how much data really does need to be tabular at the end of the day? Or are we doing that because it's convenient?

Juan Sequeda: That's good. I mean, I think in our world, in our organizations, we live a lot with tabular data. I think that's that's always front and center. But I think, yeah, this is a good question. And at the end of the day, I always ask myself, we go back and everything ends up being tabular. I'm organizing this academic workshop next year where we're getting together to talk about, why do we always go off and think about all these data models. The graph has existed for so long and there are so many versions of graphs and JSON and XM- trees and stuff, but we always end up back in tables. And it's fascinating how we always do that. But we go off and build all these things, but we end up back in tables. So what is it that us as humans, we just, why tables and humans? What is the fixation about this? I don't know, but-

Joe Reis: I mean pixels are actually, it's a table.

Juan Sequeda: So this is meeting on organized, it's called The Why of Knowledge and Data Models.

Joe Reis: Oh, cool. Well let us know when that happens. It sounds interesting. Pop in.

Juan Sequeda: This is something that we can keep talking for hours, you guys are seeing.

Joe Reis: Yeah, hopeful we might .

Juan Sequeda: The time. We look forward to actually doing this live one day.

Joe Reis: Yeah.

Juan Sequeda: Talking person to person, but we got to hit our lightning round now. So let's move to the lightning round, which is presented by data. world, the data catalog for your successful cloud migration. Ill kick it off. So you mentioned Inmen, Kimball, they don't apply perfectly to the modern analytics landscape. Is there a new modeling paradigm that will emerge?

Joe Reis: TBD.

Matt Housley: I'll say there will be, so yes.

Joe Reis: Yeah.

Matt Housley: I think in the near future, right.

Juan Sequeda: Matt says yes.

Joe Reis: Matt says yes.

Matt Housley: I think it's coming.

Joe Reis: Yeah, I mean I'm working on something right now, so we'll talk a bit more about it when it's ready.

Tim Gasper: Yeah, I like that. I'll have to share more is that keeps evolving. I liked your comment about how does streaming and some of this other stuff fit in. So I think it's interesting to think about the big picture.

Joe Reis: Oh yeah, this is a clue. Yes.

Tim Gasper: Yeah. Interesting, interesting. So second question, you both mentioned about data engineering, as the tooling gets easier, as the technology gets more advanced, the data engineering maybe is actually going to disappear potentially as a title or as a role. Curious about, we're seeing this rise of analytics engineering and curious to see, do you see this moving to analytics BI, the business questions a little bit more that analytics engineering flavor. Do you see that as being a likely successor or a likely shift here? Maybe Matt, if you want to start?

Matt Housley: What I'll say is that I see a fragmentation of the data engineering role happening. And so maybe that's where the title is going to go away. I think analytics engineers are likely to take over a lot of work done by data engineers right now, especially making sure data is flowing appropriately into the business, into various teams. And then other parts of data engineering will probably either move under ML engineering or get some new title that's like something about ML oriented data engineering that's a bit more specific. And then you're still always going to have these engineers that work on the guts of systems at Snowflake and Google and such. And so maybe we'll find a new title for them because their job is really quite different from what most data engineers are doing. Like Joe was saying we, most data engineers have evolved out of doing that. And yet if you have these products, then someone has to work on them behind the scenes. You need hyper specialists who are working on these systems and there's got to be some good title for that.

Joe Reis: I mean, I don't see analytics engineering as being anything new. It's been around for decades.

Juan Sequeda: What comes around goes around. Those who don't read our history are doomed to repeat it.

Joe Reis: Oh, you just come with different names for stuff, right? Data engineering's the same way. But yeah.

Tim Gasper: Marketing is fun.

Juan Sequeda: This is why I say we got to be critical.

Matt Housley: Yeah, yeah. For sure.

Juan Sequeda: Next question. Is the best way for a data engineer to learn data modeling, is it a hands on experience or can reading your book or a book do justice?

Joe Reis: I don't think reading our book's going to teach you data modeling. It'll expose you to the concepts. But as we point out, yeah, there's a lot of books out there that like 500, 600 pages long. So do the hard work. I would say read the books but also do it, practice it, right?

Matt Housley: Yeah, I'm going to agree with Joe on this one and say you have to go read the classics and then synthesize them into something that you own through a combination of thinking and doing. And hopefully that story will improve over time where there's more of guided journey so you don't have to go off on your own quite so much.

Joe Reis: I mean it's reading a book on dating at some point. You can read books on dating or you can go on dates and so...

Juan Sequeda: Oh my god, this is perfect. Right? Learning data modeling is like learning how to go date. You can go read the theory if you have to go and practice it.

Joe Reis: Yeah. And please don't read The Game or some stupid book like that.

Matt Housley: You heard it heard first, go read The Game, apparently.

Joe Reis: I know guys who have read that book. I just sit there just cringing. I'm like, Oh man, it's pretty bad.

Tim Gasper: All the secrets. Still single after 12 years. All right, last lightning round question here. So the 2010s kind of saw the rise, especially towards the end of the decade of the data scientist being this sexy, awesome critical job. Maybe data engineering longer term is going to disappear or something like that. But in the shorter term, is that really actually that sexy critical job of the early 2020s? Is it data engineering?

Matt Housley: I would say it has been for the last two to three years. And I think the open question is around economic transformations that we might or might not be going through right now. I think we've seen a huge talent shortage since maybe at least 2017, 2018. Maybe going back further if you include big data engineers. And the question is, I don't know if we go through a recession or something, does that change the conversation? Maybe it does.

Joe Reis: Well, I mean the conversations I'm having with people are cost management comes up a lot. Thin ops. I would say any data professional engineering or otherwise that understands cost management, you're going to stay employed hopefully, unless you're company implodes. Which could also happen.

Tim Gasper: That's a good skill to add to our list here.

Juan Sequeda: Yeah, I just had a....

Joe Reis: Add cost management is a huge one. I would say cost engineering like that, the next wave of startups, I think, in a data space you're going to be at cost management for cloud tools is so opaque.

Matt Housley: Yeah.

Juan Sequeda: Yep. Definitely.

Matt Housley: The problem is that there's no proper training for this and if you came from the previous generation of data engineers, then you were taught performance management. So it's like how do I optimize more Oracle systems or optimize queries, not cost optimization. It's a different problem.

Juan Sequeda: And that's when you have to think about people and money and more things and that thing changes. This is a...

Tim Gasper: XTRY, right? Yeah,

Juan Sequeda: Excellent point. All right, T, T, T, Tim, take us away. Your takeaways. Go first.

Tim Gasper: All right. So Matt, you kind of kicked us off with mentioning shiny object syndrome or magpie syndrome when we talked about where data engineering or data engineers might become very, either distracted or very invested from a technology angle around tools. And I think y'all kind of brought up why did we get into engineering in the first place? It was being able to do cool stuff with cool technology, and so therefore we're technologists at heart. We're interested in this discipline of applying technology and as technology evolves, that's exciting and we want to jump in on the new stuff. And also y'all mentioned about resume driven development and how, as this new tech comes out, we want to take advantage of it, putting it on your resume, whether it's for your own benefit or because employers are looking at that kind of stuff, that it becomes a focus. And as we adopt new technologies, as companies, as enterprises, we want to," Oh, we want to implement Airflow or something like that." Okay, well" Let's hire somebody who knows Airflow. And sometimes it becomes easy to kind of go in that direction or you're wearing a Coursera shirt, we can take some courses and we can pick up some new skills. So it is nice how easy it is to do that now. And there's a little bit of this lack of emphasis on people in process, and tech is becoming more and more the easy part. And that leaves open- ended some of the hard stuff, which is more the people in the process stuff. We talked a little bit about history, looking back to especially the big data Hadoop phase of things. And that was a great example of where technology was a big part of the conversation. And as we've moved past that phase, now we can look back with open eyes 2020, and what it really was and it was valuable, but there was a bubble that happened. And I think now those who know their history want to see that not happen again in the future. And we talked about what skills data engineers are really focused on and can get a lot of value from. And some of the ones that we wrote down were assessing questions, like really being able to look at questions and answer them and figure out how to answer them. Assessing technologies based on business problems. So not just technology for technology's sake, but the applicability of technology given the kind of problem that you're trying to solve. And this mention of enterprise data engineering. So a lot of this activity around modeling, around cataloging, governance, schema, there was a mention about Maslow's hierarchy, maybe some of the basic blocking and tackling now is being made a lot easier. We're addressing a lot of that stuff and now we're being able to handle these and focus on some of these things that are a little higher up the hierarchy. So I think that's a good thing. And yeah. Juan, what about you? Takeaways?

Juan Sequeda: Well really, yeah, let me continue on the skills one, right? We talked a lot about data modeling. This is something top of mind of all of us right now. I like how you said you're guesstimating that 20% only data engineers around, only 20% of them know what data modeling is. And I would agree with you on this. We really need to update the classic techniques to the modern world of analytics right now. We were talking about how it's the science and an art and there's the stuff that we need to go figure out, given the state of today. We need to learn how to go talk to people. Communication is key. Where is this happening? Where are you actually teaching this? If you get a computer science degree... I did computer science for a long time, no communications, but is this happening in MIS or stuff? I think this is a key thing about communication. And one that just came up was cost management. This is maybe we focus before on performance management, cost management is next. We talk about what's next for data modeling and hey Joe, just put a hint there. Is it something about streaming graphs or whatever? So what are the new paradigms on data modeling? Something you said that I fully agree with, data projects don't fail for technical reasons. It's because the data teams are not aligned with the people you need to go serve. So they fail for the people in the process, not for the technology. The whole life cycle of data. In the early days it was well understood, we had to go bring data, we transformed and we go use it. But I think we've given so much technology right now and has really distracted us and we need to go focus on kind of self- actualization of data engineers. And now we have cataloging and governance and modeling, so how does this fit in the life cycle of data? We have all these roles where there'll be a consolidation of these roles in the future. The app looks like, I think we agree that there will be. And then finally we talked about the tabular unstructured data and it's so hard to pick up the subtleties of tabular data. And I think personally, I think there's a big challenge in future opportunity there and how much data actually needs to be tabular. One random thing that you said earlier on, I love this," Legacy is a condescending way to refer to something that makes a lot of money." I love that. I'm going to close with that. Matt, Joe, how did we do? Anything we missed?

Joe Reis: That's good.

Matt Housley: This is a great chat. Thank you for having us on.

Joe Reis: Yeah, it was fun.

Juan Sequeda: All right, throw it back to you guys. Three questions. What's your advice about data, about life? Who should invite next? And third, what resources do you follow? People, blogs, newsletters, books obviously go get your book, but what else?

Joe Reis: Matt, do you want to give the advice part?

Matt Housley: Yeah, I'll actually give, I'll just give advice for Aspire data engineers and it goes back to an internal conversation that we had at Turner Data this morning. We were talking about lifelong learning and how you really have to be a self learner and a lifelong learner to succeed in data engineering. And so going back to the conversation that we were just having about people in process, I feel like if you want to be a successful data engineer, you start by learning the people in process stuff, which hopefully you can learn from our book. Now, this is what I'll tell you. Our book will not teach you data engineering, right? That's truly bizarre for a book that's about the fundamentals. Rather, it's meant to give you foundation so you can start that lifelong learning journey and get into the profession. So learn about people in process, learn the big picture, and then embark on the journey of actually learning the technology and learning the practices to be successful. If that's what you'd like to do.

Joe Reis: As far as who you invite on, I'm going to recommend Bill Inmon. He's working on some really cool stuff with text right now and he's a very good friend of mine. I'm always inspired by him. I think I can only hope when I'm his age I'll be contributing a fraction of what he is right now on a daily basis to the data world. I really feel like he's still at the top of his game, which is really cool.

Juan Sequeda: I would be truly honored to meet him and yes, look forward to connecting with you and have him on the show for sure.

Joe Reis: Yep. Thank you.

Juan Sequeda: Yeah. And finally, what resources do you all follow?

Joe Reis: Let me see, lots of stand up comedians.

Matt Housley: So Benn Stancil on the data space, I think Ben is awesome and he is very focused on the fundamentals. I'm going to give you two other names that you're probably familiar with that are not technically in data. One is Kelsey Hightower. I think Kelsey Hightower mostly worries about containers in other technologies, but he's super, super pragmatic and so I think he has a lot of insights that impact data as well. One of my all time favorite Kelsey Hightower talks is a talk he gave about AWS Lambda when he worked at Google, which is an Amazon competitor of course. I always feel like I learned something from his talks about data, even if he's not focused on data. One more name on the thin op site is Cory Quinn. You guys know who Cory Quinn is probably. So totally focused on cloud cost management, very entertaining. One of my favorite all time video data YouTube videos is his happy birthday to Larry Ellison video. But be warned it's not safe for work, you know which one I'm talking about.

Joe Reis: Yeah, I know which one you're talking about. People I'd recommend following, there's a lot. I think in the LinkedIn filter bubble we're all in, I actually, you know, guys are two I'd recommend and I think it's awesome. Ethan Aaron, I like a lot of the stuff he's coming up with these days. There's a lot of people I think of. So yeah, I would say follow all of us and then you'll be exposed for better or for worse to some great data people.

Juan Sequeda: All right, then finally, go get your book.

Joe Reis: Get the book, yeah.

Juan Sequeda: I just been opening it up just random places and I'm like," Oh wow." I'm very impressed, really excited about it.

Joe Reis: It's a gold mine. I mean, I hate to be shameless about our book, but I mean a lot of people have read it at this point. I think it's universally gotten a really good recommendations. I think the only fault of it is somebody wrote on Reddit and it's Reddit, stick it what you worth, but it's like," Oh, I already knew all this stuff in the book, so I didn't really get anything out of it." And I was like, that's a great admission of how awesome you are, but it's not a knock against the book. It didn't make you a worse person as a result.

Juan Sequeda: I just opened up this right now to this page, 196 on storage. You have a magnetic desk disc.

Joe Reis: That's right.

Juan Sequeda: My dad finished his PhD, went off to IBM in 1970s and worked on this a lot. His PhD was all applied, so this is so cool. You guys even go into hard drive and stuff, so that's awesome. All right, well next week as we said, we're going to have Ole Olesen, he's an author of the upcoming O'Reilly Data Catalog Book, and I will be live with him. I will be. I'm in Europe next week and I'm going to be with him. That's going to be fun because we're probably going to be 11: 00 PM live, while we're drinking some wine and Tim will be 4: 00 PM over here, so it'll be a fun conversation.

Tim Gasper: Yes, that'll be fun.

Juan Sequeda: With that. Thanks as always to our sponsors data. world, we get to do this because data world supports us, our Enterprise Data Catalog. Thank you Data World, and thank you Joe. Thank you, Matt. We this-...

Joe Reis: Of course, anytime.

Juan Sequeda: And also go follow your podcast and everything. We love it.

Joe Reis: Thank you.

Matt Housley: Yep.

Joe Reis: Yeah, Monday morning data chat. Cool. All right, Thanks guys.

Speaker 1: This is Catalog& Cocktails, a special thanks to data. world for supporting the show. Carling inaudible producing. John Loins and Brian Jacob for the show music. And thank you to the entire Catalog& Cocktails family. Don't forget to subscribe, rate, and review wherever you listen to your podcasts.


What is the state of Data engineering today and where is it going (or should it be going)?  Who better to talk about Data Engineering than the authors of the recent O’Reilly book “Fundamentals of Data Engineering”, Joe Reis and Matt Housley from Ternary Data.  

Join Tim, Juan, Joe and Matt to discuss the state of data engineering.

Today's Host

Guest Thumbnail

Tim Gasper

|VP of Product,
Guest Thumbnail

Juan Sequeda

|Principal Scientist & Head of AI Lab,

Today's Guests

Guest Thumbnail

Matt Housley

|Co-Founder and CTO at Ternary Data
Guest Thumbnail

Joe Reis

|Co-Founder and CEO at Ternary Data