Catalog & Cocktails: Bonus Episode with John Kutay

Media Thumbnail
00:00
00:00
1x
  • 0.5
  • 1
  • 1.25
  • 1.5
  • 1.75
  • 2
This is a podcast episode titled, Catalog & Cocktails: Bonus Episode with John Kutay. The summary for this episode is: <p>Lucid Streaming; how to take full advantage of data streaming</p><p><br></p><ul><li>Data streaming, Stream Processing, Real-time analytics, operational analytics — what is this? What’s the difference?</li><li>Most important use cases for data streaming</li><li>There are lots of misconceptions especially for the MDS crowd (not as much enterprise) between fast batch vs streaming</li><li>Memory-first processing (in-memory) vs disk space batch jobs</li><li>Change data capture (and only capture of change)</li><li>Data warehouses are now tying to support streaming more (like Snowflake)</li><li>This will be a big deal to make it so that more streaming can happen&nbsp;</li><li>Streaming warehouses (Rockset, Materialize) vs data streaming</li><li>Lineage - transformed data - can I trust this data I'm looking at</li><li>How does data streaming and lineage come together? What’s unique about lineage in a streaming context?</li><li>If time: what does it mean to do streaming data products in a data mesh context?</li></ul><p><br></p>
Introduction & Toasts
03:12 MIN
The most bizarre lucid dream you've ever had
01:29 MIN
The ins and outs of data streaming
01:14 MIN
Real-time streaming in the modern data stack, minutes to milliseconds
02:00 MIN
Examples of when real-time versus a streaming approach might be preferred
03:29 MIN
When is streaming or real-time not a good choice?
01:06 MIN
Balancing costs, value, and transactional time
02:17 MIN
Transactional to analytical, and instant gratification
02:54 MIN
The technology behind streaming and their use cases
04:28 MIN
Change data capture and the history of the practice
01:05 MIN
The relationship between streaming data and transformations, DBT and ETL streaming
04:11 MIN
The use cases of streaming ETL
02:33 MIN
Calculating lineage, SQL, and speed
04:06 MIN
Lightning round
03:44 MIN
Takeaway Time
04:41 MIN
Three questions
04:38 MIN

Speaker 1: This is Catalog and Cocktails, presented by data. world.

Tim Gasper: Hello and welcome. It's time for Catalog and Cocktails. Honest, no BS, a non- salesy conversation about enterprise data and analytics presented by data. world, the data catalog for leveraging agile data governance, to give power to people and data. I'm Tim Gasper, a long time data nerd and product guy at data. world, joined by Juan.

Juan Sequeda: Hey, I'm Juan Sequeda, Principal Scientist here at data. world, and today this is a different episode. We are coming to you from our office at data. world, but we are not live. This is a bonus episode that we're doing, and we're dropping this off brand new. We're been so thrilled and honored about how everybody has been listening to the podcast. We have so many people who are listening to it, they're reaching out to it, and we have so many guests who want to be on it, and it sucks to say," No," so we're not going to say," No."

Tim Gasper: Yeah, we want to bring more amazing content to you and amazing people, talking about really key and important things in data.

Juan Sequeda: And with that, I am super excited to introduce John Kutay. Did I pronounce your last name right?

John Kutay: Close enough.

Juan Sequeda: How are you doing, John?

John Kutay: Hey. Hey Juan. Hey Tim. Hey, thanks so much for having me today. My name's, just for everyone, my name's John Kutay... product at Striim, and I also host my own podcast, not competitive with yours, but it's called, What's New in Data. How's it going today?

Juan Sequeda: I don't think podcasts compete. It's just more and more knowledge that we have out there, that we need out there, and this is great that we have the opportunities to be able to go spend time and talk to great people. At the end of the day, we're just so lucky that we get to talk to folks like you and get to learn and just have all these great conversations, but all right, this is just a normal episode. What are we drinking? What are we toasting for today? John, how about you?

John Kutay: Hey, for me, my drug of choice is caffeine. I got my homemade latte. I'm doing a lot of espresso brewing, so just enjoying this while we go into this episode.

Tim Gasper: That's awesome. I am also a caffeine aficionado, and I am not doing my own espresso and making lattes and things like that, but I want to, so we're going to have to connect on that. You're going to have to tell me what you're using for that.

John Kutay: Oh yeah.

Juan Sequeda: What are you up to, Tim?

Tim Gasper: I'm drinking this Dominga Mimosa Sour by New Belgium. It is a sour beer and it's got a strong citrus hit. Very light, very interesting. I don't usually go for sour beers, but this is very cool.

Juan Sequeda: I could not do that. I think it's an acquired taste. I'm actually taking it really easy. I'll be very honest. Yesterday, we had a lot... It was an interesting night, so I'm just hydrating a lot right now.

Tim Gasper: An interesting night of lots of data activity and some drinking, too.

Juan Sequeda: There was, there was, there was.

Tim Gasper: Some good things.

Juan Sequeda: I'm toasting. This is our first bonus episode. We're doing this. So thanks, John, for being our inaugural guest for this. Thank you so much and toast to you here, for being a guest.

Tim Gasper: Yeah, cheers.

John Kutay: Yeah, cheers. Thanks for having me. Couldn't be a better topic to do a bonus new episode on.

Juan Sequeda: Yeah, absolutely. We have our funny question. Today, in one sentence, describe the most bizarre lucid dream you've ever had.

John Kutay: The most bizarre dream I've had and it seemed pretty lucid. It was at the height of the COVID pandemic and we were all working from home on Zooms all day, and we had just gotten a puppy, and I had this weird dream about having to schedule a Zoom with my dog to feed him and take him on a walk. It didn't make any sense when I woke up, but I think something about the Zoom world and having a dog, I was like, oh yeah, I have to schedule a Zoom with my dog. Yeah.

Juan Sequeda: That's pretty funny. How about you, Tim?

Tim Gasper: My weird lucid dream is, and this actually happened a couple of weeks ago. I was taking a test and everybody was turning in their tests, but I was just getting started, so it's some kind of a stress dream, but I remember it very vividly, and then I look up and the proctor of the test is my boss, Jon Loyens. He's the one who's making me take this test. You know what? That doesn't make any sense. Why am I having a stress dream? Jon is awesome to work with. That was my really weird lucid dream.

Juan Sequeda: Well, you know what? I don't remember dreams. My wife and I have this discussion all the time because she remembers all her dreams. She describes them and I think I remember dreaming, but then I wake up and they're gone, so actually I have nothing to share because I don't remember my dreams.

Tim Gasper: No interesting ones.

Juan Sequeda: All right. Well let's start.

John Kutay: You're living the dream.

Juan Sequeda: I'm living the dream. Yes, yes.

John Kutay: Hey, I like that.

Juan Sequeda: Maybe some lucid things happened yesterday. I don't remember. All right, John. All right, honest, no BS. We talk about data streaming, stream processing, realtime analytics, operational analytics. What is the difference between all this stuff?

John Kutay: Yeah, so you touched on a lot of cool points there, a lot of cool subjects. Just to start with data streaming. Data streaming is very simply, the pattern of collecting data as it's new, and only capturing what's changed from source systems and processing it in real time, sequentially, as an event driven architecture. Now, that's applied in a few ways, one of which is operational analytics, situations where let's say, I'm a major airline and I have real time maintenance data on every single plane that lands. That's an event driven system. You want to take action on maintenance data as it's entered into the system, and there's a real time SLA around it. I can get into how data streaming meets real SLAs and all these business objectives, but I would say, the core thing is just capturing that data in an event driven fashion.

Juan Sequeda: We were having this discussion yesterday on the podcast about, sensor data came up, and we started talking about real time. The discussion there was, real time is subjective because real time is, for somebody real time is yeah, every second, but real time can be 10, 15 minutes or whatever. How are you seeing that folks that you work with, your colleagues, your customers and stuff, how is real time defined? What do people expect by real time?

John Kutay: I actually see two flavors of this and I would say, the distinction is, the one that's most common that we see in, let's say, the modern data stack or the cloud analytics world, is near real time analytics on fresh data. That means, hey, I want to take my warehouse syncs, let's say I'm running Snowflake or I'm running BigQuery, doing analytics there. I have all these business intelligence users who are running reports... Near realtime analytics on fresh data just means I'm bringing those sync frequencies down to let's say five, 10, 15 minutes. That's like the low hanging fruit for most companies, and for a lot of companies, it gets the job done. That at least, gives all the end users and analysts confidence that, hey, when I pull a report, I'm actually looking at something that's pretty much fresh. We're talking about within the last five, 10 minutes, and there's a way to monitor this. Your actual data freshness SLAs. That's one area. The other is true real time data streaming, which I also see, where seconds... milliseconds of data delivery frequencies, mostly see with message bus systems like, hey, I stayed into Kafka and then that's going to go into some Spark job or Spark streaming, and machine learning that's going to update some model and kick off some workloads or send some alerts and notifications. That's where I see more true, real time in the millisecond context.

Tim Gasper: Interesting, interesting. You're saying that a lot of companies may be doing more of this near real time stuff. Then there's use cases that are more streaming oriented and you mentioned a few things like alerts and things like that. Can you go into a little bit more detail of, when would you be doing more of a near real time kind of approach, and when would you be doing more of a streaming kind of approach?

Juan Sequeda: Yeah, and I want to drill down into the use cases because we hear this all the time. Oh yes, we want to go have real time, but really is it crucial or is it nice to have? Are we leaving money on the table because we're not being real time or is it just, it would be really cool, but do I really need to know exactly how many people are in this site or whatever, or just how many people just pay for something? What are those use cases, and let's get truly honest, no BS on this stuff, because this is where I honestly feel that people are pushing real time a lot, but it's like, eh, do you really need to go through that? I don't know.

Tim Gasper: Right. When do I need to take what approach? And hopefully, we can break...

Juan Sequeda: All right, let's go demystify some stuff here.

John Kutay: Let's do it. I'll give you a real time use case. One of the customers we work with is a major retailer. A lot of their infrastructure is on, on premise inventory management systems say, tracking all our shipments, tracking the stock of all our goods, and then they have a whole consumer application suite that faces their customers, and the real materialization of this is like, you have some mobile app where you're shopping for clothes or you're on Amazon or whatever, some mobile app where you're actually seeing what you can shop for, what the inventory is. For those companies, they actually need a real time source of truth for inventory, and inventory infrastructure from a data standpoint can be complex. It can be very disparate. It's going across different warehouses, different data centers, different clouds, and they want to sync that all in real time to the backend for their consumer facing applications, because you don't want to give customers a bad experience where they think that they're buying something that's in stock, going to be delivered in one, two days, but the truth is, that's actually stale data. They're out of stock. Next thing you know, they say," Oh, we're actually going to ship that to you in four to six weeks. We were wrong when we said it was in stock." That's a real example of how real time data can influence customer experiences and really does impact your bottom line and your net promoter scores and all that types of stuff. We see that a lot. We do have some case studies on the Striim website, Striim spelled S- T- R-I- I- M. com, where some major retailers and other operational companies have deployed real time to build better customer experiences. That's one classic problem.

Tim Gasper: That's your example where if it was in five, 10, 15, 30 minutes, that wouldn't be in time because by the time that person has made that purchase, now that mobile app essentially lied to them and maybe some companies can just handle that with back order, but a lot of companies want to actually allocate the correct inventory towards the particular shopping cart.

Juan Sequeda: What about-

John Kutay: Exactly.

Juan Sequeda: What are the situations where people would think you need streaming or real time, and it's actually not the right thing to go do?

John Kutay: Mm, like a common misconception when you're like, oh-

Juan Sequeda: The anti-patterning.

John Kutay: Do I need streaming? It's like, no, actually you could do that in batch or you could do that in fast batch or whatever.

Tim Gasper: I always recommend to customers, if the data's going into a data warehouse, think about cost because Striim does work with a lot of customers who are" streaming data", but into a warehouse, and we work with them to make sure that their sync frequencies and merges and all that stuff is actually sane because if they're doing merges every single second, they're going to be tripling or quadrupling their warehousing costs, and it doesn't really make sense when their end users are just people pulling up a report. In that case, you can optimize for costs and say," Okay, just do the sync frequencies on a 15 to 30 minute interval," if the only thing that's going to happen is you just have some users who are going to pull up a report and look or power BI or something like that.

Juan Sequeda: Yeah. This is an interesting point is balancing the cost, and I think we all get excited about, oh, let's get this now, and oh the technology's there to go do that, but hey, let's remember that there's costs we got to go pay. At the end of the day, what is the value out of it? I think this goes back to, we talk about this all the time, what is the business problem that this technology's helping? Do we really need to be able to go do some sort of analytics over this in real time? Go back to your example, you said the retailer, I see this more as real time transactional. Is there a real time analytics when it comes to do some sort of analytical processing in real time? This is something I've personally, been a bit confused, is the necessity or the use cases behind this. What are you seeing?

John Kutay: That's such a great question. Basically you're saying," Hey, that's a transactional use case. We're just replicating actions and making sure those are in sync," and you're totally right, but that's kind of like the problem that you saw, like the 2000s era data integration vendors solved like GoldenGate software, who's owned by Oracle now. That's the founding team for Striim. Now, it's almost like having analytical models in between the raw data and then, what customers are experiencing, is becoming the defacto way of delivering data where, hey, it's not just copying a transaction from an inventory system and then putting it in another database and showing that to customers. You want to essentially personalize experiences. You have data on the customer across multiple SaaS applications, and you want to actually correlate that data between, let's say, they might have a record in Salesforce. They might have a record in your marketing system, your VIP loyalty system, and you want to build that real time single source of truth, which does involve correlation, which does involve model building. It could be as simple as running some rules based queries and then, making that a real time source of truth does actually involve operational analytics.

Tim Gasper: Interesting. So, that's an example of a use case, these personalized experiences that are based on maybe rules based queries or something like that. That might be something where you need that event information. It's a little bit semantics. Is that really kind of transactional in nature, but in general, that's a very real time analytical sort of use case?

John Kutay: Correct.

Juan Sequeda: What are they-

Tim Gasper: And in that context, usually have some analytical models being trained in the process, or maybe it's just a warehouse dimension that you want to correlate from multiple systems.

John Kutay: Right, okay. Once you're doing that sort of model building in the warehouse, I think you cross the plane from purely transaction into analytical.

Tim Gasper: Okay.

Juan Sequeda: Mankind in a way, has now changed so much to want things now. Amazon Prime, you get delivery in two hours now. Our expectation of getting everything now, is leading us to think more about being real time, but I don't know. I still think that... We were having this discussion on a previous podcast here about effectively, everything could be just real time, it's just a difference of how you feel real time. Batch is very slow. It's not near real time, but it's some sort of a spectrum, so at the end of the day, it's not like it's either batch or it's real time. It's like a knob, so you decide how to go turn it on. You put it faster or you put it slower. If it's slower, possibly it's cheaper. If it's faster, you get more of this other great data faster there, but probably going to go pay more around that stuff. Is that a good way of seeing it, that this is just one spectrum? It's just a knob that you're just turning?

John Kutay: It can be a knob, but once you try to get super real time, then you actually think about putting new architecture in, and that's what's scary about it, because you would think that... I'm in the batch world right now. Right now, my sync frequency could be one hour into my warehouse and then I'll update my models every two hours, and now I'm thinking, okay, let's bring that down to 30 minutes. Let's bring that down to 15 minutes. Let's bring that down to 10 minutes. Boom, boom. Everything will blow up because it's not designed to deal with data at that type of scale and speed. That's when companies start thinking about actual real time infrastructure and there's been a lot of good writing on this topic. Actually, Amy Chen from DBT has a great writeup, and that's very much in the... BS, very, very in line with this pod where she says," Most cases you don't need, but here's where you really do need it, and don't even try to do this with batch because it won't work," and it is an operational analytics use case that she's referring to. It's a good blog post. I'll share a link with y'all later that you can repost.

Tim Gasper: Yeah, that'd be awesome. You said, Amy Chen?

John Kutay: Yeah, Amy Chen from DBT.

Tim Gasper: Awesome.

Juan Sequeda: But from a user perspective, I just want it to be a knob. I get that from a technical point of view, there's differences, architectural point of views, but I don't care. Make it transparent for me because right now, I think this is kind of my personal annoyance with all the entire Stack and everything in the modern data stack and stuff. It's all over the place and there's so many tools, when these things should be... I don't want to have two tools for this stuff. This should just be one tool which has a knob, and I decided, and behind the scenes, the vendor figured out the best architecture for that. Is that too crazy? I don't think so.

John Kutay: No, I'm with you.

Juan Sequeda: Or am I crazy?

John Kutay: You're crazy, but in a good way, and that's how we push boundaries here, and I'm a hundred percent with you that for users it should be... And when I talk about users, maybe the data engineering end user who just wants to go from batch to real time, it should just be, show me the knob. Turn it down from, 60 minutes syncs down to one second syncs, and it should just work transparently. I'm a hundred percent with that. I think that's where we need to go if we're really going to deliver real time to companies at scale. When I mean scale, I don't mean just the Ubers and the Googles of the world who can hire 50,000 PhDs to build this stuff out. I mean every company should be able to implement the value of real time data for their customers.

Tim Gasper: Got it. I think that makes sense. This is interesting. We're talking about which use cases fit in which modes and things like that, and some of the misconceptions around batch for streaming versus fast batch and things like that, and real time, near time real time versus streaming. Maybe let's talk a little bit more about the technology as well. I know in the spirit of streaming, you hear a lot about message buses and queues. You hear about things like, in memory versus writing things to disk. What is the technology that people should really be thinking about broadly speaking, when we're thinking about streaming or real time and what are some of the trade offs here?

John Kutay: I always think, unless you're starting with the source of the data, it's not even worth going after real time. If you're only thinking about the message bus, you're already a little too far down the stack before you're solving real world problems. I start with the source and when you're looking at the source, you need to build out mechanisms for efficient change data capture. Change data capture is the process of capturing what's changed from an operational system in an event driven format, rather than pulling it as batches on 15, 30 minute interval. The reason you want it to change data capture, especially with databases, is with operational databases that have heavy... in terms of the applications you're running on top of them. You don't want to put more strain on them by running queries on the database just for analytical purposes to pull it's change. You want to basically go against the change logs, all databases write to what's called, a write ahead log, which it's basically, a journal of all the trends and you can parse those transaction log... just the data that's changed. And then you can feed that in streaming system like a Kafka or I think, RedPanda is also becoming a popular alternative to Kafka, getting that data driven format. That's where I'll really start and change data capture, there's a lot of it for databases. There's a lot of options for it. Really being able to do it from the logs is what's critical. Now, SaaS applications, Salesforce for instance, has a CDC API now, where you can just pull the changes versus doing batch style API calls against it. I think we're going to see more SaaS applications support native CDCs, but until we get there, what you're pretty much just going to have to do is, do those pool based API requests and look at the timestamp and try to get what's changed based on that, so that's where you start. A lot of tools will extract that for you. Striim, the product I work on being one example, so you don't have to worry about doing that yourself. Piece is the message bus, so now there's a couple ways you can go about it. There's full in memory streaming systems. Well, first we can say why in memory versus disk space? Memory is obviously faster. Writing to disk, there's more IO. You have to do look ups against the drive. Whereas, when everything's in memory, for instance, with Striim's implementation, we just inaudible as a sequential queue that's in memory and can run processing directly against... and that's what gives you really fast performance, millions of queries per second on the data that's flying through. Finally, the way you egress that data, the way you write it out into external systems, like a data warehouse, or let's say a Slack alert writer or whatever. You have to buffer that data into a format that can actually be handled by the data warehouse, because warehouses and databases and alerts, whatever it is, they all have batch extractions ingesting data, and you have to format streaming data for that. Does that make sense?

Juan Sequeda: Completely. I am so appreciative of what you just said, because I think you very clearly described the implementations behind this, which I just learned a lot with what you just said, but this is very, very important. A lot of the folks that I talk to, I think streaming is something that they want to discover more, so this is a really great discussion.

Tim Gasper: Yeah, they want to understand the pieces a little bit. I think change data capture is an interesting topic because that's not something that I think... I think traditionally, you go back 10, 20, 30 years, people talked a lot about change data capture. I feel like it's waned a little bit in the knowledge. You don't hear modern data stack people talking a lot about change data capture, but I feel like some of these traditional concepts are coming back with a vengeance now of, oh, wait a second. Remember that stuff we used to talk about? Yeah, it's actually still really valuable, and that's interesting to hear that even SaaS companies like Salesforce are starting to introduce that, because that was actually going to be one of my follow up questions is that, in a world where a lot of things are coming from SaaS, it's actually a little more difficult to handle change data capture because databases write these things all the time, but SaaS systems, you may not have access to that information, so that's super interesting.

Juan Sequeda: Digging in, another thing that I appreciate your clarification here is, when it comes to streaming and we look at things like the transformations and ETLs and DBTs, so broad question, what is the relationship between the streaming data and all these transformations that are occurring? Is this streaming or using DBT transfer models? Is ETL becoming streaming, too? Is that part of the knob? What is that relationship?

John Kutay: Yeah, that's a great question. My goal is making streaming as simple as using your traditional batch warehousing systems like Snowflake, and to the end user, it should be just as simple, like you were saying. Right now, it's too complicated, everything is too fragmented, everything is such a low level abstraction where you have to learn how to do the CDC yourself, how to buffer the events and memory yourself, all that stuff. It shouldn't be that hard. Now, the relationship between... Back to your question, DBT, ETL, ELT streaming, I still see a world where you're going to do a lot of ELT, Extract Load Transform, in a streaming format, in an event driven format, and the type of transformation you're doing is so low level that the users shouldn't even know about it. We're literally, just processing the data so that it's ingestible by a warehouse like Snowflake, and then your DBT teams will still do your model building on top of the warehouse because for those types of use cases, it's as simple as to just have all the data in the warehouse build your models from there, have all the history of the data between pure raw data that I just loaded, and then my analytical models, whether it's a solely changing dimension or a fact table or whatever. The second category is, if I actually need analytical model data in real time, I can't afford the latency of indexing in the warehouse, doing a merge, waiting for a DBT model to run on the warehouse. We're talking about going from seconds of latency to minutes, and it could be up to 30 to 40 minutes, and it's not going to be cost optimized. That is where streaming can turn into ETL, where you're doing transformations and joins on the data in flight, to build that analytical model in a streaming incremental fashion where everything's an insert, and I'm just feeding it into the warehouse, basically in realtime, ready for analytics, and you can fire that off as alerts to operational systems like ServiceNow or Slack to say," Hey, this is actually critical data, and we need someone to act on it now."

Juan Sequeda: If I look at ETL, ELTS, the E and the L would be batch versus streaming... If you look at ELT, the EL could be batch and streaming, and then the T would be something that happens afterwards, so that's after the real time or the batch, but then if I actually do ETL, then that means that the transformation needs to be done while in flight and the streaming. Is that a good way of seeing that distinction?

John Kutay: Exactly. Exactly. It could either just be a streaming EL, just fast data loads, or it could be a streaming ETL, where I'm modeling the data the second that it's captured and loaded in the downstream of analytical systems. On the no BS side, I still see much more of the streaming ELT, where we're just getting the data to the warehouse as fast as possible in an event driven format, and most of the modeling is done after it's loaded. I'm still seeing mostly ELT, but there's a lot of momentum, a lot of questions, a lot of new use cases coming out around streaming ETL now. I'm personally, keeping a pulse on that. I've seen some adoption, but definitely a lot of energy and interest in it.

Juan Sequeda: Say more. Say more of this. What are the use cases that you would want to go do, streaming ETL? Honest, no BS, streaming ELT is really, as you just said, faster loading into the warehouse, then you're doing your inaudible. But then itself, if I'm doing the transforms and that happens to be a model that needs to be built, it still takes whatever time, two minutes to 20 minutes, whatever, to go build that. Effectively, you're just building a view, and if it's a materialized view, then if it's unmaterialized view, then you'll get the data as it comes in. If it's a materialized view, you got everything to materialize it. At the end, I don't care how fast I'm streaming things in, if I'm materializing it after the fact. That's not real time anymore. It's just whenever I materialize.

John Kutay: Exactly.

Juan Sequeda: But then-

John Kutay: Exactly.

Juan Sequeda: What are the use cases where you're streaming ETL? Because it seems to me that you're, I don't know. That's a question to you.

John Kutay: Exact. Yeah. Yeah. That's a great point. I've worked with one customer where we got their ELT down from 30 minutes to an hour, down to five materialized views and modeling happening in the warehouse, and now that was a big win for them, but that's about as far as... go in terms of doing this in house, just because it's inherently... Not that there's anything wrong with that. There's a lot of advantages of that, but the use cases for streaming ETL are coming about with... I hate saying this word, operationalizing, the warehouse, where we're talking about reverse ETL, data activation type things, where I'm taking data from my analytical system. It doesn't have to be a warehouse. It could be any analytical system and feeding that back ends of engagement as McKinzie calls it. Feeding it back into CIS, feeding it into your email marketing campaigns. You want to act on user data, and that's where streaming ETL is becoming very interesting again, because we're basically taking analytical data modeled from what customers are doing, all these different sources, and now we're actioning it by feeding it back into some SaaS application or some user facing application that's using that data to give them a personalized offer or fraud or whatever it is. There's lots of operational use cases for it.

Juan Sequeda: Just quickly, is this all still SQL?

John Kutay: Yes. Yeah, in Striim it's all SQL.

Tim Gasper: Interesting. Oh man, this is an interesting conversation. I think we've got one major question left for you before we move to our lightning round, but just before we get to that, just wanted to say that this episode is brought to you by data. world, the data catalog for your data mesh, a whole new paradigm for data empowerment. To learn more, go to data. world. John, I wanted to ask you, in some of our conversations before we chatted here today, we talked a lot about lineage and how there's some interesting interplay between what does lineage mean in a streaming context and how do those things come together, and is there something new to think about like, the E and the T and the L in streaming in terms of how that connects to lineage? Curious about your thoughts there.

John Kutay: Yeah, lineage is super critical across a lot of industries, especially financial industries. Eric Broda, his last name is spelled B- R- O- D- A, has a lot of great writing on this. He was a longtime architect in financial services, and now he's writing a lot about data mesh, and I had him on a podcast episode recently. He was talking about how, for regulatory and compliance purposes, tracking the lineage of all the data that they're analyzing is required. They need to be able to tell you, how did you collect this data? What were all the intermediate processing steps it went through? What were all the external systems it ran through? And being able to see the actual life cycle of every single record that a bank is storing. That's the enterprisey, highly regulated industry use case. Now, coming back to a very simple use case that everyone can relate to, you're pulling up some report and you're going to see some metric in a dashboard, and someone will say," How do we come up with that metric?" It looks cool. I don't know if it's right. Where do we get this data from? That's where lineage comes back into play, because you basically want data engineering teams to be able to have visibility into how a certain field was generated, and not just from the model in the warehouse that came up with that number, but what fed the warehouse? What was the source application? Was it a database? Was it a SaaS application? When was it written? What were the transforms that fed it? What was the ETL tool that loaded it? Being able to do all that sort of reverse engineering of data that you see in your reports is a super critical use case for lineage.

Tim Gasper: Is there a different way that you need to calculate lineage when it's in the streaming context? Because for example, when you're looking at databases, a lot of times people either think, okay, well either that system is generating lineage or more often you're looking at the SQL code and you're trying to parse the code and things like that. Is it often similar in the case of streaming? You're looking at what those either SQL or other sort of transformations are and what those steps are, and it's just faster? It's just more that streaming paradigm or is it actually different in some fundamental way?

John Kutay: Yeah, because in the streaming system, everything is event driven, we can be super granular about the lineage of each event and say," What table did it come from in the source database? What object in Salesforce that came from a change data capture record? What was the offset in the write ahead log for this specific transaction?" We can get that granular, but one of the things that's valuable is being able to see, what was the ETL workload that loaded this data? What was the streaming query that processed it and transformed it? Is the logic there accurate? Is it going to produce correct data when it was loaded into the warehouse, was it just an insert or was it a merge, some sort of inaudible? Being able to have all that information at an event level is really valuable for teams to go back and triage and give their business stakeholders confidence that the data... and that's one of the features that I spent a lot of time on at Striim, on our streaming gauge function that basically gives you all that data when you write it into the warehouse. We don't show you the lineage. We make that metadata available for data. world to pick it up.

Juan Sequeda: This is a really interesting point because you can now get into so much granular level of detail, but you got to be careful about it, too, because well, do we need all of this stuff? Do we need all these level of detail?

Tim Gasper: Well, especially in that regulatory use case, I can see Eric Broda, that financial background, that's an obvious use case, but then I guess if you're trying to troubleshoot, why are these 10 events malformed in some way, then it becomes very operational in that sense, right?

John Kutay: Exactly. The more metadata, the better just so teams can go back and triage the lineage and say," Yeah, this is correct," or they can say," Oh no, this query logic is actually wrong." I can go troubleshoot and fix it, and now they know the data's correct. I think it's definitely something that will arm data engineering teams to have more mission in terms of giving their BI users confidence in the data.

Juan Sequeda: All right. Well, streaming is something that I personally think, it's going to just naturally come in without us realizing. I'm going back to the whole knob thing. I think one day we'll realize, as you said yourself, anybody should be able to go do streaming. I think that's what's the exciting thing, it's this quiet thing that has always been around because we've always had it and we've always moved things, and the world is getting faster. We're expecting things now, so it's going to come and it's starting to come right now. I really appreciate we've had these discussions right now, so let's go move to our lightning round. Presented by data. world, the enterprise data catalog for your data and knowledge, and I'm going to kick it off first. Is streaming a better term than realtime analytics to describe the event driven processing?

John Kutay: Yes. Oh, you want me? Should I elaborate or is it lightning?

Juan Sequeda: No-

Tim Gasper: If you have something you want to add.

Juan Sequeda: Or maybe it's just yes, and that's it.

John Kutay: Yes.

Tim Gasper: Perfect. That's the best kind of lightning round answer. That's the best kind. All right, next question. Are there industries which really have no real time streaming use case?

John Kutay: No.

Tim Gasper: All right.

Juan Sequeda: Okay, so all industries will have some sort of real time use case.

Tim Gasper: Maybe varying degrees of applicability, but there's something for everyone.

Juan Sequeda: All right.

John Kutay: Yeah, it's so horizontal in terms of the technology that I think that all industries have some use case for it.

Tim Gasper: Okay. That makes sense.

Juan Sequeda: All right, next question. Will stream processing expand in popularity and functionality, where it will actually handle batch use cases?

John Kutay: Yes. Yes, you'll see hybrid streaming and batch workloads in one paint of glass.

Juan Sequeda: Is that coming more, the streaming tools doing batch or the batch tools are going to do streaming? Who's getting to that" battle" first?

John Kutay: Well, I actually see it going both ways. Snowflake coming out with streaming ingest, BigQuery having their own streaming ingest layer, and then, you're going to see streaming tools work with those APIs very well, so that's very transparent to the user as to whether this is a streaming workload or a batch workload.

Tim Gasper: Interesting. I feel like this is a trend that's kind of resurfacing again. My big thing in streaming was Apache Storm, and that was kind of, you could do streaming or you could actually build more of a batched inaudible, so that's interesting that we're coming to that again. All right. Last lightning round question. There's some new streaming data warehouses coming out, like Materialize and Rocket Set, and then you even just mentioned just now things like Snowflake and others are starting to add some streaming capabilities to their feature sets. Do they compete with stream processing?

John Kutay: Yeah. If this is lightning round, I'll say, yes.

Juan Sequeda: Expand.

John Kutay: There's nuances. Yeah, okay. There's nuances-

Juan Sequeda: ... We need some context around this way.

John Kutay: Well, yeah. I would say there's really two types of streaming, the data streaming with movement where it's really focused on... processing it, sending it to some other system, and then there's other things that are closer to databases, which is I'm just going to ingest the streaming data, index it, and you're going to query me to get the realtime analytics rather than me sending it to a Snowflake or a BigQuery. The product I work on is more on the former, where we're focusing on the streaming movement and data integration, but we're going to see the data streaming databases get into it as well.

Tim Gasper: Interesting. Yeah, a lot of overlap, a lot of maybe confusion, but that's why we're here, to help cut through the BS and help us figure out what's going on, so this is good.

John Kutay: The key is not to reinvent the wheel. A lot of everyone has a warehouse today. Work with that first, try to get it to as near real time as possible, see if that fits your needs, and then look into investing in Harp streaming databases or infrastructure.

Juan Sequeda: Well, takeaway time. Tim. TTT, Tim, take us away with your takeaways. Go first.

Tim Gasper: Yeah, sure. This has been awesome, John. I feel like I've learned a lot about streaming and of real time versus batch, versus real time stream processing, change data capture. There's been a lot that we've covered today, which has been awesome. Just to start things off, you mentioned that of all those definitions, you honed in on streaming. What does it mean to do streaming data or streaming analytics? And you mentioned that it's the power of collecting data as it's new and processing it in a sequential manner, capturing it in an event driven way and working with it in an event driven way. You gave an example of doing maintenance related to airplanes, and you may want to have the ability to do real time monitoring and actions on that kind of data, as an example of something that real time streaming might make sense for. We talked a little bit about, what about near real time versus true real time? What is real time? Is it five minutes? Is it 10 minutes? Is it five seconds? Is it milliseconds? And the answer sounds like it depends. It depends on your use case, depends on what you're trying to do. When we're talking about streaming, it does sound like we're leaning a little bit more towards those use cases where you can, when you're being event driven, handle things in a pretty fast way, get into those seconds, get into those milliseconds if needed. And then, it just depends on the use case and that you can monitor this. You mentioned data freshness SLAs a few times in terms of that being very applicable to this. When you talked about streaming real time, you said it's usually message bus related, Spark, machine learning, alerts and notifications. These are the kinds of things that can often be use cases and things that will key off of streaming, and you gave that example of a major retailer where a mobile app might be being kept up to date with real time inventory information, so that way, when somebody buys something, they're not going to accidentally buy something that's not available or not in stock. And then, when is streaming not the right thing? You said, if you're using a data warehouse and you're servicing the BI report and there isn't really a use case for having that information be in seconds or something like that. That's an example where if you were doing that, maybe that would be unnecessary from a cost standpoint or from a complexity standpoint, that there's really other use cases that are more appropriate for that streaming paradigm. I thought those were some really interesting things there. Juan, what about you?

Juan Sequeda: Well, the main thing for me is this whole knob. I think we agree on the vision that the knob should be from batch to real time streaming that today it's different. It's hard because there's very architectural differences around that, but the vision is that this, for the user, it should just be a knob and I think we're going to get there sooner than later. We talk about, underneath the hood of streaming, so message business, is it the right starting point? It's the source, in particular, the change data capture. Looking at using write ahead logs and change logs and the journal of transactions, and we're seeing things like Salesforce, you said have a CDC APIs, but until that, with SaaS applications, you have to do everything in a pooled approach. We talked about a lot of the ETL, ELT on streaming. If it's ELT that EL is going to be the streaming and that's just really fast loading, and then you're doing the transformation later into your data warehouse, but when you do ETL streaming, then you're literally doing that transformation on the data on the fly, so that's when you really have those real time use cases where you need to do that type of modeling, which is interesting when you bring in this lineage and streaming is that you can really get into the much more granular events, those event level lineage, and at the end of the day, more metadata's better, so we get that more granular level of details about what's going on. More metadata is better. It can help us answer a lot of questions. How did we do? Anything we missed?

John Kutay: Those are amazing takeaways. It really sounded like you were able to ingest a lot from my talk in a streaming fashion.

Tim Gasper: It was very real time.

John Kutay: The one thing I'll say is if you're doing change data capture from a source system, you should always be streaming the data. You shouldn't be doing batches because that's going to cause inaccuracies in your data, even if your end system is a warehouse.

Juan Sequeda: Interesting.

John Kutay: So that's the one little caveat-

Tim Gasper: Okay, good advice.

John Kutay: One little caveat. If you're doing change data capture, you should always be streaming, even if you're loading into a warehouse ELT style.

Tim Gasper: That makes sense, and then you can read it in order, et cetera.

Juan Sequeda: All right, John, back to you. Three questions. What's your advice about data, about life? Second, who should we invite next? And third, what are the resources that you follow?

John Kutay: My advice about both data and life, a long time ago, I was a musician and I went to music school, San Francisco Conservatory of Music, and basically what I learned there is, five to 10 minutes a day, practicing your craft is better than marathon sessions every week or two. When it comes to data, it's the same thing. Just spend five, 10 minutes a day just dedicated to learning and getting whatever it is you're trying to get better at. Can you remind me the other questions?

Tim Gasper: That's great advice.

John Kutay: I lost them.

Juan Sequeda: Who should we invite next?

John Kutay: Oh man. I have a lot of people. I know you've had Sarah Krasnik.

Juan Sequeda: Yep.

John Kutay: Always worth inviting back. I would recommend her. Ethan Aaron, are you following him on LinkedIn? He has some very-

Juan Sequeda: I am. I am.

John Kutay: ...very cool takes. I know you had Chad Sanderson in the past. Those are all the people I follow, but you're already on top of that. Aaron's LinkedIn has a lot of great insights and a very contrarian view on topics, which I love to hear. In the data industry, we sometimes get a little trapped in inaudible people like Ethan to keep us on our toes and really question some of the assumptions out there.

Juan Sequeda: All right. Well, Ethan, if you're listening, we'll definitely invite you over. This is great. I've been following a lot of this stuff and I think they were chatting with Joe and Matt yesterday, something about data contracts and stuff, which we're going to have Joe Rice and Matthew are going to be on our podcast on our live show next week.

Tim Gasper: Next week.

Juan Sequeda: Which is September 8th, I think. Finally, what resources do you follow? You mentioned some people who you follow on LinkedIn. Blogs, books, podcasts, share with everybody.

John Kutay: I definitely recommend my podcast, What's New in Data, and we really just focus on latest trends and we have great guests. Joe Rice is coming on as well. I just picked up his book on data engineering. This weekend, Labor Day weekend, I have it on my calendar to go through it. I'm really excited to read it. I also read Zhamak Dehghani Data Mesh book. I know there's so many opinions on data mesh that seem a little uneducated flying around, and I really wish everyone just read her book, because it really does clarify a lot of stuff, so that's a great resource as well. This podcast, Catalog and Cocktails. I love tuning into it. It's a great one, especially to get pretty much everyone who's a thought leader right now has been on this pod, so I'm really excited to keep list.

Tim Gasper: Absolutely, and plug your podcast, too. Tell us about yours.

Juan Sequeda: Yeah.

John Kutay: Yeah, What's New in Data, we talk about what's changed, what's new in the data industry. We go live Wednesdays at noon Pacific. inaudible on September 7th in context with is going to be Seattle data guy. My most recent guest was, Sarah Krasnik and we've been relatively light in terms of just trends, high level, just keeping people up with the late end, rather than diving too deep into a specific subject. I know there's a lot of great pods where people go deep on certain subjects. On What's New in Data we're trying to keep it like, this is a new term that's coming up, what is that? And just keeping people in sync with things. You can follow that at Striim, S- T- R-I- I- M. com/ podcast. You can subscribe to it there.

Juan Sequeda: If you want to have a really podcast day, on Wednesdays, you tune into your podcast, which is at 12:00 PM, Pacific 2: 00 PM Central, and at 4: 00 PM Central, we do Catalog and Cocktails live, so that can be a fun afternoon.

Tim Gasper: Amazing.

John Kutay: Podcast Wednesday.

Juan Sequeda: There you go.

Tim Gasper: That's awesome.

John Kutay: Man, if you went through those podcasts, you can learn a lot on Wednesdays.

Juan Sequeda: Learning Wednesdays, I love this. Next week, I think this is coming out this week, this is the first time we're going to be pushing this bonus episode out. September 22nd is the data. world summit. Just sign up, go to data. world. Right now, we're listing the schedule of all amazing speakers we have. It's going to be just a full day packed with so many different events. We're going to be, that week live, from big data London, so that's going to be a fun week, and you'll be there, too, John, so we're going to probably cook up some really cool Catalog and Cocktails events and looking forward to seeing you again. I saw you at Gartner, I saw you at Snowflake, and then Big Data London, so thank you so much. Thank you. Thank you. Thank you. As always, thanks to data.world who let's us do this podcast, the data catalog for a successful cloud migration. Cheers, John.

Tim Gasper: Cheers, John.

John Kutay: Hey, cheers, Juan and Tim. Thanks so much for having me. Cheers to data. world as well.

Speaker 1: This is Catalog and Cocktails. A special thanks to data. world for supporting the show. Karli Burghoff for producing. Jon Loyens and Brian Jacob for the show music, and thank you to entire Catalog and Cocktails fan base. Don't forget to subscribe, rate, and review wherever you listen to your podcasts.

DESCRIPTION

Lucid Streaming; how to take full advantage of data streaming


  • Data streaming, Stream Processing, Real-time analytics, operational analytics — what is this? What’s the difference?
  • Most important use cases for data streaming
  • There are lots of misconceptions especially for the MDS crowd (not as much enterprise) between fast batch vs streaming
  • Memory-first processing (in-memory) vs disk space batch jobs
  • Change data capture (and only capture of change)
  • Data warehouses are now tying to support streaming more (like Snowflake)
  • This will be a big deal to make it so that more streaming can happen 
  • Streaming warehouses (Rockset, Materialize) vs data streaming
  • Lineage - transformed data - can I trust this data I'm looking at
  • How does data streaming and lineage come together? What’s unique about lineage in a streaming context?
  • If time: what does it mean to do streaming data products in a data mesh context?


Today's Host

Guest Thumbnail

Tim Gasper

|VP of Product, data.world
Guest Thumbnail

Juan Sequeda

|Principal Scientist & Head of AI Lab, data.world

Today's Guests

Guest Thumbnail

John Kutay

|Product at Striim, Host of "What's New in Data"