Where are the semantics in the data dictionary? w/ Dan Bennett

Media Thumbnail
00:00
00:00
1x
  • 0.5
  • 1
  • 1.25
  • 1.5
  • 1.75
  • 2
This is a podcast episode titled, Where are the semantics in the data dictionary? w/ Dan Bennett. The summary for this episode is: <p>Machines and people. Why can't we just speak the same language? The truth is we can, and doing so could make life demonstrably better for data scientists. Yet here we are, living in a world of rows and columns that few people outside of the data owner understand.</p><p><br></p><p>Join this weeks episode of Catalog &amp; Cocktails as hosts, Juan Sequeda and Tim Gasper with special guest, Dan Bennett, tackle semantics and how to get everyone -- machines and people -- on the same page.</p><p><br></p><p><strong>Key Takeaways:</strong></p><ul><li>[00:01&nbsp;-&nbsp;02:47] Intro &amp; Cheers</li><li>[02:49&nbsp;-&nbsp;04:57] If you were the picture for a word in the dictionary, which word would it be?</li><li>[04:58&nbsp;-&nbsp;08:35] The Greatest Sin of Tabular Data</li><li>[08:40&nbsp;-&nbsp;11:02] Examples of semantics missing inside of tabular data and their utility</li><li>[11:03&nbsp;-&nbsp;12:39] Adding context and profiling data</li><li>[12:43&nbsp;-&nbsp;14:55] How are constraints and semantics being defined, and what is a scaleable approach?</li><li>[15:02&nbsp;-&nbsp;16:24] Data producers and enriching data</li><li>[16:30&nbsp;-&nbsp;17:58] Enrichment that travels with the data</li><li>[18:00&nbsp;-&nbsp;20:14] What are the tools we use, the data dictionary, and standardizing</li><li>[20:16&nbsp;-&nbsp;22:34] Metadata and the bridge to the semantic world</li><li>[22:36&nbsp;-&nbsp;24:57] Innovation and Dan's thoughts on relational model table relationships</li><li>[24:57&nbsp;-&nbsp;28:09] Solving the same problems over and over again</li><li>[28:12&nbsp;-&nbsp;30:19] Network effect, the marketplace of ideas and social spheres</li><li>[30:21&nbsp;-&nbsp;33:56] Diving into the network effect and the semantic world</li><li>[33:58&nbsp;-&nbsp;36:48] Why redefine if an option exists that can be used, and thoughts on simple ideas being the best solutions</li><li>[36:52&nbsp;-&nbsp;40:11] Figuring out supply and demand curves for S&amp;P Global</li><li>[40:12&nbsp;-&nbsp;44:34] The business value of data and data literacy in accurate findings</li><li>[44:44&nbsp;-&nbsp;46:19] Advice to data leaders and vendors</li><li>[46:37&nbsp;-&nbsp;50:14] Lightning Round</li><li>[50:29&nbsp;-&nbsp;56:48] Takeaways</li><li>[56:52&nbsp;-&nbsp;59:44] Three final questions</li></ul>
Intro & Cheers
02:46 MIN
If you were the picture for a word in the dictionary, which word would it be?
02:08 MIN
The Greatest Sin of Tabular Data
03:36 MIN
Examples of semantics missing inside of tabular data and their utility
02:21 MIN
Adding context and profiling data
01:35 MIN
How are constraints and semantics being defined, and what is a scaleable approach?
02:12 MIN
Data producers and enriching data
01:21 MIN
Enrichment that travels with the data
01:27 MIN
What are the tools we use, the data dictionary, and standardizing
02:14 MIN
Metadata and the bridge to the semantic world
02:18 MIN
Innovation and Dan's thoughts on relational model table relationships
02:21 MIN
Solving the same problems over and over again
03:11 MIN
Network effect, the marketplace of ideas and social spheres
02:07 MIN
Diving into the network effect and the semantic world
03:35 MIN
Why redefine if an option exists that can be used, and thoughts on simple ideas being the best solutions
02:50 MIN
Figuring out supply and demand curves for S&P Global
03:19 MIN
The business value of data and data literacy in accurate findings
04:21 MIN
Advice to data leaders and vendors
01:35 MIN
Lightning Round
03:36 MIN
Takeaways
06:19 MIN
Three final questions
02:51 MIN

Tim Gasper: Hello everyone. Welcome. It's time for Catalog& Cocktails presented by Data. world, the data catalog for leveraging agile data governance to give power to people and data. We're coming to you live from Austin, Texas. It's an honest, no BS, non- salesy conversation about enterprise data management with tasty beverage in hand. Hi Data. world. I'm Tim Gasper, long-time data nerd and product guy at data. world. And this is Juan.

Juan Sequeda: Hey Tim, I'm Juan Sequeda, principal scientist at data. world. And as always, it's Wednesday, middle of the week towards the end of the day and it's our time to take a break to go chat about data, have some cocktails and have a lot of fun. Today, our very special guest is a friend of mine, who we go back from the Semantic Web community. This is Dan Bennett, he's a chief data officer at S& P Global Commodity Insights. Previously, Dan would go back, I remember you were at Thompson Reuters and then Refinitiv where you're doing all the PermID link data graph, go to permid.org. You can check all the stuff that Dan was working on a long time ago. It's been the foundations. Dan is a guy who's gotten the whole semantics and knowledge and link data and all that stuff from the beginning, and it's awesome to have this conversation with you. Dan, how are you doing?

Dan Bennett: I'm doing great, thanks. Hi, Tim. It's fantastic to be on.

Juan Sequeda: Awesome.

Tim Gasper: Fantastic to have you on.

Juan Sequeda: What are we drinking? What are we toasting for then?

Dan Bennett: Well, I'm in the office because I'm the sort of person who worries about hotel wifi. I'm visiting New York right now, not based here. I've got espresso right now and I apologize, it's not really a cocktail unless you put some vodka in it, but that's that. But what I'm toasting is the fact we seem to have had a free and fair election. That's a good thing for this country.

Tim Gasper: Word.

Juan Sequeda: That's a great one. How about you, Tim?

Tim Gasper: I am drinking here some Texas- Style Bock from Community Beer Company. It's a local beer here. Pretty good. I'll cheers to free and fair elections as well. Thank you everyone who voted.

Juan Sequeda: Cheers to the elections. I made a cocktail, having a spicy tamarind pineapple Paloma. Smirnoff in Mexico has a fantastic infused vodka, which is a spicy tamarind vodka. Highly recommend. I've only found it in Mexico. Then there was some pineapple I had in my kitchen and then I have Squirt, which is what you use for Palomas. Instead of tequila, I used that. But anyways, this is nice and spicy, really refreshing. Cheers to fair elections.

Tim Gasper: Cheers.

Dan Bennett: Cheers.

Juan Sequeda: All right, so we got our funny question today. If you were the picture for a word in the dictionary, which word would it be?

Dan Bennett: I hate questions like this. You really made me sweat on this one, Juan. I landed on bemused because as you can tell by my accent, I'm not originally from around here. I moved to the US about 17 years ago and I'm still, despite having married an American and having all these American in- laws, there's still so much about this country that bemuses me. And so, there's that level of bemusement and then I'm a parent to, well, one teenager and one 20- year- old. Man, does that bemuse you. Then just working in data and technology, I feel like, and this is the segue onto what we were talking about, we know so many of the answers. We know how to do this stuff and it just bemuses me that as an industry and as practitioners, we are not able to get this stuff done like we should be able to. I'll take bemused please, Juan.

Juan Sequeda: Wow, that was a very well thought- out answer knowing that we told you this question five minutes ago.

Tim Gasper: That's a very well-

Juan Sequeda: How about you, Tim?

Tim Gasper: You know what? I'm going to go with just a funny response on this one and I'm going to say, I'll be next to the word y'all because I grew up in Cleveland, Ohio. And so, I had my hard consonants, my nasal talk, and a lot of, " Hey guys," and I have successfully learned to say y'all. If Tim can do it, anybody can do it, y'all.

Juan Sequeda: If I would do one, this is, I'm trying to get something very different. I guess you would have this as a little phrase, salsa dancer, salsa dancing. That would be me, because that is a passion that I have that I grew up in Cali, Colombia, which is the salsa capital of the world. That was something I've done and I continue to do it. What is a picture of salsa dancing? That would be me right there. Salsa dancing.

Tim Gasper: Nice.

Juan Sequeda: All right, well let's kick this off. All right, honest, no BS. Dan, a couple of months ago, you wrote this great blog post called The Greatest Sin of Tabular Data. Honest, no BS, what is the greatest sin of tabular data?

Dan Bennett: Well, this stemmed from some conversations we were having internally and just thinking about why does it take us so long to get some of our data engineering done and what's the challenge there? As we were talking about it internally and I was thinking about it, I realized that so much of this is because in tabular data or in a data dictionary of a database, all we can describe things in is in terms of their primitive types, so whether it's a float, or an integer, or a date. As humans, we're pretty good at understanding that and interpreting something beyond that. But typically, even if there's a comment in the data dictionary or there's a PDF attached to it that describes the scheme or describes the CSV file you're going to get, it's all just human readable and not machine readable. It just seems bizarre to me that we haven't solved that problem that we can say in a machine readable way, " Hey, this is a float, it's a floating point number, but it's also a unit of measurement. It's barrels of oil per day," Or whatever that type is. It's like a stronger type that sits above those primitive types. If we can do that, my Scooby sense is that we can start to then write code that will do some of those simple conversions for us. That to me seems like a really, really wonderful opportunity to cut down on that. There's that old joke about 80% of data science is data engineering, and it's like all the jokes, it's funny because it's true. We're not going to cut the 80% down to zero with something like this, but maybe we can chip away at it and maybe we can make it easier by writing meta code that will solve for some of this. To me, that's the greatest sin. It was fun, I put it up internally and as we got talking about it, I'm like, this is a good one to run externally as well. It had some feedback internally and then I put it on LinkedIn and you started to interact with it, Juan, and had some good feedback on it. It is a good one and there's stuff out there we can use to solve this, we just don't today.

Juan Sequeda: In a nutshell, the greatest sin is that, I mean, putting it in my words, is that tabular data is just that data for Squirrel, which is like here's a number and then you can do, okay, it's a number, it's a positive number. You can do some means or just basic statistics around that stuff, but you don't know what that actually means. What is meaning behind it? You probably have to go ask somebody about it and then you go get that, and then that meaning gets lost. It's not in a way that's tangible for computers, machines to be able to go interpret that, and that's the problem. It's like this is getting lost. We're seeing now a lot of these conversations around data contracts. I mean, for me, it's like we're all going back in circles. I mean yes, we need call it contracts, but this is just constraints. This is just a well- defined schema knowing not just, oh this is a column that has an integer. Let's get more specific about that. Let's get more specific. Give me some concrete examples of what are the semantics that you're missing inside that tabular data and what would you be doing with that? Honest, no BS, real examples.

Dan Bennett: Real examples. Look, the business I'm in is all about commodities information and commodity pricing. The information we share with our customers and we use internally often is it's those kind of production data sets. It's this refinery produced this much gasoline this week, those kind of data sets. But when you're talking about that, you have to talk about that in a volume of measurement. You have to say, is it billions of barrels? Is it thousands of barrels? Each one of those has a definition. And so, that the simplest thing is to be able to say, " This is the definition that time." But once you do that with the semantic world as you know, you can start to hang other things off it. You can start to say, " Yeah, this is billion barrels of oil per day." That's how we're measuring it. We know the number's not going to be a negative number now. We can start to stripe some of the constraints around it and describe what we expect that number to look like. We can start to describe confidence intervals and describe what the standard deviation might be and you can just hang all these additional things off. Data quality to that can then become something that's not, you're inferring it all or trying to describe all of that after the case. It's just a fact of saying, " This is my type and this is what I'm hanging in there." To me, you start with those stronger types, those machine readable types that give you that. Then in the semantic world, we can link that. We can say, " Well, look, if it's barrels per day, here's a conversion factor that will give you that liters per second." It's just simple math. If you've got that, now I can write software that where I find another data set that's in liters per second, it will do that joint for me and do that comparison. Whether that's in a BI tool or whether that's in a data engineering pipeline, doesn't matter. The point is you've enriched that column of data with a more sane, productive use of what that data is.

Tim Gasper: You can make it much more productive if you add that context to it. But what do you say to people who think that either this needs to be or can be just fully automated like, "Oh, I'm just going to profile it or something like that and that'll give me my context." Are they right? Are they wrong?

Dan Bennett: It's an interesting one, because you can do a lot with that profiling of data. You can do your distributions, you can figure out your standard deviations, your mins and your maxes, the cardinality. There's definitely a fingerprint for a column of data that you get out of that. I think that probably to me that I would argue or suggest that that's not wasted work by any means. What I would say to that is that's a clue to how you can then give the human tooling that says, " Hey, we think this is temperature measured in Celsius because this looks like other times we've seen that. We've trained on that and that we've seen PROV data with this PROV-O for, yes or no? And you could do the rank list of we think it's this, this, this, this." To me, yes, you can profile all you like, but to me, I don't believe that you could ever just do this all just through data profiling and have a sufficient confidence that you matched everything correctly. To me, the data profiling is an aid to the end- user to make it easier for them to add that labeling.

Juan Sequeda: These sophisticated call it... I mean, I'm going to use the word interchangeably of semantics, well- defined schemas, constraints, contracts. For me, I'm just going to use all these words interchangeably for now. Who is going to be defining these constraints, these contracts, semantics? I mean, how are they being defined and what is a scalable approach to create and manage them?

Dan Bennett: Well, so the immediate commercial answer to that is where you have companies like us or any company that has an API, you have an incentive to make it easier for the consumer of that data to understand and get faster time to market with the consumption of that data. I would say if there were standards around here that the market was pushing for any data we are sharing with our customers, we'd immediately start to mark this up and put that labeling in there, because we do it today, we publish Swagger and it has all this commentary in it that says, " Here's what our API looks like." From the Swagger, yes you can create bindings, so you can simplify some of that, but we're not giving you that extra lift. But if we could, if there was a way that as an industry we agreed we could do that, we straight away would do that. Then internally, to me, it all starts to be that snowball effect that if you've got two data sets and one's got this and you can see that, " Hey, pull this into my Tableau or to my Power BI, it's much quicker because it's this intelligence built in my BI tool that it understands how to do this." Then that starts to create an internal demand in the same way that you have internal demand for a data catalog today. To me, it's a sort of obvious consequence of the easier you make it for anyone to consume the data, the more you create that demand for the data to have that markup if we all agree on how to do it. That's the biggest challenge here.

Tim Gasper: We need to agree on how to do it so that way, expectations are clear and the work is clear. But it sounds like you're referring to a feedback loop where if it's being consumed, that creates demand, that creates more benefit to, I'm going to call them the data producers to do a better job of enriching and documenting their data. Is this sort of a flywheel that you're talking about here?

Dan Bennett: Yeah, that's my, can I achieve this before I retire aspiration? Look, my company is in a really nice position on this because we have a little bit of a bully pulpit and we also are in a position where we're selling data and there's clear incentive for us to make it as easy as possible for our customers to consume that data. But I'd say internally within any organization, this is just an extension of that data governance. It's about making your internal data easier for everyone to consume and we have the data citizens, how is it that we're going to make it easy for them to consume this stuff? Yeah, there is a flywheel nature to this, no doubt about it, Tim, because we see that with every technology. What was the flywheel that drove HTML 30 years ago? It was because there's enough content and it was solving a problem that wasn't really solved before and it compounds and compounds and compounds and then all of a sudden, network effects take over.

Juan Sequeda: I love what you just said about... I mean, this is issue about governance. Whenever I hear about governance, I'm a big fan. I love Laura Madsen's work, her book I recommend all the time, Disrupting Data Governance. The reason why I like it so much is because she has this great table that says, " How do you spend the work on governance?" We usually come from the world of it's all about protection, but it's now about being the ambassador of data so people can go use it. That's how you disrupt data governance. This is what you're saying, this is part of that disruption is making sure that the data's... We want the data to be used, but make it even used faster and successfully used and it's putting that semantics. Putting that effort into semantics in is part of that process of the data governance and that offensive, proactive, let's go do something with the data.

Dan Bennett: Well, right. Here's the really cool bit is if you do some version of what I'm thinking about and we're talking about here, it travels with the data. If I create a query that joins two tables that have this semantic tags on, then the query result will have those tags as well, because the types don't change unless we do some kind of upper end the column, and then in that case, you would have to revert back to the primitive. The views and all of those other tools that we have for how we connect and join data, they can all take advantage of this and it can flow through as part of that technical lineage if we can make it machine readable.

Juan Sequeda: Let's get a little bit into the technical side here, because you said something, it's like if we agree how to do this, so how should we be doing this? What are the tools? I mean, we both, you and me, Dan, and also Tim here about the technology behind it, for us it's always these semantic technologies, semantic web technologies, the RDF and OWL, this stuff is out there. What are those other tools? Is it that? Is it something else? What should the vendors, other database vendors and BI vendors, and I mean, all types of vendors that are related to data, what is your message to them?

Dan Bennett: This is the bit I love. If you go back 50 years, one of the winning things that the relational database guys did, IBM and then Oracle, all of that late'70s, early'80s was they really codified data dictionary. If you think about how many times the data dictionary gets called within a database technology, everything relies on it. Because when you have a data dictionary, what you have is this description of your tables and your structure in a machine readable manner. We've solved machine readable to some extent already there and you have all of this tooling downstream of that that relies on it. Any catalog product will go and read the data dictionary of the database because there's a whole bunch of information in there that you can expose. Any BI tool goes and does the same thing. That data dictionary is this meta layer that, as an industry, we've done a really good job of standardizing on. It's interesting when you think about the big data guys as that came out with Hadoop and everything like that, they all copied it because it was something we all understood and if you made that available over an ODBC or a JDBC connection, it would all just work and the tooling on the other end would just work. This model of tables and columns and relationships between them is something that tons and tons of tooling works with.

Juan Sequeda: Just to confirm here, it's basically the definition of what a JDBC returns, the information scheme, this is something that's well defined, every single database tool engine or it provides this already, like you don't even think about it, right?

Dan Bennett: Right. It's all built in and that's one of the greatest gifts we've managed to give ourselves, whether by accident or by design, is things like JDBC and ODBC because they allowed that indirection layer but allowed that metadata, that data definition to cross that boundary, and that's incredibly powerful for a whole ecosystem of tooling that we would be much worse off if it didn't exist. My sense of that is, give me one more... When I describe a column, in most of the data dictionaries, there's going to be a type on it and there's going to be a human readable comment. It's give me another field in there which is a URI. This is the bridge to that semantic world. Juan, you and I talked about this many times, we sometimes do ourselves with a disservice with the semantic world because we immediately jump off into talking about graphs and all of that stuff. I think people can get lost in that. What I'm saying is take that field, give me an ability for that field for me to put the URI of a more complex type in there. That can be a more complex type that I define and put out there, or it can be a Schema. org type, or any of the other masses of types that already exist out there. The really cool thing about this is there's already a demonstration of this with CSVW. You can go to csvw.org and it tells you exactly how to associate those stronger types, those more elevated types to a CSV file and we are using a little lump of JSON to do it. We've already shown that this can work, but to me, until it's something that's in those data dictionaries and it's something that can flow through that ODBC connection, that JDBC connection so that all the tooling can say, " Oh, is that there? I'll take advantage of that." That's where it becomes really powerful.

Tim Gasper: Dan, where does technology play a role in this? Is it serving the need and we just need to embrace it more? Is the tech falling short?

Dan Bennett: Well, so I think my hot take on that is going to be that the predominant model that we have today, which is a relational model tables relationships, that data dictionary thing we just talked about is it's an industry standard. We all are using it. But where's the innovation in that? The innovation in that has all been in how we store, how we do query execution, how we do performance. We had the whole columnar thing happen. We've got the DuckDBs of the world. That's all great. But where I would argue that side of the technology is letting us down is they're not solving these questions of interpretation. They're not trying to solve that. Then you've got a semantic view of the world, which is all about solving that. But because there's a real tendency for that to stare at its own naval quite a lot and lose... It's never really become that much of a mainstream product and mainstream technology. Part of that I suspect is because that semantic world and that relational world or that relational ecosystem if you want to call it that, haven't really connected and found the common ground. That's why I get so excited about this and that's why I joke about retiring on this one because it's like I feel like CSVW and what we're talking about here is it's where those two Venn diagrams can slightly overlap and we can get some advantage of both.

Juan Sequeda: I mean, I'm with you on this, Dan. I mean, we've come back from the same pedigree, you and me on this, but then I'm pausing because... Here's the thing, this doesn't seem to be a technical problem. I mean, I think the solution is there. Now, I ask myself, why aren't the vendors or actually the consumers who are now, we're now talking about all this data contracts, the semantic layer, dbt is bringing the semantic layer on these things. You go off on LinkedIn and you're hearing everybody data contracts and data quality. We have these pains. There is the incentive and now everybody's going off and basically reinventing things and doing things again and again. Why is that?

Dan Bennett: Why is it we're not using CORBA? Why is it we're not using WSDL and SOAP? As an industry, we love to solve the same problems over and over again. I've never met a developer who didn't like to solve a meta problem. It's always far more fun to abstract the business problem you've been given and say, " Well, I'm going to write something that's a bit cleverer than that and create another layer of abstraction that solves that problem." That's a far more intellectually satisfying task. It moved me the quote about the there's no problem in computer science that can't be solved with another layer of abstraction except the problem of too many layers of abstraction. I think there's part of this which is people come up with new ways to solve old problems and sometimes those things move along a little bit. But you look at GraphQL, you look at OData, well, great, you've built another query language but we already had one of them. I think some of this is just that human nature that we have to try and solve our problems again. I think some of this is essentially an idea that it seems maybe too hard or too difficult to actually do that semantic thing because people don't understand it. I don't know. I struggle with this one. I really do, Juan.

Juan Sequeda: Because I play devil's advocate and this is like we could just implement these types of semantics in some sort of a store procedure or in triggers or in things like that.

Dan Bennett: Yeah, you can.

Juan Sequeda: I mean, not like you have Python. People, you can embed the Python inside of Snowflake and stuff like that. Isn't that the answer?

Dan Bennett: No it's not.

Juan Sequeda: Couldn't that be an answer?

Dan Bennett: I mean, I don't think it is because I don't think you've made it the same first class citizen that you have with the data dictionary. That's the greatest gift of the relational model is that data dictionary stuff. You can solve some of this with things like great expectations if you want, but you're not solving it in a way that gives you network effect. I think that's probably the kind of key. It's network effect is the important point here.

Tim Gasper: Can you talk a little bit more about what you mean by network effect? Because I see Juan and Dan glowing with that statement and I want our listeners to understand a little bit more about why you're glowing about that.

Dan Bennett: Network effect is fundamentally why Twitter is the number one platform and it maybe won't be later on. It's this idea that in any kind of competition of ideas or social networks or of other ways that we as humans interrelate, there's this marketplace of them and as some start to rise to the top, more and more people are talking about them and using them, whether it's a software package or a technology. If you are using that tool and I'm using that tool, our lives are easier, so then you and I start using it. It's this crazy non- linear scaling thing that happens and the social network guys all see it-

Tim Gasper: It's the web.

Tim Gasper: It's

Tim Gasper: the web.

Dan Bennett: It's Tim Berners- Lee network effect was what made HTML and HTTP work, because there was stuff before it. There was Gopher and Archie and all those things that old people like me remember, but they didn't have enough usage and so they didn't reach that critical mass. We've seen this in technology time and time again, Tim, that there are some technologies that achieve that lift off velocity of that network effect. Then there's some that die off. I'm old enough to remember SGML which was what came before XML and it was cool, but it was really, really tough to use. XML came along, it was a simpler version of it and all of a sudden, that was the predominant way of marking up text and then HTML came along. It's that network effect is that idea that if enough people are using it, everyone uses it, I guess.

Tim Gasper: This is interesting. This is an aspect that I think people don't talk about enough or think about enough and is maybe one of the undervalued aspects of the semantic world, but perhaps why it will succeed so greatly long- term is that, like what I wrote down in my notes here is the more you add to it, one plus one is greater than two. Whereas traditional approaches one plus one equals two or more often, one plus one is less than two. There's actually debt involved with accumulation.

Dan Bennett: Yeah, and that debt will drag you down over time if you're not careful. Again, I'm looking at this as the computer scientist, and to me, the reason I push so hard on this being a data dictionary thing and to me the reason why that seems so valuable is it's very rare in our domain to find standards that survive decades test of time. ASCII is one of them. Unicode will be one of them. There's some basic standards for how we describe characters. But SQL is one of those. That SQL model, how many people have now written SQL interpreters and query optimizers, but they all rely on that same language. That suggests to me that SQL and the data dictionary model is a really, really good abstraction. It's an abstraction that's adding a ton of value, therefore the desire to move away from it just isn't there. I'm just saying, " Hey, can I just add a little bit onto that, that's orthogonal to what you have already? Doesn't try to substitute for anything you have already," because to me, then you're riding on the coattails of someone else's network effect, I guess.

Juan Sequeda: To add to this network effect, part of it is that we're reinventing the knowledge. We don't need to reinvent that knowledge. At some point, we want to have that agreement and I think that's the network of it saying, " Hey, we're talking about the same thing. Why keep doing this separately? Let's go combine it." I think that's what we see the web has done. I mean, Google Search and page rank is that. People start pointing together, pointing to things and that's the popular interest that people have. I think the goal here is also to just reduce the amount of redundancy that happens organically within that network. At the end of day, it's like, " Hey, it's great. I don't have to go do that because somebody already did that. Thank you." Or, " If I need to extend it, then I don't have to go do only and not do all the other work that people have done."

Dan Bennett: Yeah, well, here's a great sort of pseudo semantic example. How many times do people just link to Wikipedia when they want to describe a concept, because it's like we all know Wikipedia, it's so big. It's kind of a defacto. If I want to say, " What's acid?" Well, here's the link to Wikipedia. And so, that link to Wikipedia and you know where I'm going with this, Juan, it essentially becomes an identifier for that idea. A great example of I've seen used in business is GeoNames. Wonderful website. Because if I say London to you, your interpretation of London and my interpretation of London, Tim, you've got one in Texas down there somewhere around you, but to me, my London's different. Whereas if we all agree on an identifier for London, then we don't have to go, " Oh wait, Tim was talking about a different London."

Juan Sequeda: I think this again goes back to why identifiers are so important, and I mean, the solution seems pretty straightforward is, hey, take that data dictionary we already have, add a column which the semantics of that column is there's more information, more knowledge in that link. Go follow that link and that's it. I'm not saying anything else. Then you follow that link and that link should go present to you some self- describing metadata saying machine can interpret this. I think that's where we need to go to. It sounds a really simple solution. Is this it, you're calling all the database vendors all tools, add another column or data dictionary, call it C also where the type is a URI, and is that it?

Dan Bennett: Aren't some of the best ideas the ones that are really simple? I mean, that's basically it. I've talked to some of the vendors, like I said before, one of the benefits of this role is we get to have relationships with some of these guys. I do talk to them and they're like, " Huh, that is an interesting idea." Then of course, the second thing they say is, " We'd love it if our customers were asking us about this." It's like, well of course, I get it. That's how prioritization works. When you reached out to me about talking about this, it's like that was one of my goals is go tell your vendors that you want this supported and to read Bennett's stupid blog about it and see if there's a there there. So, yeah, I think it can be that simple. I think what happens if you do that is it does... The immediate then question is, yeah but wait, what's that link going to go to? Again, if I think about the model where I'm sharing data with my customers, what that will force me to do is to put that documentation out there in a web accessible form, machine readable and human readable so that those types are out there and available. And so, our definition of a barrel is available and it's out there and we talk about what a barrel is. Interesting thing about us is as a pricing agency, we have to be public about what our methodology is for pricing anyway. Part of being public about and explaining that methodology is to say what your conversion factors are. We have that data out there on the S& P Global website right now, but let's put it out there in a way a machine can read it much easier. It forces a hand and then you'll see people going, " Well, I don't want to redefine that type. Is there one I can already use? Oh yeah, I'll use GeoNames to describe places."

Tim Gasper: This is an interesting example of tying it to a specific business scenario where you've applied this and it's created real value for you guys at S& P Global, but also more-

Juan Sequeda: For y'all, Tim. For y'all.

Tim Gasper: Yeah, for y'all and for y'all, for everybody. I got Texas right here on my poster, y'all. Can you give some other examples of how this creates business value or how money came out of it where this machine readable context was the difference maker?

Dan Bennett: Well, so to be clear, Tim, this is something we want to do and we are figuring out how do we do it if we don't have the vendors doing it, helping us with it. The answer to that is an ugly answer, because it means you build some layer on the top and do I really want to be in that business. But as we've talked about it internally, the value just internally is all about reducing our data scientists time to merge data sets and build their models, reducing the time that our modelists and analysts, so the folks who are figuring out the supply and demand curves for the next five years of diesel refining or whatever it is that, it sits within our business where we have these domain experts building these models and trying to answer those kind of questions. It's about helping them and giving them quicker time to answer. Then it's about us having a better set of tools for our data quality. As data is passing through, if it's got these tags on it, there's a whole set of associated standards, a thing called SHACL and stuff like that where you can then start to put real constraints around your data. Again, you can solve the meta problem of, let's just describe this in metadata, because the data quality tooling today really primarily rests on that profiling approach that we were talking about at the beginning. It's great and it's valuable, but it's even better if you actually know the type and you know some hard rules on that type. This is temperature measured in Celsius and it's never going to be below this minus number, I forget, is it whatever that 273 or whatever it is. It can't be, physically cannot be below that number. And so, if you see a number below that, it's an immediate data error. It's not just an outlier. You can go after those quality, data quality things, and then if I don't have a situation where I can share this directly with my customers because there's not a way for them to consume it, I can at least take this data and I can generate a whole bunch of user- facing documentation. In theory, you could generate your Swagger from this, all of those kind of things rather than hand curating that. Because every time you hand curate those, that's opportunity to fail and it's opportunity for it to drift from the actual underlying definition. We see that a lot in documentation just by nature.

Juan Sequeda: I'm starting to see this more with just customers and prospects I go talk to is if they want to have automatic generation of IT artifacts, saying, because, " Oh, I want to have a Swagger file, I want to have a SQL DDL, I want to have a PROV- O scheme or whatever." These are things that are being created, they're probably not always being created automatically. If there's some human involvement, then it's error prone right there. By the way, these are just syntaxes, different ways of representing the same meaning about this different syntaxes. We just need to make sure that we're eliminating any type of errors provided by humans. I think that's one of the advantages that I'm seeing of having the semantics is that you're automatically generating much more IT artifacts. There are two aspects. One is from the technical side and one from the business side. I think on the technical side, I'm seeing, again, hearing all these conversations about data contracts and stuff, I think this needs to be really pushed down ideally to the moment that the data's being produced. It's something that the original producers of the data are the ones who need to be demanding this too, saying, " Hey, if I'm creating this application, I want to make sure that the application is being kept correctly." What does correct mean? That's where the semantics is, because right now, that is getting lost and then while we are consuming it, then they have those issues and then those things are happening somewhere else in the middle of the trajectory of where this data ends up. I think there needs to be more ownership also on the technical side of where the data's being produced about this. But then on the business side, I think this is where there's opportunities to be had that we're not there yet as an industry, because I think there's still this big disconnect from what the business really thinks about the value of data. The business provides value of data, because yes, we generate insights and all these things, blah, blah, blah, blah. But if we get into more of these technical stuff that the tech side is already so concerned for their technical things, how do we drive that directly to making money and saving money? If I have this particular constraint on this column of this number has to be between this and that, what is that implication if that number is not there for the bottom line, the revenue of the company? I'm not saying that there's always going to be one, but I think if this is the opportunity for the data teams to say, " Hey, if I start having that business literacy, understanding the context of the business, I can make that argument saying we need to have well- defined data here because if the numbers ever come out because of human error, whatever, we're trying to avoid these things, we're going to avoid all these risks, we're going to not leave money on the table, so forth." I think that's the opportunity that the data teams need to have is to understand where the business and be able to go translate the business value, making money, be, " We need to make more money here," and that's how we're going to go explain this to other people. I'll stop ranting.

Dan Bennett: No, I you're right. I worked hypothetical is inventory levels, whether it's warehouse inventory levels or it's your produced fuel, refined fuel sitting in a port, that inventory level is driving a whole bunch of financial decisions all along that supply chain of whatever that inventory is. Do I need to order more? Do I need to order less? The more accurate and less error you have in that inventory level, especially if you're doing this at scale where there's a lot of just algorithmic decision making based on this, those can be meaningful numbers that come out the end in terms of the consequences if that inventory level comes through completely well. We are seeing it a little bit in the chip industry right now where there's a massive glut of certain chips because everyone back in COVID, you couldn't get them and then it turned out, everyone was actually hoarding and this is monetary impact that these things have. Now, imagine you get some errors in that data. They've got to be in there. I don't believe all that stuff is 100% accurate.

Tim Gasper: Dan, before we go to our lightning round, for our listeners, for the data practitioners, the data leaders and even the vendors, what's your advice for them? What do we do next? What's the action?

Dan Bennett: The action is if you buy into this, if you think that there might be value here, especially if you're a big company that has pull with these vendors, next time you're meeting with the account manager, throw that blog post at them and say, " Hey, this is a cool thing. When are you guys going to do this?" When you're talking to us or your information providers, say, " Hey, when are you guys going to do this?" Because ultimately, the way these things change is someone sees some dollars in adding this feature and someone sees some first mover advantage. Think about serverless analytics. Snowflake came along and really drove that market because there was real dollars to be saved in that. Now, everyone's got a serverless answer because you are here, keeping the cluster up and running all the time is expensive. This is the same thing. Once it gets to a certain critical mass of people saying, " You know this would really help, can we get this done?" Someone will see dollar figures and they'll prioritize it. If it's a simple, as I think it is, it's not even a big ask to add it into the data dictionary, it really isn't.

Juan Sequeda: This is a great takeaway. You have a very specific action for everybody here. Hopefully, this is for all sides, for the vendors, for the buyers, for the consumers, for the producers of data, everything, I think this is very, very specific takeaway. Let's go to our lightning round, which is presented by data. world, the data catalog for your successful cloud migration. I'm going to go first. We talked a lot about the data dictionaries. Is the data dictionary going to go away, so to have this new thing or is it just going to be an expansion like the same direction we have today, it's not going to change, just a little bit slight thing or it needs to be revamped?

Dan Bennett: Nope, just ever so slight expansion. It already tells you to type. All I'm asking for is for it to tell me the complex type as well if it knows what that is, otherwise just tell me the primitive type.

Tim Gasper: Nice. Embracing data dictionary, taking it to the next level.

Dan Bennett: Yeah, exactly.

Tim Gasper: All right. Second question. If we truly solve semantics and context around data, can we get to a point you think where things like data integration are automatic because that context is just so prolific or is that a pipe dream?

Dan Bennett: I think you can get 80% of the way there. I'm a huge believer in the 80-20 rule and data integration's so freaking hard today and we spend so much time and money on it. That's what I'm trying to solve for here is let's solve the 80% grunt work of that.

Tim Gasper: We solve 80%, we make a lot of people happier.

Dan Bennett: Right.

Juan Sequeda: We still need the consultants and stuff to go to that 20%, right?

Dan Bennett: Well, absolutely. There's a business there.

Juan Sequeda: You don't take all the business away. All right, so third question. Do all the semantic standards that we need already exist? We just need to embrace them, learn them, popularize them, or is there's stuff that's actually not out there yet that still needs to be defined?

Dan Bennett: Interesting question. I think, yes, all the semantic standards exist for the base level of implementation here. I think if this actually got traction, you'd probably see kind of son of SHACL or daughter of SHACL that was more around the profile of the data and allowing you to describe the profile of the data, which SHACL doesn't, at least it's been a little while, but I don't think it really... SHACL's more declarative and deterministic on the way it describes the data right now.

Tim Gasper: All right. Well, fourth question, last question. Are you a fan of the buzz around semantic layer driven especially by dbt and those folks, or are you disappointed in it, concerned by it? What's your adjective?

Dan Bennett: That semantics doing a lot of work in that marketing speak. I'm not a fan of any use of the word semantic that doesn't include machine readability, because to me, how can we be so far down this AI road and not have really addressed that? It blows my mind. So, no, unless we mean by that machine readability, not a fan.

Juan Sequeda: That is a very important takeaway right there. I'm marking that is mid- 49 something you just said.

Dan Bennett: That's a very-

Juan Sequeda: That's an honest no bullshit right there commentary with a semantic layer. All right. I mean, Dan, I told you we can keep talking for this topic for hours and hours, but it's time to go to our takeaways. Tim, take us in with your takeaways.

Tim Gasper: Awesome. Takeaway time. We started off with what's this blog post you wrote about needing to really expose The Greatest Sin of Tabular Data. You really pointed to the fact that why does it take so long for us to do our data engineering? It all points back to the tabular database being very limited in the context that you can describe. You can describe these primitive types, float inter-string, and then maybe you're going to create a PDF or something like that which has documentation or imagery and things like that. But that's human readable, not machine readable. If it is machine readable, it's barely so and lacking context. Really, it's not just a float, it's a unit of measurement. It's barrels of oil per day in this context. It sits above the data and if we can connect this and if we can write this... We'll be able to do things, write the code that actually provides the conversion for us. It's an opportunity, you said, to cut down on this whole 80% of data science is the plumbing and the janitorial work and things like that. We can really make a huge dent in this. What are the semantics that you're missing? Right now, often the data's being used in oil and gas, these are the production data sets, but you have to talk about these things in terms of the units of measurement, in terms of the specific definitions. The way that we're doing it right now is too simplistic. You have to be able to do more and it has to be more declarative, more binary where it can never be negative, here's the confidence interval, it should always be this type. You talked about automation is not some sort of a silver bullet here. It helps, it provides additional context, you can get that fingerprint, it's not a wasted work for sure, it's part of the overall equation, but there's more that needs to be done that involves humans and there has to be a process where humans are involved. And so, we talked a little bit about scale. Well, how do you scale that human involvement? You said that companies like S& P have incentives that really make the data, to make the data usable as fast as possible and those incentives are important. And so, if you think about things like Swagger and saying, " Hey, you got to create Swagger documentation," you got to think about similar things in the world of semantics and the data dictionary, really think of semantics as an extension of data governance. Then we started to then get into how do you agree on how to do that? Back in the day, the data dictionary was this concept that came out, but it wasn't fully executed. Especially when things like big data came out, it also made this a challenge. It came with a data dictionary, but it was not something people could use very easily. There's just a whole bunch of opportunity here for improvement and it can be simple. You pointed to a very simple opportunity, which is what if you just have the ability to have an extra field that points as a reference to something else, and then now you have your identifier. All of a sudden, the game changes. And so, I think that's an interesting opportunity here and then I'll toss the baton over to you, Juan, to continue.

Juan Sequeda: We're talking about how technology may fall short here. I mean, in this data world, we have the relational model as a main standard, but innovation in the relational database world has been more mainly about storage and compute, vectorization, but they are not solving this hard problem of interpretation of semantics. There seems to be this gap between the semantic world and this relational world. They haven't found a common ground. You did highlight that CSVW, take a look csvw.org, the CSV on the web standard or the W3C has, and it's a way of showcasing how this overlap is starting, how it is overlapping. But as an industry, we love to solve the same problems over and over again. We did CORBA here, here CORBA, I got introduced to CORBA in 2003, 2004, I don't know. That was just... WSDL. But creating another layer of abstraction, that's always a more intellectual challenge. We want to go work on those things, and I guess, that's why we reinvent the wheels a lot. But what we really need is to make semantics first class citizens, that's the important thing and make that part of the data dictionary, because, yeah, we can go solve this with any other tech, but we need to make first class citizens because by doing that we can have that network effect. I think that's a very clear key takeaway here is that we want to be able to have that network effect just like the web has it. I think Tim brought a very insightful observation here, it's the one plus one is greater than two, because traditionally, we think about one plus one is two, or often one plus one is less than two because we have that debt in there. We've always thought about the standards, who will stand the test of time, ASCII, Unicode, SQL. We really need to be able to build on the shoulders of those giants. I think that's why tapping into this SQL, extending that a little bit, having that column as a C also, and pointed that URI is a very small lift and you're already making a first class citizen right there. Seems like a very simple solution, but how do we get this? Your call for arms here is, " Hey, if you buy into this, especially if you're a big company that has a pull with all these vendors, throw your blog post at them." Don't forget the blog post is called The Greatest Sin of Tabular Data. You should Google that. Throw that blog post at them and ask them if they're going to go do with this? If you work at a company, your consumed data from some data company, throw it at them too. I think at the end of the day, people will start seeing the money and the first mover advantage around that. By having these semantics, I mean it's about reducing time, it's about reducing risk. We can start automatically generating more of these IT artifacts and then start tying it more to direct making money and saving money. Dan, how did we do?

Dan Bennett: Yeah, you nailed it, guys. It's pretty good coverage.

Juan Sequeda: Well, I mean, it's all your content here.

Tim Gasper: All you.

Juan Sequeda: Let me throw it back to you. Three final questions, Dan. One, what's your advice? Second, who should we invite next? And third, what resources do you follow?

Dan Bennett: My advice is when you're doing a podcast, move around more so the occupancy sensor doesn't turn off the light in the office here. But more importantly, I've been in this career for almost 30 years now and what I've learned is you've got to enjoy what you do and do what you enjoy. Life's too short not to, and I noticed this might sound completely hackney, but man, I just love this stuff. And so, it's easy to talk about and it's easy to just geek out on this and keep going after it and it's just fun. Hopefully, that comes across. Who should we invite next or you guys should invite next? I want to recommend my good friend Giuseppe Saltini, he's based in the UK.. Well, I was trying to remember if you've ever met him. I think you might have, but he was one of my big semantic guys at Thomson Reuters and he's one of those ones, he's the rare breed who can talk to semantics and talk the business. In our semantic community, sometimes we lose the plot a little bit when we talk to the business. He does that real good. Then the final question you had was what resources do I follow? I'm pretty boring on this stuff. I was thinking about it when you sent me that question through. It's Daring Fireball. I just love the way John Gruber writes. If you ever want to figure out how Apple thinks, you just read what he writes and it's pretty much there. I'm a huge, huge blockchain skeptic, so I'm loving Web3 is Going Just Great. That blog that turned up earlier this year or late last year, and of course, like FTX happening today. It's good to get confirmation bias from that blog. Ars Technica is wonderful. Their coverage of so many things is really, really good and in depth. Then one that if you're in the UK, you probably know, which is good for basic tabloid level IT coverage is The Register, which I love those guys and their writing is just the right level of humor.

Juan Sequeda: This was great content for us to go follow. Dan, thank you so much for that. I do encourage everybody, go Google The Greatest Sin of Tabular Data and you'll be very happy and surprised to see what shows up there.

Dan Bennett: Some SEO going on there.

Juan Sequeda: All right. Well, next week, we're going to have Theresa Kushner from NTT DATA. We're going to be talking about are the data teams actually keeping up with the AI teams? That's our topic for next week. With that, Dan, thank you so much as always. Thanks data. world who lets us keep doing this every single Wednesday at Catalog& Cocktails. Dan, Tim, thank you so much. Cheers.

Tim Gasper: Cheers, Dan.

Dan Bennett: Cheers.

DESCRIPTION

Machines and people. Why can't we just speak the same language? The truth is we can, and doing so could make life demonstrably better for data scientists. Yet here we are, living in a world of rows and columns that few people outside of the data owner understand.


Join this weeks episode of Catalog & Cocktails as hosts, Juan Sequeda and Tim Gasper with special guest, Dan Bennett, tackle semantics and how to get everyone -- machines and people -- on the same page.

Today's Host

Guest Thumbnail

Tim Gasper

|VP of Product, data.world
Guest Thumbnail

Juan Sequeda

|Principal Scientist, data.world

Today's Guests

Guest Thumbnail

Dan Bennett

|Chief Data Officer at S&P Global Commodity Insights