KG + LLM = Happily Ever After? w/ Tony Seale, Knowledge Graph Engineer

00:00

0.5
1
1.25
1.5
1.75
2

This is a podcast episode titled, KG + LLM = Happily Ever After? w/ Tony Seale, Knowledge Graph Engineer. The summary for this episode is: Knowledge graphs and large language models. A match made in heaven. This pair will drive centralization and eventually work with AI to further consolidate and connect your data. So why aren’t people prepared for it and what’s going to happen when AI steps in? Join us on this weeks episode of Catalog & Cocktails with hosts Tim, Juan, and special guest Tony Seale, Knowledge Graph Engineer.

Transcript

Tim Gasper: Hello everyone. Welcome once again to Catalog& Cocktails, presented by data. world. It's your honest, no BS, non- salesy conversation about enterprise data management with tasty beverages in hand. My name is Tim Gasper, longtime data nerd, product guy and customer guy at data. world, joined by Juan.

Juan Sequeda: Hey everybody. I'm Juan Sequeda, the principal scientist at data. world, and as always, it's a pleasure to take a break and chat about data and knowledge and AI today is our topic. We're going to be talking about knowledge graphs and large language models, a hot topic, but the person I want to introduce is somebody, if you have not been following him on LinkedIn, you've truly been missing out. I'm super, super excited to have Tony Seale, who's a knowledge graph architect. Tony is one of those people who's just been around the knowledge graph, data integration world for so long, has gone around circles, and everything he says, I'm like, oh man, we need more Tony's in the world who just says these things. And he's been really, really thinking about all of these aspects of not just knowledge graphs, but together combining them with large language models. We've been having so many fantastic conversations on the side, I'm like, all right, we just need to record these conversations right now. Tony, I'm super excited. How are you doing?

Tony Seale: I'm good. Thank you so much for having me here on your show. I've watched it and I've enjoyed it. I love the people that you're getting together and the interesting conversations you're having. It's fantastic.

Juan Sequeda: Well, now it's your turn. Let's just kick this off. We've got so much to talk about, and we can talk hours for these things. All right, Tony, honest, no BS, what's the deal with all of this GPT large language models and knowledge graphs?

Tony Seale: Yeah. So as you say, I have been thinking about this for a while. Before large language models really hit on the scene, it's been quite obvious that a big change, a big revolution has kind of been sitting there waiting to happen, slowly building up as far as the AI thing is concerned, and now it's just beginning to bleed into the public consciousness. But really we're only at the very beginning of this exponential curve as it begins to take off. There have been several of these kind of revolutions, the major one being the agricultural revolution before, and then the industrial revolution, and we're just at the beginning, I think, of this technological revolution that's really going to kick in. It started off, but it's still got a long way to go. And I guess what motivates me is, I think it would be a real shame if as this revolution occurs, we lose diversity from the ecosystem of our various different companies that we've got. So yes, the big players, they understand AI, they understand data, they're very data centric, and they have a lot of smarts about it, what about all of the other companies who are doing really great jobs, have got a really valuable thing to add into the market, but now they're looking at this AI and it's like, well, what's going to happen to me in this context? So that's where I think the knowledge graphs come in, because they give an opportunity for every organization to take the asset that they have, which is all the data that they've been curating over a long period of time, and then consolidate and connect that and organize it so that the semantics are clear around it. And then once you've got that, then you too have got a large interconnected dataset that you can run AI over. And these are the things that I've been thinking about for a while, and large language models are now giving an opportunity, I think, to start taking some of these concepts onto the next stage. And I guess we'll get into some of that. We can talk about the details of that.

Tim Gasper: Interesting. Tony, go in a little bit more. I know we want to talk a lot more about knowledge graphs and that, and how LLMs and these things can come together, but let's go back to this comment around diversity. Can you talk a little bit more about what you mean about diversity, and the differences that you're painting between maybe the larger companies, the big tech companies that are going to really push the AI to its limits, versus other maybe more diverse companies, organizations that are going to struggle?

Tony Seale: Sure. So if you're a big internet search engine or you're one of these big large language model AI research companies, then the pattern for you is that you're creating a general set of information. So there, you have everything everybody's put on the web. So the web has been developed there. It's this great mechanism through which information can be shared and linked, and your goal then simply is to go out onto this huge repository of data... You want to learn about cats, there's millions of instances of cats that you can go down and you can pull your models. It's this generalized information, which is great, but if you are in a specific vertical, you're a trained provider, you're a hospital, you're a commercial retailer, you're a bank, going out to the internet for this kind of generalized information is not what you're about. You're about a local specialized set of information. You have your own private information. So how are you going to be able to do all of this AI stuff that requires these vast datasets? But if you think about it, it's like a beggar sitting on top of a gold mine, because actually we do have a lot of data, it's just that it's fragmented and separated at the moment. It's in all of these separate different siloed databases. If you can just connect that information together, if you can get clear about what the semantics are and create a clear semantic layer over the top of it and organize it, then suddenly you can join in the game too.

Tim Gasper: Right. No, this is interesting. The visual that comes to my mind is, there's this whole saying around, oh, there's T- shaped people and there's I- shaped people and things like that, a lot of this generative AI that's been trained on the internet and things like that, it's got a very strong top bar, it's very broad in what it can do. These different organizations, they have very deep vertical knowledge in various areas, and an interesting thing that we need to see now is the marrying of these two things together to create that T shape, right?

Tony Seale: I think that's exactly it, and an understanding of what information is private and what information is public, because obviously there's going to be a strong drive to make as much information public as you possibly can in a format that the large language models are going to be able to understand. Because if you're offering a service or you've got a set of products, the more information that you can put out there onto the web about those service and products, the better large language model is going to know you and understand you, so the more likely it is when people start asking you questions, to funnel them towards what you have to offer. So that's going to be really important.

Juan Sequeda: Hold on, you just said something really interesting here, which I don't know how many people are going to agree with this, and people who call BS on it, which is you just said there's a strong drive to make information public.

Tony Seale: I think there will be. Yeah. Yeah-

Juan Sequeda: Hold on. Let's expand on this one, because yes, government agencies and stuff, that data should be public already, that continues to happen, but companies, they're private data, what incentives and what's going to let them drive to make their information public?

Tony Seale: Not your private data. You want the exact opposite for your private data, but for the stuff that you want people to see, like the surface area of your products. So the inaudible is already there. Your company's got to have a website these days. So all of that information that you would've put in a brochure before, you now put that out on the website so that people can go onto the web and find your information. And the large language models will be digesting that, but increasingly I predict there'll also be a drive to put any data about those inaudible-

Juan Sequeda: So this is interesting. So are making the argument that these large language models will become the next version of search for the web?

Tony Seale: I think so, yeah, yeah. Me and my wife actually, we couldn't work out where we were going to go on holiday this year. We'd been on holiday to Sweden the year before, and we'd had an amazing time, and how do we top that? And in the end, after not being able to work it out, I said, " Well, I'm doing loads of stuff with ChatGPT at the moment, let's ask ChatGPT, we want such and such a holiday." And it found us a very good holiday. We ended up booking in on it right down to the details of it, so exactly like search.

Tim Gasper: Interesting. This reminds me a little bit of ChatGPT recently came out with a plugins feature, which is like, hey, let's let the AI tap into our data feeds for kayak, for travel, whatever it might be. There's a bunch of plugins out now. Do you see that as the first step towards, hey, let's put our data out on the web?

Tony Seale: Yeah, I think so. Yeah, exactly. But then also from the other side, something that I'm personally interested in is this silent movement where to do search engine optimization, more and more people have been putting JSON- LD into their website, as well as putting the text there to say, right, okay, this is our product, or this is our service, this is our location. At the same time, people will put alongside that this little nugget of data, this little island of data which will be in JSON- LD, and that JSON- LD will then link back to schema. org, and on schema. org they'll be like, here's the semantics of what a products is or a business is or a doctor's surgery is. So most people don't know this, but over 40% of websites now contain these islands of data within it. So across the web now, we've got a whole bunch of data as well. And I can't see inside GPT, but I know what I would do if I was those guys, and from the experiments that I've done accessing it, I think they're using that data. And why not? It makes perfect sense.

Juan Sequeda: Let me articulate what I'm getting out of this conversation up to now, which is I'm having a little bit of an aha moment. So if we go back to the early Tim Berners- Lee's vision of the web and going into linked data and the semantic web has always been about putting your raw data on the web. I think that's where the whole semantic web and linked data movement came from. That's you, me for 15 years in this stuff. It's linked open data, cloud, all that stuff. And there has been an initial academic curiosity driver for that. I think 10 years ago, we saw a big push because of Google and Yahoo and Bing and all these search engines who created schema. org. So for the folks who haven't heard about it, just go to literally schema. org. This is a schema and ontology that has been created over the last 10 years in a very decentralized community way where basically you're telling people, if you want to go describe this thing on the web, here's your schema, here's how you create the markup, or here's how you add the structured data on your website such that when crawlers from Google, Yahoo, Bing and so forth, they go off and bring in the documents, the text, they actually can extract the structured knowledge out of it, the triples, the graph, and they start building their graph. And that's why Google has created their knowledge graph. Now it's 10 years ago. A lot of this stuff was built also using the stuff that comes from scraping the web and all this that data that shows up on the web through schema. org. Now the point that you're making is that 10 years later, now we're here, we're like, there's another big motivation. I don't know how many people are seeing this, but you're hypothesizing or you're predicting here that this will be another big motivation to say, wait, if the new version of crawlers of the web is going to be GPT, ChatGPT and stuff like that, in addition to giving them just the raw text, let's give them the raw facts, the triples basically. And it already knows... I'm using the word knows here, but we got to be careful about this, because we don't really know what it actually knows, but it has so much of this text of triples, which essentially are observations, let's call this. It actually has the potential to learn much more about it, because it has all this structured information in there. So you have this combination of structured and unstructured. So at some point, the second or third wave, I don't know, of incentives to start putting structured data on the web is just going to grow because of GPT.

Tony Seale: Yeah, that's my working hypothesis. Yeah, exactly. And kind of leads us then nicely into starting to look at one of those diversion points between structured and unstructured texts. So the beautiful thing about JSON- LD is, it's sitting there on the same page. So basically you've got your unstructured data, which is the text that you've written about your recipe or product or whatever, and sitting next to it you've then got your structured data, which is linking back to a well- known schema. So I think that there's an interesting pattern there where basically we're seeing the unstructured text being grounded in ground true facts. Now, people put bad JSON- LD out there, so facts in adverted commas, but one of the things that I talked about recently is that unstructured data text is not as unstructured as maybe we thought. GPT is essentially, and the other large language models have clearly discovered a strong semantic structure within all of that text, which is what they've been able to learn off. And with graphs and that kind of JSON- LD structure, we've got a data structure that's not your rigid box shape relational database, so maybe the division between structured and unstructured data is not quite as stark as we once thought. And this ability to start to bring the two together, I think, is at the heart then of the interface that we're looking to make with large language models and knowledge graphs. It's that same pattern. It's the text on one side, the large language model learning about the semantic structure within that connected, and grounding it into truths within a graph.

Tim Gasper: I like this pairing that you're talking about here as a pattern, unstructured text grounded in... And yes, you put air quotes around it, truth/ facts. Let's dive into this more in this next section here. What does it look like for knowledge graphs and semantic approaches and LLMs to work together? What does that look like?

Tony Seale: Yeah. So there's different ways that they can interface with each other. So I put out something on my LinkedIn feed which shows one way of doing that, which is that basically because I hypothesized GPT has been learning about JSON- LD and schema. org so well, because it keeps bumping into it on the web, what you're actually able to do as one way of interfacing with it, is you're able to say large language model, I have a question for you. I'm going to give a question for you in natural language like, what companies are likely to do well in the UK after a cold snap? And because GPT understands the semantics of that so well, it's able to come up with what it thinks that those companies would be. But you say, respond to me in JSON- LD using schema. org that you know about, so then GPT knows enough about that to basically do whatever it's kind of thinking is, oh, I think Marks& Spencer will do well, and I think British Gas will do well, and maybe some of these other ones, and then it will take British Gas as an entity that it's recognized, it will take what it knows about British Gas, and it will turn that into a bit of a graph in the JSON- LD format, which you can then get back, including its relationships to other things. So basically I put a question into a large language model in natural language, it used its smarts to come up with an answer, but it's responded to me in a graph representing what it knows about it. And two things. So one, you can get it to make that graph conform to schema. org, so we've got clear semantics about the schema, the inaudible side of things, but then also what you can do is, you can use some kind of Rosetta Stone to effectively ground the ID. For the ID of this thing, I would like you to use Wiki data. So that's an open knowledge graph that's out on the web. So I'm going to ground that British Gas entity into the knowledge graph's unique identifier, the URI, that that knowledge graph has got for it. And then if I've got my own company data now sitting back there, what I can do... And I've got my catalog of companies. If I've connected my IDs that I'm holding locally into Wiki data's URIs, I can basically get a mirrored graph fragment back from my internal data that I'm holding there. And because graphs are just made for integration and connectivity, those two things then just marry together. And then I can kind of say, now on my local data, go a couple of hops out, maybe look for sales that I've done with these companies, and go and pull that information out. So it's now got this working memory graph that I can go and do a bunch of inference off of. So that's one quite exciting channel. The other thing is, it knows how to do these various craft query languages. So you can do it the other way and say, go execute this query against my store. And again, you put natural language in, that's going to get translated into a query language like SPARQL or Cipher, or pick your favorite, and then it will return that information from the graph. And so I could see a situation developing where this then becomes maybe some more interactive conversation.

Juan Sequeda: Yeah. So going through this in my head... And I love this example. Basically you're saying you asked a question about X, and tell me more or answer me the question, but give me the answer in structured data, in particular something like JSON-LD or RDF Turtle syntax, and also with a particular schema, you already know that there is an agreement upon that GPT should know what's that agreement, because inaudible to the web. Now as a skeptic, playing devil's advocate, I can say, just give me back a table with that information. And technically it's almost the same, but I would argue the value of actually getting the raw triples, the raw graph back is that you could literally just add that back to your internal graph, literally just almost copy and paste it over there, and then you can interoperate with the current semantics in this case if you're actually using inaudible or language ontology. So it's this interesting circle that the GPT has already knowledge, that you want to extract that knowledge, being able to go use that knowledge and embed it inside of your organization or your private world, whatever. So I think that's that first kind of approach that you're thinking about.

Tony Seale: Yeah, that's right. And I think the point to highlight there kind of goes back to what we were talking about at the start. It's very easy to get focused in on the flashy AI bit at the top, but that is merely the very, very top of the iceberg. All under the water is basically data and data management. That's data pipelines, getting the information in there. Of course the algorithms are in incredible pieces of kit, but in terms of workload, in terms of hours, man hours, it's actually all about the data that's going on beneath there. So that then turns back to the company. So what is the company going to do in order to be able to get this kind of integrated information structure, which is then when it comes back to the graph, because the absolute no- brainer open goal solution that I see, have been banging on about for ages, it's just mirror what's going on outside. So if out on the web you've managed to have all the 40% of pages, which have integrating all the information from different companies according to a common schema based on schema. org in a completely decentralized manner across different servers in different parts of the world with different teams and entirely different companies inter- operating, why not just mirror that same pattern internally? So basically you can even go and grab schema. org. It's open source. So you can go and grab the code for schema. org with their baseline ontologies and their websites, you can stand up a version of schema. org within your company, and then you can begin changing the ontology to reflect the specific semantics of your own particular organization. And then within the organization itself, you just say the same thing, which is, okay, I would like all of you 50, 000 different applications to begin collaborating slowly over providing your data according to, publishing it in JSON- LD according to, where possible, this common ontology that we're developing out. So then you basically have this mirror. You've got the web out there, and the web out there has got text, and it's got JSON- LD, and it's got schema. org, and it's got a centralized Wiki data knowledge graph, and then it's got large language models learning off of it. Mirror that internally, so now you've got your own kind of schema. org. You've got all the applications which are publishing up their information that's going into it. Some of that information is public information, so it can just be pushed as is straight out there onto the web, and some of it's private information, but it's in the same format, and you could do your own Common Crawl. Maybe a bit expensive for doing the whole time, but your data is all now available over HTTP, and it's all in plain text, so you could train your own private large language model, just like the ones that are being done outside, and then that's going to smooth this interface, because you are going to need this interface then. I see it like the boundary of a cell wall, and certain stuff is held private within it, and then there's going to be communication outside of it. You start to get a bit sci- fi at this point, but with an eye on... What do I think? I think it takes a long time, Juan, as you know, to work on these ontologies, to understand them, to have all the meetings, to do all the collaboration, to kick off the processes, to get the kind of data maturity. The time to start on this was five years ago, but failing to have done that, as many people as possible should be starting on this right now.

Tim Gasper: Yeah. The innovation is out in front now, and it's like... I think about the mobile revolution where everybody was bringing their phone to work and started using their phone for work email and everything like that, and of course we should have had our mobile governance policies in place before all that happened, but oops, I guess there's never a better time than now to fix the sins of the past. So this is exciting. And you've talked about existing knowledge graphs and things like schema. org that can be really helpful that's out on the web today, probably underutilized by the world, and hopefully this is something that really helps that to grow, and you also talked about getting knowledge graphs out of GPT, what about the other way around, an organization that has a knowledge graph that they have built or are building, and they want to put the KG into GPT or LLMs? What do you think about that and the patterns that might form around that?

Tony Seale: Yeah. Well again, you could have your own private model, which is you could take GPT on Azure, you can take the model that they've trained, you can stand up an Azure instance, and then you can do fine tuning on it. And there's a few papers on basically you take your knowledge graph, you break it into the triples as text, and then you just train, fine- tune the large language models on the kind of facts that are coming out of the knowledge graph. So basically you take their base model, and then you say, I'm going to fine- tune you now so that you understand and are able to predict what's going on within my knowledge graph. So that's one way. And then as I say, as I briefly mentioned before, the other is that you can use GPT to then query your knowledge graph that you've got. So write me a SPARQL query that's going to do X, Y, Z, or a Cypher query or whatever, and then you can execute those queries against your knowledge base and return the effect back out it.

Juan Sequeda: So we're live brainstorming here how to interface knowledge graphs with inaudible. You've really been thinking about this first one of, hey, GPT already knows so much about JSON-LD and schema. org that it can, let's say extract basically knowledge graphs outside of GPT. So that's one approach that we're doing. A second one is, if it knows the context and knows the schema, it can generate the query using that inaudible. So that's the second one. A third one you just mentioned now is that you can fine- tune the large language models with the facts that come in from a knowledge graph. So a different approach, just turning the triples, the graph into text, and then passing that on, but fine- tuning. What about, how much have you been playing around with the prompt engineering and passing facts or triples, knowledge graphs through the prompts? I'm curious to hear your experiences around that.

Tony Seale: Yeah. Personally, not so much yet, but obviously eventually you could end up with this slightly two- way conversation where then you've got the graph that you've created, which represents your context of what you're doing at the moment, and then from that context, you generate a prompt to GPT. So I think that's probably how it's going to all pan out, something along those lines. And I think of it a bit like the kind of... I don't know if you read that Thinking, Fast and Slow book, but in the human brain, they talk about system one and system two, thinking fast and slow, and in some ways, I think maybe the interaction between the graph and the large language model might end up being something a bit like that where you've got these prompts with these very fast responses, but maybe a bit fuzzy that are then being grounded down into the system two, which is into the facts of the graph that you're then able to do a bit of inference off of that then will engineer a prompt, and that will then come back. But I think as you point out, and it's probably worth highlighting, we're right at the start of this, and we decided to have this conversation in order to put our thinking out in the open, but there's so much thinking to be done on it, and all these kind of frontiers that need to be going into. What we haven't talked about yet is vector and vector mappings. So it gets a bit more technical at that point, but-

Juan Sequeda: No, share your thoughts please.

Tony Seale: Okay. So just to briefly explain, with these large language models, what you do is, you get an encoding, which is basically I can put a bunch of text into a large language model, and it's going to give me back a string of numbers. And with these large language models, they're large because they're trained on so much data, but also because they have so many parameters in it. So you could see something with just two dimensions on it. I could have a map drawn with two dimensions on it, and I can give you two coordinates, and that's going to let you go and find the exact position on there. So now maybe I add three coordinates in, and it could either be up or down in the air, but still you're going to be able to go and find that. And these large language models, they're doing that for language, but they're doing it, if you imagine on a kind of inaudible however many billion or whatever points it is. So there's so many dimensions going on this, and each word or word part is existing somewhere on this really highly multidimensional space. And even with the early embeddings, you're able to take the number, the embedding vector for king, and then you're able to say, here's the embedding vector for female, add these two together, what's now going to be the closest embedding vector to this? Well, it's queen, because in this multidimensional space, the things are mapping to each other. And on the other side, you can pull out embedding vectors from graph neural networks as well. So you can go and run a graph neural network over your graph store, and you're going to get an embedding vector which is kind of operating in a similar space, not on words now, but in nodes that are in the graph. And there's a rise of these embedding vector databases like Pinecone and stuff where basically it's another way of interacting with GPT that doesn't take graphs there, but you can basically go and scan in a whole document or a chunk of a document, you can get the embedding vector for all of that, which kind of says this is in the space of words and concepts, this is where this thing is going to exist, and I can go and store that chunk of that document down into Pinecone or one of these vector embedding databases, and then I could have a question, and I've got this large embedding vector database, and I can just get the embedding vector for my question, and I can go and search the database and say, okay, of the embedding vectors I've already created from having scanned in all of these various documents, which ones are closest to this? And then you can go back to GPT and say, okay, well now examine these documents and summarize them and whatnot. So there's then on the edge of this stuff, there's this interesting idea of, well, can we bring these embedding vectors closer together? Can the graph embeddings be somehow mapped into the embeddings that we're getting off of the semantics of coming out of natural language via some kind of mapping between these embedding vectors? So I think that's really exciting. And ultimately if you look at the work of people like Michael Bronstein, they're looking for what they call these geometric deep learning patterns, and some people, there are certain papers which are saying, well, actually transformers are a type of graph neural network. I did a post on it, and some people would disputed, so I think it's fair to point out there's a bit of controversy around that, but the way I think of it is, if you could take it up to the next level where actually if we were able to discover a blueprint that is able to work over graphs and text, then perhaps the difference between structured and unstructured data would actually pretty much collapse at that point. But again, it's more like far future speculation.

Tim Gasper: Fascinating. This is very interesting in terms of looking to the future and where this may all go. Tony, just before we go into our lightning round here, I want to ask you one last question, which is that what happens if we don't use knowledge graphs with GPT? What if that just doesn't happen? What are we losing here? And can that happen? Should it not happen?

Tony Seale: Yeah, I guess it can not happen. And I think the vast majority of people are not doing that. So most people which are trying to do internal stuff with it are doing a combination of the fine- tuning and they're doing that Pinecone thing. So they're basically using tools like LangChain to chop up their internal documents, create vector databases, and then chuck them in to do prompt engineering. Yes, you can do that, but I think the elephant in the room from my mind is the data integration problem. We've got a whole bunch of unstructured data, but most of the time what you're wanting to do is actually ask questions related to facts within your databases. You want certainty, you want explainability, you want to be able to ground these things out. If you're working for a bank or something and want to use a model like this, there's quite a lot of, quite rightly, hoops that you have to jump through to prove that this model is not hallucinating and not going to do something really dodgy. You have to be able to ground it out, and that's where I think knowledge graphs interacting with large language models are really going to shine.

Juan Sequeda: Based on that comment, this is what I've been thinking about it, is that for all the cool things that people are thinking about, knowledge graphs are probably not on the top of the mind. They won't even need it. That's for the cool. But when you want to combine the cool and the useful, and where useful means that you need to have, let's use the word governance here, trustworthiness, hallucination is not an option. At that point, for enterprises, they need to be able to have a way to go make sure that hallucinations don't exist, or if somebody has doubt and has an explainability around this. My position is here... This my strong position. I'm curious what your point is. Knowledge graphs are the way to go do this, because knowledge graphs are basically the way how we are presenting the facts what we have. What are your thoughts about this?

Tony Seale: You know you're preaching to the choir there, and I'm obviously going to agree with you there. They're the perfect system for capturing facts, because they can take any data source that you've got, and they atomize the facts down to the lowest possible-

Juan Sequeda: Is there a... Again, honest, no BS here. Could we do this in a different way that it's not knowledge graphs? inaudible Snowflake data warehouse and GPT, combine that.

Tony Seale: Yeah, I think you could, it's just whether you should. And this is where the whole semantic layer comes in as well, because it's going to get harder and harder to understand what these black boxes are doing, what's going on inside you. And I think the only... And we kind of see that with what... So basically the way that the GPT model works is that, they've created those embedded vectors, they've trained it by just training it over the data, and then that was kind of unusable. It was so crazy that people couldn't really interact with it. The knowledge was embedded in there, they compressed the web, but then no one could get hold of it. So what they then did is use reinforcement learning with human people, and then that is what has allowed it to be actually useful. And I think for an enterprise, that's hopefully what the knowledge graph and the semantic layer is going to let you do. Because if you're looking at something really complicated, and it's got this high dimensionality structure, and it's all over the place in it, then for a human brain, what you need to be able to do is, you need to be able to look out to a higher, more abstract level. You need to be able to step back from the details and look at it at a conceptual level. And of course that's exactly what an ontology does. If you've got this semantic layer over the top of your data with the examples of the data, A, you're able to feed those in as training sets to be, oh yeah, here's another example of this thing, here's an example of this thing, but B, when the answers come back, it's aware of your conceptual framework. And I know, Juan, you already said to me, " When I go and sit with business people, I just say to them, take your business and tell me the words that you use to describe your business, and get them up on the whiteboard there." And if you had a semantic layer that is well thought out and is in business terms explaining what this business is doing, connected into the data, and then with a large language model kind of interacting through that, you're able to then do this thing where you can scan out from the perspective and go from the lower level kind of, okay, this is my answer to this thing, but how does that conceptually fit in with how my business operates? Well, I've been training on your ontology. So look, from this ontological perspective, this human perspective, this is what I'm talking about here.

Juan Sequeda: Well, this has been a fascinating conversation, and we have so much notes here, so we're going to go quick lightning round, and then hit to our takeaway. So let's kick it off here really quick. So lightning round question on the fly, yes and no answers, and we'll keep going. So question number one, will GPT and LLMs drive significant adoption of schema.org and other knowledge graph approaches?

Tony Seale: Yes.

Juan Sequeda: You go, Tim.

Tim Gasper: All right. Second question. You mentioned exponential growth in the agricultural revolution, the industrial revolution, a counterpoint is to look at the recent big data craze or the recent deep learning craze, huge advancements, but then it plateaued for a bit, do you think we're going to hit a plateau with LLMs and GPT before it really takes off, or are we really headed for the moon here?

Tony Seale: I can't answer that with a yes or no, but I think that large language models are a progression of the deep learning thing. So they are deep learning. And it won't necessarily be large language models. I kind of got this hope that it'll be like we get this architectural blueprint with something more natively graph- based, but we're only at the beginning of the journey of what these things can do. The world has changed, but there's a lot more coming.

Tim Gasper: Okay.

Juan Sequeda: And final question here. Well, I got my own final question. So for companies and enterprises, will private LLMs be more popular than the public ones?

Tony Seale: Yeah, unknown. Really unknown. I think everyone's just got to start thinking about that boundary, which is another old thing from the semantic web. What do we mean by public and private, and what about individuals as well? Companies are part of it, but at the moment we just allow all of our data to be shared with everybody, and it's kind of being used, but is it being used in our interest? If you look at the stuff that Tim Berners- Lee's done with Solid, that and Inrupt. Then should we actually be having our own private data lockers? Should we be having our own where we give access to the information, it's collated and we hold that data there? Should we be training our own models that are working just for us, that are then off interacting with other ones, as opposed to basically just opening ourselves up completely with every aspect of our lives that then other people are learning and training their AI on? Everybody's got to think about this. The way I see it is, as a society, we're like an archer that is pulling back their bow, and they're about to shoot an arrow off. Once it's left the shaft, we have much less control over where it's going. So it's really worth sitting back and thinking about where we're going to aim this thing.

Tim Gasper: All right. An exciting and frightening metaphor there. All right. Very final lightning round question here. Do you think chat will continue to be the main interface for LLMs, or are you excited about a different interface?

Tony Seale: Yeah, I think probably chat. Chat's here to stay, but obviously it's going multimodal, so you're going to get image and sound will come out into it. And I hope in the corner of that that there's a place for graph as well so that we can switch into this more datary, conceptual, ontological view of it.

Juan Sequeda: Well, with that takeaway, Tony, we got so much notes. Tim, how well can we summarize? This is a fascinating conversation.

Tim Gasper: All right. I'm going to try to keep this snappy here. So many good takeaways today. So I think one of the biggest things that you mentioned was just how impactful LLMs have been here, and this exponential curve of innovation that's taking off, the agricultural revolution, the industrial revolution, maybe we might even say the internet revolution or something like that, and now this AI revolution that we have going on here, which is maybe just an extension of the internet revolution, an amplification of it, but it would be a huge shame if we lowered diversity through this entire movement. Especially thinking about these larger companies with this deep knowledge and expertise and capability in different areas, it would be a huge pity if we threw that all away or didn't leverage it properly simply because we put all this trust in, let's call it the big tech fame companies and things like that, and just that high level AI that they've been building and creating. It's the marriage of the knowledge and capabilities that the great diversity of the different organizations and groups and people across the world have with this great new AI superpower that unlocks, I think... And I think Tony, you kind of implied the real innovation here and the real opportunity here. And this is where knowledge graphs come in. If you can take the knowledge that you have, consolidate it, connect it, and have this interconnected semantically and conceptually meaningful dataset that you can run AI together with, it becomes not only more powerful for everyone, and it also taps into all that work and that knowledge that that organization has been doing. And you mentioned there will be pressure to put data out on the web as well. So this is going to kind of accelerate the interconnectivity and the accessibility of knowledge and data onto the web. Things like schema. org are obviously great opportunities where you can really connect things to fact that's publicly accessible, and you mentioned that this is a pattern that we're going to see as we go forward here, the unstructured text grounded in truth and facts and those things working together. Juan, what about you? What are your big takeaways?

Juan Sequeda: All right. So we discussed a lot about the interfaces between knowledge graphs and large language models. We're kind of brainstorming here live. So your hypothesis is that GPT knows already so much about JSON- LD and schema. org because 40% of the web has that. So you can actually... And we've been testing this. You can ask natural language questions, and what companies are doing well and whatever context, and then it will respond, and you say respond giving me JSON-LD and using schema. org as the ontology, and you'll get that back. So you're actually extracting knowledge from GPT that it knows, and put it inside of a knowledge graph to be consumed. So that's one approach. It already knows how to write SPARQL and any other graph query language. Everybody's been testing these things with query languages. So that's another approach. Fine- tuning large language models with facts from the knowledge graph, turning triples into text. I think an approach to consider also is around prompt engineering, given the facts through the prompt. But the important takeaway is, underneath all of this there's data. The algorithms and the whole user experience is fascinating of course, but the data underneath is what really, really matters. And your view here is that GPT has been crawling or just reading everything from the web. You might as well want to create your own internal web and let GPT be able to go train on that. Now if you use internally the same things that GPT has done externally, basically do JSON- LD, do schema. org, you combine that stuff, and the hypothesis is that it can be super, super powerful. Something to look into is the rise of embedding vector databases. Things like Pinecone and LangChain are becoming very popular, and now having all these embeddings and mapping graph embeddings to tex embeddings is going to make it even super, super powerful. But playing devil's advocate here and saying, we've talked about knowledge graphs and large language models, what if we don't use knowledge graphs? So I think a lot can be done without a knowledge graph, all the cool stuff, but where you really depend on truths, then it becomes much more critical. So for all the use cases in the enterprise world... In business, you need to have the facts, and the knowledge graphs is the perfect system for capturing these facts. And these AI models, they're just so black box oriented, so you need to be able to provide that abstracted conceptual model, which is basically the ontology, and that's what's going to provide a very trustworthy, dependable AI system. So I think the main takeaway here is that enterprises who want to go use large language models, GPT, if you're going to use this for seriousness in your enterprise, large language models must be tied with knowledge graphs. If you don't tie the knowledge graphs, then you're really just using it for fun, for play, for cool. But if you want to bring that in for something serious, knowledge graphs need to be part of your strategy. If they're not, you're just playing around and wasting people's time.

Tony Seale: Yeah, beautiful-

Juan Sequeda: Probably fun, but wasting your enterprises time and money

Tony Seale: Yeah. And actually I just had a conversation with someone from the business who pretty much said exactly what you're saying now, that tying these two things together actually looks like something we can concretely inaudible.

Juan Sequeda: How did we do our takeaways? Anything we missed?

Tony Seale: Yeah, amazing. It's lovely to hear it, to summarize so well back there. Yeah.

Juan Sequeda: All right. So just to wrap up here very, very quickly, three questions. What's your advice, who should we invite next, and what resources do you follow?

Tony Seale: Okay. So my advice is to start looking in... If you haven't got a knowledge graph, then begin looking at your knowledge graph, initiating your knowledge graph project immediately, and do it in a decentralized way. So check out what the data catalog pattern is, because the data catalog pattern fits in really nicely with a knowledge graph. People think knowledge graph, oh, that's one big central database, that's the wrong direction if you're going down there. We're looking at a decentralized data mesh here with a catalog over the top of it, which is semantically organized. So start looking into that yesterday. Sorry, what was your next question?

Juan Sequeda: Who should we invite next?

Tony Seale: Oh. Well, if you could get Michael Bronstein or someone like that on there, then he would be able to dig much deeper into the kind of machine learning side of it, and he would be able to tell you a lot about the kind of geometric graph neural networks. But anybody in that space would be really interesting, I think, because that's then the evolution of it once you start getting into this. So there's been a division at the moment. Gartner even had knowledge graph on two parts on the height curve. You had it cycled through on the data side, and then about two years later, it's now at the top on the AI side. That's the same thing, it's just a knowledge graph, but the two communities, they're not connected together. So if you guys could reach out to the other side of the fence and get people there in, that's where the future lies in bringing the AI with the data.

Juan Sequeda: Awesome. And then final, what resources do you follow, people, blogs, conferences, books, podcasts?

Tony Seale: Well, I guess a plug for the Knowledge Graph Conference, which is coming up soon. I'm going to be speaking there. I guess you guys will be there too. And LinkedIn is a fantastic source actually. That's probably my main source of information these days. They're obviously backed by a big knowledge graph.

Juan Sequeda: All right. Well, Tony, thank you so much for this fascinating conversations. It's very, very timely. And I think it's one where we're seeing so many people who asking about large language models, and we kind of had a great final takeaway that, if you're in the enterprise looking about this stuff, you also need to look at knowledge graphs. So pay attention. And with that, cheers, Tony. Appreciate it. Have a good one.

Tony Seale: Cheers.

Tim Gasper: Cheers, Tony.

Tony Seale: Cheers. Thank you.

DESCRIPTION

Knowledge graphs and large language models. A match made in heaven. This pair will drive centralization and eventually work with AI to further consolidate and connect your data.

So why aren’t people prepared for it and what’s going to happen when AI steps in?

Join us on this weeks episode of Catalog & Cocktails with hosts Tim, Juan, and special guest Tony Seale, Knowledge Graph Engineer.

Today's Host

Tim Gasper

|VP of Product, data.world

Juan Sequeda

|Principal Scientist & Head of AI Lab, data.world

KG + LLM = Happily Ever After? w/ Tony Seale, Knowledge Graph Engineer

DESCRIPTION

Today's Host

Tim Gasper

Juan Sequeda

Recent Episodes

We need your Honest No-BS Feedback! What should we do next?

What is AI Ops with Brandon Gadoci

TAKEAWAYS - What is AI Ops with Brandon Gadoci

4 Years of our Honest No-BS Podcast, Live from Gartner D&A Summit in London

Everything you wanted to know about Knowledge Graphs but were afraid to ask with Ora Lassila

TAKEAWAYS - Everything you wanted to know about Knowledge Graphs but were afraid to ask with Ora Lassila