Unstructured Data - turning data swamps into insight

Michael Bird (00:09):
Hello and welcome back to Technology Now, a weekly show from Hewlett Packard Enterprise, where we take what's happening in the world and explore how it's changing the way organizations are using technology. We are hosts Michael Bird.

Aubrey Lovell (00:23):
And Aubrey Lovell, and in this episode, we are looking at an area which impacts every business in the world, unstructured data. That is, how can we start to squeeze insight from the piles of text, audio, video, and every other type of data that doesn't fit into a neat table? We'll be exploring why this data is so hard to use, and we'll be asking what can be gained from analyzing it. Plus, we'll be asking if the effort is worth the insight.

Michael Bird (00:48):
Sounds great. So, if you are the kind of person who needs to know why or what's going on in the world matters to your organization, this podcast is for you. And if you haven't yet, make sure you subscribe on your podcast app of choice so you don't miss out. Right. Aubrey? Let's get into it.

Aubrey Lovell (01:04):
Let's do it.

Michael Bird (01:08):
Unstructured data is an increasing issue, or an emerging opportunity, depending on how you look at it. According to research from Statista, in 2024, we generated over 147 zetabytes, that's with a Z, of data globally. That's over 402 million terabytes every single day. My goodness, Aubrey, that's a lot of data.

Aubrey Lovell (01:09):
Wow.

Michael Bird (01:34):
Now, increasingly, that information isn't just in nicely collated text form. It's speech from our call logs, it's video, it's information from over 200 billion IoT devices in the world, and we've cited that estimation in the show notes. It's voice requests we've made to our smart devices of which there are over 8 million in circulation according to research from Global Web Index.

Aubrey Lovell (02:01):
Now, all of that disparate information is known as unstructured data. Left alone, it's a huge storage problem. But carefully analyzed, it can contain valuable insight to be compared against other traditional metrics such as sales figures, voting intent, or economic results. Sifting through it all though, that's an issue, and it's one organizations are only now able to tackle.

Michael Bird (02:24):
That's right, Aubrey. And a short while ago, I had the chance to talk about this with Gokul Sathiacama, VP of Data Storage for AI at Hewlett Packard Enterprise. He heads up HPE's work into unstructured data platforms.

Aubrey Lovell (02:38):
So, Gokul, welcome to the show.

Gokul Sathiacama (02:40):
Thank you.

Aubrey Lovell (02:40):
Thank you so much for joining us. So, first question, what exactly is unstructured data?

Gokul Sathiacama (02:46):
Well, if you look at many systems of 10, 15 years ago, you created a file system and then behind the file system, you had an object storage solution. And today, we are looking at data that is in natural form, whether it's images or videos or texts that needs to be stored in natural form, or converted into things like parquet, JSON formats and need to be stored in a very rudimentary, at a file level kind of way.

Michael Bird (03:19):
And so, how does that differ to structured data? So, structured data is, what, sat in a database?

Gokul Sathiacama (03:24):
Yes, it is like a database. It's in columns and rows, typical structured format where you store information so that you can easily query that information and get data out of it. In unstructured, there's no form.

Michael Bird (03:40):
So, it's photos, videos, blocks of text that you can't necessarily format in that sense?

Gokul Sathiacama (03:46):
Correct. And then, you need to know what the name of the file is in order to retrieve it.

Michael Bird (03:46):
Right, okay.

Gokul Sathiacama (03:51):
In a database or a structured format, you can actually go to the column, you can do a query on a particular column or row and get access to that data itself.

Michael Bird (04:01):
Okay. So, if I had a database full of weather data and I had a database full of photos of weather data, in the database full of weather data, I could say, "Now, I want to know what the temperature was like on the Thursday the 15th of August."

Gokul Sathiacama (04:13):
Yeah.

Michael Bird (04:14):
Or maybe I want to know what days it was sunny. Whereas if I've got photos of where data, I can't necessarily do that because I just have the file names

Gokul Sathiacama (04:22):
Correct. But you can create something like a semi-structured environment, right? So, you have some context of the data, but it's stored in a natural form.

Michael Bird (04:32):
Right, okay. So, that is, going back to my example. So, you're essentially saying you're creating some additional metadata about those images.

Gokul Sathiacama (04:32):
Correct.

Michael Bird (04:40):
So you're saying, okay, this is an image, but maybe I'll add some notes to say this is an image of a sunny day.

Gokul Sathiacama (04:46):
And a dog sitting on a rock or on a stand-up paddle board, whatever it is.

Michael Bird (04:53):
Right, right, right. Now, so doing that at scale must be pretty tricky because a human going through every single photo and saying, "Oh yeah, that's a dog," or that's-

Gokul Sathiacama (04:53):
Yes.

Michael Bird (05:02):
... That's a lot of challenging. So, is this sort of the crux of the issue? This is where managing unstructured data becomes the challenge?

Gokul Sathiacama (05:09):
Exactly.

Michael Bird (05:09):
Right.

Gokul Sathiacama (05:09):
I mean, think about this. You have a camera and you're taking videos and photographs, and in your mind you have to create a catalog of when that image or video was taken and who was in front of the camera and what context that video or image was taken.

Michael Bird (05:28):
Yeah, yeah.

Gokul Sathiacama (05:29):
And that's what that semi-structured environment is doing. Your brain is a semi-structured format and your camera is actually, the data is stored.

Michael Bird (05:39):
Right, okay.

Gokul Sathiacama (05:40):
And so, what we are trying to do at HPE is create context for the data so when AI applications want to train from the data that they've collected or want to get inference or compare them with other data in a RAG format, that's when you need to provide that context.

Michael Bird (06:02):
So, how do we provide that context?

Gokul Sathiacama (06:04):
Very good question. So typically, you put it in something like a vector database, and that vector database enables you to create vector embeddings that creates those contexts for that data.

Michael Bird (06:17):
Right. So, vector database, can you just define what a vector database is, how it differs from a standard database?

Gokul Sathiacama (06:25):
So-

Michael Bird (06:25):
Sequential database, I guess?

Gokul Sathiacama (06:27):
... Yeah. So, if you look at a structured database, you have a table with columns and rows. That's how you create the context. But if you want to do analysis of that data, you put it into a data warehousing solution. You extract, transform, load to a data warehousing solution before you can do business intelligence.

Michael Bird (06:47):
Got it, got it.

Gokul Sathiacama (06:47):
So, what a Vector database allows you to do is move unstructured data into a format that can be easily queried against.

Michael Bird (06:58):
But how does it do that? Because the data is magic. You still have these pieces of unstructured data that you don't necessarily have any context around. So, what exactly is it doing?

Gokul Sathiacama (07:09):
So, there are programs that are being developed, for instance, for images that create that context for the data. So, when you get an image, you can run it through this program and then it can say, oh, it's an image of a dog sitting on a stand-up paddle board, and then store that information in a vector, a database, then start querying it from there.

Michael Bird (07:32):
So, let's take this into a real world scenario. What sorts of organizations are generating a lot of unstructured data? Because I mean, one would think that most organizations, I don't know, let's think retail, they've got products on the shelves, each product has a barcode, a serial number, maybe a supplier name. But I can't think of many off the top of my head, many examples of organizations that would be creating unstructured data. I'm sure I'm wrong.

Gokul Sathiacama (07:57):
Everybody's creating unstructured data.

Michael Bird (07:59):
Okay, okay.

Gokul Sathiacama (08:00):
For instance, financial institutions.

Michael Bird (08:02):
Oh, really?

Gokul Sathiacama (08:03):
Yeah. Customer support, right? So, when they're check images or even conversations with customer support, everything is being tracked, everything is being recorded, and that needs to be stored somehow. And how would you normally store it? If it's an audio file, you would store it in an audio format.

Michael Bird (08:22):
Yeah. So, what other industries are creating lots of unstructured data, or is what you're saying, every industry is now creating lots of unstructured data?

Gokul Sathiacama (08:31):
Every single industry is creating. I mean, you can go to manufacturing, right? So, when things are going down the production line, they're capturing images, they're capturing data in terms of who is handling it. So, they can go back and trace certain things or where things are falling through the cracks in a manufacturing line.

Michael Bird (08:48):
I'm guessing, I don't know, 10, 15, 20 years ago, we didn't have the capabilities to be able to do that. The story, I mean, 15 years ago you wouldn't have had the space to store videos or images at any scale.

Gokul Sathiacama (08:58):
Yeah. So, last year alone, 3.5 exabytes of data was stored per day.

Michael Bird (09:06):
Wow.

Gokul Sathiacama (09:07):
Per day. That's a lot of capacity, and most of it is stored in its natural form, and that's unstructured data.

Michael Bird (09:17):
Okay. So, how should organizations be storing this data? What's the solution?

Gokul Sathiacama (09:22):
So, you need performance, right? You're storing all this information or all this data, and in order to do anything with the data, you need information about the data.

Michael Bird (09:32):
Yeah.

Gokul Sathiacama (09:33):
So, being able to create context or intelligence about the data is going to be critical. And that means performance of a storage solution is going to be very, very important. And as I said, the amount of storage or capacity that you need is important as well. So, you need capacity scale, performance scale, and then as the data grows, you want simplicity of operations.

Michael Bird (09:58):
Yeah.

Gokul Sathiacama (09:59):
Because there aren't enough people to manage these systems.

Michael Bird (10:02):
Yeah, okay.

Gokul Sathiacama (10:03):
So, we are trying to simplify that by providing easy to use workflows to minimize the infrastructure management.

Michael Bird (10:12):
Yeah, and that workflow, I guess is the critical thing because I think possibly how many organizations are doing is that they're just dumping this data somewhere-

Gokul Sathiacama (10:20):
Yeah.

Michael Bird (10:21):
... without making sense of it. But what you're saying is actually there's a workflow where you say, yes, we will store the data, but once we store the data, we want to actually then process that information and make sense of it, and then store that process information somewhere else.

Gokul Sathiacama (10:32):
Somewhere else.

Michael Bird (10:33):
Then when you want to reference that data-

Gokul Sathiacama (10:33):
Exactly.

Michael Bird (10:35):
... you can easily pull it up.

Gokul Sathiacama (10:35):
Correct.

Michael Bird (10:37):
We generate lots of audio files and if we just had a folder full of all the audio files, you probably get some information about when you recorded it and how long it was for, but you wouldn't know who you were interviewing necessarily or the content of that was.

Gokul Sathiacama (10:50):
Yeah, I mean, if you look how you're going to save this audio file, you won't say 1, 2, 3, 4, 5, 6, 7, 8, 9, right?

Michael Bird (10:50):
Yeah, yeah, yeah.

Gokul Sathiacama (10:57):
You would say, "[inaudible 00:10:59], add, discover, and the date and time," so that you provide the context as a file name. So, instead of doing that, how do you create context without creating an entire long file name? And that is where metadata comes in, and that's why object storage is an important element. But in many cases, customers are hoarding data as opposed to advancing data or activating their data for intelligent applications or AI or analytics.

Michael Bird (11:31):
Yeah, that is just sitting there doing nothing.

Gokul Sathiacama (11:32):
Doing nothing, right? They're just hoarding data, and a data lake becomes a data swamp at that point. It's a bog, it's mired. There's no context. And in order to activate, in order to take advantage of that data, now either you got to move it to a different solution or ignore what the data you have, and which means you're just not doing anything with it.

Michael Bird (11:56):
Organizations have lots of that structured data, don't they? But what you are saying is that if we can make sense of the unstructured data, actually we can augment that into making better business decisions.

Gokul Sathiacama (12:08):
Absolutely, right? If you want to store it as natural form, unstructured data storage is the right way to go. If you look at what we've done in the past over the last 10, 15 years, it was all about big data.

Michael Bird (12:08):
Yeah.

Gokul Sathiacama (12:21):
And we had to do this extract, transform, load from natural form to a database form to create context. But now, we are moving from batch processing to real-time processing, and that's the main difference between analytics and AI.

Michael Bird (12:36):
Yeah, okay. AI can do it real-time.

Gokul Sathiacama (12:37):
Exactly.

Aubrey Lovell (12:41):
All right, thanks Michael. What an awesome interview, and we'll be back with more in a moment, so don't go anywhere.

(12:48):
Okay. Now it's time for Today, I Learned, the part of the show where we take a look at something happening in the world we think you should know about.

Michael Bird (12:55):
Yep, and it's one for me this week, Aubrey, and potentially a bit of a celebration. Norway is on track to become the first nation to eliminate the sale of internal combustion-powered private vehicles this year. So, back in 2017, the country set a goal of eliminating gasoline and diesel vehicle sales by 2025. As of last year, 89% of cars sold in the country were electric, making Norway pretty well on track to achieve the goal this year, or in 2026 at a push.

(13:26):
To give you some comparison, 20% of vehicles sold in the UK and just 8% in the US were EVs. Norway is crediting the milestone on both long-term political planning with buy-in across parties, and on supplying an advanced charging network. Norway has five times more charges than the UK per 100,000 people. That means that despite its often chilly climate, making them less efficient, there's plenty of places to recharge. So, lesser breakthrough and more the end of a long-term policy, but pretty awesome nonetheless.

Aubrey Lovell (13:59):
Pretty awesome indeed. Thanks for that, Michael.

Michael Bird (14:04):
Right. Now, it's time to return to my interview with Gokul Sathiacama, VP of Data Storage for AI at Hewlett Packard Enterprise to talk about unstructured data.

(14:13):
So, have you seen any examples where customers have used a large language model or a similar model with a data set, with unstructured data and they've been able to make use of that in a good way?

Gokul Sathiacama (14:25):
In natural form, text form, a lot of companies are doing that, right, to take audio, put it into a text form so that they can take that and then interpret what the dialogue was between two people.

Michael Bird (14:39):
Yeah.

Gokul Sathiacama (14:40):
Whether it's a customer support environment in a banking or retail, or whether it's a complaint environment in a retail. So, there are so many different applications that you have for generative AI, if you will, to figure out what the conversation was or what the transaction log was, and figure out how to optimize or automate some of these processes that customers do repeatedly.

Michael Bird (15:07):
Presumably there's an additional privacy thing here with unstructured data because, yeah, some unstructured data can be quite personal. It's one thing having a photo or an audio file or a video, but actually once you've added context, that file takes on a life of its own to some extent.

Gokul Sathiacama (15:24):
Correct.

Michael Bird (15:25):
You're extracting more information out of that than the original image. So, I guess from that angle, there's initial safeguards you'd probably have to put around that data.

Gokul Sathiacama (15:33):
Correct. And it depends on what data you're using, who is going to use that data and what purpose.

Michael Bird (15:40):
Yeah.

Gokul Sathiacama (15:41):
What's the purpose of that data?

Michael Bird (15:43):
Yeah.

Gokul Sathiacama (15:43):
Privacy is an interesting question because it's not only about safeguarding information that you store, but think about it as from a privacy within a country, within a region as opposed to growing global. In some cases, you want to extract that information so you can do comparisons between different data sets in the world, and customers use masking technology of the data. So, not all the information is given to a broader set of customers that may want to augment their analysis. So, that's how they're avoiding or taking care of that privacy by masking most of the personal information.

Michael Bird (16:29):
Yeah. All right. So, Gokul, final question. We ask all of our guests, why should organizations care about the possibilities of unstructured data?

Gokul Sathiacama (16:39):
You're using data, you're storing data for something that may not materialize, but if you really want to take advantage of the data you're storing and getting that context to do analysis, to improve your organization, to add value and intelligence to the processes, the business practices you do, this is where unstructured data can help you.

Michael Bird (17:05):
Amazing.

Gokul Sathiacama (17:06):
And I tell all my customers, do not store data for the sake of storing data. And at some point, you may use it.

Michael Bird (17:15):
Yeah.

Gokul Sathiacama (17:16):
Only store the data you need, accelerate the data, time to value, time to intelligence is important, and make the right choices.

Michael Bird (17:24):
Yeah, okay. Gokul, thank you so much.

Gokul Sathiacama (17:26):
Thank you for having me.

Aubrey Lovell (17:28):
Thanks so much, Michael, for bringing us that. It's been fascinating, and you can find more on the topics discussed in today's episode in the show notes.

Michael Bird (17:37):
Right. Well, we are getting towards the end of the show, which means it is time for this week in history. A look at monumental events in the world of business and technology, which has changed our lives. Aubrey, what was last week's clue?

Aubrey Lovell (17:50):
Well, the clue last week was it's 1941, and this drug was a real cleanser. And I actually think you got this right, Michael. You guessed it last time.

Michael Bird (17:59):
Well, my guess was something to do with penicillin, but what was the answer?

Aubrey Lovell (18:04):
You're right. It was the first human trial of penicillin. British scientist Alexander Fleming is widely credited with discovering penicillin in 1928, but he never experimented with purifying or testing it as a drug in animals or humans. And that role was taken on by a team led by Howard Florey, who began by experimenting in mice in May of 1940. By February 1941, the drug was ready for human trials. Now, the first subject was Albert Alexander, a 43-year-old policeman, and it's widely believed that he scratched his face in a rosebush, which then became horribly infected, though it's now claimed he may have been injured in an air raid. Either way, he had just days to live, and Florey's team gave him a dose of penicillin every four days. He made a miraculous recovery. Unfortunately though, the drug ran out and the infection returned, but it was a massive step in antibiotic research and spurred the mass production of the drug.

Michael Bird (19:00):
That's incredible. Now, the clue for next week is, it's 1977 and this flight boldly went, but it was not alone. Oh, cryptic. Any ideas?

Aubrey Lovell (19:14):
I don't know. This one could be tricky.

Michael Bird (19:16):
Okay. Well, that brings us to the end of Technology Now for this week. And a huge thank you to our guest, Gokul Sathiacama, VP of Data Storage for AI at Hewlett Packard Enterprise. And of course to you, thank you so much for joining us.

Aubrey Lovell (19:28):
Technology Now is hosted by Michael Bird and myself, Aubrey Lovell. This episode was produced by Sam Datta-Pollen, with production support from Harry Morton, Zoe Anderson, Lincoln Vander Westhoizen, Alison Paisley, and Elissa Mitri. Our social editorial team is Rebecca Wissinger, Judy Ann Goldman, Katie Guarino, and our social media designers are Alejandra Garcia and Ambar Maldonado.

Michael Bird (19:51):
Technology Now is a Lower Street Production for Hewlett Packard Enterprise. And we'll see you the same time, the same place next week. Cheers.

Hewlett Packard Enterprise