How do you solve a network outage?

MICHAEL BIRD
I want to pose a scenario for you… you’re doing something online – shopping perhaps – you get to the checkout, and the page doesn’t load. What do you do?

AUBREY LOVELL
Well, I think past the moment of frustration, I would probably reload the page.

MICHAEL BIRD
Yeah, I try to stay calm. I usually screenshot the page and then I reload it. And then if it doesn't work, I usually then go and check my Wi-Fi router to see if something's happened. I imagine most of us would assume there's just a small bug in the system somewhere. We try turning our phone or our computer on and off again,
something like that.

AUBREY LOVELL
That's true. It's an age old fix.

MICHAEL BIRD
Oh it works! Do you know what, I worked in an IT department and I remember frustratingly often turning it off and on again to fix the problem and the most worrying thing is I never knew why.

AUBREY LOVELL
We’re really completely at the mercy of the people trying to fix that problem in the back end.

MICHAEL BIRD
Exactly and, in this episode, we’ll find out why.

I’m Michael Bird

AUBREY LOVELL
I'm Aubrey Lovell

MICHAEL BIRD
And welcome to Technology Now from HPE.

MICHAEL BIRD
As we hinted last week, Technology Now has changed ever so slightly. Think of it as 2.0.

AUBREY LOVELL
I’m excited!

MICHAEL BIRD
Now we’re still bringing you the latest from the world of Technology, but with a new feel and sound.

Right that’s the admin over, let’s get onto this week topic...

And it’s all about connectivity and more specifically… networks.

They are so integrated into modern society that we barely notice them at all. As you know, they have everything from the internet through to mobile phones, to private networks within companies, and these networks allow us to transmit data to friends, family, co-workers, pretty much anyone you want.

AUBREY LOVELL
There's so many things that can go super right and then there's so many things that can go super wrong

MICHAEL BIRD
Yeah, and later on we'll be hearing from Sarah Tovar, a principal network engineer on the advanced customer engineering team. Honestly, fascinating conversation.

AUBREY LOVELL
But, before we get to Sarah, I want to take you back to the beginning of the first networks... It’s time for Technology Then.

AUBREY LOVELL
One of the most important networks was born in the sixties...

The US Department of Defence were focussed on one of the oldest concepts in the history of humanity, and it was driving innovation…

[pause]

War

MICHAEL BIRD
Well I guess it's the 60s, the Cold War?

AUBREY LOVELL
Correct.

And the questions on everyone’s mind was: what’s more important… communication, or weapons?

MICHAEL BIRD
Communications are like vital, aren't they, during war?

AUBREY LOVELL
They absolutely are. obviously, that's what the department chose, right, was to focus on the communication. And in a big move, they diverted $1 million from ballistic missile defense to an experimental program designed to link computers together. We might ask the question, why? Well, the network computers would decentralize the system. if the Cold War were to heat up, no pun, there wouldn't be any single command center which could be attacked to take down all communications. So very smart.
Why? Well, the networked computers would decentralise the system so if the Cold War were to heat up, there wouldn’t be a single command centre which could be attacked to take down all communications. So very smart.

MICHAEL BIRD
like self-healing…

AUBREY LOVELL
Exactly. And this network was called the ARPANET. So I've actually found a picture, Michael. Well, I should say Producer Harry found this picture. We're looking at it right now. What do you kind of see in this picture of the original ARPANET work?

MICHAEL BIRD
Okay so it's quite a simple drawing, hand drawn, and it's of three sort of interconnected nodes, one node hanging off one of the connections.

AUBREY LOVELL
It definitely looks like something I could draw, which is not saying much because I am not a artist when it comes to drawing but 20 years later in Switzerland, innovation is kind of now being driven by this, you know, holistic scientific collaboration. There's obviously an international institute that kind of runs experiments for particle physics, but there was one problem back in 1987. They needed lots of information to be shared to researchers all around the world. So what does a British scientist create?

a little old network called the World Wide Web. Obviously something that we all rely on in today's age. And get this, to prevent it from being accidentally switched off, the computer had a handwritten label in red ink saying, this machine is a server, do not power it down. And that was kind of the fundamental shift for networks, right? Like now we have networks everywhere. We need to keep them working and obviously major outages affect people around the globe and can cost companies billions of dollars. So it's pretty important to try and avoid them at all costs.

MICHAEL BIRD
It's like sort of the backbone of society. And I would say it's hidden, isn't it? We don't really think about our wider networks. So I think understanding why they fail is incredibly important. And to find out more, I spoke with Sarah Tovar. She's a principal network engineer on the advanced customer engineering team at HPE. And we started off by looking at why networks go down.

SARAH TOVAR
A lot of times it's because of human error. There are many components to networks and they're becoming more complex and human error really causes the most issues I would say in networks.

MICHAEL BIRD
Why is that? I mean, sorry, why have networks become more complex?

SARAH TOVAR
Well, if you think about a long time ago, it was just really the need for people to have a wire in a switch and you'd have a wire at your desk if you really needed connectivity. And most of your resources were local. You weren't accessing the cloud. You didn't have wireless. And applications were really simple.

MICHAEL BIRD
Is it common for the problem to be software or hardware related? I suppose software is changing something somewhere. Hardware might be sticking a screwdriver in something, so I suspect it might be software

SARAH TOVAR
Yeah, it could really be either, but typically humans make configuration mistakes in networks and sometimes you can make a configuration mistake so bad it can actually impact and mess up the hardware and then you need to replace the hardware too. But the underlying driver on a piece of hardware is the software that really runs it. And humans code the software, so humans are prone to mistakes. Humans even design the hardware, so a human mistake in the hardware manufacturing process can also create issues. So we've got config, software, hardware, it all comes back to a human somewhere.

Michael Bird (06:27.928)
does a fault-based failure differ from a cyberattack bringing down a network?

SARAH TOVAR
Absolutely. So when it comes to a human-based or a fault-based error, for example, if it's a configuration mistake, if you can figure out what it is and fix that mistake, then you're off to the races. But when it comes to a cyber attack, a lot of times you'll see the attack, mitigate it, and then the entity performing the attack will change, and they'll do something different. And then you'll have to mitigate that, and it can become a never-ending cycle.

MICHAEL BIRD
But presumably, if you had a network outage or failure within your network, you won't necessarily know if it was a cyber attack or if it was a fault-based failure.

SARAH TOVAR
Yeah, it can be really difficult to determine if it was something that was an attack because attacks can actually be so severe that they can cause hardware errors. And I've actually experienced myself one time we had an attack that caused one of our devices to consume so much CPU and memory that a module failed in that device. So we thought the whole time we were just having an issue with our hardware, replace the hardware, and then it started happening again and we're like, wait, a minute and at that point we had the necessary tools in place to determine that that's what it was but it can happen.

MICHAEL BIRD
So Sarah, how do you diagnose the cause of an outage? Like where do you start?

SARAH TOVAR
Hopefully you have really good monitoring because that's your number one place to go to. you have, for a cyber attack, there's a lot of tools that you can get that will help you determine if it is a cyber attack. So really you need to go to your monitoring software or whatever you've set up in place to monitor. That's your first place to go.

MICHAEL BIRD
We sort of touched on this, but is it easy to tell when a network fails because it's being attacked?

SARAH TOVAR
It depends. So... If it's a very simple denial of service attack from a single server and you look at your firewall and you're like, hey, this one server is eating up this much bandwidth. That's easy, but that's like 0.001 % of attacks these days. Now attacks are very complicated and sometimes they're very packet level. They're extremely difficult to determine. So you really need to plan for these attacks. You're going to have them see how your application is expected to perform and expected to behave. And then when you see anomalies in that, that may be a good trigger for you to know that, there may be an attack going on here.

MICHAEL BIRD
But one would imagine the first thing that you might think when you have some outage, some issue with your network, I wonder if most people jump to this is a cyber attack because that's probably the most sensible thing to think straight away.

SARAH TOVAR
A lot of times you do, and I mean, I've been so guilty many times of I just jump to what the answer is without actually really fully understanding what's going on. So you have to really get an idea of what the problem actually is, not what you think it is, and not what other people are telling you it is. And that could be hard.

MICHAEL BIRD
how long do you think it would take to diagnose the the sort of the root cause of an issue?

SARAH TOVAR
It really depends on the issue. If it's a very complicated issue and if it's a very advanced cyber attack, it can take a long time before you know. And during that time, the stress levels are high. If it's a critical severity one outage, everything's down and you have people yelling at you, you sometimes start panicking and then just trying to figure something out and just do something, maybe I'll fix this. that usually, that panic and just acting without really understanding the situation will result in a longer outage or even a worse outage.

MICHAEL BIRD
So I guess you've diagnosed the cause of the outage. How do you recover from that outage? Like what steps do you take?

SARAH TOVAR
So once you've diagnosed and you know what you need to do to recover, You don't want to just start going on the keyboard and just making a bunch of changes and I've been there before, been on many outage calls where that has happened. And you have to get buy-in for what you're going to do. you should, even though you think you may know what it is, maybe you don't know everything. So you should tell everybody on the call, I think I see this issue fix this, let's go ahead and get everybody to buy in and understand and fix whatever that is. You don't just quietly fix it in the background and then be like, hey I fixed it because that can cause all issues as well.

MICHAEL BIRD
how do you make sure that the people who are fixing the issue can get on with fixing the issue, but also can communicate how they're fixing it and the time scales and all that sort of stuff?

SARAH TOVAR
So, ideally your organization has already set out a plan for who's going to own an incident. And usually that's an incident manager. And that person typically will make the judgment on the call if they need to have separate breakout calls for different parts of the incident. you should have one person that's responsible from each stream that would join the main call and give updates to the incident manager or via chat so that that person really knows everything that's going on.

sometimes people think somebody else is going to do something and they just sit back and wait and that's another reason that an incident manager really helps because they're telling people and calling them out by name, you're going to do this action and then you're going to report back to me because it makes people feel like they have a job and a role and they're not just sitting back and waiting.

MICHAEL BIRD
So network has gone down, you and your team have swooped in and have solved the problem, everything is back up and running, what next? You just sort of put it behind you and forget that it ever happened?

SARAH TOVAR
So you need to learn from that incident. You spend a lot of money if you think about it in that terms to give you the opportunity to better your application. And you need to take and figure out all the contributing factors that cause this issue and see what you can do to mitigate each one of those. But really do not waste that opportunity.

MICHAEL BIRD
I mean, is there an element of, okay, it might be human error, but actually there's a failure in a system or a process that we have in place.

SARAH TOVAR
every human error has reasons behind it. And it's your job to figure out what those reasons were and to mitigate them. So a good example of that, someone may have made a configuration mistake on a network device. Why did they do that? Maybe the configuration was too complicated., very rarely is somebody just going out there maliciously making configurations mistakes for a company and if they were there's still a reason behind them doing that and you need to figure that out and prevent them from doing that in the future.

MICHAEL BIRD
how do you then prevent the same issue happening again?

SARAH TOVAR
Sometimes the outage occurred because the network wasn't well documented. So in that circumstance you would want to add in documentation and sometimes outages were delayed because you couldn't get a hold of the people that would fix a certain component or you didn't even know who they were. In those circumstances you need to plan for that and if you discovered and you're learning from incident analysis that that was a contributing factor to either causing a longer outage or causing the outage to begin with.

MICHAEL BIRD
when we spoke before, Sarah, you had some strong feelings about the term root cause analysis, which I think is quite a popular term in our industry. But why exactly do you dislike this term so much?

SARAH TOVAR
So I actually took a class from the late great Dr. Richard Cook several years ago on learning from incidents versus root cause analysis. And it really changed my perspective completely. You're usually like, OK, the root cause is this. And I'm going to blame this person. And then we're going to move on because we know the blame. But very, very rarely is there a single root cause to an issue.

And very rarely is there going to be just one thing that you can do to prevent it from occurring again. It's multifaceted. And if you really think about it in terms of learning from that incident and then using those learnings to make your application better and more resilient to issues or prevent issues entirely if you can, that's really the goal. That should be what you want to do, not just point your finger at somebody in a root cause and say, hey, network team, you made a configuration error.

MICHAEL BIRD
Prevention is of course better than a cure. So what are the most important things companies can do to prevent a network outage occurring?

SARAH TOVAR
Take previous incidents, learn from them, and the biggest thing that you can do is to plan. Plan for what you would do if you would have an outage, because you're going to have one at some point. And that planning phase includes multiple steps. you need to plan for what you're going to do if you have an outage, all the people that need to be involved to fix it as fast as possible

MICHAEL BIRD
Are there any good examples that you can give where a company has either resolved something exceptionally well or had a spectacular issue that could have been avoided?

SARAH TOVAR
there was a major incident and we didn't have an incident manager for this one and it was actually a very major application issue and I was looking at it from the network perspective and I was seeing some hardware module failures on our firewall. At the same time, when I was looking at the firewall and I was making the determination that I need to fill the firewall over to fix this problem, the application team was looking at their application. Now they didn't know that there was no problem with their application, that it was a firewall issue. So I just sat on the call, hey, I'm gonna fail the firewall over and the application team at the same time decided that they were going to restart their app.

Unfortunately, when they restarted the app, it did not restart cleanly So then the problem is resolved by me failing the firewall over. But now we have an application issue because we didn't say, hey, I'm going to do this step. You need to not do anything while I'm doing this. If we would have had an incident manager on that call,
the incident manager could have said, okay, Sarah, fail the firewall over.
Application team wait, don't do anything until that action is done. If that would have happened, then the application would have been fine because the issue was the firewall.

MICHAEL BIRD
So something as simple as just like getting on a phone or having an incident manager, , would have solved a load of problems.

SARAH TOVAR
Yeah, yeah and if you... are a junior network engineer, which I was at the time, sometimes you can be a little intimidated and a little afraid to just take charge. Now, I'm more seasoned, and I am not afraid to take charge. If there's nobody that's leading the call or an incident, I will stand up and I will just lead it because I know it needs to be done and tell people, hey, you're doing this, give me an update when you're done. You wait until this action is done and then it runs a lot smoother.

MICHAEL BIRD
So looking forward to the way that networks are becoming more and more complex, what is the best thing that people can do to try and prevent major incidents with their networks going down?

SARAH TOVAR
As we talked about before, you're going to have network outages. So what you need to do is you need to plan for it and make sure that one, your application is built resilient to any kind of network outage. So if different components fail, your application should be able to withstand those failures if it's a critical application. You need to document very well so anybody could pick up the phone and take care of an outage. You need to plan for you that knows the application best to not be around. A junior engineer should be able to take care of that outage.

Documentation does suck. That's probably the least favorite part of everybody's job. Nobody likes to do it. But it can save you a lot of time and money. Every second that your application or your network is down, it's costing you money and if you could do just an hour’s world of documentation to prevent that, the cost savings are just immense, you need to do it.

Even if you don't want to, you have to. It's a necessary evil, so to say.

AUBREY LOVELL
Nowadays, everything seems so complicated. Even if we tried to make it simpler, there’s probably too much going on to actually make any large changes right?

MICHAEL BIRD
Yeah, and this is a known issue but I’ll hand it back to Sarah to explain how we’ve ended up in this position.

SARAH TOVAR
If you can, keep it simple, try to, but really applications these days are very complicated. It's not typically a single server and a single router that is going to be involved with your application.
I don't think we're ever going to get back to the point where your application is so simple that it's not going to have any kind of complexity. I wish we could, but we're past that.

AUBREY LOVELL
Okay that brings us to the end of Technology Now for this week.

Thank you to our guest, Sarah,

And of course, to our listeners.

Thank you so much for joining us.

If you’ve enjoyed this episode, please do let us know – rate and review us wherever you listen to episodes and if you want to get in contact with us, send us an email to technology now AT hpe.com.

MICHAEL BIRD
Technology Now is hosted by Aubrey Lovell and myself, Michael Bird
This episode was produced by Harry Lampert and Izzie Clarke with production support from Alysha Kempson-Taylor, Beckie Bird, Alissa Mitry and Renee Edwards.

AUBREY LOVELL
Our social editorial team is Rebecca Wissinger, Judy-Anne Goldman and Jacqueline Green and our social media designers are Alejandra Garcia, and Ambar Maldonado.

MICHAEL BIRD
Technology Now is a Fresh Air Production for Hewlett Packard Enterprise.

(and) we’ll see you next week. Cheers!

Hewlett Packard Enterprise