When I started in IT, almost 40 years ago, I always felt that I needed to learn a lot, quickly. I took university and industry courses. I read a lot. But I learned the most from white board sessions with people who had real, hands on experience and had learned in the ‘real world.’ I still learn a lot from listening to people who are at the top of their game.
So when I got a chance to interview Guillaume Neron, Security and Resiliency Practice Leader – Canada, at Kyndryl, I jumped at the chance. The result was a bit of an experiment – to see if I could get an interview style piece that really turned into a ‘thought piece.” In honour of one of my old mentors, I called this, ‘chalk talk.’
It was a fascinating conversation and we’ve prepared this summary from the transcript. We’ve done a little editing to make a casual conversation more ‘readable’ but this is a very accurate rendition of our talk. But I’ve added another element to this. I’ve tried to share not just what I asked, but to some degree, what I was thinking and my response to what was said.
Also, not just to maintain the informal aspect of our conversation, but also because even for me, it’s jarring to see the word Love in bold throughout the piece, I’ve stuck to our first names.
– Jim Love, chief content officer, IT World Canada
There’s been a real shift in the conversation in cyber security. While the ideas of prevention, detection and remediation are still prominent, increasingly I’ve been hearing the word “resiliency.”
Resiliency envisions building an organization that is prepared. One that can effectively respond and recover when attacked. It’s a great concept. In fact, I was so fascinated that we dedicated our annual MapleSEC conference to the idea of preparation and response.
But is this idea of resiliency real? Or is just another buzzword?
I sat down for a “virtual coffee” with Guillaume Neron and took the opportunity to pursue this line of thought.
Neron spoke about what it was like to be in a company that was both a big established enterprise and a startup – at the same time.
“When we became Kyndryl, there were practices that were transferred from IBM. The one exception was my practice. We had the opportunity in the case of the security practice to rebuild from scratch, since IBM kept its security practice.
On one side we have security; on the other is resiliency, which has actually been in existence for over 60 years. As an example, think about when IBM had mainframes, and put men on the moon; we obviously couldn’t be running mission-critical things like this without making them resilient.”
We moved on to the topic of resiliency. While I do believe in the concept, I think we should give any emerging or new IT term a healthy amount of skepticism. So I pushed the point with Guillaume. Is this a buzzword? Is there a real thing we can call “resiliency.”
Guillaume surprised me with an answer, although, having been in IT for forty years, it made eminent sense. Resiliency isn’t new, according to Neron. It’s a natural extension from the idea of “high availability.”
“Operational resiliency has evolved. It started by ensuring high availability, but eventually moved on to data, to sites, and so forth. It has continued to evolve over time. Today, looking at our practice, we have security and resiliency. We no longer run those two practices separately because we [see them as] a continuation of each other.”
But even if resiliency is an idea that has been around for many years, it is, for many, still a new concept. Guillaume told a story that made this point clearly:
“Yesterday I was at a conference in Ottawa for federal government workers. We had our VP of growth […] from the UK who was basically doing a keynote. In preparation for the keynote, the conference organizer [sent] some questions to the conference people. And the questions were like, “Do you have backups?” and ‘Are you confident [in your backups]?’ A vast majority of the people answered yes.
And then I asked, ‘How many of you have actually tested your ability to recover from [adversity]?’ [Of the roughly] 150 people that responded, 121 [said they] never tested it.
I couldn’t believe it.
When we look at why resiliency is so important, [given the current reality of] cyberspace, it’s no longer a question of will you get breached, but when. There’s a quote in cybersecurity that says, ‘there are two types of companies: those who are breached and those who don’t know they’ve been breached.’”
So what is the definition of resiliency in the modern sense?
“Resiliency is about your ability to recover. What we’re seeing right now is that [many] organizations have not [fully adapted] to the reality of that merger of security and resiliency. If you look at NIST CSF [National Institute of Standards and Technology’s Cybersecurity Framework] as an example, the fifth pillar is recoverability.”
The next comment from Guillaume had a proverbial ‘light bulb’ going off in my head. One could question whether IT needed another C-level position; the creation of the role of Chief Information Security Officer reflected a strong focus on security. But did it equally emphasis the idea of resiliency?
“In many organizations, CISOs are not responsible for backups. Many CISOs are not responsible for [business continuity planning]. So, when a situation occurs, and a company’s servers get wiped by ransomware, [CISOs] will be responsible for incident response; but once the situation is under control, maybe the CEO asks, ‘What are you doing about getting the system back?’ The CISO then says ‘Backups are not my responsibility.’”
There are some fundamentals in the idea of resiliency that we all recognize. Personally, as a self-confessed ‘data geek,’ I found the next idea very consistent with what I think is core to effective cyber security. Cyber security isn’t something separate. It’s about protecting the business. To understand what it is that we are protecting, we have to understand our data.
“Resiliency is critically important as it is your lifeline to keeping your company going. It is — or should be — integrated with and into everything. Most companies we have talked to have some sort of understanding of what their crown jewels are. Unfortunately, many do not go to the full extent of what they need to do in order to ensure full recovery in the event of a breach.”
So how do we make this practical? There is a dilemma. Few, if any companies could do what we did in the old days and ‘go manual.’ I questioned whether it was even possible for companies to operate manually. Guillaume had coincidentally posed this question to an audience a few years ago.
“There was a question yesterday from [someone in] the audience as to whether we would be willing to go back to paper-level processing and preparation [in the case of total catastrophe]. I think that’s a very difficult question to answer. Companies plan to not get there. However, there are real-life examples of companies who [were forced to] do just that.
There was a shipping company with offices in the Ukraine that was affected by NotPetya — malware executed by the Russian government targeting Ukrainian companies. Within 25 minutes, 100 per cent of this company’s Windows systems were destroyed. Their phone and OT systems stopped working.
What eventually allowed them to get back up was not good planning but extremely good luck.
One of their offices in Africa, which apparently had poor connectivity, went offline just before the attack. Result: their Active Directory was [intact].”
So it might be possible for some companies to continue operations. Yet how many companies could do this? And those of us who quote Murphy’s Law know that there are more stories about bad luck than good luck,
Without the ability to ‘go manual,’ the math of ‘mean time to recovery’ on large systems that have been effectively attacked means that it could take days or even weeks to recover. How do you deal with the need to get the business running versus the realities of recovery? Guillaume had a great way of looking at this – the ‘minimum viable company.’ Another light bulb turned on.
“There’s a concept called minimal viable company (MVC) — these are the systems that a company needs — at minimum — to run their most critical business process. So let’s say you have SAP Enterprise Resource Planning (ERP). It’s your crown jewel, and you cannot run without the SAP ERP taking its authentication from Active Directory, so Active Directory is part of the MVC. It requires network and DNS, so the DNS server is part of the MVC. There’s a backend database — you need a database to run SAP etc. So you have all these systems that are important for your SAP system.
We’re also seeing a strong connection between business and technology. I gave the example of SAP, but SAP is probably not the best example. What’s really important is the business process that SAP is supporting. Do you need to be able to pay your suppliers within 24 hours? Yes? Well then, in order to do that you may need SAP. So, going higher in the stack and lower in the stack is also something absolutely required now by companies in order to have better resilience.”
What I liked about Guillaume’s approach was that his idea of resiliency incorporated some of the fundamental principles that we’ve known for years. Security, like any good defence strategy, is about layers. We assume that any layer of a defence can be breached. So you need multiple layers. And if you can’t prevent an attack, you slow them down to give yourself time to respond. A resilient organization adopts that kind of realism.
“You have to think about how you will protect yourself. That goes through network isolation; that goes through hardening of systems; that goes through putting the right controls in the right place; that goes into planning and gamification of your recovery posture. Making sure you’re able to get back up, not only on paper, is one of the things we’ve started doing for some of our customers. [This is an] engagement where we take security and resilience and bring it to the same table.
So let’s say we send security testers to go and penetrate the Active Directory. Successful or not, we’ll then do a tabletop to see how a company handles incident response, and how it handles recovery initiation. Because Active Directory is a crown jewel, you can’t wait for the attack to have completely run its course; your company is no longer running, and you need to get back on your feet ASAP. So you move quickly to recovery, into actually recovering from backup. From there we’ll get the chaos up level. And we go, ‘Okay, now we’re targeting your backup.’ We see if your backup servers can be compromised. Then we go into a tabletop saying, ‘You no longer have backups, you no longer have your Active Directory — what are you going to do?’ If the company has evolved, we try to recover from the vault. Then we increase chaos one more level. We target the vault. And the idea is [that] in doing that, we’ll [find] often that everything works on paper, but a company won’t know its breaking point. Identifying this breaking point, then, becomes key.
One of the things that get in the way is that it’s relatively easy to think in terms of building your defenses. People have that as their job, [but] my job is to run the firewall, to protect, to do backups. We don’t have anybody whose job it is to ensure the organization can recover, and therefore they have no resources. So it always gets kicked down the road. It’s a major thing.”
The resilient organization must have another type of realism. It accepts that no defence is perfect – hence the idea of layers. It also accepts that people are not perfect. We do not always instinctively make the right moves in a crisis. That’s why it’s also so important that create rules to guide us in the uncertainty of an attack and hopefully, a recovery.
“I think rules of engagement are critically important, and they need to cover what executives can and cannot ask of the technical resource because, inside an incident, whether it’s security-related or a natural disaster, people tend to panic; panic leads to improvisation, and improvisation leads to chaos. So it’s important to have clear rules of engagement as it relates to people — the people problem related to recovery — because no company staffs themselves to be ready to rebuild 30, 50, or 100 per cent of their ecosystem.”
It’s this realism and the ability to question how we will cope in a real disaster that is where the concept of resiliency really shows its worth. It’s not about asking, ‘what have we got covered’ and patting yourself on the back. It’s about asking ‘what have we missed?” I confess, as a company with small budgets, I’ve envied friends who kept a recovery company ‘on retainer.’
But if you look past my envy, there is another question. ‘What else do we need?’
“When we looked at the industry, we saw [cases where an] incident response retainer was signed with a company. This person is called in the case of an emergency. They’ll come in, detect the source of the compromise, contain it, and produce a nice report. Then they leave you, saying, ‘You’re good, everything’s fine, you no longer have a problem.’ But then you might turn around and your data centres will be “on fire” because you no longer trust 50 per cent of your servers. You need to go back to backups. You need to “re-hydrate data”. You need to validate that the […]data is itself not compromised. [You’re trying to restore] some application you set up once, and you no longer have the expertise for setup and implementation on your team because a third party did it. And then you need to rebuild everything.
[Unfortunately, most] companies do not add end-to-end recovery plans that cover rebuilding everything.
After a full conversation without one sales or marketing message in it, I think Guillaume earned the right to talk about what his company does and brag a little. So here’s his message:
“We launched a new layer of retainer that specifically addresses recovery. We have a recovery retainer […], and the idea is we give you Service Level Agreements (SLAs) and Service Level Objectives (SLOs). You get access to the resources you need. I think we’re critically well placed for that as we have over 7,000 people in our resiliency practice, and a total of 90,000 people in all lines of technology, from mainframe to cloud to application data and AI to network.
We’re a managed service company that comes from the DNA of a strategic outsourcer. So we’re really good at technology. “People and process” is where we shine. We’re not a Lego firm. That being said, we can run tabletops with customers to help them pinpoint their deficiencies. We run an exercise for customers we manage in order to ensure they have the ability recover from an incident. We have a service around the automation of recovery so we can basically take the most critical workloads and work toward automating their recovery so as to reduce the time these companies will be down.”
That’s ‘Chalk Talk’ – my conversation with knowledgeable experts. We’d love to hear your comments. We’d welcome suggestions of topics and experts that we can interview. The rules are simple. This is not about getting across a sales or marketing message. It’s about sharing expertise with peers. If you are up for this, or know someone who is, contact me and we’ll arrange a recorded discussion. If it makes the grade, I’ll publish it.
Just click the check mark or X under this article and pass on your comments or suggestions.
Our experienced team of journalists and bloggers bring you engaging in-depth interviews, videos and content targeted to IT professionals and line-of-business executives.
IT World Canada. All Rights Reserved.