General thoughts

General thoughts

Put the Security in SRE: Proactive Defense

Put the Security in SRE: Proactive Defense

Dec 15, 2021

Paul-Arthur

Jonville

Some would argue that defense must be proactive to face the ever-increasing and unpredictable security environment. To that end, they put security in SRE (Site Reliability Engineering) and rely on its core principles.

The modern software industry is increasingly distributed, rapidly iterative, and predominantly stateless. Development and Operations have to deal with these issues at an increasingly rapid pace. To address these issues, developers and operations teams devised ways to increase the software's reliability and delivery speed: SRE in operations and DevOps in development.

We can also assess reliability in security, which remains predominantly preventative, focused, and dependent on the state at a given moment.

We're focusing on SRE and what motivated its creation at Google in 2003, as well as whether or not this applies to cybersecurity. The goal is to have a different perspective on how we conceive security.

To keep up with us, look at what we're going to learn down there:

  • Can you tell me what SRE is?

  • Can I have some security in this?

  • A new approach to cybersecurity.


Can you tell me what SRE is?

Traditionally, development and operations are two distinct fields. However, two fields usually imply fragmentation about tools, metrics, and goals, which leads to a lack of communication between them.

The idea behind SRE was to merge development and operations around the same goal: reliability.

The term reliability has to be understood from the customer's perspective. What makes software reliable to them? To SRE's founder, Benjamin Treynor Sloss, multiple variables behind the reliability concept include availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.

It's easy to apply these variables to your customer perspective. For instance, when someone wants to watch a video on their favorite streaming platform, they're very concerned with its capacity to watch it when they want: availability. Also, he wants a smooth experience with no endless buffering or error code, latency, or performance. In case of failure, he wants a fast response from the provider, a quick return to normal: emergency response.

We could go on about what a customer expects when using software or a service, but you get the point. In his eyes, all of these characteristics make it reliable.

As we said, this was typically handled by operations teams, who relied on manual labor to resolve issues.

However, SRE relies on a different approach when it comes to the reliability of your software. Whereas operations traditionally rely on manual tasks to overcome problems arising in production, SRE is bringing automation to handle these at scale and provide more reliable software.

In SRE, engineers take a software approach to handling these issues. Through code, they can address problems in a much more scalable and sustainable way than traditional processes.

The SRE's mantra is: "Do it once manually, automate it the second time."

Some would say that SRE was DevOps before it was coined. It's partly true because it means bringing one team's mindset and tools to another. Still, SRE is more about reducing failure rates as the software evolves than streamlining change in the development pipe. To distinguish the two, let's say that:

  • DevOps automates development speed;

  • SRE automates reliability at the production scale.

To improve reliability, one of the core principles of SRE, besides automation, is chaos engineering, where failures aren't feared but embraced as valuable lessons.

SRE and chaos engineering: continuous experimentation

Zero risk doesn't exist in SRE. This initial postulate changes the way they approach reliability. Instead of hoping the systems will run smoothly forever, they expect it to break at some point.

As you inferred, Chaos engineering derives from Chaos theory, a cross-disciplinary scientific theory focusing on the evolutions of initial conditions in dynamical systems.

Do you picture Ian Malcolm doing his thing? That's right. He nailed the chaos theory! Now let's transpose this in the software field:

Software is evolving, even through minimalistic changes from its initial condition. Each time, these updates and changes pile up on the infrastructure. Even if the software is deterministic, following a unique behavior according to the intent of its creators, updates integrate into a complex system, a distributed computing system linked over network and resource sharing, potentially impacting its base. The wider this distributed system is, the more unexpected it can act when introduced to changes.

This can yield widely diverging outcomes in a dynamic system and render long-term production impossible, thus appearing as a randomized system or chaotic.

Now, Chaos engineering focuses on this random and unpredictable behavior to identify weaknesses through experiments done in a controlled environment. The goal is to stay ahead of the unexpected or even an attack. Break your system on purpose instead of waiting for someone or something to do it when you're asleep.

To that end, you're going to conduct experiments and tests. Why? You're unsure what will happen in an experiment when you start the process. In a test, however, you know the outcome. It's done to ensure it, not to discover something. You'll compare the results to your initial hypothesis and theoretical normal running state.

Okay, that's cool, but can I have some security in SRE?

We agree that this unreliability can be related to anything in your company, from hardware malfunctions to errors embedded in your code. However, there's also some reliability in security.

A security breach often implies snowballing consequences for the end-user himself. Ransomware, for instance, can force a company to shut down its delivery system, dramatically impacting its customers and the company's reliability.

Roughly put, security can be considered part of reliability. The SRE and security teams share the goal of having the most resilient system to reduce incidents as much as possible.

The traditional security approach would be acceptable if one could capture all the risks in an audit. The experience proved us wrong. Relying on this approach dooms you to face unexpected events barely prepared since you're only focused on risks listed in your audits. In other words, when the tide goes out, you realize you've been swimming naked.

You need to prepare yourself for the unexpected incidents arising from this chaos. The solution is to apply SRE and chaos engineering to security.

Security and chaos engineering = security experimentation

Above, we said that without an SRE approach, we often discover failures after the incident. You can tell the same about security. The most common way we acknowledge them is after the security incident. Knowing about your failure after it materialized unexpectedly is a little too late. Damage has been done.

This is why some engineers wanted to take a more proactive approach to security. To them, you have to embrace failure and anticipate it. Yes, security failures are going to happen at some point. Yes, you better be prepared for them.

To ensure more robust security, you must test and experiment for known and unknown vulnerabilities to understand what happened and stay at the edge of security as it evolves and changes. Thus, you create a feedback loop around your controlled experimentations. This combination of chaos engineering and security gives us security experimentation.

Use risk analysis metrics to assess the workload.

SRE relies on metrics to evaluate the reliability of the service and the work to be done. Key metrics are namely Service Level Agreements (SLA), Service Level Indicators (SLI), and Service Level Objectives (SLO). SRE teaches us that the closest to perfection you can achieve is 99.999%. Given the facts we've stated above, 100% security isn't achievable. You'd be wasting resources and money that you could employ better elsewhere to achieve it. Your error budget is built according to this principle. Reach these metrics' inbounds, and you're set to focus on the other one.

In security, you have to assess your metrics according to this concept. Use known ones such as: 

  • Key Performance Indicators to assess the effectiveness of your activities: Mean Time To Detect, Mean Time To Repair, Alert Time To Triage, and so on;

  • Key Risks Indicators: internal assets deemed critical, known vulnerabilities, etc.

Not relying on such metrics and precise lower and upper bounds will lead you to chase ghosts when time, money, and people matter.

Tests aren't enough to face the unknown.

Modern distributed systems are composed of immense, constantly changing stateless variables, which make it nearly impossible to understand the work at a given moment. Moreover, in terms of security, the main factor is the human, which is unpredictable. Every system has to work this factor into its security schemes.

No security system can remain idle in an ever-evolving threat environment. The Set-it-and-forget-it mantra is doomed to fail because it's static. Create dynamic security through feedback loops and experimentations.

In a few words, test and experiment as many times as possible. You want to be ready when it comes.

  • Test: the validation or assessment of a previously known outcome;

  • Experiment: derive new insights and information that was previously unknown.

Do it once manually and automate it.

One of the main particularities of SRE is to rely on automation once the vulnerabilities have been found and acknowledged. As such, you're increasing your readiness when hard times come.

Also, automation dramatically enhances human labor. The capacity to remediate issues at machine-time speed leverages our work and allows us to focus on the next issue once we've discovered one, fixed it, and found a way to rely on technology to handle its future occurrence. 

A new approach to cybersecurity

We're developing new measures and tools to handle security threats every day. Cybersecurity spending is already over $50 billion annually, and it vowed to continue increasing at a fast pace. Still, the issue only seems to get worse. More attacks and new exploits are reported every day. 

Interestingly, most attacks aren't due to advanced threats like APTs. No, most are somewhat simpler than that. They're about incomplete implementation, misconfiguration, or design imperfections. Human error and system glitches (system error, misconfiguration) account for most security breaches. Unpatched vulnerabilities or human error (credentials, phishing, or accidental data loss) account for more than 50% of the total root causes of data breaches.

Malicious or criminal attacks succeed more often due to initial human errors and system glitches. We know these factors. They're testable and measurable. This is about taking a proactive approach to designing, building, implementing, operating, and monitoring our security controls.

In a dynamic environment such as cyberspace, continuous instrumentation and validation of security capacities are needed to create a real sense of reliability in our system's ability to defend against incoming threats. This is about relying on something other than hope when we're waiting for the threat. 


Automate processes with AI,
amplify Human strategic impact.

Get a demo

Automate processes with AI,
amplify Human strategic impact.

Get a demo