How to fix alarm overload with cloud security automation
So you’ve got your new security monitoring system set up. You have alarms that send you emails when the root user logs in. You get a slack message when someone launches an RDS instance open to the internet. And everyone gets a text message when an S3 bucket is created without encryption.
This is great but, like pretty much everyone, you’re swamped with notifications all day long.
At CloudSheriff, we believe it’s cruel and lazy to set up a security monitoring system just to send notifications. Many products on the market do a great job of telling you what is wrong – look, this bad S3 bucket has a red label on it! – and that’s it.
We want to take care of your security so you can innovate and sleep well at night. That’s why we offer automated remediation.
What is automated remediation? Let’s break it down. In cloud security, “remediation” is just a fancy word for “fixing something”. When you find something broken, you fix it, or remediate it. If you find an S3 bucket with public access, you remediate it by changing the access policies to be private.
Automated remediation is just fixing security problems automatically. That means when problems pop up (and pop up they will) they get fixed in the background. And the only notification you get, is that the problem was fixed. Now that’s a notification anyone would like to get!
Cloud Sheriff’s Best practices for avoiding security alarm overload with automated remediation:
- Build a remediation plan for a wide variety of common problems. For example, when you use AWS Security Hub, it launches with dozens of Center for Internet Security (CIS) recommendations already implemented in AWS Config. You will have to code the remediation actions yourself.
- Use human-readable policies so your team has a fighting chance at compliance. At CloudSheriff, we use CloudCustodian, an open source project that lets you build guardrails with human-friendly YAML. We give you 100 such policies out of the box so you don’t have to reinvent the wheel.
- Save automated remediation for the most egregious policy violations (like a Relational Database Service cluster open to the public Internet) that you know for sure you will want to fix, with no hesitation.
- For policy violations that are more of a gray area, have a “man-in-the-middle” approach to remediation. These are errors that are easily-fixed, but you’d like to know about them before you fix them. For example, if a port was opened up on a Security Group that wasn’t normally authorized, you might want to inspect it or ask around the team before you take action. It could be a valid change.
- For the least cognitive load on your team, offer a choice in resolution – we can fix it by doing x or y, or do nothing. Which do you prefer? Let the team member just click a button instead of planning out what needs to be done.
- Plan remediation in advance. Nothing kills productivity like having to context-switch completely away from value-producing work to think about how to fix a policy violation.
- In order to plan in advance, you will end up thinking about baseline rules that everyone should follow. It’s very common for teams to enforce that everything be tagged with a simple tagging policy. CloudSheriff automates this by enforcing global tag policies. What this does for you is give a clean view of your entire inventory – a necessity for a healthy security posture (you can’t secure resources that you don’t know about)
With the wide variety of security offerings in AWS today, it is possible to become so overloaded with alarms that you become numb to them. We call that “notification fatigue” or “alarm overload”. If everything is urgent, then nothing is urgent. A better approach is to automate a fix to common and obvious policy violations. This will save tens if not hundreds of developer hours per year that are better used to push the business forward. Do yourself and your business a favor and think through the security policies you want to enforce, and automate the remediation of policy violations.