Securing Mission-Critical Workloads at Runtime

This session examines how federal security teams can maintain continuous runtime visibility, detect active threats as they occur, and apply policy enforcement consistently across Kubernetes and containerized environments, even when systems are isolated or mission-bound. Delivered at a practitioner level, the webinar focuses on producing defensible, real-time evidence to support Zero Trust enforcement and FedRAMP High requirements.

1. Runtime protection, zero drift tolerance (block, don't just alert)
2. Reduce vulnerability noise and focus remediation where risk is real
3. Zero Trust runtime-captured evidence to close audit gaps
4. FedRAMP High demonstration requirements

Ideal for teams that must validate operationally across hybrid and air-gapped environments and produce defensible security evidence.

Duration 32:06

Presented By:

Tsvi Korren

Field CTO, Aqua

Transcript

Hello, and thank you for joining us. On behalf of Aqua Security and Carahsoft, I would like to welcome you to our Aqua webinar, Securing Mission Critical Workloads at Runtime. Before we get started, I would like to go over a few housekeeping items. Please don't hesitate to ask any questions you may have during or after the presentation.

You can direct your questions to the chat pod at the bottom of your screen. If for some reason we are not able to get to your question, we will follow-up with you after the webinar. This webcast is being recorded, and a link will be sent out in a follow-up email for you to review or share with a colleague. Before we introduce today's speakers, I would like to tell you a little bit more about Carahsoft.

At this time, I would like to introduce our speakers for today. Tsvi Koren, field CTO of Aqua Security Ryan Ahlers, senior director, sales engineering, Aqua Security and Jeffrey Harnois, senior security engineer, Aqua Security.

Ryan, the floor is all yours.

Excellent. Thanks, Eli.

Joining me today is, Tsvi and Jeff. Tsvi is our field CTO, and Jeff is a long time hands on cloud security practitioner.

Both of these gentlemen are very much involved in both the direction of our product as well as the success of our customers. I have one slide to show today. We're going to talk about four big topics, and you guys feel free to ask questions as we go through. I'll moderate them in the Q&A field and then make sure I ask Tsvi or Jeff them accordingly.

So, let's begin at the beginning. These are the four main topics we plan we plan to cover today.

Here at Aqua, we truly are leaders in in the in the capability of protecting, workloads at runtime. The four big topics we're going to talk about are vulnerability management, compliance, runtime risks, and workload protection. These are topics that we hold near and dear to us. These are things that we work on every single day.

We help our customers become successful in them. We're happy to share some best practices around them. I'm going to go through and I'll ask a big question on each of these.
Feel free to follow-up with any questions that you may have, and I'll make sure we interlace them into the talk here. The first question, gentlemen, is all about the first topic there, which is vulnerability management in context. And the question is: finding vulnerabilities is easy. Fixing the right ones is not.

What processes or feedback loops can agencies adopt to move from detection to remediation? And how should they link vulnerability data with runtime insights to focus on actively exploiting exploitable risk instead of CVE lists, which is a big topic we run into every day? Tsvi, what do you got?

Yeah. I think we can think of this. So, hi, everybody. Thanks for joining.

The topic of vulnerability management, I'm sure everybody if you're on this this webinar, you probably had some aspect of dealing with, open source vulnerabilities in the last few years. We are currently drowning in vulnerabilities, right, the rate of vulnerabilities findings and the and the rate in which open source components, which are usually where the vulnerabilities are being used in many organizations to support critical missions, but then we have, the problem that most of those components come with a baggage of vulnerabilities. And the question is not really, as Ryan said, not finding the vulnerabilities because we can scan and we can get a whole list of items that we need to deal with, it is understanding which are the ones that are more important for us to take care of because understanding that we cannot really fix all the vulnerabilities, sometimes because they just don't have a fix yet, but a lot of times it's because we just don't have the capacity in our development organizations to do everything that's needed in order to remediate vulnerabilities, understanding that for a lot of the time that requires massive changes in in in the code base.

We have we have a lot of vulnerabilities in the world and one of the things that we are looking to do is to give the people that need to fix them some guidance on how to do that. What we found is that when we're dealing with applications that run-in cloud native environments, specifically containers or even sometimes functions in the cloud, we have a lot of information on the runtime side that can guide us into what vulnerabilities are most important to fix. What we're doing in Aqua today is that we have the ability to classify vulnerabilities based on the objective data that that we get from the NVD and other sources.

So that would be severity score, exploitability, and so on. But then, for each and every one that is trying to remediate those vulnerabilities, we need to overlay what's actually happening in runtime. It can be as coarse as, is the container image that we are scanning actually being used at runtime?

We have a backlog of images that are not deployed for historical purposes or maybe, images that are not ready. We really only need to concentrate on the vulnerabilities in those images and those repos that are actively being used in in production. And the second thing is if we can go even a step further is to understand what is actually running?

Out of all the components that are available inside of the image, usually, a fraction of them are actually being used in a runtime environment. The runtime data that we get about the usage of software within your containers is going to feed a system that will then reclassify vulnerabilities based on the importance and the prevalence of these vulnerabilities in production, and that will allow us to gain better focus and fix them faster. On top of that, there are other data that comes out of the runtime environment. If the workflow that we're dealing with has the direct path to the open Internet, if it's servicing customer requests, there's a lot more danger because then malformed requests can make use of those vulnerabilities.

The more that we are seeing our workload being used actively, being called on by different services, either internal or external. If the components that are running there are corresponding to known components that have vulnerabilities, all of these are going to help us classify and reevaluate which vulnerabilities we need to fix in the environment, which is going to elevate our security posture overall.

Yes. And, also, to put a little bit context to my answer, I work for the FedRAMP SaaS that we have. And, for the those of you familiar with FedRAMP, we're starting to see a shift right now towards the twenty x motions coming out of FedRAMP, and that's where we're starting to look at those reachability CVEs rather than just all of the vulnerabilities. Prior to that, we would generally have one of two remediation paths.

One would be do a false positive documentation analysis for every single one that we determine is not something that would be vulnerable, or two, we would just fix everything regardless of its exploitability. But with this change, we're starting to focus on what's actually being exposed to the Internet. By looking at your baselines, looking at the attack surface, any maybe shadow IT infrastructure, that you weren't aware of that might be running in production, those can be all done via something like auto discovery that some of the tools allow you to do.
This is what we're starting to see as our priority of what we need to, be fixing. These front facing applications will constantly be showing up in the runtime detections, and the alerts that we have, and then we can start prioritizing those as the ones that we need to remediate.

I would also add that that from a just from a workflow point of view, we need to shift our focus from vulnerabilities in the registry, or images in the registry because after probably ten years of containerization efforts, the market and most of the people that we talked to have accumulated quite a lot of images in the registry. And when we scan and rescan those images, especially if new vulnerabilities are coming and being discovered on old code bases, you might get a situation where you're going to be alerted on a few thousands or maybe tens of thousands of “new vulnerabilities” even though they're not new because they just live in in existing images.

They just sit stale in the in the registry. So an actionable item, if you have the ability to do that, is to clean out all the images from your registries. It's going to do wonders for your vulnerability stance because if those images are not in use anymore, there's really no need to keep them so they're available to be downloaded into the production clusters, and then we don't need to scan them, and then we can clean out that code base. It could be as easy just as going back and removing images from the registry.

And then for ongoing work, we have to understand exactly what's running so that we can be a lot more precise in the vulnerabilities that that we need to fix.

Perfect. I can just confirm even on the solutions architect side, we see this challenge continuously. Even if it's in a small evaluation environment, we will see thousands, if not tens of thousands of vulnerabilities show up. If you know any sort of context on where they are, what they're running, or what they could affect, how do you know where to begin?

I don't see any questions so far. I want to continue on with the compliance piece, and then I've got a couple of questions to ask, follow-up questions on there as well. Shifting from vulnerability management over to compliance, the question is, how can continuous vulnerability assessment and runtime visibility help agencies maintain compliance with ONB, NIST, and Zero Trust mandates? How can they prove that protection is actually happening in production?

I can take that as well. From a kind of the theoretical basis for that is that, especially in cloud native and in containerized environment, doing vulnerability management is a control in most compliance frameworks.

You need to understand what the level of risk is in the software that you push out. And then we have other controls that we need to execute on, anything from anti malware, from logging, from the file integrity, and so on. And all of these are runtime controls that can be implemented in a containerized environment fairly easily.

We because even though your containers might be bloated and the images might be big and you might have done some lift and shift in the past and have full operating systems inside of your containers, the containers still are only going to do kind of one or one or two things. And it now becomes easy to understand, for instance, that we don't need any administrative functions to happen inside of a container. Nobody should log on to a container.

Nobody should modify it on the fly, and instituting those kinds of controls that prevent administrative access exec into containers, replacing executables inside of the containers, trying to patch things inside of the containers. All of these things really should not happen at runtime, and the vulnerability management processes really contribute to that.

A good effective vulnerability management process is where everything can be discovered right on the image, hopefully even before you push out the image to the cluster. We have the ability to simulate or actually get a real understanding of what the vulnerability posture is going to be in a workload just by scanning its image. By doing the work to remediate vulnerabilities quickly before anything gets into the cluster and then making sure that the container does not change whilst it's in the cluster, those two controls are going to give us a lot of confidence that we can implement the compliance controls, demonstrate compliance.

For instance, one of the things that you could do, if you want to use Aqua, in the container is to not allow any new executables to instantiate inside of a container after it started. We call that Drift. And what Drift is doing is preventing, from the workload for the workload to change, after you've done your vulnerability analysis on the image. So the story goes like this.
We do a vulnerability analysis on the image. We let the container run. If we can make sure that the container does not change at all during its execution, then we can be guaranteed that the vulnerability posture that we know about is actually the effective vulnerability posture of the container. So runtime controls are an important part of that because, first of all, they're going to answer some of the compliance controls for file integrity, anti-malware, and the rest of it, but they're also going to ensure and demonstrate that your vulnerability management is effective because whatever you caught inside of the container does not change, and then you have time to remediate, generate a new image, and then replace it over time so that you have a better vulnerability posture overall.

And from the FedRAMP side, or NIST eight hundred fifty three, the big successes that we see are around the runtime policies with the agents. When it comes to controls like lease functionality or, compute access or, any sort of system monitoring, those all include a remediation plan, like having an intrusion detection system, intrusion prevention system.
These runtime agents that we have that are running in the background, monitoring everything, and staying up to date of what's happening in environment. That's a powerful tool to have for both that IDS and IPS all wrapped in one that's, looking at all of those NIST eight hundred fifty three controls within those runtime policies.

Some other use cases in our FedRAMP environment include the controls that as we mentioned, allow lists for executables or images, malware detection. All of those come up in those controls, and that's all wrapped into these runtime agents that can act in real time, which is such a highly, valued, return on investment when it comes to these containerized environments.
That's perfect. I've actually had a follow-up question that you touched on a little bit, and, Jeff, you as well. But I want the follow-up questions all around zero trust. When you're running a zero trust and you have these compliance requirements, we're constantly being asked to prove that protections are active in production.

What does that kind of real time evidence actually look like in runtime?

You could prove this in in basically two ways. You can, first of all, make sure that all the workloads are in scope for those for those controls, and you can always get a list of the containers or workloads that are subject to runtime policies and what front end controls they're subject to, which demonstrates that it's in effect. But the other side of that is the logging.
Every time that you see an event that might trigger a control, you get an audit alert on it. And just those continuous audit alerts and, again, it doesn't really mean that there is a security incident per se because a lot of those, are actually artifacts of sometimes, not adhering to best practices of containerization.

Your container might load an executable from time to time. It's not what we wanted to happen, but we do understand that there are some occasions where that might happen. But if you see that event, that means that the control is active. You can do it in two ways.

You can get from the system itself an understanding of what workloads are covered, and then a continuous, stream of audit events also demonstrates that those controls are in place, and then some of them might need to be acted on.

Perfect. Thank you, sir. Alright. We're going to move on to our third topic, because I'm sure we'll get a bunch questions around runtime as well as the workload protection, what Aqua is famous for. Let's talk a little bit about runtime risks inside of modern architecture. Threats are very different at runtime, as we all know.

How are teams adapting detection and response to protect workloads as they run?

That's a short question, but it actually hides inside of it a whole change in in practice.

Again, we're dealing with cloud native workloads. Cloud native workloads: small, immutable, orchestrated, all the things that we know and like about containers. But the biggest change, I think, for security is in the security response for that.

It's mostly because, again, if the application is architected correctly, a single container is not going to impact the application that much or the absence of a single container is not going to impact the application because it's scaled and it's replicated across multiple, deployments inside, let's say, a Kubernetes environment or a cloud environment. What that means for us as security practitioners is that we can choose to be a little bit more heavy handed on the response to what we perceive to be as security incidents. If we see, let's say, an event where a new executable has dropped into a container and we don't anticipate that, the response for that can be as simple as maybe isolating that node, maybe tainting that node if you're running in Kubernetes, which will then reschedule all the containers on another node.

We can then preserve that node for investigation and analysis, if we have the ability to do that. Although we also got to understand that sometimes containers do disappear kind of off the map and get deleted, and that's why it's important to get a continuous log and artifact preservation on those containers and the things that we find. To me, the runtime response side of it is really the one that has been mostly disrupted, with the move to containerization, because of those reasons.

Containers will work flawlessly until they don't. And when they don't, it's actually not a huge problem because, hopefully, your cloud and your clusters are architected in such a way that they can compensate for any workload that, that is impacted by, let's say, a security incident and needs to be isolated and shut down. It also means that we don't have to be super accurate in our ability to detect security risks.

There there's still some false positives. Like, we deal with the system where sometimes we need to guess what the intent of the developer or the attacker was if we see some event in the container. But those are not crucial decisions because we have the ability to have some tolerance in the system in case we need to take a container out of commission and then investigate what happened to that to that container.

The advice that I have to everybody here is, first of all, get an understanding of what are the ways that containers should be working properly so we can start by understanding what the application component is doing, what executables are expected to run, and then we can we can go on controls that do detect any misuse of the container, any misuse of the intended parameters that we expect the container to run, and then any response can be as heavy handed as you want it. If you want to retain the container, you're free to do that. If you want to clean it out of the way, hopefully, your orchestration system is going to replace it, and you won't see any degradation of service.

Perfect. Jeff, your thoughts?

Let me take a little bit of a different approach and a use case of FedRAMP. Take a really annoying control, for example, like, file integrity monitoring. This one goes off so often and I have to go back and readjust the baseline.

And that's a crucial part of it is establishing that baseline, because things will change constantly in your environment, especially when you have something like updates that might be occurring. Your SIEM solution will go off all the time generating new alerts.

But one of the greatest parts about having these responsive agents is that you can constantly have that feedback loop of what's running. So, if you put a new update into your dev or staging environments, you can kind of learn, as the new update is running, what kind of changes that will have in your files. Then if there's new temporary files, a new package that you integrated might be changing or something like that, it might change that baseline. You can learn as you run, and you don't have to deal with any sort of old school waterfall methodology where you would put it out, and then you would have to wait for x number of issues to come back, fix those, redeploy.

Because that's a very lengthy process. You could have that kind of feedback loop and learn from the changes in your environment by having these responsive, agents, and so you don't have to deal with wasting all of that time.

Good. All make sense. I'm going to move to workload protection, and then I'll have some follow ups. I'm sure we'll get some questions from the folks on the line as well.

But we're talking about workload protection. That's god. That's where Aqua has made its name for the past ten plus years, and workload protection doesn't just exist in one environment.
It works across all of your different environments if you truly want to secure your workload. So my question to you, both of folks, is as containerized workloads span on prem data centers and multiple clouds, how can agencies maintain consistent runtime protection and controls, securing workloads dynamically without disrupting mission critical operations?

That really relates to what we just talked about is to make sure that those controls are consistent across all the environment. And, again, here we get more benefit of running in cloud native and containers because once you have the image, you can establish your container image once and then run it in multiple environments. The old you know, make the image once and run it in everywhere.

And we think that actually those security controls really need to travel with that workload because the security controls for a container don't rely on the place that it runs on. The same software components can run-in a you know, for testing in a in a Docker machine, under somebody's desk, or it can be done in a private cloud or in a private data center, or it can be done in a public cloud infrastructure either on the commercial side or the fed FedRAMP side. Because containers travel with all the prerequisites, they should also travel with all the security controls that are required.

One of the things that we recommend to Aqua customers is that you have a kind of complete set of nonnegotiable controls, like the things that are required for compliance, like file integrity or malware or eliminating Bitcoin miners and so on. Those can be deployed as a global policy where, regardless of where the workload is running, if it's protected by Aqua, it is going to be subject to those to those controls. And then if there are controls that are required for, a specific environment, let's say there is a cloud environment or your on prem environment needs to have a little bit of a different variation of that control, there is enough flexibility in the policies in order to accommodate that so that you can start with a with a global baseline that every single workload regardless of where it needs to be covered under.

And then you can do other controls that might be supplemental and might require a particular control that is dedicated to an environment, and then you have the best of both worlds. So even if a workload or an image is then taken from your on prem environment where you might have just a baseline of detection, and then you want to put it in a cloud environment that might have more protections, then just by virtue of moving that container to this environment, it will inherit all the necessary controls that are built for that environment.

A baseline of nonnegotiable, the basics, the absolute must have controls, and then you can add in separate environments anything that makes sense for those environments. Again, make sure making sure that your posture has at least a baseline, and then anything else can be added on top of that.

Yeah. Makes sense.

Not too much to add there. I think we both covered it. The number one thing is those runtime policies.

The great thing we mentioned about cloud native apps is that they're portable. You can move them around. You can have them on prem. You can have them on whatever cloud you want.
But what doesn't change is the different policies that you might want to enact as global, between all of those the different environments that you're running your applications on. Whether it's eight hundred fifty three from NIST or, if it's any sort of SOC or, any other compliance.

Those are all going to be built into your runtime policies that you can enable, and you can have very similar controls to that will meet all of those different compliances across all of the different clouds. You don't have to worry about needing to look at any sort of redeployment to match if you're moving from AWS to Azure to GCP or to Oracle or whatever it is.
You don't have to redeploy those rules to fit those different environments. It's all containerized, so you can move those controls over and meet whatever that you need to for any sort of compliance with just those runtime policies.

Perfect. Speaking of control, I got a follow-up to that. This is something we run into on the solution architect side a lot of time.

Basically, people out there are starting to run more containerized AI workloads. Are there runtime controls that can detect when those workloads start behaving in unexpected or AI specific ways?

Absolutely, AI is going to be one of the main drivers for more containerized workloads, and those workloads are actually behaving in a very predictable way. If it's the client application that is starting to send prompts towards an internal model or maybe an external service, if it's an MCP infrastructure that coordinates the inner workings of your AI service between the tooling and the model and the inference server, all of these are really consistent and can be discovered fairly easily.

One of the things that we've added recently is the ability to detect whether or not there is AI based communications between the different components. Even if you only have the client application on your side and the AI service is somewhere else hosted or maybe not under your control, we can still figure out whether or not there is an AI based transaction there. But if you have more components in your environment, if you do have an MCP server, if you do have the tooling server, if you do have an inference server, even the model itself, you can get a complete map of what models and what services are being used in your in environment.

And that is good for two purposes. One is sometimes there is shadow AI, so you're not even sure that the application is authorized or should access a model, but we are seeing a lot of AI capabilities being embedded in third party and commercial applications. If you inherit some kind of a component or a complete image, it's important to understand whether or not it will, at some point in the life cycle of the application, perform some AI function.

And the other thing is that organization actually do want to absorb AI functions into their environment. In environments where you do anticipate to use AI engines, there is also a lot of time controls over what AI models can be used, what AI services can be used, what is the version of the components that that you get from the vendor of the AI system. A lot of them, we got to understand that this is kind of new, and we are seeing more and more vulnerabilities being found in in AI system. If you get an MCP server outright from Entropic, it's probably going to have a few vulnerabilities on it, and we need to account for those as well. AI actually presents several kind of nuances to the work that we do today. It's, first of all, the fact that applications are starting to use AI transactions whether or they're authorized.

Next one is to differentiate which AI models, which AI services are being used. If AI tooling is being used, if agents are being used, the more autonomy we give to those AI systems, the more they can do, and the more we need to make sure that they're not doing it from a privilege point of view. We don't want to run containers as root. We don't want to run containers as privilege in the system.

If you're running a tool server or an agent server for your AI deployment, it's better not run under root because it's there's no way to tell what somebody might ask you to do. If the ION engine can write its own Python program and execute it, we really need to contain its privileges. And then the auditing, the logging, the fact that we sometimes have just a small whitelist of what we need to of the models and the services that can be used, and we need to identify anything other than that.

Everything that we talked about today is security and is actually very important for running AI workloads, plus there are some nuances for AI that actually make the effort a lot more worthwhile because the danger from a kind of rogue AI agent is fairly great, especially if you give it quite a lot of privileges in the system.

Oh, for sure. We see we see that sort of just detection of the models and which ones are being used as a huge value add for customers where even in the evaluation stage, we'll drop in drop in the enforcers, and they will find things that customers had no idea was running. Just you standardize on one particular model and maybe ten others that are that are hidden away in different containers. We're able to find them. So that's great insight to me.

Alright. Eli, I don't see a lot of questions here inside of the chat. We're happy to answer whatever you guys want. I'm going to stop the share because it doesn't have to be within the constraints of those four bullets. We're happy to answer whatever questions you may have.

Watch Next