Skip to main content

· 16 min read
Anthony Maia

Introduction

Context

This blog post results from research and the implementation of various solutions for real-world scenarios to secure the runtime on AWS EKS. It aims to present different approaches to runtime security and compare these solutions, helping readers select the best option for their environment and requirements.

Let's start with defining the topic. Runtime security refers to protecting a system or an application while actively running. This involves monitoring, detecting, preventing and responding to threats in real-time, as opposed to merely securing the code during the development phase or ensuring the environment is secure before execution. Runtime security is crucial because it addresses threats that can arise during the execution of applications, including zero-day vulnerabilities, malicious insiders, and sophisticated attacks that bypass traditional security measures.

In dynamic and complex environments like those found on AWS EKS, where applications and services interact and evolve continuously, runtime security offers a crucial layer of protection that can adapt to these changes. While AWS manages the Kubernetes control plane, customers are responsible for securing Kubernetes nodes, applications, and data. As part of the shared responsibility model, it is possible to delegate more security responsibilities to AWS, by using AWS Fargate for example. However, as we will explore in this article, delegating runtime security may not be the optimal choice.

Instrumentation Techniques

Effective runtime security relies on robust instrumentation techniques to monitor and manage the behavior of applications and workloads. Understanding the instrumentation methods below allows for a tailored approach by selecting the right solutions and configurations based on the specific dynamics and requirements of environments, ensuring effective integration with existing tools and infrastructure.

The three primary techniques used for instrumentation are LD_PRELOAD, Ptrace and kernel instrumentation, each offering unique benefits and limitations.

LD_PRELOAD is an environment variable that directs the operating system to load a specified dynamic library before any other when a program is run. This technique allows for the interception and modification of function calls within the libc dynamic library.

Advantages:

  • This method is generally efficient, introducing minimal overhead.
  • Simple to implement and can be applied without altering the original application code.

Limitations:

  • Limited accuracy with programs that bypass the libc library and make direct syscalls, such as applications written in Go.
  • Primarily useful for intercepting standard library calls, but less effective for low-level system interactions.

Ptrace is a syscall that allows one process to observe and control the execution of another process. It is commonly used for debugging but can also be employed for monitoring and security purposes.

Advantages:

  • Provides detailed control over the monitored process, allowing for inspection and manipulation of its state.
  • Can be used for a variety of tasks, including debugging, tracing, and runtime security.
  • Ptrace is a method that can be used in user-land and does not require kernel privileges. It can be used in environments like AWS Fargate where access to the host is impossible.

Limitations:

  • The level of detail provided by Ptrace comes with a performance cost. Collecting more data increases overhead, potentially slowing down the system.
  • To simulate a broader scope (all processes in user-land), Ptrace can attach PID 1. In case of instability (and it is very likely to bug) it will make the container crash.

Kernel instrumentation involves inserting modules or hooks into the kernel to monitor and control syscalls and other low-level operations. eBPF is an efficient method for performing kernel instrumentation.

Advantages:

  • eBPF offers low overhead and high accuracy, making it suitable for real-time monitoring and security.
  • Provides granular visibility into system behavior, enabling detailed analysis and response to security events.

Limitations:

  • Kernel instrumentation requires elevated privileges, which may not be feasible in certain environments, such as managed cloud services.

The analysis shows that the only solution possible for managed services without root access to the host is Ptrace. However, expect global instability or use it only on specific processes, losing a lot of visibility. Whereas kernel instrumentation (especially eBPF) is the best solution for unmanaged services where you have root access to the host.

Linux Security Features

Several well-known Linux security features enhance the runtime environment. SELinux uses a configuration language to meticulously define permissible actions within user space, while seccomp and seccomp_bpf restrict syscall actions at a granular level. Implementing and maintaining these solutions is exceptionally challenging in dynamic and complex environments like AWS EKS.

The value of these security features lies in the precision of the policies and rules they enforce, effectively creating a firewall between applications and the kernel. However, poorly designed or maintained rules can adversely impact system performance and availability. Therefore, these tools are best suited for static environments, specific use cases, or as foundational elements for building more tailored security solutions as the ones we will present later in this article.

Protection vs. Detection

While these approaches are not mutually exclusive and can be combined, they have significant differences that need to be considered and discussed early on.

Monitoring and Detection

Monitoring solutions are less intrusive regarding performance and availability. They work by observing system behavior and generating alerts for any suspicious activity. However, this approach requires someone to review these alerts and take action quickly, as it is not proactive in stopping threats.

Monitoring solutions have the advantage of being application agnostic, meaning they can be implemented without requiring deep knowledge of the specific applications being monitored. Additionally, they typically require less initial setup and ongoing maintenance compared to protection solutions. Despite these benefits, monitoring solutions still require significant privileges to gather comprehensive data, allowing precise answers to "Who, What, When, and Where" questions. While monitoring can be done at the user-space level, it tends to be less accurate than kernel-level modules. User-space monitoring often adds substantial overhead, similar to protection tools, reducing its overall effectiveness.

Preventative Measures

Preventative security solutions actively block harmful actions, such as unauthorized syscalls, providing a proactive defense against potential threats. However, configuring these solutions can be challenging. Determining which syscalls to block for each process can be complex and highly dependent on the specific application and its libraries. Any updates to the application might require re-tuning of the blocked syscalls, particularly if new libraries are introduced.

This fine-tuning is crucial but can be disruptive if not managed properly. If the configuration is too restrictive, it might prevent the application from functioning correctly. Conversely, if it is too lenient, it might fail to block malicious activities effectively.

Runtime Security Strategies

While monitoring and preventative measures have distinct advantages and challenges, combining them can provide a more robust security posture. Monitoring can help identify unusual activities and provide data to refine preventative rules. Nevertheless, as of today, such solutions are not mature enough on AWS EKS for heavy workloads and real-world cases without suffering integration costs and/or significant performance issues.

Choosing between monitoring and preventative approaches, or deciding to combine them, depends on the specific requirements and constraints of your environment. In dynamic environments like AWS EKS, where workloads are constantly evolving, understanding these differences is crucial for implementing an effective runtime security strategy.

Runtime Protection

There are numerous solutions available, but I will only detail two mature solutions designed and tested against heavy workloads because they embody opposite philosophies aimed at protecting the runtime.

gVisor

gVisor is an open-source container runtime sandbox developed by Google. It enhances the security and isolation of containerized applications by acting as a user-space kernel. This approach intercepts and handles syscalls from applications running inside containers, preventing direct interaction with the host kernel. By doing so, gVisor significantly reduces the attack surface and minimizes the risk of host system compromise. Designed for compatibility with container orchestration tools like Docker and Kubernetes, gVisor integrates seamlessly into existing workflows, particularly within Google Kubernetes Engine (GKE). However, it is interesting to note that gVisor is not officially supported by AWS. It supports two platforms, ptrace_susemu and KVM.

gVisor

Reference

The ptrace_susemu platform leverages the Ptrace syscall to intercept and emulate syscalls made by containerized applications. Sentry is gVisor’s user-space application kernel, it intercepts these syscalls, minimizing the attack vector by preventing direct interaction with the host kernel. Sentry emulates the syscalls in a secure environment using a limited set of API calls enforced through seccomp, ensuring that only necessary syscalls are permitted. Unlike a simple Ptrace sandbox, Sentry interprets the syscalls and reflects the resulting register state back into the trace before continuing execution, maintaining application behavior while enhancing security. Additionally, the gofer component provides secure file system access, acting as an intermediary between the container and the host file system. Since it still relies on Ptrace, applications heavy on syscalls will experience performance penalties.

The KVM platform, on the other hand, uses the Kernel-based Virtual Machine (KVM) to run Sentry in a lightweight virtual machine, providing an extra layer of isolation and security through hardware virtualization features. This approach enhances isolation by leveraging hardware-level security and can optimize performance by reducing the overhead associated with syscall interception and emulation. The drawback of the KVM platform is that it requires nested virtualization or bare-metal instances.

You can find a PoC of the setup of gVisor on EKS EC2 and test by yourself here.

Fargate

Fargate is a serverless compute engine designed for running containers without the need to manage the underlying infrastructure. As part of the shared responsibility model, AWS assumes more responsibility with Fargate for securing the infrastructure, including the runtime environment of the containers.

It is important to note that by opting for Fargate, you are transferring the responsibility for runtime security to AWS. However, like any black box service, the inner workings of AWS's operations remain opaque to users. For a deeper exploration of this topic, you can refer to this great article. It explains that choosing Fargate does not inherently guarantee a secure runtime environment.

While you can still implement monitoring, relinquishing control over your containers means you lack direct access to the host, which poses several key limitations for monitoring solutions because there is no root access to the host and CAP_SYS_ADMIN (Linux kernel capability that grants a process a broad range of administrative privileges over the system) is not supported. Consequently, it is not possible to deploy kernel modules or leverage eBPF for tracing syscalls. File system monitoring and network activities lose visibility and accuracy. And sidecar architectures become not compatible with monitoring solutions.

These factors underscore the trade-offs associated with using Fargate: while it delegates the responsability, it also imposes restrictions on advanced monitoring and security practices. As a side note, Fargate is supported by GuardDuty Runtime Monitoring (presented later in this article), but only on ECS and not EKS.

Runtime Monitoring

Falco

Falco leverages a kernel module and eBPF modern probe to collect syscall events from the underlying host kernel. This allows Falco to observe low-level system activities such as file access, network communication, and process creation.

In a Kubernetes cluster, Falco is deployed as a DaemonSet. This means there is an instance of the Falco agent running on each node within the cluster. These agents continuously collect and analyze syscall events from their respective nodes.

Collected syscall events are processed by the Falco event processor. The processor applies rules to detect suspicious or unauthorized activities. When a rule is triggered, an alert is generated.

Falco provides various alerting options, such as sending alerts to syslog, writing to a file, or integrating with external alerting and monitoring tools like CloudWatch.

In addition to runtime events, Falco relies on audit logs (Kubernetes API server and control plane logs) to enhance its contextual awareness. It can access information about pods, containers, and their relationships, which is valuable for understanding the runtime environment.

Falco_Architecture

Reference

Falco stands out with its exceptional customization capabilities. It allows security engineers to tailor detection rules based on specific syscalls, container events, and network activities. This granularity minimizes false positives and enhances the accuracy of threat detection. Following the same logic it is also possible to customize the output of the alerts thanks to the rule engine. Falco uses a custom rule language in YAML format, below is a custom rule to detect a reverse shell inside a pod.

- rule: Reverse shell
desc: Detect reverse shell established remote connection
condition: evt.type=dup and container and fd.num in (0, 1, 2) and fd.type in ("ipv4", "ipv6")
output: Reverse shell connection (user=%user.name %container.info process=%proc.name parent=%proc.pname cmdline=%proc.cmdline terminal=%proc.tt)
priority: WARNING
tags: [container, shell, mitre_execution]
append: false

You can find more details about this rule here.

As for gVisor, here is the PoC to install and test Falco on EKS EC2 with Fluent Bit and CloudWatch.

GuardDuty EKS runtime Protection

Similar to Falco, GuardDuty EKS runtime protection relies on an agent. This agent, known as the AWS EKS integration agent, is deployed as a DaemonSet in the EKS cluster. It runs on each node and is responsible for monitoring and collecting data. The agent collects data related to network traffic, DNS requests, and other activities within the EKS cluster. This data is sent to GuardDuty for analysis.

GuardDuty employs machine learning and threat intelligence to analyze the collected data. It looks for patterns and anomalies that may indicate security threats or malicious activities.

GuardDuty_Architecture

Reference

GuardDuty integrates with other AWS services, such as CloudTrail and VPC Flow Logs, to gain a comprehensive view of activity across the AWS environment, including the EKS cluster.

GuardDuty_EKS_Protection

Reference

When GuardDuty detects suspicious or malicious behavior, it generates alerts. These alerts can be configured to trigger notifications via AWS services like Amazon SNS (Simple Notification Service) or sent to AWS CloudWatch for further analysis and action. These findings can be integrated with AWS SecurityHub, which acts as a central repository for security-related information. SecurityHub can be configured to trigger a CloudWatch Event or EventBridge event when a new finding is created. Upon the event trigger, a Lambda function can be invoked to send a notification to a designated Slack channel, for example.

Falco vs. GuardDuty

To determine the optimal solution, we will employ the following criteria:

1) Threat Detection: Evaluate the effectiveness of each solution in accurately detecting security threats, minimizing false positives, and identifying actual security incidents.

2) Customization: Assess the level of customization each solution offers, particularly in defining security policies and rules tailored to our environment.

3) Performance: Analyze the impact of each solution to ensure it does not adversely affect services performances.

4) Operational Overhead: Evaluate the operational overhead, including deployment, configuration, and ongoing management of each solution.

I will not be covering pricing as a critera in this article because the results of the cost analysis were inconclusive and uncertain. Although Falco and GuardDuty runtime protection utilize identical architectures, the volume of events processed impacts the cost.

Dependencies

In terms of dependencies, to run Falco in the least privileged mode with the eBPF driver, the requirements vary based on the kernel version. On kernels below 5.8, Falco requires CAP_SYS_ADMIN, CAP_SYS_RESOURCE, and CAP_SYS_PTRACE. For kernels version 5.8 and above, the required capabilities are CAP_BPF, CAP_PERFMON, CAP_SYS_RESOURCE, and CAP_SYS_PTRACE, as CAP_BPF and CAP_PERFMON were separated from CAP_SYS_ADMIN. For GuardDuty EKS runtime monitoring, the prerequisites are outlined in the Amazon GuardDuty documentation.

SWOT Tables

SWOT_Falco

SWOT Table Falco

SWOT_GD

SWOT Table GuardDuty EKS Runtime Protection

Compraison Criteria

1) Threat Detection

Both Falco and EKS Runtime Protection utilize eBPF probes, ensuring comparable accuracy in threat detection. Falco offers a significant advantage with its high degree of customization, allowing for fine-tuning and tailoring detection rules to specific needs. On the other hand, GuardDuty ingests logs from various AWS sources, providing a better context for the investigation.

In both cases, you will have to deal with a significant number of false positives, especially in an EKS environment. For GuardDuty, this issue is particularly evident with findings such as New Binary Executed and New Library Executed. For Falco, the rule for tampered logs (fluentd) outlines similar challenges. While it is easier to filter GuardDuty alerts, some filtering parameters may be missing, leading to an inability to filter or overly restrictive filtering that might miss important cases.

The machine learning module in GuardDuty is effective for detecting anomalous behaviors within EKS clusters, such as container compromises from the internet or developer account compromises. It can also detect internal abuses, like developers executing commands in production containers or the misuse of an incident-specific role in AWS for unintended purposes. Additionally, the threat intelligence module relies on third-party providers such as Proofpoint and CrowdStrike, offering a reliable source of intelligence with no observed false positives.

Overall, Falco offers more possibilities and benefits from a larger native library of detection rules compared to GuardDuty runtime and audit logs findings.

2) Customization

Falco excels in customization, offering extensive capabilities to customize the output of the alerts and define security rules tailored specifically to the environment or specific CVE like this example with log4j (CVE-2021-44228). This flexibility is crucial for adapting to evolving security threats and aligning with unique requirements. Whereas, customization is nonexistent on GuardDuty.

3) Performance

EKS Runtime Protection benefits from AWS's scalability and performance optimizations for EKS clusters. It is less likely to adversely affect service performance, especially at scale. That said, no performance impact has been observed on services where Falco has been deployed in production environments with heavy workloads.

4) Operational Overhead

As a managed service, EKS Runtime Protection significantly reduces operational overhead. It integrates seamlessly with EKS clusters. It is designed to work harmoniously with AWS services, ensuring straightforward setup, maintenance and operational use. In contrast, Falco requires more effort in deployment, configuration, maintenance and operational use, making EKS Runtime Protection a more convenient option for reducing administrative burdens.

Conclusion

There is no one-size-fits-all solution. The best choice depends on the specific requirements, environment and team capacity. Understanding the nuances of runtime security and the strengths of each solution allows for informed decision-making to enhance overall security posture.

Future directions involve further exploration of monitoring solutions integrating prevention capabilities as they evolve such as Tetragon, and the integration of distinct monitoring and preventive solutions. An example of this is Falco with gVisor on GKE (Google), leveraging the strengths of both tools to provide robust runtime security in containerized environments where gVisor sandbox is used as a source for creating and fine-tuning detection rules.

· 9 min read
Anthony Maia

Context

This website has been initially created in May 2019 on a VPS (Scaleway) using Hugo and the docdock theme.

It had a homepage piosky.fr and two subdomains:

  • cs.piosky.fr to host the cheatsheet
  • cve.piosky.fr to reference all CVEs

Despite the relatively low time investment and cost, this setup still required ongoing management. The cost started at 3€ per month and increased over time, reaching more than 7€ per month. Additionally, some important features were missing natively in Hugo and the docdock theme. As a result, the goal was to identify a more efficient hosting solution that eliminates maintenance requirements, reduces costs, and offers minimal web analytics that respect visitors' privacy.

Solution

Cloudflare Pages

Cloudflare Pages is a static site hosting service that makes it easy to deploy and host your website. One of the great things about Cloudflare Pages is that it supports deploying websites built with static site generators, such as Hugo, Jekyll, Docusaurus. The list of all supported frameworks is available here.

There are a number of pros and cons to using Cloudflare Pages to host a personal website. Some of the main benefits of using Cloudflare Pages include:

  • It's easy to use. With Cloudflare Pages, you can quickly and easily deploy your website without worrying about setting up and maintaining a web server.
  • It's fast. Cloudflare Pages is powered by Cloudflare's global content delivery network (CDN), which means your website will be delivered to users quickly, no matter where they are in the world.
  • It's secure. Cloudflare Pages offers built-in security features such as DDoS protection, SSL/TLS encryption, and a web application firewall (WAF) to protect your website from attacks.
  • It's free. There are still limits observed by the Cloudflare free plan.

However, there are also some potential drawbacks to using Cloudflare Pages to host a personal website. These include:

  • It's a static site hosting service. Cloudflare Pages is designed for hosting static websites, which means it does not support dynamic features such as server-side scripting or a database. If you need these features, you may need to look for a different hosting solution.
  • It has limitations on file size and storage. Cloudflare Pages has limitations on the size of individual files and the total storage space available for your website. If your website has a large number of files or is particularly large, you may need to look for a different hosting solution.

Overall, the pros and cons of using Cloudflare Pages to host a personal website will depend on the specific needs of your website. If you have a small, static website and are looking for an easy-to-use, fast, and secure hosting solution, Cloudflare Pages may be a good choice. However, if you have a large or dynamic website, or if you need a lot of storage, you may need to look for a different hosting solution.

Docusaurus

Docusaurus is a static site generator specifically designed for creating documentation for open source projects. It has a number of features that make it great for creating documentation and blog posts, including easy navigation with responsive design, dark theme, and search.

  • It's meant for documentation and blog. Docusaurus has built-in features to deploy full-feature blog and documentation sections.
  • It's easy to use. Docusaurus has a simple and intuitive interface that makes it easy to create and manage your website, even if you're new to static site generators.
  • It has a customizable home page. Docusaurus allows you to easily customize the home page for your website, including adding a banner, navigation links, and featured content.
  • It has built-in support for versioning. Docusaurus makes it easy to manage multiple versions of your website, allowing you to easily switch between different versions and manage the content for each version.
  • It has built-in search functionality. Docusaurus includes a built-in search engine that allows users to easily search for content on your website.
  • It has a responsive design and a dark theme. Docusaurus is built with a responsive design, which means your website will look great on a wide range of devices, from desktop computers to mobile phones.

Setup

Creating a Website with Docusaurus

To create a website with Docusaurus, you will need Node.js version 18.x. You can check your Node.js version by running the following command in your terminal:

node -v

The next step is to create a new Docusaurus project and build run the development server:

npx create-docusaurus@latest my-website classic
cd my-website
npm run start

You can now access your website from http://localhost:3000.

Hosting a Website with Cloudflare Pages

Cloudflare Pages integrates seamlessly with GitHub, allowing you to easily deploy your static website from a GitHub repository to Cloudflare Pages. This integration makes it easy to automate the process of building and deploying your website, allowing you to quickly and easily update your website with the latest changes from your GitHub repository.

To use the GitHub integration with Cloudflare Pages, you will first need to have a Cloudflare account and have added your website to Cloudflare. Once you have done this, you can go to the Deploy tab for your website in the Cloudflare dashboard and click on the Connect to GitHub button. This will open a window where you can log in to your GitHub account and authorize Cloudflare to access your repositories.

Github_settings

After you have connected your GitHub account, you can select the repository that contains your website and specify the branch that you want to deploy.

Cloudflare

You can also configure the build settings for your website, such as the build command and the directory where the built files should be placed. For Docusaurus, you will need to create the NODE_VERSION variable with the value 18 in Build settings > Environment variables.

Once you have configured the settings for your website, you can click on the Deploy Site button to deploy your website to Cloudflare Pages. This will trigger a build of your website using the settings you specified, and the built files will be deployed to Cloudflare Pages.

After your website has been deployed, you can go to the Overview tab for your website in the Cloudflare dashboard to see the status of your deployment. From here, you can also configure automatic deployments, which will automatically deploy your website whenever new changes are pushed to your GitHub repository.

To use a custom domain that is managed by Cloudflare, click on the custom domains tab from your project and on Set up a custom domain, then enter the name the FQDN. This will create a CNAME entry in the DNS management section, where the Name is your domain and the Content is the domain automatically created by Cloudflare when it deploys the website. It is still possible to do the same configuration from another DNS manager.

Custom_Domain

This setup will save you time and effort, and make it easier to keep your website up-to-date and running smoothly.

Optional setup

Cloudflare Page Rules

Cloudflare Page Rules allow you to control how Cloudflare processes requests to your website. With Page Rules, you can create up to 3 rules in the free version that match specific URLs on your website and apply actions to those URLs, such as redirecting the request, modifying the cache settings, or adding / removing security features.

Each rule consists of a URL pattern that specifies which URLs the rule should apply to, and one or more actions that specify what should happen when a request is made to a matching URL.

Page Rules are useful for a number of different situations, but in this case it has been used to redirect users using the old subdomains to the new endpoints.

Pages Rules

In relation to the screenshot above, you will also need to create a DNS entry for the domain cs that points to the domain created by Cloudflare.

Protecting a Website with Cloudflare Zero Trust

In addition to migrating your website to Cloudflare Pages, you can also protect your website or certain parts of it by giving access only to authorized users using Cloudflare Zero Trust.

In this section, I present the fastest and easiest method to protect your website using Cloudflare Zero Trust with the free plan. However, it is worth noting that this service offers more features than what I will cover in this post, particularly with the paid version.

For small businesses or personal projects with basic use cases, the free plan can be a viable option in the context of this blog post. Nonetheless, it has some limitations, such as a maximum of 50 users and the lack of support for advanced authentication methods such as SSO.

To use Cloudflare Zero Trust to protect your website, follow these simple two steps:

Create an Access Group: To create an Access Group, navigate to the Access section and click on Access Groups. Here, you can give your group a name and define rules that will determine who can access your website. If you are using this for personal use, select Emails and add your email address(es). This method has the advantage to not rely on the Cloudflare Access agent (WARP) and users do not need a password. Instead, they will only need to provide their email address when authenticating, and they will receive an email with either a magic-link to authenticate or a code to copy. For small business usage, select emails ending in and enter your domain name @domain.com. If you require different criteria for user access or if you want to also rely on WARP agent refer to the official documentation.

Apply the Access Group policy to your website: Navigate to the Access section, click on Applications and select Self-Hosted. You can then select your domain managed by Cloudflare and optionally the subdomain and the path of the website to protect (such as "private").

Application Policy Configuration

Proceed to the next page and provide a name for the policy. Then, select the access group you created earlier. After completing the setup, allow a minute before attempting to access your website to test whether it is functioning properly.

Login Page

Conclusion

Overall, this solution is a zero-cost strategy (custom domain excluded). It provides more possibilities, the maintenance and the security are managed by Cloudflare which also provides great analytics.