Google Open Source Blog: April 2023

Posts from April 2023

gVisor improves performance with root filesystem overlay

Friday, April 21, 2023

Overview

Container technology is an integral part of modern application ecosystems, making container security an increasingly important topic. Since containers are often used to run untrusted, potentially malicious code it is imperative to secure the host machine from the container.

A container's security depends on its security boundaries, such as user namespaces (which isolate security-related identifiers and attributes), seccomp rules (which restrict the syscalls available), and Linux Security Module configuration. Popular container management products like Docker and Kubernetes relax these and other security boundaries to increase usability, which means that users need additional container security tools to provide a much stronger isolation boundary between the container and the host.

The gVisor open source project, developed by Google, provides an OCI compatible container runtime called runsc. It is used in production at Google to run untrusted workloads securely. Runsc (run sandbox container) is compatible with Docker and Kubernetes and runs containers in a gVisor sandbox. gVisor sandbox has an application kernel, written in Golang, that implements a substantial portion of the Linux system call interface. All application syscalls are intercepted by the sandbox and handled in the user space kernel.

Although gVisor does not introduce large fixed overheads, sandboxing does add some performance overhead to certain workloads. gVisor has made several improvements recently that help containerized applications run faster inside the sandbox, including an improvement to the container root filesystem, which we will dive deeper into.

Costly Filesystem Access in gVisor

gVisor uses a trusted filesystem proxy process (“gofer”) to access the filesystem on behalf of the sandbox. The sandbox process is considered untrusted in gVisor’s security model. As a result, it is not given direct access to the container filesystem and its seccomp filters do not allow filesystem syscalls.

In gVisor, the container rootfs and bind mounts are configured to be served by a gofer.

When the container needs to perform a filesystem operation, it makes an RPC to the gofer which makes host system calls and services the RPC. This is quite expensive due to:

RPC cost: This is the cost of communicating with the gofer process, including process scheduling, message serialization and IPC system calls.

To ameliorate this, gVisor recently developed a purpose-built protocol called LISAFS which is much more efficient than its predecessor.
gVisor is also experimenting with giving the sandbox direct access to the container filesystem in a secure manner. This would essentially nullify RPC costs as it avoids the gofer being in the critical path of filesystem operations.

Syscall cost: This is the cost of making the host syscall which actually accesses/modifies the container filesystem. Syscalls are expensive, because they perform context switches into the kernel and back into userspace.

To help with this, gVisor heavily caches the filesystem tree in memory. So operations like stat(2) on cached files are serviced quickly. But other operations like mkdir(2) or rename(2) still need to make host syscalls.

Container Root Filesystem

In Docker and Kubernetes, the container’s root filesystem (rootfs) is based on the filesystem packaged with the image. The image’s filesystem is immutable. Any change a container makes to the rootfs is stored separately and is destroyed with the container. This way, the image’s filesystem can be shared efficiently with all containers running the same image. This is different from bind mounts, which allow containers to access the bound host filesystem tree. Changes to bind mounts are always propagated to the host and persist after the container exits.

Docker and Kubernetes both use the overlay filesystem by default to configure container rootfs. Overlayfs mounts are composed of one upper layer and multiple lower layers. The overlay filesystem presents a merged view of all these filesystem layers at its mount location and ensures that lower layers are read-only while all changes are held in the upper layer. The lower layer(s) constitute the “image layer” and the upper layer is the “container layer”. When the container is destroyed, the upper layer mount is destroyed as well, discarding the root filesystem changes the container may have made. Docker’s overlayfs driver documentation has a good explanation.

Rootfs Configuration Before

Let’s consider an example where the image has files foo and baz. The container overwrites foo and creates a new file bar. The diagram below shows how the root filesystem used to be configured in gVisor earlier. We used to go through the gofer and access/mutate the overlaid directory on the host. It also shows the state of the host overlay filesystem.

Opportunity! Sandbox Internal Overlay

Given that the upper layer is destroyed with the container and that it is expensive to access/mutate a host filesystem from the sandbox, why keep the upper layer on the host at all? Instead we can move the upper layer into the sandbox.

The idea is to overlay the rootfs using a sandbox-internal overlay mount. We can use a tmpfs upper (container) layer and a read-only lower layer served by the gofer client. Any changes to rootfs would be held in tmpfs (in-memory). Accessing/mutating the upper layer would not require any gofer RPCs or syscalls to the host. This really speeds up filesystem operations on the upper layer, which contains newly created or copied-up files and directories.

Using the same example as above, the following diagram shows what the rootfs configuration would look like using a sandbox-internal overlay.

Rootfs configuration in gVisor with internal overlay

Host-Backed Overlay

The tmpfs mount by default will use the sandbox process’s memory to back all the file data in the mount. This can cause sandbox memory usage to blow up and exhaust the container’s memory limits, so it’s important to store all file data from tmpfs upper layer on disk. We need to have a tmpfs-backing “filestore” on the host filesystem. Using the example from above, this filestore on the host will store file data for foo and bar.

This would essentially flatten all regular files in tmpfs into one host file. The sandbox can mmap(2) the filestore into its address space. This allows it to access and mutate the filestore very efficiently, without incurring gofer RPCs or syscalls overheads.

Self-Backed Overlay

In Kubernetes, you can set local ephemeral storage limits. The upper layer of the rootfs overlay (writeable container layer) on the host contributes towards this limit. The kubelet enforces this limit by traversing the entire upper layer, stat(2)-ing all files and summing up their stat.st_blocks*block_size. If we move the upper layer into the sandbox, then the host upper layer is empty and the kubelet will not be able to enforce these limits.

To address this issue, we introduced “self-backed” overlays, which create the filestore in the host upper layer. This way, when the kubelet scans the host upper layer, the filestore will be detected and its stat.st_blocks should be representative of the total file usage in the sandbox-internal upper layer. It is also important to hide this filestore from the containerized application to avoid confusing it. We do so by creating a whiteout in the sandbox-internal upper layer, which blocks this file from appearing in the merged directory.

The following diagram shows what rootfs configuration would finally look like today in gVisor.

Rootfs configuration in gVisor with self-backed internal overlay

Performance Gains

Let’s look at some filesystem-intensive workloads to see how rootfs overlay impacts performance. These benchmarks were run on a gLinux desktop with KVM platform.

Micro Benchmark

Linux Test Project provides a fsstress binary. This program performs a large number of filesystem operations concurrently, creating and modifying a large filesystem tree of all sorts of files. We ran this program on the container's root filesystem. The exact usage was:

sh -c "mkdir /test && time fsstress -d /test -n 500 -p 20 -s 1680153482 -X -l 10"

You can use the -v flag (verbose mode) to see what filesystem operations are being performed.

The results were astounding! Rootfs overlay reduced the time to run this fsstress program from 262.79 seconds to 3.18 seconds! However, note that such microbenchmarks are not representative of real-world applications and we should not extrapolate these results to real-world performance.

Real-world Benchmark

Build jobs are very filesystem intensive workloads. They read a lot of source files, compile and write out binaries and object files. Let’s consider building the abseil-cpp project with bazel. Bazel performs a lot of filesystem operations in rootfs; in bazel’s cache located at ~/.cache/bazel/.

This is representative of the real-world because many other applications also use the container root filesystem as scratch space due to the handy property that it disappears on container exit. To make this more realistic, the abseil-cpp repo was attached to the container using a bind mount, which does not have an overlay.

When measuring performance, we care about reducing the sandboxing overhead and bringing gVisor performance as close as possible to unsandboxed performance. Sandboxing overhead can be calculated using the formula overhead = (s-n)/n where ‘s’ is the amount of time taken to run a workload inside gVisor sandbox and ‘n’ is the time taken to run the same workload natively (unsandboxed). The following graph shows that rootfs overlay halved the sandboxing overhead for abseil build!

The impact of rootfs overlay on sandboxing overhead for abseil build

Conclusion

Rootfs overlay in gVisor substantially improves performance for many filesystem-intensive workloads, so that developers no longer have to make large tradeoffs between performance and security. We recently made this optimization the default in runsc. This is part of our ongoing efforts to improve gVisor performance. You can learn more about gVisor at gvisor.dev. You can also use gVisor in GKE with GKE Sandbox. Happy sandboxing!

By Ayush Ranjan, Software Engineer, gVisor

Google’s Open Source Security Upstream Team: One Year Later

Tuesday, April 18, 2023

The Google Open Source Security Team (GOSST) was created in 2020 as a response to the increase of software supply chain attacks—those targeting software through vulnerabilities in other projects or infrastructure they depend on. In May 2022, GOSST announced the creation of the Google Open Source Maintenance Crew (now simply "GOSST Upstream Team"), a dedicated staff of GOSST engineers who spend 100% of their time working closely with upstream maintainers on improving the security of critical open source projects. This post takes a look back at the first year of the team’s activities, including goals, successes, and lessons learned along the way.

Why GOSST Upstream?

Every year, open source software becomes an even greater part of the software landscape—in fact, 2022 saw the most downloads of open source software ever. However, this popularity means open source software becomes an ever more appealing target for malicious actors.

Often, open source maintainers cite lack of time or resources to make security improvements to their projects or maintain them long-term. As a response, GOSST started working hands-on with some of the projects most critical to the open source ecosystem to help reduce the burden of implementing security enhancements, and assist with any additional maintenance incurred. Our goal with our contributions is to convey our respect, gratitude, and support for their valuable work.

What we do

We based our approach on examples from open source communities and recommendations from experienced maintainers and contributors:

A GOSST engineer manually analyzes each project to understand their context and security needs.

We follow the project's contribution guidelines to suggest the most impactful improvement selected specifically for the project's situation.

We file issues before creating any PRs to initiate a discussion and address any maintainer questions or concerns.

We focus on improvements that can be solved via pull request. Maintainers are busy, and we want to do as much of the work as possible for them to reduce their workload.

We treat our contributions as conversations. Maintainers know their project better than we do, and their input helps us ensure our security improvements satisfy their unique requirements.

We welcome all feedback on our work and related technologies, and if necessary, we work with the relevant teams to make improvements.

Depending on the need, we follow up with additional improvements and check in to address any ongoing maintenance needs for changes we’ve made.

This approach allows us to suggest and support improvements while still respecting maintainers’ time and effort.

Achievements since last year's announcement

Following this working model, we have proposed security improvements and made ourselves available to more than 181 critical open source projects since mid-2022 including widely-used projects such as numpy, etcd, xgboost, ruby, typescript, llvm, curl, docker, and more. We contacted them using whichever method was requested in the project’s guidelines: most interactions were via GitHub issues/PRs, but some were via email mailing lists, Jira boards, or internal forums.

To analyze each project we used OpenSSF Scorecard, which evaluates how well a project follows a set of security best practices. The result of the analysis is used to gather insight on possible future enhancements for the team to implement for the project, usually aiming to fix known vulnerabilities or any opportunities for security improvements. The contributions from this approach often include many common best practices, such as:

Adding a Security Policy (example)
Limiting unnecessary permissions on workflows (example)
Rewriting workflows that include dangerous patterns, such as allowing remote code execution in the repository context (example)
Adding the OpenSSF Scorecard action, to automatically check security status and report results (example)
Pinning dependencies (example)
Adopting a dependency-updating tool, and more

The team also looked for ways that projects could adopt the OpenSSF's Supply-chain Levels for Software Artifacts (SLSA) framework to harden their build and release processes. These contributions help secure projects against tampering and guarantee the integrity of their released packages.

A significant part of our contributions also helped projects adopt the OpenSSF Scorecard GitHub Action. When it's added to a project, each change to the code base triggers a review of security best practices and alerts projects to changes that could regress or harm their security posture. It also suggests actionable and specific changes that could enhance this posture. Additionally, the tool integrates with the OSV Scanner, which evaluates a project's transitive dependencies looking for known vulnerabilities.

As a result of these interactions with open source maintainers, the team received and conveyed valuable feedback for the tools we suggested. For example, we were able to address potential friction points for maintainers who adopt Scorecard, such as improving scoring criteria to better reflect their efforts to secure their projects. We’ve also gathered frequently asked questions we received from maintainers and created an FAQ to answer them. Additionally, the team added to documentation for the SLSA framework by clarifying the decoding process to access SLSA provenance created with the OpenSSF SLSA GitHub generator tool.

Still to come in 2023

In the year ahead, we’re going to expand our support to even more repositories, while still focusing on the projects most critical to the open source ecosystems. We'll also revisit projects we've already contributed to and see if there's more support we can offer. At the same time, we’re going to double down on our efforts to improve usability and documentation for security tools to make it even easier for maintainers to adopt security improvements with as little effort as possible.

Moving forward, we’ll also further encourage maintainers to take advantage of the OpenSSF's Secure Open Source Rewards program. This Linux Foundation program financially rewards developers for improving the security of important open source projects, and has already given $274,014 to open source maintainers to date. Maintainers can then choose to either have us raise PRs and implement the changes for them, or be financially rewarded for making the changes themselves – we'd still be available to answer questions that come up along the way.

Takeaways and how to get in touch

Based on the positive responses from open source maintainers this past year, we’re happy to count our efforts as time well spent. It’s clear that many open source maintainers welcome help from the companies that rely on their projects.

We’re constantly getting in touch with open source communities and love talking to folks about how we can help address security issues. The door is always open for community members to contact us too! Please reach out about anything related to our upstream work at gosst-upstream@google.com.

We’re just getting started and look forward to giving back to even more projects. We hope other companies will join us and also invest in full-time developer teams dedicated to supporting the open source communities we all rely on.

By Diogo Sant'Anna and Joyce Brum, Google Open Source Security Team