How automated reasoning helps Amazon S3 innovate at scale
When we launched Amazon S3 a little over fifteen years ago, we put simplicity at the center of the customer experience. S3’s core API is just four REST commands (GET, PUT, LIST, and DELETE). Our customers tell us that they love the simplicity of S3 not just because of its API, but especially because of the things that they don’t have to think about. S3 is elastic, meaning that customers can store as much or as little data as they like without worrying about provisioning storage space ahead of time. S3 is durable, offering an “11 9s” design for data durability that keeps data safe even if an entire building is lost. In addition to these, S3 is secure, highly available, and offered as a foundational storage service in all AWS Regions around the world.
Over the fifteen years since S3 launched, we’ve learned a lot about offering this simplicity at scale. S3 is built from over 300 software “microservices” that collaborate to form a distributed storage system. Hundreds of S3 engineers work every day to operate, evolve, and improve these microservices. One of the biggest lessons we’ve learned as S3 has grown is the power of building tools that help our engineers write correct software up front and to catch and fix bugs as early as possible. These mechanisms help us move fast while also holding a high bar for quality and reliability. For example, we’ve talked at re:Invent about the different ways that we think about durability and the mechanisms that we use to protect it.
Automatic correctness-checking tools are powerful because they make our engineers more confident and productive. Like a well-built test suite, these tools give engineers the peace of mind that they can innovate without introducing regressions. But unlike a test suite, these tools can check literally billions of combinations of inputs and system states to validate correct behavior. For us, this is a critical element in scaling S3. As the systems that we work with get bigger and more advanced, we need to be more deliberate about preventing this scale from impacting the engineering process.
The future we’re working towards: provably correct systems
We are working towards a world in which the systems we build are provably correct. That is, to mathematically define specifications of the important properties of our software and have formal proofs that our systems never violate those specifications. These specifications will allow us to give strong assurances about durability, security, availability, and performance both to our engineers internally and to our customers. Provable correctness (or, as it’s often referred to in computer science research, formal verification) isn’t a new idea at all. There have been huge advances recently that we’re excited about. Researchers have demonstrated approaches to provable correctness on all sorts of large systems. We’re already using provable correctness to power new S3 features like Zelkova that give customers strong assurances about the security of their data.
But there are still some challenging problems to solve to make these ideas practical at S3’s scale. Specifications are often large and complex and need to be written by experts. Verifying implementation against a specification needs to be automated, often referred to as automated reasoning. For large systems like S3, this automation is computationally challenging. Most importantly, we need to find mechanisms for provable correctness that are agile and flexible enough to be evolved alongside S3 microservices that are growing and improving every day.
Where are we today? Using lightweight formal methods to validate a key-value storage node
At the upcoming ACM Symposium on Operating Systems Principles, engineers from Amazon S3 will present a paper on our progress in solving some of these provable correctness challenges. The paper describes our experiences designing and building ShardStore, our new storage node microservice. ShardStore is the service we run on our storage hardware, responsible for durably storing S3 object data. ShardStore is a ground-up re-thinking of how we store and access data at the lowest level of S3, incorporating over 15 years of experience managing data on disks so that we can get the greatest possible drive-level efficiency. ShardStore is also an essential component for durability in S3, and so we want to catch bugs early and comprehensively.
Traditionally, provable correctness comes with high overhead, with the formal proof needing up to 10x more effort than just building the system itself. For ShardStore, we developed a new lightweight automated reasoning approach that gives us nearly all of the benefits of traditional formal proofs but with far lower overhead. Our approach employs techniques like property-based testing and model checking to automate the validation of ShardStore’s Rust implementation code in a variety of scenarios, including transient and permanent storage device failures and Availability Zone power failures. The result is that for ShardStore, automated reasoning requires only about 13% more code on top of the implementation.
We’re focused on building automated mechanisms to help our engineers ensure correctness of every change they make, so whatever we built for ShardStore needed to give strong assurances not just once, but on future changes as well. This challenge starts with specifications. As we evolve and improve S3, the expected behavior of ShardStore also evolves, and so we need to keep the specification up to date. We do this by embedding the specification into the ShardStore code base and writing it in Rust rather than a more formal language like TLA+. This embedding makes it simple for our engineers to update the specification at the same time as they implement new features or optimizations. Our team can then code-review both the code and specification changes together. In short, applying automated reasoning becomes a standard part of the software engineering process, not a one-time or “extra” bit of work.
Once the specification is updated, we need to validate that future code changes are still correct. We use our lightweight automated reasoning techniques to validate every single deployment of ShardStore. Before any change reaches production, we check its behavior in hundreds of millions of scenarios by running our automated tools using AWS Batch. To support this type of scalable checking, we developed and open-sourced the new Shuttle model checker for Rust code, which we use to validate concurrency properties of ShardStore. Together, these approaches provide a continuous and automated correctness mechanism for one of S3’s most important microservices.
Our work on ShardStore is just the beginning of our efforts to make provable correctness a day-to-day part of our engineering processes at AWS. We’re continuing to innovate both on the fundamentals of automated reasoning and on its applications at S3 scale. Within S3, our Automated Reasoning Group is a team of scientists and engineers building the tools and infrastructure we need to make this goal possible (and they’re hiring!). We’re excited for provable correctness to continue to make S3 the best storage on the planet for data used by cloud-based applications.