At Scalr we provide SaaS and on-premises software to manage cloud infrastructure (and it’s open-source), so we end up making lots and lots of API calls to the AWS APIs.
What’s great about scale is that when you’re making thousands of API calls a day, events with a 0.1% probability of occurring tend to happen multiple times a day, sometimes resulting in surprise, confusion, and hair pulling! One of those low-probability events is the subject of today’s blog post.
Let’s start with a quick test, shall we? Can you spot what’s wrong with this code?
import boto.ec2 # Connect to EC2 conn = boto.ec2.connect_to_region("us-east-1") # Credentials are in the environment! # Create a new SG security_group = conn.create_security_group("test-sg, "Just making a point!") # Define SG rules ip_rules = [("tcp", 80, "0.0.0.0/0"), ("tcp", 443, "0.0.0.0/0")] # Update the SG with the rules for protocol, port, network in ip_rules: security_group.authorize(protocol, port, port, network)
If you can’t: read on!
The AWS APIs Are Fundamentally Eventually Consistent
When you hit the AWS API to create a security group, or allocate an Elastic IP, the response you get usually includes an ID for the resource you just created — and when it doesn’t, you get an error message explaining what went wrong, so you can fix your API call.
Now, popular belief (and the AWS documentation) suggests that as soon as you have the resource ID, you can make further API requests against it, like adding security rules, or associating an EIP, or adding tags to an instance. And in most cases, this will work.
But the truth is that the AWS APIs are in fact eventually consistent: the mere fact that you have a resource ID does not guarantee that the underlying resource actually exists.
In practice, this means that once in a while, your API call will return an ID that you can’t use just now, because the resource doesn’t exist yet. And if you nonetheless try and use it, you’ll get an error message, like:
The security group 'sg-xxxxxxxx' does not exist
The allocation ID 'eipalloc-xxxxxxxx' does not exist
The instance ID 'i-xxxxxxxx' does not exist
Of course, when you retry the call a few minutes later — after you got to your logs — the security group is there, and so is the Elastic IP, and you’re left wondering: “Oh AWS, why can’t I use the security group I just created?”.