Amazon Cloud Cost Management
“Quick! Cast a fireball or something!” cried the warrior as the undead horde gnawed at her shield.
“I can’t.” replied the wizard apologetically, “no more magic power, blew it all on teleporting us here.”
“Well that’s poor capacity planning. Didn’t you know how much magic power teleporting us will cost?” asked the warrior, her shield almost giving way to the weight of her enemies.
“Nope.” the wizard said simply, “I just know generally how much I have.”
“You’re not a very good wizard, are you?”
They were then both consumed by the immense undead workload.
Sorry, I’ve been playing a lot of video games lately. But the point is this: The wizard didn’t have visibility into how much each spell costs, he only had the lump sum of magic power he could spend.
In this article I’d like to discuss cost management in enterprise Cloud environments. To kick things off, let’s consider an example (don’t worry, this one actually has something to do with the point).
A few months ago a customer of ours had a contractor come in to train some IT folks on a piece of software. The training was 3 days long and required about 40 Linux servers for the students to install software on and play around with. An m3.medium general purpose instance on AWS EC2 costs about $1.60 per day, so 40 of them cost about $64 per day. That’s training infrastructure for 40 people for 3 days at just under $200 - not the cheapest deal you could get but not a bad one (but don’t forget the storage volumes cost money, too!).
You can probably see where I’m going with this. After the 3 days were up the contractor moved on, the students went back to their cubicles and the servers were left to gather virtual dust.
After a few weeks an administrator finally noticed the overspend and terminated the machines. The cost incurred was roughly $3000. This type of thing happens all the time in different variations. Let’s try to break down the challenges enterprises face around Cloud cost management and what can be done about them.
Even good wizards struggle with visibility
The first challenge enterprises face tends to be visibillity. Who’s spending, how much are they spending, what are they spending on.
A cloud platform is typically the solution to a productivity issue. Developers need to deliver solutions fast and traditional datacenter workflows simply aren’t agile enough. Visibility becomes an issue when teams in different business units turn to different cloud services.
One of the issues financial administrators traditionally had with datacenters was the lack of deep visibility. Meaning that at the end of the month finance would be presented with a large lump sum that represents how much was spent on infrastructure.
Funny enough, the distributed nature of self-service cloud consumption in large enterprises produces the exact same problem! Since the various business units might have different requirements from a cloud platform multiple services are used, and central visibility becomes challenging. At the end of the month finance is presented with a lump sum that represents how much “The Cloud” costs. Just like the old days.
Controlling the (undead) sprawl
I think my Dungeons & Dragons analogy has played itself out, but humor me just one more time.
Our inept wizard has a spell that reanimates the dead. The spell works like this: every time an enemy dies in the wizard’s vicinity - that enemy comes back as a zombie. The spell costs the wizard 2 magic power points per zombie. This means that taking out a large group of enemies might result in overspend. He needs to somehow control the undead sprawl. Now I’m done, I swear.
Here’s a fun fact: EC2 consists of about 75% of AWS spend. That means enterprises are spending three times more on virtual machines than on all other cloud services combined. This means that controlling virtual machine sprawl is a major part of reigning in cloud costs.
As we mentioned earlier, this challenge manifests in many ways. In the training example the machines were left on, but what if the machines were terminated but all EBS volumes were orphaned? Or if someone provisioned powerful mega-servers when the most modest of instances might do the trick?
If the first half of the cost challenge is visibility, the second half of the control challenge is context. You’ve detected that you’re overspending and over-provisioning, what are you going to do about it?
The context and control issues tie into the visibility challenge, we need to be able to quickly and easily determine the context of a piece of infrastructure, what application is that server a part of? What happens if it goes away? If we know that 40 servers were provisioned for a training session and that session is over - we know we can get rid of them safely.
But what about a more complicated scenario? What if a developer provisioned an application stack for a project but then left the company? How do we know what shutting that stack down will affect?
How are enterprises approaching these challenges?
“We get it, Ron” you might say, “are you just going to talk about how hard cloud cost management is or are you going to give us something actionable?”
Well I’m glad you asked! Let’s talk about the ways enterprises approach these challenges.
The control, context and visibility challenges are all connected. For example, deeper visibility through tags will provide you with information about who’s spending what, but will also provide context that might help reign in costs.
Since all the challenges we discussed affect one another it makes more sense to treat them as one and the same. When we discuss solutions we need to consider approaches that find a balance between the problems. It doesn’t make sense to just solve visibility or to just solve sprawl, and although this article discusses financial challenges it also doesn’t make sense to solve those without considering the needs of other groups in the enterprise.
Let’s consider two general approaches to cost management: reactive controls and preventative controls.
Let’s revisit the training example. We had 40 servers that were left running and incurred a charge of about $3000. I mentioned that a few weeks after the training an administrator finally noticed the overspend and terminated the instances. What I didn’t mention was that it was actually a reactive control that notified her about the spend.
A reactive control is a form of policy that detects usage anomalies. For example, we usually spend $500 on compute resources over the span of a few weeks, when all of sudden the actual spend goes up to $3000 - the alarm goes off. If the reactive control wasn’t in place, the company might have spent $30,000 instead of $3000, that’s $27,000 of loss averted!
The issue with reactive controls is typically agility, most cloud environments are dynamic and constantly changing. A reactive control fixes something once it’s broken or raises a flag once a line was crossed, this works well for many situations but complicates the problem in others. If a reactive control detects a mis-provisioned server it might scale it back down, but a cost might have already been incurred and a storage volume might be orphaned and forgotten about.
It’s a balance, reactive controls are perfect for certain use-cases. It’s generally a bad practice to terminate resources immediately when a budget is exceeded. You don't want your dev to lose her work, and doing so in prod would be disastrous. A reactive control could be put in place to simply notify the administrator when a certain percentage of a budget has been consumed, so a more informed decision could be made.
Once again, let’s take a look at our training example. While a reactive control helped the company avoid an even steeper case of overspend, a preventative control would have prevented the mess from ever happening. A preventative control is a guardrail that prevents users from making poor choices.
For example, a preventative control could have been set in place to give the training instances a set lifetime with an option for extension. After 3 days, all resources related to the training session would have been shut down. If for whatever reason the training was extended, a request with business justification would have been submitted.
In order to create these types of guardrails, the policy enforcement point needs to be a part of the provisioning workflow.
With these approaches in mind, let’s examine 5 things enterprises do to control and manage cloud costs:
1. Visibility Through Tagging
As we mentioned earlier the initial challenge is visibility. There are many things that can be done to gain deeper visibility into cloud spend. Out of those possible solutions the most popular is tagging.
Most cloud platforms employ some form of tagging, for reference, here are how tags look like on the major clouds:
For all clouds, tags (or labels or metadata) are key:value pairings. From a cost management perspective tags can be used to associate resources with cost centers, owners, projects, customers and more. Through tags administrators can gain better answers to the Who and the What questions.
Some general best practices around tags are:
- Use a standardized format for tags and implement it consistently across workloads
- Group your tags into categories such as “Cost Management”, “Technical Tags” and “Security”
- A good way to come up with a tagging strategy is to gather all the questions you need answered and create tags that answer those questions
Something that you’ll often hear about tagging is that it’s a “put garbage in, get garbage out” paradigm. This of course is true for everything else in life but I agree it relates to tagging as well, irrelevant tags are useless. For example, there’s no reason to tag information you can easily find through the console or an API call. A tag should answer a question like “Which application is using this server”, “Who owns this resource?” or “What cost center is this resource associated with?”
The challenge it would seem is not with coming up with a tagging strategy, but with implementing it. Let’s consider a simple 3-tier application, each tier has a single server and a single storage volume. For this example, we’d like to tag each resource with a cost center, owner and environment tag. That’s already 18 tags for a very simple application stack.
2. Communicate Costs to End-Users
The self-service nature of Cloud makes it easy to just get what you need and move on. Typically a developer at a large enterprise will not lose any sleep over the financials of the company.
We might argue that it’s fine, the end-user shouldn’t care, that’s the whole point right? Letting users get what they need without having to worry about policy and budgets? Well, sort of.
Communicating the financial consequence of a provisioned resources to the developer creates another check in the “should this be provisioned” decision tree. Ideally, the process will be along these lines:
1. Developer decides she needs to provision resources.
2. If she has the permissions and capacity the system allows her to select the desired resources from a catalog.
3. Before provisioning the price and budget consumption (financial consequence) of the desired resources it presented to the developer.
4. Resources provisioned with required policies applied.
Step 3, showing the financial consequence to the developer before provisioning, helps the end-user determine where this piece of infrastructure fits in an overall plan. It also serves as a reminder that someone will actually have to pay for these resources and that they might come at the expense of something else.
3. Reclamation Policies
Provisioning applications often drags in a multitude of resources we tend to forget about, namely storage volumes. Automated reclamation policies are controls that are set in place to reign in resources and prevent sprawl.
Examples of reclamation policies:
- Put all “orphaned” storage volumes in a clean-up queue and delete the ones that will not be reused.
- Set a lifetime for each provisioned application stack. Once the lifetime is exceeded terminate the application infrastructure. Use persistent storage and desired state engines to make sure you can easily power the stack back up if needed.
- Reclamation means returning something to its initial state. That’s exactly what reclamation policies are: automation for reclaiming resources once they are no longer needed. Automation is the key word here since developers can’t be expected to remember when and under what circumstances application resources should be reclaimed.
One of the main reasons it’s difficult to run a cost effective datacetner is over-provisioning. This translates into one of the major selling points for Cloud usage in general, you can use only what you need:
Having said that, the self-service model lends itself to over-provisioning as well. Autoscaling allows us to achieve the goal of only using what we need. Autoscaling can be used as a lifecycle management policy for provisioned infrastructure and prevent applications from costing more than they should.
Examples for lifecycle management policies with autoscaling:
- Scale down all development workloads during nighttime and power them back up during business hours.
- Scale up once a bandwidth threshold is met to accommodate for traffic in peak times without paying for extra resources when traffic is slower.
5. Standardized Provisioning Process
As we discussed earlier, the challenges of managing costs and their solutions are intertwined. Beyond that, the challenges enterprises face with running Cloud in general are all connected. If developers have an easier time getting the resources they need, they don’t turn to platforms which management doesn’t have visibility into.
A standardized provisioning process is the heart of effective Cloud usage. It’s the balance between all the challenges we discussed and the ones we didn’t.
Essentially, a standardized process means that every action or requests is placed in the right funnel and has the appropriate policies enforced on it. On one end the developer gets the quick and easy process she needs to get her job done, and on the other Finance, IT and Security gain visibility and have their policies enforced.
Example for a possible workflow: