Clock
6
min read

Why Ably Switched From Terraform Cloud To Scalr

Huw McNamara

Huw McNamara

Senior Site Reliability Engineer at Ably

Here at Ably, we’ve engineered a serverless WebSocket platform that makes it easy to reliably handle realtime data distribution to millions of web and mobile apps at the edge. Terraform is central to our infrastructure management. We use it to define and manage resources across a wide variety of providers including AWS, Snowflake, PagerDuty, and Grafana.

The ability to use the same workflow and tooling to control different providers is a major selling point for Terraform, and one of the reasons we switched from AWS CloudFormation.

To provide remote state management, along with remote operations, so engineers were not deploying to production locally, we initially used HashiCorp’s Terraform Cloud (TFC) as our managed Terraform service provider.

Please Wait in the Queue

As Ably grows, so does our infrastructure footprint, which leads to more and more Terraform workspaces and Terraform operations. We began noticing that runs were getting stuck in a queue. Digging into the details of our TFC plan, it only included one runner. This was sufficient when we first started using Terraform, but it was time for an upgrade.

HashiCorp does not provide detailed pricing information on their website, so we set up a call with their sales team to find out more. What came back in the quotes led us to evaluate other providers.

Ably’s Terraform Requirements

When evaluating different Terraform service offerings, the key areas we looked at were:

  1. Migration effort - will we need to drastically change our current code base and ways of working?
  2. State management - is state storage part of the service?
  3. SLA - Terraform is critical to our workflow, so we need a reliable service with an SLA to back it.
  4. Support terms - if something goes wrong, how quickly can we expect a response?
  5. Cost, including future growth - if we grow by 10x, will the pricing still be competitive?
  6. Disaster recovery - are we going to be dead in the water if the provider suffers an outage?

We discovered that Scalr is the only other Terraform service provider to support the Remote, and now Cloud, Terraform backend configurations. This was vital to both the developer experience and from an operational perspective. Without these features, we would have had to spend significant resources rolling out our own state management system, and ensuring access control was done in such a way engineers could still perform a terraform plan locally, but not have the permission to run terraform apply. Being able to perform a terraform plan against a production workspace without committing code is vital for the developer experience by providing short feedback loops.

Thankfully, Scalr also ticked all the other boxes, offering a 99.9% uptime SLA, 2-hour ticket response time, and a very easy-to-use calculator on their pricing page. 

Custom hooks give the ability to run scripts at specific events in the Terraform workflow, such as post apply. This provides support for situations such as DR, by automatically exporting state to an external store after every successful apply, and for Terraform CDK, which we wanted to use in future projects.

Switching Over from HashiCorp Terraform Cloud to Scalr

There were two stages to our switch from TFC to Scalr:

  1. Creating all the prerequisite resources, such as Scalr environments and workspaces, and AWS IAM roles (the build-up).
  2. Migrating the state for all of the workspaces from TFC to Scalr.

The goal throughout this process was to minimize any disruption to engineers and allow them to use TFC until the last second. A feat we accomplished with around 20 minutes of downtime at the end of it all.

Build-up

First things first, access control to AWS. Scalr supports AWS IAM role delegation, which is great as it meant we could use temporary credentials to give Scalr access and could remove the AWS IAM user we had to use for TFC.

We used the Scalr Terraform Provider to effectively make Scalr manage itself after an initial bootstrap step of manually creating a workspace and Scalr service account to facilitate this.

At the time of migration, Scalr only supported attaching one set of cloud credentials to an environment. As a result, we opted to have an environment per AWS account we deploy to. With the new provider configurations and the ability to have multiple sets of AWS credentials in the same environment, we would have likely organized our workspaces so they are grouped by project rather than AWS deployment account.

All of our workspaces in TFC were created using Terraform, with the code stored in our main infrastructure repo. This meant creating the workspaces in Scalr was straightforward: we branched off of main, added the Scalr Terraform provider and changed tfe_workspace to scalr_workspace, and updated a few variable names. Scalr would then execute a run and create the needed resources. If a new workspace was added to TFC, the Scalr VCS branch was updated to match.

State Migration

Now that all of the workspaces were available, it was time to move the state. Using both Scalr’s and TFC’s APIs, we created a bash script to migrate state across all workspaces in our repo:

  1. Parsed the workspace name from code.
  2. Found the corresponding workspace_id in both TFC and Scalr.
  3. Downloaded the existing state from TFC and Scalr.
  4. If a state in Scalr existed, the serial in both were compared to check if the Scalr copy needed updating. The lineage was also compared as an extra safety check, to ensure nothing had gone drastically wrong.
  5. If the TFC state serial was larger or the state in Scalr did not exist, the state was uploaded to Scalr.

With the checks for state serial, we could run this script on demand to transfer the latest state files from TFC to Scalr. This allowed us to easily run a plan on each Scalr workspace to check it had the correct permissions, variables, and that Terraform did not show any differences in the plan output compared to a run on TFC.

The final preparation step was updating the backend configuration blocks of any workspace in our repo to use the new Scalr values. We did this in a separate branch to ensure a PR was ready to merge after migration.

Going Live

On the go-live day, we locked all TFC workspaces, ran the script to migrate state once more, and executed plans on all Scalr workspaces one last time. After they had passed, we merged the PRs we had open and asked engineers to update any feature branches they were working on.

It all went without a hitch!

Our Experience With Scalr

We have been using Scalr for over 3 months now, and we’re very happy with the experience. As well as being reasonably priced, we also have a great relationship with the Scalr team, who are always eager to hear product suggestions and help out with any questions.

Quite a few of the features we have asked for have now made their way into the product, such as git submodule support and pre-init custom hooks.

We look forward to all the features currently in development!

About Ably

Ably gives you the capabilities to deliver the live experiences your customers demand without go-live delays, runaway costs, and unhappy users. Ably’s Serverless WebSocket platform reliably handles high-scale realtime data distribution to web and mobile apps at the edge, so engineering teams can focus on core product innovation without having to provision and maintain complex realtime infrastructure.

Developers at companies like HubSpot, Toyota, and Webflow use our APIs and global edge network to power things like business-critical live chat, food order delivery tracking, and document collaboration for more than 300 million people each month.

If this sounds like something you’d like to be part of, have a look at our open roles (all remote-first) and come join us.