Broken drift detection with Terraform

When the resources of a Terraform provider (e.g. Azure) define “computed” optional attributes, these are not properly covered by Terraform’s drift detection. As I will demonstrate, this can lead to non-reproducible infrastructure. I also experiment with Terraform’s behavior, depending on what you declare in your configuration file, the content of your state file, and the state of your actual infrastructure.

Introduction

Terraform is one of the most popular and most established tools to create cloud infrastructure. In contrast to proprietary tools built by the cloud vendors (such as Azure’s Bicep, GCP Cloud Deployment Manager, or AWS’s Cloud Development Kit), which are all “stateless”, Terraform puts internal state data in a tfstate JSON file, which it stores either locally or in a remote storage backend.

This state file is needed for many reasons: most importantly, it helps Terraform correlate infrastructure that you defined in your configuration file (something.tf) with the actual infrastructure that Terraform created for you. This lets Terraform differentiate between infrastructure managed elsewhere (e.g. that you created by clicking in the AWS or Azure web portal) and infrastructure that is under management by Terraform. There are several other reasons for having a state file, as stated in Terraform’s documentation here.

If you look at the state file, you will notice that it not only relates the resources of your configuration file to the (cloud-generated) resource-IDs, but it also contains all attributes of the resource (including optional ones and their values), for caching reasons.

Examining divergences in configuration, state and infrastructure

When you define infrastructure and have Terraform create it, it is actually defined in three different places:

  • Your configuration file (e.g. main.tf)
  • Your state file
  • The actual infrastructure

Whenever there are divergences between any of these three places, it is easy to understand what Terraform does, on a conceptual level: Terraform either creates infrastructure that is only defined in your configuration, or it deletes infrastructure that exists in reality and in your state, but is missing in your configuration.

However, there are edge cases no one ever cared to discuss – so let’s look at them now. For instance, what will Terraform do if you both deleted the actual infrastructure and removed the corresponding resource entry from your configuration, but it is still present in the state file? The following table shows all possible combinations, tested with an azurerm_storage_account. A ✅ in the table indicates that the storage account is defined in the corresponding place:

Configuration (.tf file)StateActual infraEffect
Nothing – infrastructure is already up-to-date
Terraform recreates actual infrastructure, updates the state
Happens if you “lose” your state. Terraform plan only checks (“refreshes”) those resources present in the state, and therefore thinks that it should create the resource (not knowing that it already exists). When Terraform apply tries to create the resource, it fails with an error such as Error: A resource with the ID ... already exists - to be managed via Terraform this resource needs to be imported into the State. Please see the resource documentation for "<resource-type>" for more information.
After you manually import the state, the situation equals row #1 of this table.
Terraform creates actual infrastructure, adds it to the state
Terraform deletes the actual infrastructure and removes the resource from the state
plan and apply find no differences. Terraform removes the resource from the state. However, after printing that no changes were found, the plan command still exists with the following error:Error: Resource node has no configuration attached. The graph node for <path-to-resource> has no configuration attached to it. This suggests a bug in Terraform's apply graph builder; please report it!
Nothing – the existing actual infrastructure is not under Terraform’s management, so Terraform leaves it alone
Nothing

The above table examines only creation and deletion – but what about modification? Let’s see what happens if we change an attribute (that can be updated in-place, not forcing a recreation), tested with the static_website attribute of an Azure storage account:

Case no.Configuration (.tf file)StateActual infraEffect
1AAANothing
2AABTerraform corrects the value in the actual infrastructure to A (“drift detection”)
3ABATerraform detects no changes, updates the state to A
4ABBTerraform changes actual infrastructure to A, updates state to A
5ABCTerraform changes actual infrastructure to A, updates state to A

As we can see, the value stored in the state is completely irrelevant for Terraform’s decision regarding making changes to your actual infrastructure. For instance, Terraform behaves exactly the same in cases #1 and #3 (or for the cases #2 and #4). Only the value defined in your configuration file counts.

Non-reproducible infrastructure due to Computed attributes

When you create resources with Terraform, they are typically offered by some provider, such as Terraform’s AWS or Azure provider. The developers of these providers define a set of attributes for each resource. For instance, the account_tier attribute of the azurerm_storage_account resource is defined (in Golang) as follows (source):

"account_tier": {
	Type:     pluginsdk.TypeString,
	Required: true,
	ForceNew: true,
	ValidateFunc: validation.StringInSlice([]string{
		"Standard",
		"Premium",
	}, false),
}Code language: CSS (css)

For each attribute, the developer can configure the data type (string, boolean, …), limit the set of allowed values, and configure whether the attribute is required or optional (in your configuration file). Attributes that are optional come in two flavors:

  • Optional with a default value, which is set in the Golang code,
  • Optional with the Computed field set to true, which means that if the user does not specify a value in the configuration file, the cloud platform chooses the default value for you

In other words: for any optional attribute that you omit in your configuration file, the default value is either enforced by Terraform (on the client-side), or by the cloud platform (server-side). Unfortunately, the generated documentation might not disclose which case it is, which can lead to non-reproducible infrastructure and broken drift detection.

Let’s take a look at a concrete example where this happens. The access_tier attribute of the azurerm_storage_account is documented as follows:

Official documentation for the access_tier attribute

access_tier – (Optional) Defines the access tier for BlobStorage, FileStorage and StorageV2 accounts. Valid options are Hot and Cool, defaults to Hot

However, as the source code shows, the access_tier attribute is an optional computed attribute. If you do not specify its value in your configuration file, the value in your actual infrastructure could be either Hot or Cool – Terraform won’t detect any changes either way, and won’t enforce anything – there is no drift detection for this attribute!

As a consequence, your infrastructure is non-reproducible: if you recreate your infrastructure in a different region or AZ (e.g. in case of a disaster recovery scenario), Azure will choose the Hot default value, even if your previously existing infrastructure used the Cold value (configured manually by an admin). This might affect your bill 🤣, or have other repercussions.

This behavior also breaks your (reasonable) assumption that if you import existing infrastructure (using the terraform import command), you would only need to copy those attributes from the state to your configuration file that are either required, or are optional but have a different value than the documented default value. As demonstrated, this assumption is unfortunately wrong.

Conclusion: Set all attributes

The way I see it, the only reliable solution is an ugly one: you have to set all the attributes in your configuration file (that you care about), including those optional ones with a non-nil default value. If you import existing infrastructure, just copy all attributes from the state to your configuration file (and let terraform apply guide you which attributes you must delete from the configuration file again, because they are purely computed).

While this inflates your configuration file, you get 100% reproducible infrastructure. Also, you are safe from unexpected changes in the default values that the cloud provider (or the developer of the Terraform provider) could make at any time, which are easy to miss (unless you study every change log in detail).

Did you run into similar issues or unexpected behavior with Terraform? Please let me know in the comments.

Leave a Comment