A story of Go, Terraform, Gitlab and AWS Lambda

Written by Maxime Walzberg on 1st march 2020.
Find the sources here.

Introduction

If you have the right attitude, interesting problems will find you.

The Cathedral and the Bazaar

As with many stories, this began at a bar, involving beer. I was explaining to a co worker and friend that I was bored at the day job. I said something like

A month in [the new day job assignment] and I felt like I knew everything I needed to know.

One part of his answer was simple :

Well, if you really think so, try to build all what you're working on from scratch.

This wasn't really a wake up call. I always try to build something new and to acquire new skills, in a lot of different domains. It sure felt like a challenge. But what could I build ?

Then there was this curiosity, and questions, that I had in mind for some time about some techs. And from this and that, came an idea, and then a plan. I won't tell you the questions, the idea or the plan, because

Suspense is important
It would be too chatty, and this article would be less to the point
I don't want to commit, in some way, to write and code some more on my free time. While I enjoyed this, I need to do it at my own pace.

The first part

So with this dramatic suspense, what's left to do is explain what are the goals of this first part :

Build a basic AWS Lambda (Function as a Service) with Go
Create the infrastructure using Terraform
Implement a build/test/deploy CI/CD pipeline
Have an easy way to deploy several environments (development, testing, production ...)

As you can guess, this set of goals are laying out the strong foundations of any serious project :

Choosing a stack (in this case, a cheap one)
Being able to deploy easily and quickly
Keeping a high degree of confidence (because the CI will run automatic tests).

Getting those basics right will give you a short feedback loop meaning you can code the smallest changes, test them quickly, push them to production with a click, and then move on to the next thing. A quick feedback loop allows you to embrace change and put out those production fires quickly. Having and maintaining tests means you will never fix the same bug twice. In short, it lets you add value at a steady pace.

I built this project by mixing a bit of those articles :

To avoid useless repetition and to keep it short and to the point, I do not comment and explain all the source code here. I believe it's simple enough, and in case of doubt all code explanations are provided by the links above. Instead I focussed on :

The problems I encountered (hoping that someone would run into them too, and find one answer here)
The key decisions

Humble beginnings

When I started to implement this, I was building and deploying manually. That meant :

Running tests (the motto is that if it's not tested, it doesn't work)
Compiling
Zipping the binary
Running Terraform CLI

Getting the first Hello World

The first problem I ran into was a non-helping error from AWS, and what appears to be a missing validation from Terraform AWS plugin.

API Gateway is an AWS service that let you expose APIs to the world. We use it as the proxy to our Lambda invocation. It's organized this way :

API ("root URL")
- Resource (a path part of your URL)
  - HTTP Method.
    - Integration (how to get the response - in our case we invoke the Lambda)

Then you create a deployment of this structure (for example, you can deploy the same API for development, testing and production).

The problem was, the deployment needs a stage name, but Terraform didn't force me to set one, and, well, I didn't. Everything was green during Terraform run, and AWS happily accepted the missing stage name.

But when hitting the URL I had a mysterious "Forbidden" ... I quickly saw that I forgot to reply the 200 status code in my function. But it still didn't work. Then I thought it was CORS (why would that be, since I was direclty requesting API Gateway hostname, and not through another website ?). Still no luck. I re read the tutorials I used and finally spotted my mistake.

A taste of the future problems

At this early stage, I already ran into the most important question of this exercise. The Lambda platform takes your code in the form of a zip file. You can either provide it by uploading the zip when creating or updating the Lambda, or you can provide an S3-stored object. If you use the former, you must call the AWS API each time with the latest version of your zip and you won't get in troubles.

But if you want to store your build artifact (the zip file) somewhere as a backup, then S3 is the way to go. You still need to notify AWS that the Lambda's code changed.

So here's the problem : how to let AWS know that our zip changed through Terraform ? The first way I tried this was through the S3 object's ETag. I wanted Terraform to compute a SHA1 of the ZIP and setting this as the object's ETag. It would then detect that the file changed, upload it again, and got smart enough to see that the Lambda depends on this object and therefore should be updated. But as I quickly moved on to the complete CI/CD workflow, I could not check that this method does work consistently.

Deployment must be one click away

One of the most valuable lessons I learned recently is that deployment must be one or two clicks away. As explained in the introduction, it's very important to have the quickest/shortest feedback loop, and automatic, quick and easy deployment is part of that.

Storing the infrastructure state somewhere

To achieve that, the first problem to deal with is Terraform state. After deploying an infrastructure, Terraform will keep a record of its existence, to be able to destroy it, or update it by applying only the necessary changes. Thanks to the state records, merging a branch or two with infrastructure changes is not hazardous when it comes to deployment.

When starting with Terraform CLI, you keep this state on your local file system. This is dangerous because it doesn't have a backup. You could version it using your VCSs, but if you work in a team, it will soon be very dangerous, as you don't want concurrent Terraform runs modifying the state files in different branches of your VCS. Good luck merging the state file so that it really represents the existing, resulting infrastructure !!

To deal with this, Terraform has various backends that can lock deployments so a single Terraform run runs at the same time, and stores the resulting state in a secure way. Two of them got my attention :

Terraform Cloud, which store the state but can also integrate with your VCS service to automatically deploy and update your infrastructure.
- Obviously, this is the "golden" solution since it's made by the same company that created Terraform in the first place.
- It's somewhat out of your infrastructure, since you use this to setup your infrastructure ... which just feels better
S3/DynamoDB, which stores state as an S3 file and uses DynamoDB for locking.
- The advantage of this is to avoid using another cloud service provider/service involved in my CI/CD stack (I already use AWS and Gitlab.com)
- But it makes sense not to store your infrastructure state as infrastructure at the same Cloud provider ... right ?

So I chose Terraform Cloud. This involved :

Creating a Terraform Cloud account
Creating a Gitlab.com Terraform Cloud application on my Gitlab account. This means Terraform Cloud has access to all my repositories. The procedure is described here.
Creating an AWS IAM access key with all privileges to let Terraform Cloud do its administrative job. While this might seem dangerous :
- You can always audit all the actions performed with this access key using AWS CloudTrail.
- You could also try to reduce this key's authorizations to the bare minimum, but this process might need some guessing, and a lot of back and forth experimentations ...

Build artifacts & Terraform Cloud

Quickly another problem arised : as I built my binary and Lambda zip file with Gitlab CI/CD, I obtained artifacts. But Terraform Cloud, when automatically polling updates from Gitlab.com, does not have access to your CI artifacts. In fact, it doesn't even wait for your CI pipeline to complete successfully (so how could it get artifacts ?).

If you're worry at this point, let me put you at ease : Terraform Cloud does not apply your infrastructure automatically until you enable it. When it get udpates from your VCS, by default it will only make a plan to update your infrastructure. That means it will compute what to do from the last stored state and the new Terraform configuration, and tell you what's to change. Then you will be provided with buttons to apply (deploy) the plan or not.

The solution I applied is quite simple. Instead of having Terraform pushing the Lambda's zip package, I had Gitlab to do it in the CI. I had to create another AWS access for Gitlab (albeit this key is quite constrained to uploading files to a specific S3 bucket).

But then again, how do we let know AWS that our Lambda's code has been updated ? Here comes S3 object versionning. I enabled versionning in my function artifacts bucket, and asked Terraform to get the last object's version id, and use this version id as the object to use for running the Lambda.

But ... as Terraform does not wait for the end of the Gitlab CI pipeline to start (especially the planning part, where it gets the S3 object version id), it does not "just work" because the new artifact is not in S3 yet. You have to manually requeue a Terraform run after the CI run.

Still, I like this model because :

I have remote storage, backup and versionning of my built functions.
It would work well if I were delivering to a customer : my job is done once the artifact is pushed to the customer's S3 bucket. The customer can get the Terraform configuration automatically, and use his own AWS access keys and various Terraform input parameters to deploy the new version on the environment he wants.

Environmental collision

About environments ... at one point you need to have at least two of them :

One you can throw away, break and give hard times for testing changes and integrations (testing the merged result of parallel works on the project, working with external systems you can't run/call locally ...).
One for the live production, with data you can't just throw away, and the required level of stability and quality.
You may also have one for QA, keeping it stable enough to run end to end tests before deploying to the live system.

It does not hurt to always keep an open possibility of creating any number of environments, for experimentations and for any future need. The above list is not a closed one, and the Terraform/CI/CD configurations should support easy "environment-list" changes.

So I set out to enable multiple environments. It's my understanding that with Terraform, you do that with workspaces. A project (the same source repository, with the same Terrafom configuration files) can have multiple workspaces. A workspace :

Stores its own infrastructure state
Sets input variables values for the Terraform run
Manages a Terraform run queue

As mentioned before, the Terraform configuration can be made dynamic using input variables that are set when you run the tool. One prime example for this is a database connection string that your application might depend on.

In your Terraform configuration file, you declare and describe input variables and use them to set run or environment parameters of your to-be-deployed application. You then create one Terraform workspace per environment and set all those variable's values for each environment. Therefore your development environment will use a development database, and so on ...

With that said, I think it's best to limit the number of input variables, and to have Terraform create all the resources you depend upon using a naming pattern based on your environment/stage name. So if you need a database - that you should not share between applications/services -, don't create it apart : use the Terraform configuration file of your project, and set its name and other unique identifiers based on a stage name input variable.

But there I made a rookie mistake again : when using variables for everything (unique identifiers and such, that would look like helloaws-development after the patch) I missed one, the one that grants API Gateway the Lambda's invoke access.

So when I deployed to production, the production infrastucture stole the Lambda access from the development infrastructure ... The development environment would then fail to respond with a mysterious "Internal error" response.

With a real world software this kind of collision would be a disaster : to fix it I had to destroy both infrastructures (so that includes the supposedly production one 😊) and re-deploy.

Given the worthless of this hello world playground software, I admit I got a little impatient and hit the red switch. But given some more time and thoughts, and when required, it's probably fixable with a more delicate touch, one that wouldn't delete your entire production infrastructure.

What we got so far

So what do we have as code :

A rather simple Lambda function written in Go
An infrastructure setting up the Lambda and the proxy to it, that can be used to set up multiple environments
A CI script that builds, tests and pushes the function up to a delivery point.
A SaaS that takes care of deploying and tracking infrastructure

The only thing that still bothers me a little is that the CI is not completely automatic. One solution for this would be to disable the VCS integration from Terraform Cloud, using the Terraform Cloud backend only to store state and avoid parallel deployment, and run Terraform CLI automatically from the CI pipeline (it already runs Terraform for validating the configuration syntax).

On the other hand I like the clean separation between build and deployment ... This needs some more practice and experimenting.

Nothing is perfect, and another enhancement I could bring would be to split the Terraform configuration in multiple files. But at some point, you gotta sleep.