DJANGO DEVOPS

Deploying your Django static files to AWS (Part 4)

Code deploy

19/03/2020


This is the last part of this series. In the first 3 posts, we saw how to serve our static files through a CDN and host them on the AWS Elastic File System, using Whitenoise to serve them to CloudFront. By using EFS, we were able to share the static files amongst various instances, thereby enabling us to setup an autoscaling group behind a load balancer. In this part we'll go through the steps required to deploy new versions of your django project.

 10 min read

The previous parts of this series can be found here: part 1, part 2 and part 3.

Part 4: How to deploy

This is now a bit more tricky than with a single instance. Here are the steps you'd normally do on a single instance:

  • ssh into your instance
  • git pull the latest code on your master branch
  • manage.py migrate if there are any database migrations
  • manage.py collectstatic to move the static files to the static root folder (now your EFS volume), if you made any changes to your static files.
  • Restart gunicorn in order to reload the Django framework and your app.

I'm describing the steps as if they were manual, but this can of course be automated using a deploy script. You could use python Fabric to write your deploy script, or use a more elaborate framework like Ansible or Chef. I'm not going to touch these topics in this tutorial.

In the new situation, we need to make sure that all instances in your autoscaling group are updated. And that the launch template is updated since that's what is being used to launch new instances by your autoscaling group. Otherwise you'd still be launching outdated instances.

There are quite a few different deployment strategies, like green-blue deployments, canary deployments, A/B testing and more. I'm going to focus on a ramped deployment, because it's the most straightforward in most cases. A ramped deployment is when you gradually replace old instances with new ones and have them side-by-side during your deployment.

So in the new situation, the steps to deploy are:

  1. Create a new instance (outside the autoscaling group), which we will call management instance (management.example.com), from the launch template. This instance has the same code as your currently running instances.
  2. ssh into the management instance
  3. git pull to update the code on the instance
  4. manage.py migrate to apply migrations to your database. Now this is in 99% of the cases something safe to do. We'll explain the remaining 1% later.
  5. manage.py collectstatic to move the static files to the static root folder and create the static manifest file.
  6. Restart gunicorn on the management instance.
  7. Check that the website is working by browsing to management.example.com. Now there are some complications here which I will explain later.
  8. Create an AMI from the management instance. 
  9. Create a new version of the launch template using this AMI and associated Snapshot
  10. Replace the instances in your autoscaling group one by one: Start by making a list of all current instances in your autoscaling group. Add a new instance first by changing the desired capacity (and if needed maximum capacity) to x + 1 where x is your current desired capacity. AWS will automatically add a new instance based on the latest version of the launch template. Then de-register an old instance from the target group, checking the instance-id from the list you created at the beginning. Again, a new instance will replace it. Continue until all old instances have been removed. Finally adjust the desired capacity back to x.
  11. Now you can stop your management instance. I'll explain later why it's a good idea to always have a management instance running.

Enabling both versions of your static files to live side by side

In step 7 (and in step 10), we have two versions of the app in production and also two versions of our static files: When we run collectstatic, we don't replace the old files, we add the new files with a new hash. 

E.g. let's say we have a file app.js which in the previous version was stored as app.55e7cbb9ba48.js. Assuming we made changes to app.js in our new version, it will now be saved as app.af03944e0dda.js. Remember when I explained in part 1 that Django automatically replaces the name correctly when using {% static 'js/app.js' %}? The file staticfiles.json, which is creating during collectstatic provides a mapping from app.js to the app.af03944e0dda.js.

Now we have a problem: staticfiles.json points to the new file name, i.e. the new version of our app.js. But old instances are still serving old versions of our app. If they use the same staticfiles.json (which is centrally stored in the static root folder), they are serving new versions of our javascript inside old templates, where the javascript might not work. So I've chosen to make staticfiles.json local, i.e. keep a copy outside of static root so that it's always in sync with the corresponding code on each instance. 

To do this, you need to subclass CompressedManifestStaticFilesStorage from Whitenoise:

import json
import os

from django.conf import settings
from django.core.files.base import ContentFile
from django.core.files.storage import FileSystemStorage

from whitenoise.storage import CompressedManifestStaticFilesStorage


class LocalManifestStaticFilesStorage(CompressedManifestStaticFilesStorage):
    """
    Saves and looks up staticfiles.json in Project directory

    This makes it possible to have the code in sync with the static files: When deploying a new version
    of the code on an instance, staticfiles.json needs to be updated. But instances still running the old version should still use the old static files.

    Follow these steps for deploying:
        - Deploy to a master instance, including `collectstatic` to create a new staticfiles.json and upload the latest versions of the static files.
        - Now deploy the code to another instance and copy the staticfiles.json from the master instance to the source directory. The new instance will now use the latest version of the static files that match the code.
        - Repeat for each instance in your auto-scaling group.
    """
    manifest_location = os.path.abspath(settings.BASE_DIR)
    manifest_storage = FileSystemStorage(location=manifest_location)

    def read_manifest(self):
        try:
            with self.manifest_storage.open(self.manifest_name) as manifest:
                return manifest.read().decode('utf-8')
        except IOError:
            return None

    def save_manifest(self):
        payload = {'paths': self.hashed_files, 'version': self.manifest_version}
        if self.manifest_storage.exists(self.manifest_name):
            self.manifest_storage.delete(self.manifest_name)
        contents = json.dumps(payload).encode('utf-8')
        self.manifest_storage._save(self.manifest_name, ContentFile(contents))

This will read and write the statifiles.json file to the project folder (settings.BASE_DIR) instead of the static root folder. Now all you need to do is use this as the STATICFILES_STORAGE setting.

STATICFILES_STORAGE = 'whitenoise.storage.CompressedManifestStaticFilesStorage'

To summarise: When you run collectstatic on your management instance, only the staticfiles.json manifest on the management instance will be changed, not on your running old instances. This way, you can browse to old and new instances and still receive the correct static files depending on which version of your website you're on.

Note 1: In step 7, when browsing to management.example.com, the HTML returned will point to a new version of your static file. But your CDN will first need to fetch it and it will fetch it from your old instances since they are the ones behind the load balancer which has the hostname you configured in your CDN. That's not a problem, whitenoise just fetches any file in your EFS. But it only works if you restart whitenoise (gunicorn) on your old instances, since Whitenoise caches the contents of the static root folder for quick access.Without a restart, whitenoise will return a 404 to your CDN. Unfortunately, this means ssh into each of your running instances and restarting gunicorn (write a script!).

Note 2: If by mistake you try to access your new version of app.js before having restarted your old instances and get a 404 back, your CDN will cache that for a while. In order to speed up correct retrieval, you can invalidate the caches on CloudFront:

  1. Go to the CloudFront Dashboard
  2. Select your static files distribution
  3. Click on the Invalidations tab
    Tabbed options for your CloudFront distribution
  4. Click on Create Invalidation
  5. Enter the path to the file, e.g. /static/js/app.af03944e0dda.js
    List of invalidation paths
    List of paths to invalidate
  6. Click Invalidate

Now wait. This can take quite some time. You'll see the status of your invalidation (In Progress to start with). Only when the status is Completed should you refresh your page.

Database migrations and side-by-side application versions

This is the most important criteria to determine whether or not you should apply a ramped deployment strategy. Can the old version of your app still run with the new version of your database?

You usually add columns or tables to your db, so this doesn't affect your old instances. When make a query to fetch a model instance from your db (e.g. Product.objects.get(id=1)), Django explicitly lists all the columns that should be retrieved in the SELECT query. So additional columns don't affect initialisation of your model. One notable exception is if you're using raw queries. That might cause your models to initialise with unexpected columns and crash.

But beware! There are changes that you can make that are breaking changes to your database, for example renaming columns, adding constraints at the database level (e.g. adding a unique constraint or adding null=False to a column) or deleting columns and tables. In such cases, your old instances won't be able to continue working during the deploy: You have to choose a different deploy strategy. A zero-downtime strategy is very difficult in such a case, even with a blue-green deploy, because you need to have two copies of your database simultaneously (and keep them in sync). 

I try to avoid any breaking changes to my database so I never have to face this situation. In the rare case this happens, I will accept downtime: I point all my traffic to an Under construction page, deploy my new version and then point the traffic back to my app, preferably during the least busy time of the day.

Health checks

So how does the load balancer know whether or not an instance is healthy? Remember, the same health status is also used by the autoscaling group to decide when to terminate an instance and replace it. How it works is described in this document by AWS. When you created your target group, AWS created a default health check for it. That's not much more than a ping. If the instance is alive, the health check passes. It says nothing about your app's health.

I've created an API call which the load balancer calls and checks that my Django app is basically running. Here's how to configure it:

  1. In your EC2 dashboard, in the left column under Load balancers, go to Target groups. You'll see the target group you created in part 3.
  2. Select the target group, which will show its details below.
  3. Select the Health checks  tab and then Edit health check
  4. Select the HTTP protocol, enter the path to the view you'll create and choose port 80. I've chosen the following settings for the various timings.

    These settings mean: It takes 1 minute to detect an unhealthy instance (2 checks with wrong status or timeout) and 1.5 minutes to ascertain that an instance is healthy (3 checks with correct status).
  5. Click Save

Now this is assuming you already have a view that responds with a status 200 OK. Here's my view:

from django.contrib.sites.models import Site
from django.http import JsonResponse
from django.views import View


class ApplicationStatus(View):

    def get(self, request, **kwargs):
        try:
            site = Site.objects.get_current()
        except Exception as e:
            return JsonResponse({'error': e}, status=500)
        else:
            return JsonResponse({'status': 'ok'})

Why not just return status ok? Because that would only verify that my Django app is running. By fetching something from the database, I'm also verifying that it's able to reach the database. In my case I just fetch the current Site object (if you haven't enabled django.contrib.sites, fetch something else you know for sure should be present in the db).

Note that in my case, my Django app only allows connections over HTTPS. So the path I gave to the load balancer isn't the path defined in urls.py. I have a rewrite rule in nginx to the correct path and to use HTTPS:

server {
  listen 80 default_server;
  # listen 443 default_server;

  server_name _;

  location /health_check/ {
    try_files $uri @proxy;
  }

  location @proxy {
    rewrite ^/health_check/(?<domain>[a-zA-Z0-9\.]+) /api/v1.0/core/status break;
    # Lie about incoming protocol, to avoid the backend issuing a 301 redirect from insecure->secure,
    #  which would not be considered successful.
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto 'https';
    proxy_set_header Host $domain;
    proxy_redirect off;
    proxy_pass http://django_app_server;
  }

  location / {
    return 404;
  }
}

This maps the path /health_check/www.dedi.co to /api/v1.0/core/status (which is the url path I defined in urls.py for my ApplicationStatus view.

This also allows the load balancer to actually send requests to my targets using their local DNS name (ip-172-31-29-68.eu-central-1.compute.internal) which wouldn't be handled by my normal nginx configuration.

Summary

In these 4 parts, we've learned how to handle and deploy static files in a production environment with auto-scaling. We have used and configured the following components:

  • Django-whitenoise for managing our static files
  • CloudFront for serving our static files
  • Elastic File System (EFS) for storing our static files
  • Elastic Load Balancer (ELB), a target group and an autoscaling group to ensure zero-downtime deployments and resiliency of our website.

There are a few things you'll still want to improve or fix:

  • Automate your deployments. I'm not necessarily talking about CI/CD (continuous integration/continuous deployment), but first a script that lets you deploy at the push of a button. This script to automatically rollout your new instances in an autoscaling group helped me.
  • You might have to think about processes that need to run only once, e.g. cron jobs (or celery tasks) that perform regular tasks, like emailing reminders to your users or cleaning stuff in the database. They should not be running on all instances otherwise you're duplicating them. 
  • I have two types of instances: My management instance which is the one I use to deploy from (it's in my VPC and with the AWS CLI it can access all my instances, create new launch templates, etc...). It's also the one that runs my background tasks, so they only run once. And then my normal instances (web instances) which don't run my background tasks but just handle the live web traffic.
  • If you want to have more information on my deployment scripts, hit me up via twitter @dirkgroten and I'll be happy to share some more stuff.

Dirk Groten is a respected tech personality in the Dutch tech and startup scene, running some of the earliest mobile internet services at KPN and Talpa and a well known pioneer in AR on smartphones. He was CTO at Layar and VP of Engineering at Blippar. He now runs and develops Dedico.

Things to remember

  • Your health check API should launch a Django view and fetch something from the db.
  • Make the static files manifest local to each instance
  • Run collectstatic once each time you deploy
  • Use a separate management instance to deploy and create the AMI

Related Holidays