top of page
Search

Distributed Load Testing with Python Locust and Terraform: A Complete Guide

The first time I heard about Locust was at PyCon Sweden a few years back. I felt a bit stupid, as how many other amazing python libraries am I missing? The depth of Python ecosystem is massive. Locust is one of those hidden-gem libraries that most people would miss, I mean, who expects a full-blown load test framework with a web UI... in Python?!.


What is Locust?

Locust is a simple, robust, customizable, high-level load testing framework, and yes, it includes a web interface! You can define your tests, set the concurrency, duration, etc. You can view the tests executions and results in real time on the browser, see the error details, exceptions, even you can get a nice report to download. Now, it is written in Python, so it has endless possibilities for integrations with other tools.


happy grasshopper testing
Happy grasshopper testing the sh*t out of the application

First run

As a proof of concept, you may just want to get the feeling of the tool. So you can start by writing a simple hello world test case on your laptop and run it to see what's going on.

For example, we can define a test case where we hit a public endpoint on the application we are testing. No login, no advanced features. Just hit an open endpoint, like the root or login pages.


Perfect, lets get to it.


First of all, install locust

pip3 install locust

In Locust, you define users as classes, the basic one is HttpUser.

It represents a simulated user that is going to execute the tests. Each method of the class decorated with @task will run in random order for the duration of the test session.

Once the class is instantiated, it will create a self.client object, which is a wrapper to a requests session to the host property of the class. It already knows what is the base url (host), so you just need to pass the suffix: the remaining of the url path.


# hello_locust.py
from locust import HttpUser, task


class HelloWorld(HttpUser):
    host = "https://example.com"

    @task
    def get_root(self):
        self.client.get("/")

And that's it? yes, that's it. Now, run it on the cli

locust -f hello_locust.py

It will initiate the web server (default at 0.0.0.0:8089), open it in the browser and you will get welcomed with the test session config options:


  • Number of users: how many concurrent simulated users (how many instances of the class)

  • Spawn rate: how quickly to ramp up users

  • Host. default to the host property of the class, but you can override it here

  • Run time. Duration in seconds, minutes or hours. If unset, it will run until you stop it.


There are cli args to skip this (ramp up: -r, users: -u), skip the UI entirely (--headless), get the reports in CSV (--csv) or other formats, you are welcome to read the docs.


locust ui
Test session options

Click on start and wait. The data on some tabs of the UI will be updated in real time.


Reading the results


You can either wait until the run completes or stop the session at any time. It will still give you all the data. You will see per-endpoint stats, failure counts, and charts..


locust per-endpoint stats
Per-endpoint stats

In this example of a 2 minute run you can extract this information:

  • ~390 requests/sec average

  • 36 ms average response time

  • 26k out of 27k requests failed


Here you won't get much else, you can view more details on the failures or charts tab


locust charts
Basic stats

On the total requests per second you can see how the tool is warming up, getting a first amount of requests, successfully, then start getting a few red errors after 20 secs, and just failing everything after 1 minute.

On the response times you see that the successful requests have a load time between 200 and 1500 ms, while the errors take just a few ms to return. This explains the low average response time (36.04 ms)

On the number of users you see how it slowly went up to 10 users concurrently.


This chart says that the app can't handle much traffic. Still, doesn't say why it failed, but it will be more clear if you just jump to the failures tab.


locust failure details
Error messages

You can get a full breakdown of all errors, grouped by occurrences and URL path.

This test example was executed against a basic online store website. 429 status code is a controlled error (rate limit), so this test session tells you that the web site has DDoS protection, and it got triggered.


By default, Locust tries to execute each task right after the previous task completes, with no wait time. You can override that by defining a wait time between tasks to try avoid rate limit.


# hello_locust.py
from locust import HttpUser, task
from locust.user.wait_time import constant, between


class HelloWorld(HttpUser):
    host = "https://example.com"
    wait_time = constant(1)
    # wait_time = between(min_wait=1, max_wait=5)

    @task
    def get_root(self):
        self.client.get("/")

Now, each user will wait 1 second between tasks. It can also be a random amount between min and max values. If we spawn 10 users and we set wait_time to a fixed 1 second, we will expect to get 10 requests per second. Ok, let's see.


locust per-endpoint stats
Stats for 10 users with 1s wait_time

There are still errors, but requests per second drop down to 9.8, as expected (as still we have a rump up delay of 1 second).

The chart shows that the rate limit takes 1 minute to trigger, better than last time, and that the real average response times sits between 120 and 175 milliseconds.


locust chart
Chart with more accurate response times but still triggering Rate Limit


So far, this is just basic Locust, you get an idea of what you can do.


Beyond the Hello World


Now, lets go one step further and create a more advanced test. In a real scenario, as a DevOps, you may want to check how much load your application supports, how long does it take to load a page or perform an action, like, completing a user story. Think about your deployments of Jenkins, GitLab, Grafana, Jira, Nexus, etc. For each of them, I am sure you have a few user stories defined somewhere, at least they exist in your mind. You may or may not have tests for them, but with Locust you can write the tests on code and measure them.


Let's use the example of a source control tool. There are many self-hosted options, like GitLab, Gitea, Bitbucket. But, no matter what your tool is, the user stories are going to be quite similar.


As a user, I want to:

  • view my profile

  • load my dashboard

  • list projects

  • list repositories on a project

  • create a repository on a project

  • clone a repository

  • push code to a repository


As a user, means you are a registered user, so the first thing you have to do is login. Locust self.client is a session, meaning, if you do a login, the cookies and stuff like that are saved at the client. You can also alter them, or put a bearer token in the headers, or generate a token and use it or... you know, is Python requests.

from locust import HttpUser


class SCMUser(FastHttpUser):
    host = "https://example.com"

    def on_start(self):
        """
        Special method that runs after __init__ and before starting the tasks
        Can be used to login or run some pre-requirements.
        """
        self.client.post(
            "/login",
            {"username": 'foo', "password": 'bar'}
        )

Great!. We have logged in. Now, every task will have the login cookies. Let's define 4 basic user stories for logged in users. When you run this you will get stats for 4 different endpoints, login won't count as is not part of a task. But it could be. You just need to wrap the on_start special method with @task.


@task
def list_repositories(self):
    self.client.get(f"/projects/FOO")

@task
def view_profile(self):
    """ Because every client call returns a response, and you can verify it. This way you not only check that the status code is < 400, you also test that it is what you expect it to be
    """
    result = self.client.get("/profile")
    assert result.status_code == 200
    assert result.url == self.host + '/profile'

@task(5)
def load_dashboard(self):
    """ The task decorator can receive "weight" as a parameter
    5 means that out of 10 tasks, on average load_dashboard will be
    chosen 5 times.
    Without weight, Locust will spread the load evenly
    """
    self.client.get("/dashboard")

@tag('my_tag')
@task
def list_projects(self):
    """ Tasks can also be tagged. 
    When running the locust cli, you can pass the tags you want to run as argument
        "locust -f foo.py --tags my_tag" will only run the tasks with that tag
    """
    self.client.get("/projects")

Custom Metrics (e.g., git clone)


Now, we were saying something about load testing, user stories, clone repositories and that. Since git clone is not an HTTP Request, how can we do that? How can we measure random stuff?


The calls to self.client can be used as a context manager to generate a load time record, then add your custom metric to the context. In this example we are going to measure the time it takes to do a git clone.

import shutil
import time

...

@task
def repo_clone(self):
    with self.client.get("/", catch_response=True) as res:
        start = time.time()
        subprocess.call(
            f'git clone {self.host}/foo/repo.git /tmp/',
            shell=True,
            stderr=subprocess.PIPE,
            stdout=subprocess.PIPE
        )
        finish = time.time()
        clone_time = finish - start
        res.request_meta['context']['clone_time'] = clone_time
		shutil.rmtree('/tmp/repo')

Great, but how can I use that clone_time? Now, this is where things get interesting. We are going to generate new metrics for locust to display in the Web UI.


First, let's define an event to clear the stats cache. Otherwise, the custom metrics will stay across testing sessions as they don't get cleared automatically.

from locust import events

@events.reset_stats.add_listener
def on_reset_stats():
    global stats
    stats = {}

Then, let's define another event to add the metrics to the stats for every task. Now, we don't really need to display the clone_time of that specific request, instead we use it to add it to the total and then calculate the average "clone-time" based on the number of clones. You can extend this with as many metrics as you need.

@events.request.add_listener
def on_request(request_type, name, response_time, response_length, exception, context, **kwargs):
    if 'clone_time' in context:
        stats.setdefault(
            name,  # name is the url path
            {
                "clone-time": 0,
                "clone-time-total": 0,
                "clone-count": 0
            }
        )
        stats[name]["clone-count"] += 1
        stats[name]["clone-time-total"] += context['clone_time']
        stats[name]["clone-time"] = stats[name]["clone-time-total"] / stats[name]["clone-count"]
        
clones
clone_count, clone_time, clone_avg...

Finally, you have to tell Locust to do something with all that. Here there is a way to create new routes for the app and add them to the Web UI.


@events.init.add_listener
def locust_init(environment, **kwargs):
    if environment.web_ui:
        def get_clone_time_stats():
            if stats:
                stats_tmp = []

                for name, inner_stats in stats.items():
                    clone_time = inner_stats["clone-time"]
                    clone_count = inner_stats["clone-count"]

                    stats_tmp.append(
                        {"name": name, "safe_name": escape(name, quote=False),
                         "clone_time": clone_time, "clone_count": clone_count}
                    )
                return stats_tmp[:500]  # truncate as it may crash
            return stats

        @environment.web_ui.app.after_request
        def extend_stats_response(response):
            """
                This will create a new tab called "clone-stats" in the UI
            """
            if request.path != "/stats/requests":
                return response

            response.set_data(
                json.dumps(
                    {**response.json, "extended_stats": [{"key": "clone-stats", "data": get_clone_time_stats()}]}
                )
            )

            return response

        @extend.route("/extend")
        def extend_web_ui():
            """
            Add route to access the extended web UI with our new tab.
            """
            environment.web_ui.update_template_args()

            return render_template(
                "index.html",
                template_args={
                    **environment.web_ui.template_args,
                    "extended_tabs": [{"title": "Clone Stats", "key": "clone-stats"}],
                    "extended_tables": [
                        {
                            "key": "clone-stats",
                            "structure": [
                                {"key": "name", "title": "Name"},
                                {"key": "clone_time", "title": "Avg clone time"},
                                {"key": "clone_count", "title": "Total clones"}
                            ],
                        }
                    ],
                    "extended_csv_files": [
                        {"href": "/clone-stats/csv", "title": "Download clone time statistics CSV"}
                    ],
                },
            )

        @extend.route("/clone-stats/csv")
        def request_clone_time_csv():
            """
            Add route to enable downloading of clone-time stats as CSV. Who doesn't like a nice spreadsheet?
            """
            response = make_response(clone_time_csv())
            file_name = f"clone_time{time.time()}.csv"
            disposition = f"attachment;filename={file_name}"
            response.headers["Content-type"] = "text/csv"
            response.headers["Content-disposition"] = disposition
            return response

        def clone_time_csv():
            """Returns the clone-time stats as CSV."""
            rows = [
                ",".join(
                    [
                        '"Name"',
                        '"Total clone-time"',
                        '"Total clones"'
                    ]
                )
            ]

            if stats:
                for url, inner_stats in stats.items():
                    rows.append(f"\"{url}\",{inner_stats['clone-time']:.2f}")
                    rows.append(f"\"{url}\",{inner_stats['clone-count']:.2f}")
            return "\n".join(rows)

        # register our new routes and extended UI with the Locust web UI
        environment.web_ui.app.register_blueprint(extend)

This is the result

locust custom ui tab
To access this modern UI you need to add /extend to the url: http://127.0.0.1:8089/extend?tab=clone-stats

And the additional fancy download csv for clone stats:


locust custom ui csv download


And just like that you have hacked Locust.



Now is not basic load testing tool anymore, is a powerful framework fully customizable to test the sh*t out of your application.



Scaling Out: Terraform + AWS EC2


But... that is not the whole story. I assume that you, dear reader, being a seasoned DevOps, have already spotted the bottleneck. Yes, indeed, your laptop! Even for basic load testing, running from a single machine will generate a bottleneck at some point. Now consider that you are testing intense disk or network operations, running all from just a single machine won't give you real feedback, as you will be limited by that instance throughput. Luckily, Locust can be easily integrated with cloud resources. Azure, AWS, terraform, ec2, kubernetes, the possibilities are endless (as long as there is some sort of node-to-node communication).

Let's focus on a simple one: We are going to execute the above tests on a fleet on EC2 instances on AWS using terraform. The idea is to launch a controller (which will only host the UI, no test executions), and a fleet of worker nodes that will report back. Each worker node can run 1 or multiple users, and they all will run in parallel. So, if you want to check how your SCM will behave if 3000 developers clone a reopsitory at the same time, this will give you the answer.


Terraform


Again, the goal is to launch a controller, and a swarm of workers, make the workers report to the controller (on port 5557-5558), and make the controller to display all the information in the Web UI.


In this example code we will skip the basics of Terraform (AMI, Subnets, VPC, Route53, Certificates, etc) and just jump into the juice.


Security Group

One option is to create one SG for the controller and another for the workers, let them talk to each other and expose the controller SG for the Web UI. Or just one SG for all, and let it talk to itself.

resource "aws_security_group" "locust_sg" {
  name = "locust-sg"
  ...
}

resource "aws_security_group_rule" "ssh_ingress" {
  from_port         = 22
  protocol          = "tcp"
  security_group_id = aws_security_group.locust_sg.id
  to_port           = 22
  type              = "ingress"
  description       = "Allow SSH traffic for Terraform"
  cidr_blocks       = var.terraform_controller_cidr_blocks
}

resource "aws_security_group_rule" "web_ingress" {
  from_port         = 8089
  protocol          = "tcp"
  security_group_id = aws_security_group.locust_sg.id
  to_port           = 8089
  type              = "ingress"
  description       = "8089 as default HTTP or 443 if you plan to use Nginx"
  cidr_blocks       = var.web_ui_ingress_cidr_blocks
}

resource "aws_security_group_rule" "report_traffic" {
  from_port         = 5557
  protocol          = "tcp"
  security_group_id = aws_security_group.locust_sg.id
  to_port           = 5558
  type              = "ingress"
  description       = "Allow workers report to controller"
  self              = true
}

resource "aws_security_group_rule" "ssh_traffic" {
  from_port         = 22
  protocol          = "tcp"
  security_group_id = aws_security_group.locust_sg.id
  to_port           = 22
  type              = "ingress"
  description       = "Allow SSH traffic within nodes"
  self              = true
}

resource "aws_security_group_rule" "egress_traffic" {
  from_port         = 443
  protocol          = "tcp"
  security_group_id = aws_security_group.locust_sg.id
  to_port           = 443
  type              = "egress"
  description       = "Egress rule to reach target of the test"
  cidr_blocks       = var.egress_cidr_blocks
}

Workers send data to controller on ports 5557 and 5558, but, still, if you want to enable download of CSV reports or any other custom generated file on the workers, you will need to configure SSH so the files are sent to the controller. SSH can also be used by Terraform to setup the nodes. User data is an option, but with SSH you can orchestrate a synchronous setup.

If you decide to enable SSH, you will need to manage the pem key. Meaning, create it, save it, create all the nodes with it, but also inject it on the nodes, so they can transfer files between them (or skip this bit, if you don't plan to transfer files). For now, this will create the key locally and in AWS.


resource "tls_private_key" "pem_key" {
  algorithm = "RSA"
  rsa_bits  = 4096
}

locals {
  export_pem_cmd = "echo '${tls_private_key.pem_key.private_key_pem}' > private-keypair.pem"
}

resource "aws_key_pair" "key_pair" {
  key_name   = "loadtest-keypair"
  public_key = tls_private_key.pem_key.public_key_openssh
}

resource "null_resource" "key_pair_exporter" {
  depends_on = [
    aws_key_pair.key_pair
  ]

  triggers = {
    always_run = timestamp()
  }

  provisioner "local-exec" {
    command = local.export_pem_cmd
  }

}

User data and initial config

You can create a custom AMI with bootstrap, or just use any basic linux marketplace AMI and prepare the setup. If you go with the second option, there are a few ways to do it. You can go full in user data and put some delays here and there, cross fingers so the nodes dont start before the controller, or mix user data with Terraform controlled script provisioners to know exactly when everything is ready to roll, and then start the controller before the workers.


This is a basic user data to install locust, config basic ssh and let terraform know that user data is completed.


#!/bin/bash

yum install -y git

pip3 install locust

export PRIVATE_IP=$(hostname -I | awk '{print $1}')
echo "PRIVATE_IP=$PRIVATE_IP" >> /etc/environment

source ~/.bashrc

mkdir -p ~/.ssh
echo 'Host *' > ~/.ssh/config
echo 'StrictHostKeyChecking no' >> ~/.ssh/config

touch /tmp/finished-setup

On the EC2 instance resource in terraform, we can use the remote-exec and file provisioners to copy the locust scripts (called plans) to the instance, and config ssh to enable file transfer.

resource "aws_instance" "worker" {

  count = var.worker_count
  # ami, instance size, public ip, subnet, iam role...

  vpc_security_group_ids = [aws_security_group.locust_sg.id]
  key_name               = aws_key_pair.key_pair.key_name
  user_data_base64       = var.user_data  # above user_data

  connection {
    host        = self.private_ip
    type        = "ssh"
    user        = var.ssh_user
    private_key = tls_private_key.pem_key.private_key_pem
    script_path = "/home/${var.ssh_user}/terraform_provisioner_%RAND%.sh"
  }

  # Inject pem key required to talk with controller. Don't show this to Security!
  provisioner "remote-exec" {
    inline = [
      "echo 'starting provisioner' > /tmp/test",
      "echo '${tls_private_key.pem_key.private_key_pem}' > /home/${var.ssh_user}/.ssh/id_rsa",
      "chmod 600 /home/${var.ssh_user}/.ssh/id_rsa",
      "sudo mkdir -p ${var.load_tests_plans_remote_dir}",
      "sudo chown ${var.ssh_user}:${var.ssh_user} ${var.load_tests_plans_remote_dir}"
    ]
  }

  # Copy Locust plans to the instance
  provisioner "file" {
    destination = var.load_tests_plans_remote_dir
    source      = var.load_tests_plans_local_dir
  }

  # Wait for user_data to finish
  provisioner "remote-exec" {
    inline = [
      "echo 'START EXECUTION'",
      "while [ ! -f /tmp/finished-setup ]; do echo 'waiting user_data to be completed'; sleep 5; done",
      "sleep 10"
    ]
  }

}

The controller EC2 instance follows the same, except we don't need to copy the locust plans, as in this model, the controller won't execute any test, is just the orchestrator. But, still, needs the user data to install locust and the first script provisioner to configure SSH to enable file transfer.


So, if both instances are the same, how do we make the difference? The entrypoint will take care of that.

After all instances have launched, Terraform can also do some extra provisioner executions to run commands on remote machines, in this case, to execute Locust on all. The controller will be told it is the controller and it expects a certain amount of workers, while the workers will be invoked as a node worker with the controller IP, so they can report their data.

We will focus only on the stats/metrics transfer, and let file transfer for another day.


locals {
  controller_entrypoint = <<-EOT
      nohup locust \
          -f ${var.locust_plan_filename} \
          --web-port=8080 \
          --expect-workers=${var.worker_count} \
          --master -L DEBUG > locust-leader.out 2>&1 &
  EOT

  worker_entrypoint = <<-EOT
      nohup locust \
          -f ${var.locust_plan_filename} \
          --worker \
          --master-host=${aws_instance.controller.private_ip} -L DEBUG > locust-worker.out 2>&1 &
  EOT

  waiting_command = "while [ ! -f /tmp/finished-setup ]; do echo 'waiting setup to be instaled'; sleep 5; done"
}


resource "null_resource" "controller_entrypoint_setup" {

  depends_on = [
    aws_instance.controller,
    aws_instance.worker
  ]

  connection {
    host        = aws_instance.controller.private_ip
    type        = "ssh"
    user        = var.ssh_user
    private_key = tls_private_key.loadtest.private_key_pem
    script_path = "/home/${var.ssh_user}/terraform_provisioner_%RAND%.sh"
  }

  # wait until controller user_data is completed 
  provisioner "remote-exec" {
    inline = [
      "echo 'START EXECUTION'",
      local.waiting_command,
    ]
  }

  # Cleanup web resources
  provisioner "remote-exec" {
    inline = [
      "sudo rm -rf /var/www/html/*",
      "sudo chmod 777 /var/www/html -Rf",
      "sudo rm -rf ${var.load_tests_plans_local_dir}/logs",
    ]
  }

  # Trigger entrypoint
  provisioner "remote-exec" {
    inline = [
      "echo DIR: ${var.load_tests_plans_local_dir}",
      "cd ${var.load_tests_plans_local_dir}",
      "echo '${local.entrypoint}'",
      "${local.entrypoint}",
      "sleep 1"
    ]
  }

  triggers = {
    always_run = timestamp()
  }

}


resource "null_resource" "worker_entrypoint_setup" {

  count = var.worker_size

  depends_on = [
    aws_instance.controller,
    aws_instance.worker,
    null_resource.controller_entrypoint_setup,
  ]
  connection {
    host        = aws_instance.worker[count.index].private_ip
    type        = "ssh"
    user        = var.ssh_user
    private_key = tls_private_key.pem_key.private_key_pem
    script_path = "/home/${var.ssh_user}/terraform_provisioner_%RAND%.sh"
  }

  provisioner "remote-exec" {
    inline = [
      "echo SETUP NODES ${count.index}",
      "echo '${local.worker_entrypoint}'",
      "cd ${var.load_tests_plans_local_dir}",
      "${local.worker_entrypoint}",
      "sleep 1"
    ]
  }

  triggers = {
    always_run = timestamp()
  }

}

This is what will happen, in this exact order, when you run this terraform project:

  • Create pem key / Security Group

  • Create X workers and 1 controller with pem key and SG

  • Run user_data on all to install locust

  • Copy locust plans to workers

  • Wait until user_data is completed

  • Execute locust on controller

  • Execute locust on workers


When the controller instance comes online, it won't kick off the tests right away (even though Locust can be configured to do that). Instead, it greets you with the same welcome screen you saw at the start of this post. From there, you can launch the test whenever you are ready and rerun it as many times as you like with different parameters, without having to redeploy anything. It will recognize the workers connected, but you can still control from the UI the number of users. So, if you have 10 workers connected to the controller and request 100 users, each worker will spawn 10 users.


Ready for Real Distributed Load Testing?


If you have made it this far, you are probably serious about putting your app through its paces. Whether you need a one-off load test, a full-scale distributed setup, or just some guidance on best practices, I can help. I offer consulting sessions and hands-on services to design, write, and interpret load tests that actually tell you something useful. If you want to see how your system holds up when the traffic spikes (and sleep better knowing it will) let's talk.


 
 
 

Comments


bottom of page