Update: horde is a library that is “built to address some perceived shortcomings of Swarm’s design.” (from the introductory blog post), and is currently (as of December 2020) more actively maintained than Swarm. It also works together with libcluster. We used horde for some new use cases and we recommend you to also check it out if Swarm doesn’t fit your need.

We run Elixir on Kubernetes. While there were some rough edges for such a setup a few years ago, it has become much easier thanks to libraries such as libcluster and Swarm. Recently, we had a use case which those two libraries fit perfectly. However, since the libraries are relatively new, you might find the documentation and examples a bit lacking and it took us a while to get everything working together. We wrote this comprehensive walk-through to document our experience.

Note that you’d need to run your Elixir instances as releases, either with distillery or built-in releases (Elixir 1.9+), in order for this to work. This walk-through is based on our setup using Distillery.

libcluster and Swarm

The ability for different nodes to form a cluster and maintain location transparency has always been a huge selling point for the BEAM VM. libcluster is a library that helps with automatic node discovery and cluster formation, especially when the nodes are run in Kubernetes pods.

The main reason you might want a cluster is so you can organize and orchestrate processes across different nodes. This is where Swarm comes into the picture. Swarm maintains a global process registry and automates tasks such as process migration/restart after the cluster topology changes.

In our case, we have some permanent worker processes which should only have one instance of each running across all nodes at any given time. The problem is that we perform k8s rolling deployments with our CD pipeline, thus pods (and therefore nodes) will get destroyed and recreated throughout the day.

With Swarm, we register all above-mentioned worker processes globally, so whenever a pod gets destroyed, its worker processes will be automatically restarted on another healthy pod, ensuring that they are running exactly as envisioned all the time.

(Note: Initially, Swarm contained both the auto-clustering functionality and the process orchestration functionality. Later, the maintainer decided that it would be better to split them into two separate libraries, which become the libcluster and Swarm we see today.)

Clustering with libcluster

In our first step, we’re going to ensure that automatic cluster formation in Kubernetes takes place successfully.

Before we add libcluster to our project, we need to ensure that every Elixir node has a unique name. By default, in the rel/vm.args file generated by Distillery, we have a line just like:

-name <%= release_name %>@127.0.0.1

This means every node will be started as yourapp@127.0.0.1, which is not what we want.

We could first expose the Kubernetes pod IP as an environment variable in the kube configuration (e.g. kube/deploy.yaml) for the Elixir app:

---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: your-app
  # ...
spec:
  # ...
  template:
    # ...
    spec:
      # ...
      containers:
      - name: your-app
        # ...
        env:
        - name: MY_POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP

and then you can change the line in rel/vm.argsso the pod IP address will be substituted at runtime:

-name <%= release_name %>@${MY_POD_IP}

After ensuring unique node names, you can already test clustering manually via your console:

# On your local machine
~ kubectl exec -n your-namespace your-pod -c your-container -- sh

# Launch Elixir console after connecting to your k8s pod.
~ bin/console

iex(app@<POD_1_IP>)> Node.connect(":app@<POD_2_IP>")
iex(app@)> Node.list()

# If things are set up correctly, Node.list() should return [":app@<POD_2_IP>"]

Automatic Clustering

Having tested manual clustering, we can then move on to automatic clustering with libcluster.

You may notice that there are three clustering strategies for Kubernetes in libcluster and wonder which one you should use. In our experience, Cluster.Strategy.Kubernetes.DNS is the easiest to set up. All it requires is that you add a headless service to your cluster and no modifications are needed for the already existing (Elixir) pods.

---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: your-app
  # ...
spec:
  template:
    metadata:
      # The selector for the headless service can
      # find the app by this label
      labels:
        app: your-app
        role: service
      # ...

---
apiVersion: v1
kind: Service
metadata:
  name: your-app-headless
spec:
  clusterIP: None
  # The selector is just a way to filter the pods on which the app is deployed.
  # It should match up with the labels specified above
  selector:
    app: your-app
    role: service

After adding the headless service, we’re finally ready for libcluster. You may add the ClusterSupervisor to the list of supervisors to be started with your app.

### In mix.exs
  defp deps do
    [
      ...
      {:libcluster, "~> 3.1"},
      ...
    ]
  end

### In config/prod.exs
config :libcluster,
  topologies: [
    your_app: [
      strategy: Cluster.Strategy.Kubernetes.DNS,
      config: [
        service: "your-app-headless",
        # The one as seen in node name yourapp@
        application_name: "yourapp",
        polling_interval: 10_000
      ]
    ]
  ]

### In lib/application.ex
  def start(_type, _args) do
    # List all child processes to be supervised
    children =
        [
          [
            {
              Cluster.Supervisor,
              [
                Application.get_env(:libcluster, :topologies),
                [name: Yourapp.ClusterSupervisor]
              ]
            }
          ],
          # Start the Ecto repository
          Yourapp.Repo,
          # Start the endpoint when the application starts
          YourappWeb.Endpoint,
        ]

    opts = [strategy: :one_for_one, name: Yourapp.Supervisor]

    Supervisor.start_link(children, opts)
  end

If everything goes well, your pods should already be automatically clustered when you start them. You can verify this by running Node.list() in the console, similar to what we did above.

Optional: Local cluster setup with docker-compose

During development, you will not want to deploy your changes every time to an actual k8s cluster in order to validate them or run a local k8s cluster for that matter. A more lightweight approach would be to make libcluster work together with docker-compose and form a local node cluster. We found Cluster.Strategy.Gossip the easiest to set up for this purpose.

The following assumes the app is started with docker-compose directly, without using Distillery releases.

First, we need to make sure that each Erlang node has a unique name, just as we did for the production environment. We will do it in our entrypoint script for docker-compose:

# In Dockerfile:
ENTRYPOINT ["/opt/app/docker-entrypoint-dev.sh"]

# In docker-entrypoint-dev.sh:
if [ -z ${NODE_IP+x} ]; then
    export NODE_IP="$(hostname -i | cut -f1 -d' ')"
fi

elixir --name yourapp@${NODE_IP} --cookie "your_dev_erlang_cookie" -S mix phx.server

Then, we need to scale the number of containers for our service to 2. You can easily do it by adding another service in your docker-compose.yml file:

services:
  your_app: &app
    build:
      context: ./app
    ports:
      - 4000:4000
    # ...

  your_app_2:
    <<: *app
    ports:
      - 4001:4000

(Note: Another way to achieve the same is to use the --scale flag of the docker-compose up command.)

### In config/config.exs
config :libcluster,
  topologies: [
    your_app: [
      strategy: Cluster.Strategy.Gossip,
      config: [
        port: 45892,
        if_addr: "0.0.0.0",
        multicast_addr: "230.1.1.251",
        multicast_ttl: 1
      ]
    ]
  ]

By default, the port and the multicast address should already have been available. If not, you can check your docker-compose configurations.

By this point, the two local nodes should be able to automatically find and connect to each other whenever you start your app via docker-compose.

Process registration with Swarm

After the foundation has been laid with libcluster, we may now move on to Swarm.

This sentence from the documentation is key to using Swarm to its maximum potential:

Therefore, we found the example from the documentation, which shows Swarm being used together with a normal Supervisor, to be slightly confusing: a normal Supervisor must start with some initial child worker processes, which will not be managed by Swarm. DynamicSupervisor seems to suit Swarm’s use case the most: we can start a DynamicSupervisor without any children and ensure all child processes are registered with Swarm before they are dynamically started later.

defmodule Yourapp.YourSupervisor do
  use DynamicSupervisor

  # See https://hexdocs.pm/elixir/Application.html
  # for more information on OTP Applications
  def start_link(state) do
    DynamicSupervisor.start_link(__MODULE__, state, name: __MODULE__)
  end

  def init(_) do
    DynamicSupervisor.init(strategy: :one_for_one)
  end

  def register(worker_name) do
    DynamicSupervisor.start_child(__MODULE__, worker_name)
  end
end

Note that in the init function we didn’t need to provide any actual children to be started.

The register function is a convenience function that needs to be provided to Swarm.register_name/4 whenever we want to start a worker process with Swarm. It simply calls start_child and would return the pid of the started worker.

This is how you would dynamically start your worker process anywhere in your app:

Swarm.register_name(
  :unique_name_for_this_worker_process,
  Yourapp.YourSupervisor,
  :register,
  [Yourapp.YourWorker]
)

Finally, we come to the definition for the worker process itself. Below is a minimal working example which would simply restart a killed worker process on another node, without preserving its state:

defmodule Yourapp.YourWorker do
  use GenServer

  def start_link(state) do
    GenServer.start_link(__MODULE__, state)
  end

  def init(opts) do
    initial_state = initialize_worker(opts)

    {:ok, initial_state}
  end

  def handle_call({:swarm, :begin_handoff}, _from, state) do
    {:reply, :restart, state}
  end

  def handle_info({:swarm, :die}, state) do
    {:stop, :shutdown, state}
  end

  defp initialize_worker(opts) do
    # ...
  end
end

Optionally, if you need to make use of the state handover functionality, you would need to make your worker more complicated with these additions:

  # Change the handling of :begin_handoff
  # This is triggered whenever a registered process is to be killed.
  def handle_call({:swarm, :begin_handoff}, _from, current_state) do
    {:reply, {:resume, produce_outgoing_state(current_state)}, current_state}
  end

  # Handle :end_handoff
  # This is triggered whenever a process has been restarted on a new node.
  def handle_call({:swarm, :end_handoff, incoming_state}, _from, current_state) do
    {:noreply, end_handoff_new_state(current_state, incoming_state)}
  end

Now, if you kill a node, you should see all the workers that were originally running on it automatically restarted on another node in the cluster.