QuasarDB on Kubernetes

Deploying QuasarDB in cloud-native environments has been possible for many years using the official Docker containers we provide. With the recent rise of container orchestration platforms such as Kubernetes, we can do much more: a single interface to provision, monitor and operate a QuasarDB cluster.

In this post, we will demonstrate an approach to deploy QuasarDB in a Kubernetes cluster, and discuss the various tradeoffs.

Choosing a Controller

Kubernetes provides a wide variety of workload controllers we can use. Before we can get started with writing the template, we need to decide how we are going to run QuasarDB, as this will determine what our templates will look like.

ReplicaSet

A ReplicaSet is a useful pattern when you need to have a certain number of identical pods running in parallel. It is very similar to a Deployment, and it’s canonical use case are stateless services such as webservers, which do not need to be able to be aware of each other.

For this reason, it is not a good fit for QuasarDB: we definitely need the ability to have the nodes to discover each other.

StatefulSet

A stateful set is similar to a Deployment, but has an important feature: it provides guarantees about the ordering and uniqueness of pods. Each pod has a unique identify which is persistent across any rescheduling.

This is a great fit for QuasarDB for a number of reasons:

Due to the persistent identity, we don’t risk a pod being relaunched under a different name/identity, causing unnecessary data migration to occur as well as cluster topology changes..
The pod identities are deterministic, allowing is to predict the names of other pods in our StatefulSet. This will allow us to implement a mechanism for auto-discovery of pods in a cluster.

Service definition

Due to the mechanism of how QuasarDB and StatefulSets interact, we are going to define two different types of services:

A headless service, which will guide the identity and naming strategy for our StatefulSet pods. It will provide us with deterministic pod names.
A regular service, which clients will use to connect to the cluster. With QuasarDB’s cluster discovery mechanism, when a client connects to QuasarDB, the following will happen:

A client connects to the public service’s ip, for example qdb://qdb-cs:2836 and requests the QuasarDB node topology to discover all the nodes in the cluster.
Kubernetes will forward this request to one of the pods in the cluster.
The pods identify using the headless service, and will return an array such as [“qdb://qdb-hs-1:2836”, “qdb://qdb-hs-2:2836”, "qdb://qdb-hs-3:2836”] .
Subsequent requests of the client will talk directly through the headless service hostnames.

We define both services as follows:

apiVersion: v1
kind: Service
metadata:
  name: qdb-hs
  labels:
    app: qdb
spec:
  ports:
  - port: 2836
    name: client
  - port: 2837
    name: control
  clusterIP: None
  selector:
    app: qdb

---

apiVersion: v1
kind: Service
metadata:
  name: qdb-cs
  labels:
    app: qdb
spec:
  ports:
  - port: 2836
    name: client
  - port: 2837
    name: control
  selector:
    app: qdb

We now have two distinct services which are mapped to any pods with the app: qdb selector.

StatefulSet definition

We start out by declaring our StatefulSet, again around the app: qdb selector:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: quasardb

spec:
  selector:
    matchLabels:
      app: qdb

  serviceName: qdb-hs
  replicas: 3
  podManagementPolicy: OrderedReady
  
  template:
    metadata:
      labels:
        app: qdb
        
    spec:

There are a few things worth noting here:

We explicitly map the StatefulSet to the qdb-hs headless service we defined earlier, which implies the pods will assume an identity based on this service's DNS names.
By setting the podManagementPolicy: OrderedReady, we instructs Kubernetes to launch all the pods, one-by-one, in order from 0-N, and the reverse for termination. It is possible to launch all pods in Parallel as well, but this will cause the QuasarDB cluster stabilization to take much longer.

We can now continue actually defining the pod spec for the StatefulSet.

QuasarDB license key

The best way to manage your QuasarDB license key on Kubernetes is to make use of the built-in secrets management. Assuming you have received a license key, first create it inside Kubernetes as a generic secret:

kubectl create secret generic qdb-secrets --from-file=/my/path/to/quasardb.key

And add this secret as a volume to our pod spec:

      volumes:
      - name: secrets
        secret:
          secretName: qdb-secrets

We will later mount this volume directly into the pod, exposing our license key to the QuasarDB daemon.

Container definition

Our pod will feature a single container, for which we can now start the definition. We will base it on the public QuasarDB docker image:

      - name: qdbd
        image: bureau14/qdb:3.7
        imagePullPolicy: "Always"
        ports:
          - containerPort: 2836
            name: client
          - containerPort: 2837
            name: control
        resources:
          limits:
            memory: "2Gi"

Please note that QuasarDB needs two ports to be available: 2836 which features most of the traffic between client and server, and 2837 which is a low-volume port for cluster control messages.

Per Kubernetes best practices, we recommend setting appropriate resource limits. QuasarDB will automatically detect (and use) the available memory.

Volume mounts

We need to mount two different volumes from different purposes:

One volume to expose the earlier defined secrets for QuasarDB license keys;
One volume to maintain a persistent data storage outside of the pod.

As such, we will define these mounts:

        volumeMounts:
          - name: datadir
            mountPath: /opt/qdb/db
          - name: secrets
            mountPath: "/var/secret/"
            readOnly: true

The mount path for the QuasarDB daemon datadir, /opt/qdb/db is hardcoded in our QuasarDB docker image. If you use our official image, you must use this path for persistent storage.

We also expose our secrets into a special directory /var/secret, and will now continue to instruct QuasarDB to look for the license file here through an environment variable.

Environment variables

Earlier we defined a secret qdb-secrets with the our license key file quasardb.key. We have mounted this volume inside our pod in /var/secret/, and can now instruct our Docker container to pick up the license key using the special QDB_LICENSE_FILE environment variable:

        env:
          - name: QDB_LICENSE_FILE
            value: "/var/secret/quasardb.key"

If you're using our official Docker image, this will work out of the box.

Additionally, for our Kubernetes StatefulSet, we must add the following environment variable to our container:

          - name: K8S_REPLICA_COUNT
            value: "3"

It is required that this value matches the replicas: 3 value of the StatefulSet definition exactly. This variable will trigger an auto-discovery mechanism in the Docker container explained below.

Probes

If a QuasarDB node is told to bootstrap with another peer, it does not start listening on port 2836 until it has successfully connected to the cluster. We can use this feature with Kubernetes readiness/liveness probes, so that Kubernetes is aware of when a node is fully ready.

This works well in conjunction with the "OrderedReady" pod management policy we defined: when provisioning a cluster, Kubernetes will wait for the previous node to become fully ready (i.e. successfully joined the cluster) before allocating new ones.

        readinessProbe:
          tcpSocket:
            port: 2836
          initialDelaySeconds: 5
          periodSeconds: 10
        livenessProbe:
          tcpSocket:
            port: 2836
          initialDelaySeconds: 15
          periodSeconds: 20

Finishing touches

Our official Docker container uses reduced permissions by default, running as the qdb user with user id 999. We can make our deployment more secure by telling Kubernetes to use this:

        securityContext:
          runAsUser: 999

Additionally, as you saw above, we have yet to provision a volume claim template for our datadir volume. In this example we use a plain node-local filesystem. You will need to provision persistent volumes and adjust your volume claim template depending on your environment. The important part is that we require access to exclusive read/write volumes (i.e. ReadWriteOnce access mode):

  volumeClaimTemplates:
  - metadata:
      name: datadir
    spec:
      accessModes: [ "ReadWriteOnce" ]
      volumeMode: Filesystem
      resources:
        requests:
          storage: 10Gi
      storageClassName: manual

Launching the cluster

Assuming you saved the template in quasardb.yaml, we can simply launch our 3-node cluster as follows:

$ kubectl apply -f quasardb.yaml

You will now see your cluster launch. We can verify by launching an ad-hoc pod for the QuasarDB shell:

$ kubectl run --generator=run-pod/v1 -ti qdbsh --image=bureau14/qdbsh:3.7 -- --cluster qdb://qdb-cs:2836/

If everything worked correctly, you will now see a qdbsh prompt. Note how the client connects to one single entrypoint, our client-service, and is able to directly communicate with our headless service endpoints from there.

A note about node discovery

We conveniently left this topic out of the discussion above, but it's worth mentioning: how do nodes discover each other in this configuration, and what is the topology?

For this, we mentioned that we are using StatefulSet's deterministic node identities. These node identities are deterministic: in a 3 node cluster, each of these nodes can predict the identity of the other 2 nodes. In order to make this work, we added some special "glue" in our official Docker container to make this work:

if [[ ! -z ${K8S_REPLICA_COUNT} ]]
then
    HOST=$(hostname -s)
    DOMAIN=$(hostname -d)

    if [[ $HOST =~ (.*)-([0-9]+)$ ]]
    then
        NAME=${BASH_REMATCH[1]}
        ORD=${BASH_REMATCH[2]}
        NODE_OFFSET=$((ORD + 1))
        NODE_ID="${NODE_OFFSET}/${K8S_REPLICA_COUNT}"

        echo "Setting node id to ${NODE_ID}"
        patch_conf ".local.chord.node_id" "\"${NODE_ID}\""

        BOOTSTRAP_PEERS=$(bootstrap_peers ${DOMAIN} ${NAME} ${ORD})

        echo "Setting bootstrap peers to ${BOOTSTRAP_PEERS}"
        patch_conf ".local.chord.bootstrapping_peers" "${BOOTSTRAP_PEERS}"
    fi
fi

What happens here is the following:

If our Docker container detects the K8S_REPLICA_COUNT environment variable is set, it assumes it is running inside a Kubernetes StatefulSet;
It looks up the hostname and domain name of the current pod's complete DNS name, e.g. the qdb-1.qdb-hs.default.svc.cluster.local
Based on the first host part qdb-1, the pod determines its own position inside the StatefulSet; as the numbering starts with 0, qdb-1 would be position 2 in our cluster;
Based on this position and the total cluster size, we can assign this node an id 2/3.
Based on this position, we can also determine the hostnames of all other pods in our StatefulSet; i.e.. qdb-0 and qdb-2.
We can then automatically patch our configuration:

Set our own node id to 2/3;
Set our bootstrapping peers appropriately.

We calculate our bootstrap peers by using the following helper function:

function bootstrap_peers {
    DOMAIN=$1
    HOSTNAME=$2
    THIS_REPLICA=$3
    
    RET="["
    for ((i=(${THIS_REPLICA} - 1); i>=0; i--))
    do
        if [[ ! "${RET}" == "[" ]]
        then
            RET="${RET}, "
        fi

        THIS_HOST="${HOSTNAME}-${i}.${DOMAIN}"
        THIS_IP=$(host_to_ip ${THIS_HOST})
        RET="${RET}\"${THIS_IP}:2836\""
    done
    RET="${RET}]"

    echo ${RET}
}

As you can see, our bootstrapping function essentially tries to look for all nodes "prior" to its own position in the cluster. To illustrate, this is how our three nodes would bootstrap:

As you can see, by leveraging Kubernetes StatefulSets and adding a little bit of orchestration glue in a Docker container, we implement auto discovery of nodes.

Conclusion

We hope this article helps you get started with QuasarDB on Kubernetes. If this piqued your interest and you would like to discuss your configuration in more detail, feel free to contact one of our solution architects who will be more than happy to discuss your needs in more detail.

Full template

For completeness, here is the entire template we implemented in this article:

apiVersion: v1
kind: Service
metadata:
  name: qdb-hs
  labels:
    app: qdb
spec:
  ports:
  - port: 2836
    name: client
  - port: 2837
    name: control
  clusterIP: None
  selector:
    app: qdb

---

apiVersion: v1
kind: Service
metadata:
  name: qdb-cs
  labels:
    app: qdb
spec:
  ports:
  - port: 2836
    name: client
  - port: 2837
    name: control
  selector:
    app: qdb

---

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: quasardb

spec:
  selector:
    matchLabels:
      app: qdb

  serviceName: qdb-hs
  replicas: 3
  podManagementPolicy: OrderedReady

  template:
    metadata:
      labels:
        app: qdb

    spec:
      volumes:
      - name: secrets
        secret:
          secretName: qdb-secrets
      containers:
      - name: qdbd
        image: bureau14/qdb:3.6
        imagePullPolicy: "Always"
        ports:
          - containerPort: 2836
            name: client
          - containerPort: 2837
            name: control
        resources:
          limits:
            memory: "2Gi"
        volumeMounts:
          - name: datadir
            mountPath: /opt/qdb/db
          - name: secrets
            mountPath: "/var/secret/"
            readOnly: true
        env:
          - name: QDB_LICENSE_FILE
            value: "/var/secret/quasardb.key"
          - name: K8S_REPLICA_COUNT
            value: "3"

        readinessProbe:
          tcpSocket:
            port: 2836
          initialDelaySeconds: 5
          periodSeconds: 10
        livenessProbe:
          tcpSocket:
            port: 2836
          initialDelaySeconds: 15
          periodSeconds: 20

        securityContext:
          runAsUser: 999

  volumeClaimTemplates:
  - metadata:
      name: datadir
    spec:
      accessModes: [ "ReadWriteOnce" ]
      volumeMode: Filesystem
      resources:
        requests:
          storage: 10Gi
      storageClassName: manual