Bythos platform updates, March 2024


This post will describe the current status and foreseeable direction of the Bythos platform.

Where are we at?

Bythos Platform Services

The image above represents the current deployment of Bythos running on our lab server, the Kubercraft. It is a good foundation to work with for deploying open source applications in your own private cloud. There is still a lot of work to do to integrate all of these services together, and we have many more platform services to add, but it works beautifully so far!

Let’s review what we have here.

Base

Kubernetes - We chose Kubernetes as the base for the platform because of the community that surrounds it. If you don’t know about the Cloud Native Computing Foundation (CNCF), it’s worth becoming familiar with. Our hope is that one day, the Bythos platform will be accepted as a CNCF project to accelerate its growth and adoption.

containerd - This is the default, industry-standard container runtime that comes with K3s. It’s possible to use cri-dockerd, but that will only be useful if you have existing Docker-centric workflows that you need to quickly migrate over. The better long-term solution is adopting Kubernetes-native workflows for your applications.

K3s - Lightweight Kubernetes. Easy to install, easy to upgrade, half the memory, all in a single Go binary of less than 100 MB.

Flannel - Default for K3s. Flannel is a lightweight provider of layer 3 network fabric that implements the Kubernetes Container Network Interface (CNI). It’s possible to disable Flannel and use another CNI plugin, which we will be testing in the future, such as Calico or Cillium, for more advanced networking options.

CoreDNS - Default cluster DNS provider for K3s. We don’t have any plans to attempt to replace this. It’s used for the cluster.local domain for internal mapping, unless you change that domain. External DNS (mentioned later) is used for external-facing services on our custom domain name.

Helm - The standard package manager for Kubernetes. We look for Helm charts to deploy services and applications into our cluster if they are available. It works great with Flux. We may start building Helm charts for applications that don’t have them yet if it makes deploying that open source project easier on our platform.

Controllers

Flux - Open and extensible continuous delivery solution for Kubernetes. GitOps for apps and infrastructure. The base of the Bythos platform is just Infrastructure as Code (IaC) stored in a Git repo outside the cluster. We use flux bootstrap to deploy Bythos, for now, pointing it at the Git repo. Any changes pushed to the repo are automatically synced with the cluster by the Flux controllers. It also automatically upgrades installed applications when new releases are available.

We had a little shake up recently, regarding Weaveworks shutting down, but happy to know that the Flux project and maintainers are relatively unaffected. We were using the Weaveworks GitOps UI for observing the Flux status, but the future of that project is uncertain at the time of this writing. Because of that, we are currently testing out an alternative, named Capacitor, but there are other ways to keep tabs on Flux, if needed. We also like K9s and the Grafana dashboard.

An alternative to Flux that we may test in the future against our infrastructure code base is ArgoCD. While there are many websites that compare the two, we didn’t have any requirements to use one over the other. We just had to make a choice. Flux has worked great for us, and we don’t see a reason to switch, except for the simple reasons of testing Argo projects for comparison, and offering two different options to our users.

Ingress-NGINX - K3s comes with Traefik as an Ingress controller, but we disable that in the server configuration. Ingresses expose cluster services to the outside world, external to the cluster, which could just be your LAN. There are several alternatives that we would like to test here and offer as options to our users. If you want to use Traefik, it’s recommended to install and manage it separately in your GitOps repo, as opposed to enabling the option in K3s.

API Gateways can be used in place of Ingress controllers for more advanced traffic management, such as Emissary-Ingress. We chose Ingress-NGINX for its simplicity.

Note: NGINX Ingress Controller is actually a different product created by Nginx/F5.

cert-manager - Automated issuance and renewal of the wildcard TLS certificate via Let’s Encrypt for our custom domain name. We use that as the default certificate for all applications with an Ingress. It connects to our account in Cloudflare via API key for DNS validation at issuance and renewal time.

We are also using cert-manager with Linkerd as an on-cluster certificate authority (CA), which it needs to automatically rotate the certificates that secure traffic between pods.

External DNS - This is set up to automatically sync DNS records in our Cloudflare account via API key for our custom domain name whenever an Ingress is created or removed from the cluster.

Sealed Secrets - This provides a safe way to store secrets, such as passwords, in our Git repository for the GitOps workflow. It allows Flux to manage secrets for us instead of manual intervention.

Single Sign-on

Authentik - We had started with Keycloak and OAuth2 Proxy for SSO, as those are more traditional. After testing Authentik, which has been much simpler to roll out, we’re sticking with it for now. The Keycloak and OAuth2 Proxy options will remain in the platform in case anyone wants to use them.

Dev & Deploy

Backstage - It’s a framework for building internal developer portals, which is exactly what we needed in order to not start from scratch building our own user interface for Bythos. Originally created at Spotify, and adopted by thousands of other companies, the ecosystem surrounding it is vibrant and growing. It really is the trusted standard toolbox (read: UX layer) for the open source infrastructure landscape. We can’t wait to further develop our platform with Backstage.

GitLab - We are using the Community Edition, since the Enterprise Edition is not open source. There are still many cool features to GitLab CE we want to explore, such as Auto DevOps, Web IDE, Container Registry, and the Kubernetes Agent (for local deployments), to name a few.

Eventually, we will have an alternate stack to offer users to avoid vendor lock-in. For example, Gitea + Drone + Harbor is used frequently instead of GitLab. But for now, using GitLab is reducing our time to market since it has so many features packed into one application.

Resource consumption is a concern for us as we intend to build a lightweight platform.

GitLab memory usage

Storage

Rook / Ceph - Inside the Kubercraft are four SSDs. We had originally set up the 8TB disks in another mirrored zpool, just like we are doing with the M.2 NVMe devices. However, using Rook / Ceph instead of ZFS gives us a few advantages.

Primarily, since Ceph is a distributed cluster storage system, we can add more servers into our Kubernetes cluster to expand the storage space, while maintaining a single interface for all persistent storage volumes. We can also provision file, block, or object storage volumes for pods, while ZFS is only file-based.

While Ceph is generally installed on a minimum of three servers, it does work just fine for us on a single server with two SSDs. Rook is the Kubernetes Operator for Ceph, which simplifies the deployment and management experience of Ceph for Kubernetes.

Mail Services

Proton Mail Bridge - We are using this as a simple SMTP gateway for our cluster to receive notifications through. We use Proton services for the business email accounts, and using their bridge allows us to receive end-to-end encrypted notifications to our off-site inbox.

Unfortunately, this solution does not scale well and requires manual intervention from time to time. We will be testing different SMTP relays that offer more robust solutions, like Haraka and Postal, and possibly full-featured mail systems, such as Mailu.

Database

CloudNativePG - Using a cluster controller for Postgres databases allows us to easily define them in our GitOps workflow. Both Authentik and GitLab are using databases created via CNPG. While they are defined as Postgres clusters and generally deployed in groups of three instances, we currently only have a single server, so there is no replication.

Here is an example:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: gitlab-db
  namespace: gitlab
spec:
  instances: 1
  bootstrap:
    initdb:
      database: gitlab-db
      owner: gitlab-admin
      secret:
        name: gitlab-db-secret
  storage:
    size: 1Gi

Redis - Authentik and GitLab also use Redis. While there is a Redis Enterprise Operator for Kubernetes, we’re not going to put commercially-licensed software into our platform. We will be looking into creating “Redis as a Service” with Backstage as a completely free, open source alternative.

When new applications require different databases, like MySQL, we will implement more cluster controllers to support those databases.

Updated: Due to the recent licensing change by Redis, we will be testing alternatives for this platform.

Observability

Prometheus, Grafana, Grafana Loki - These are grouped together because we used the flux2-monitoring-example repo for our deployment. It combines the kube-prometheus-stack by Prometheus, the loki-stack by Grafana, and some additional configuration for Flux.

Prometheus and Grafana are both widely used as standard monitoring tools. Loki is for log aggregation, and we have Loki dashboards in Grafana to review cluster events. These three tools provide a good foundation for observability, but we have more to come in the future.

Linkerd - This application offers more than cluster observability. It also provides mTLS encryption among meshed pods, traffic shaping, and other advanced networking capabilities.

We chose Linkerd over other alternatives for its simplicity and lightweight footprint. Two of those alternatives that we may test and provide as options for our users in the future are Istio and Traefik Mesh.

External Services

Cloudflare - We register domains with Cloudflare and then use their free DNS service for each domain. It integrates well with our cluster services, such as cert-manager and External DNS. They have low prices compared to other domain registrars that we found.

LXD - We run LXD alongside K3s at the OS-level in the Kubercraft. There are two containers we currently use, Gitea for the GitOps repo that Flux works with, and then also a development / testing K3s installation. We are always on the lookout for ways to bridge the gap between LXD and Kubernetes in this system to have a centralized interface to manage everything.

Gitea - As previously mentioned, this is running in an LXD container external to Kubernetes to provide the GitOps repo for Flux (as opposed to storing our cluster configuration in GitHub).

We have a Service and Ingress defined in Kubernetes which points to this installation of Gitea. That allows us to take advantage of the wildcard TLS certificate managed by cert-manager, as well as automated DNS management with External DNS.

It’s actually setup as the backup repo for now. We currently have Flux pointed at GitLab to explore if there are any benefits there for integration purposes. We make changes locally in VS Code, push to GitLab running in the Kubernetes cluster, and then GitLab automatically does a mirror push to Gitea for backup.


With the above-mentioned platform services in place, we are going to shift our focus to finishing up the integration of everything there, developing Backstage as our UI, and will also start adding user applications from the CosmicForge marketplace. We will be migrating data over from another lab server when the relevant applications are ready to use.

Ultimately, we want to see how many applications we can run on the Kubercraft without overloading the little guy. There is still more to come on the platform-side of things.

This will be all for now, so please stay tuned for more updates!