cloud – Zak Abdel-Illah

Provisioning Grafana on DigitalOcean Kubernetes Service

Zak — Thu, 12 Dec 2024 22:00:10 +0000

The LGTM stack is my essential observability stack, and deploying the architecture on a vendor-agnostic basis allows me to;

Guarantee up-time in the event I need to switch due to cost increases
Re-deploy the stack to a client that has adopted another cloud provider

Anything that I find I can monitor or pull metrics from ends up in Grafana. I currently use three data sources with dashboards and alerts pulling and presenting information from all of them;

Prometheus: For monitoring real-time metrics such as CPU usage and weather
InfluxDB: For storing and monitoring historical metrics, such as stock market data
Loki: For monitoring system logs
Elasticsearch: For storing transactions, documents and RSS feeds.

I chose a Managed Kubernetes offering as a basis for deployment as opposed to virtual machines or self-hosted Kubernetes for two reasons;

Uptime is guaranteed by the vendor
I don’t have to maintain a Kubernetes cluster at a systems-level

Deploying a DigitalOcean Kubernetes Cluster

Overview of my DOKS cluster from the DigitalOcean Dashboard, where my cluster is a single Premium AMD node.

Droplet layout

I’m deploying my stack with two dedicated right-sized nodes living in two separate node pools. One is labelled as fixed and will contain only one instance, whereas the other is scalable with a maximum of three. With this design, I accomplish three essentials for the architecture;

Cost Efficiency
- I prevent over-provisioning by right-sizing, and rely on DigitalOcean’s control plane to scale the node pool when necessary, such as a larger dataset being held in memory by a data source
Availability & Scalability
- At-least one nodes should be available at all time, allowing for a single node failure to keep applications running. This is separate from a HA Control Plane, which offers a different benefit.

Cluster Location

lon1 is about 17km away from where I live, so I’ve deployed my cluster there. The location is not really important for my use case as I plan to do all data ingestion from within, but it’s cool to think it’s just a few streets down from me.

I played with the idea of taking availability even further by deploying to both ams1 and lon1 in a warm-standby style, but that’s a story for another post.

DOKS Cluster Resource in Terraform

resource "digitalocean_kubernetes_cluster" "primary" {
  name = "zai-lon1"
  region = "lon1"
  version = "1.30.4-do.0"
  vpc_uuid = digitalocean_vpc.primary.id
  auto_upgrade = false
  surge_upgrade = true
  ha = false
  registry_integration = true

  node_pool {
    name = "k8s-lon1-dedicated"
    size = "s-1vcpu-1gb-amd"
    node_count = 1
  }

  node_pool {
    name = "k8s-lon1-burst"
    size = "s-1vcpu-1gb-amd"
    node_count = 1
  }

  tags = local.resource_tags

  maintenance_policy {
    day = "saturday"
    start_time = "04:00"
  }
}

Deploying A Load Balancer

Traefik is my load balancer of choice. It’s written in Go, performant and integrates very well into the Kubernetes ecosystem. I’m not using it as a load balancer but as an ‘application gateway’, so that I can have a single IP address handle routing to many Kubernetes services based on HTTP Headers such as domain names or paths.

I’m also using it as a front for all of the web services in the cluster, so I can manage TLS certificates from the same location for all applications, not a unique configuration per-application. I’m not so concerned about inter-service TLS communication at this point, but rather over the public internet.

Traefik has its’ own form of an Operator that works with the Ingress resource definition within Kubernetes. When a resource is created, Traefik will automatically create a route based on the specification. This means that I can easily declare that https://grafana.zai.dev on the LoadBalancer will route to the grafana service on port 80.

When a LoadBalancer object is created within the Kubernetes cluster, DigitalOcean’s operator will proceed to create a Load Balancer and the charge will be applied accordingly.

The Traefik helm chart by default creates a LoadBalancer resource, which configures DigitalOcean to reserve a static public IP address that can be reached from the public internet. I don’t need to provide any additional configuration.

resource "helm_release" "traefik" {
  name       = "traefik"
  repository = "https://traefik.github.io/charts"
  chart      = "traefik"
  version    = "30.1.0"
}

Deploying an Identity Provider

My public-facing Grafana instance requiring external authentication

Authentication is required since Grafana is accessible from the public. With the same mindset for applying Traefik, I’d like to centrally control authentication & authorization rather than defining it on a per-application level.

I adopted Keycloak as it acts as an Identity Provider and / or Broker, supporting both OpenID Connect and SAML. OIDC is a common standard across many apps, including Grafana.

I use GitHub as an Identity Provider for Keycloak, and Keycloak as an Identity Provider for Grafana. I take this approach as it;

Allows me to integrate more OIDC or SAML compatible applications into my own provider
Reduces management of external accounts to a single point (rather than configuring GitHub per-application)
Allows me to add additional roles on-top of GitHub accounts required for Grafana to recognize who’s an Administrator.
Allows me to integrate LDAP in the future

I won’t go through deploying the Keycloak configuration in this post (a future one is coming with more detail on configuring Keycloak), but based on the OAuth2 specification, I have available to me the client_id, client_secret, auth_url, token_url and api_url that I pipe into the grafana.ini in the next stage. I can receive these details from GitHub directly by creating an OAuth2 application.

Deploying Grafana with Helm

To be agile in my deployments, I’m isolating the Grafana container from any configuration by deploying any configuration to Grafana through ConfigMaps. With that, I can truly version control the running version of Grafana without worrying about losing any stored work such as dashboards.

Provisionable elements, such as Dashboards, Alerts and Datasources can be loaded into Grafana by using its’ provisioning directory, defined below as /etc/grafana/provisioning. By using Kubernetes ConfigMaps, I can mount my configuration into Grafana outside of the instance itself.

By enabling the sidecar containers, I save myself from needing to maintain this volumeMount, as these act as operators monitoring for ConfigMaps with specific labels (described below), mounting them into the Grafana pod and instructing Grafana to reload the configuration without restarting the instance.

Within the grafana.ini file, auth.generic_oauth instructs grafana how to connect with an identity provider. Here, I pipe in the values received from Keycloak (or GitHub) above. To force that credentials are given from Keycloak, I enforce the disable_login_form setting.

The $__file{} operator reads a variable from a file on disk, allowing me to further protect the OAuth2 credentials by storing them in a Secret. I use HashiCorp Vault to protect secrets through ServiceAccount, but that’s outside the scope of this post.

role_attribute_path allows me to map user roles defined within Keycloak to Grafana roles, allowing me to centralize “how to define an administrator” across multiple applications, while scopes instructs Keycloak on what data Grafana requires in order to successfully authenticate and authorize.

Finally, ingress is the bridge between the Grafana instance and the load balancer. Within the Helm chart, an Ingress resource will be created that will point to the Service created by the chart, accessible on the domain grafana.zai.dev.

tls provides instructions on how to load the TLS Certificate associated with grafana.zai.dev. In my case, I store the certificate inside a Secret named grafana-tls.

resource "helm_release" "grafana" {
  name       = local.grafana_deployment_name
  repository = local.grafana_repository
  chart      = "grafana"
  version    = var.grafana_chart_version

  values = [
    yamlencode({ "grafana.ini" = {
      analytics = {
        check_for_updates = true
      },
      grafana_net = {
        url = "https://grafana.net"
      },
      log = {
        mode = "console"
      },
      paths = {
        data         = "/var/lib/grafana/",
        logs         = "/var/log/grafana",
        plugins      = "/var/lib/grafana/plugins",
        provisioning = "/etc/grafana/provisioning"
      },
      server = {
        domain   = "grafana.zai.dev",
        root_url = "https://grafana.zai.dev"
      },
      "auth.generic_oauth" = {
        enabled             = true,
        name                = "Keycloak",
        allow_sign_up       = true,
        client_id           = "$__file{/etc/secrets/oidc_credentials/id}",
        client_secret       = "$__file{/etc/secrets/oidc_credentials/secret}",
        disable_login_form  = true
        auth_url            = "$__file{/etc/secrets/oidc_credentials/auth_url}",
        token_url           = "$__file{/etc/secrets/oidc_credentials/token_url}",
        api_url             = "$__file{/etc/secrets/oidc_credentials/api_url}",
        scopes              = "openid profile email offline_access roles",
        role_attribute_path = "contains(realm_access.roles[*], 'admin') && 'Admin' || contains(realm_access.roles[*], 'editor') && 'Editor' || 'Viewer'"
      } 
      },
      "sidecar" = {
        "datasources" = { "enabled" = true },
        "alerts" = { "enabled" = true },
        "dashboards" = { "enabled" = true }
      },
      ingress = {
        enabled = true,
        hosts   = ["grafana.zai.dev"]
        tls = [
          {
            secretName = "grafana-tls",
            hosts = ["grafana.zai.dev"]
          }
        ]
      },
      assertNoLeakedSecrets = false,
    })
  ]
}

Deploying Datasources for Grafana

[]PersistentVolume are key to reliability. Without these, each data source has nowhere to store their data across crashes or reboot. All the Helm charts for each data source, by default, create a PersistentVolumeClaim and rely on the creation of a PersistentVolume with matching labels by an external factor, human or automated.

DigitalOcean’s Operator will create a Volume / Block Store whenever a PersistentVolumeClaim resource is created with any do-* storageClass.

By default, DOKS clusters have do-block-storage as a default storage class for PVCs. Once the block storage has been created, the operator will then create a PersistentVolume with matching labels so that the internal Kubernetes operator can take care of the binding between PVs and PVCs natively.

Deploying Prometheus

Prometheus is ideal for alerting on real-time numeric metrics, and doesn’t require much configuration in a small facility configuration. It includes the entire prometheus stack: AlertManager, Push Gateway and a node metrics exporter.

It contains an operator that provides Kubernetes service discovery by hooking into onto the Service creation loop and looks for prometheus.io/* annotations, and instructs prometheus to start scraping from them. At a minimum, these annotations look like;

prometheus.io/scrape=true
- Tells prometheus to actively scrape this Service
prometheus.io/path=/metrics
- Prometheus scrapes on HTTP. It will request this path.
prometheus.io/port=9090
- Prometheus will connect to a HTTP server on this port within the services’ Endpoint

This means that I don’t have to modify Prometheus configuration directly when expanding the services that my Kubernetes cluster is hosting. By simple appending annotations to any new services that expose metrics in OpenTelemetry format, I will immediately get data visible within Grafana from it.

resource "helm_release" "prometheus" {
  name       = "prometheus"
  repository = "https://prometheus-community.github.io/helm-charts"
  chart      = "prometheus"
  version    = "25.26.0"
}

Deploying Elasticsearch

Elasticsearch is great for analyzing documents and transactions where the data-type varies. It’s defined as a search engine. I use this data source for analyzing articles and stock market transactions.

My first problem was how resource-hungry Elasticsearch is in its’ nature. I had to dial down it’s memory usage to match the amount of content I was putting it through. 512Mb appears to be the right number for it to function as 256Mb causes it to fail to initialize. Increasing this value alongside the replicas value will give me higher availability.

Because of the 512Mb limit, I had to upsize my Kubernetes node as it would report that there was insufficient memory to deploy Elasticsearch.

To get data into Elasticsearch, I use the Elasticsearch Telegraf exporter and connect the input either RabbitMQ, a web socket or a HTTP polling feed. When I’m generating data through Python or Node.JS, I don’t push the data directly from the code, rather pushing the data through RabbitMQ for Telegraf to handle. I do this so that I can throttle the amount of data going through to elasticsearch that may take the service down.

resource "helm_release" "elasticsearch" {
  name       = "elasticsearch"
  repository = "https://helm.elastic.co"
  chart      = "elasticsearch"
  version    = "8.5.1"

  set {
    name  = "replicas"
    value = 1
  }

  set {
    name = "resources.requests.memory"
    value = "1Gi"
  }

  set {
    name = "resources.limits.memory"
    value = "1Gi"
  }

  set {
    name = "heapSize"
    value = "512Mi"
  }

  set {
    name  = "minimumMasterNodes"
    value = 1
  }

  set {
    name  = "volumeClaimTemplate.resources.requests.storage"
    value = "4Gi"
  }

  set {
    name = "cluster.initialMasterNodes"
    value = "elasticsearch-master"
  }
}

cluster.initialMasterNodes is needed in this helm chart as it instructs Elasticsearch to “find itself”. elasticsearch-master is the name of the Kubernetes Service that gets created, and in turn, will instruct the kube-dns service to return the IP Address of the elasticsearch instance when requesting elasticsearch-master.
I restrict the size of the DigitalOcean volume through volumeClaimTemplate.resources.requests.storage, as by default it’s around 20Gi.
minimumMasterNodes and replicas are restricted to 1 as I don’t need more than one instance of Elasticsearch. If I increase the amount of replicas and begin to shard, Grafana shouldn’t need additional configuration to cater for that.

Deploying InfluxDB

InfluxDB is my time-series database of choice when working with historical data that will need batch processing at some point (e.g: Grafana Alerting), such as Apple HealthKit and stock market data. Flux, the syntax used by InfluxDB, is extremely powerful in comparison to PromQL. But with more complexity comes a performance hit.

I also use Telegraf to ingest data into InfluxDB, with inputs pointing solely at RabbitMQ. I use NodeJS to listen to websocket streams and push data points to RabbitMQ for ingestion. Because of the amount of streaming data I plan to put into InfluxDB, I set persistence.size to a high amount of 12GB.

As the chart version hadn’t been updated in a while, using an image tag that was causing me some errors, I manually set the image.tag to the latest available version.

resource "helm_release" "influx" {
  name = "influxdb"
  repository = "https://helm.influxdata.com/"
  chart = "influxdb2"
  version = "2.1.2"

  set {
    name = "image.tag"
    value = "2.7.10"
  }

  set {
    name = "persistence.size"
    value = "12Gi"
  }
}

Deploying Loki

Loki is the most complex to configure, but I find it more intuitive (for Grafana) as a way to store system and application logs. I deploy it in a single binary configuration, and use DigitalOcean Spaces as the backend storage for logs themselves. Relying on a block storage may prove problematic as millions of messages would require constant re-provisioning of storage.

resource "helm_release" "loki" {
  name = "loki"
  repository = "https://grafana.github.io/helm-charts"
  chart = "loki"
  version = "6.18.0"

  values = [
    yamlencode({
        loki = {
          commonConfig = {
            replication_factor = 1
          }
          storage = {
            type = "s3"
            bucketNames = {
              chunks = "",
              ruler = "",
              admin = "",
            },
            s3 = {
              s3 = "s3://",
              endpoint = "lon1.digitaloceanspaces.com",
              region = "lon1",
              secretAccessKey = "",
              accessKeyId = "",
            }
          }
          schemaConfig = {
            configs = [
              {
                from = "2024-04-01",
                store = "tsdb",
                object_store = "s3",
                schema = "v13",
                index = {
                  "prefix" = "loki_index_",
                  "period" = "24h"
                }
              }
            ]
          },
        },
        deploymentMode = "SingleBinary",
        backend = { replicas = 0 },
        read = { replicas = 0 },
        write = { replicas = 0 },
        singleBinary = { replicas = 1 },
        chunksCache = { allocatedMemory = 2048 }
      })
  ]
}

Pushing logs to Loki

Loki exposes an API Endpoint for pushing logs to like the Prometheus Push Gateway, which accepts logs in a OpenTelemetry-compatible format. One tool, Promtail, will follow all container logs created by all pods in a Kubernetes cluster and stream them to the Loki push API.

resource "helm_release" "promtail" {
  name = "promtail"
  repository = "https://grafana.github.io/helm-charts"
  chart = "promtail"
  version = "6.16.6"

  values = [
    yamlencode({
        config = {
          clients = [{url = "http://loki-gateway/loki/api/v1/push", tenant_id = "zai"}]
        }
      })
  ]
}

loki-gateway is the default name of the Kubernetes Service created by the Loki helm chart. The kube-dns service will return the Endpoint IP Address of the Loki instance.

Deploying Provisioned Components for Grafana

Deploying Grafana with sidecar containers provisions operators that listen for []ConfigMap with specific labels for Dashboards, Alerts and Datasources. Simply, it takes the value of the ConfigMap and puts it into Grafana’s provisioning directory.

Grafana’s provisioning directory is defined by paths.provisioning within grafana.ini, which can be injected upon deploying the Grafana helm chart within the "grafana.ini" key. In my case, this path is /etc/grafana/provisioning.

Grafana natively will read its’ provisioning directory and load them into the instance, regardless if its’ containerized or running on the system directly.

Provisioning Dashboards

For dashboards, a label of grafana_dashboard needs to exist, but the value is irrelevant. I use templatefile() to load the file as string into main.json. This will allow me in the future to handle the renaming of data sources used within a Dashboard, or to manipulate a dashboard directly from Terraform.

I design dashboards directly within Grafana, export them as JSON and store them alongside the Terraform module for use by the ConfigMap. In the following resource, my exported dashboard will end up under /etc/grafana/provisioning/main.json.

Within the export menu, Grafana does provide the option to export using HCL (Terraform). I don’t opt for this option as this requires Grafana to be up and running in order to execute the resource. With the approach of declaring Dashboards via ConfigMap, I can re-deploy the dashboard in one go and remove the direct dependency on the Grafana instance running.

resource "kubernetes_config_map" "grafana_dashboards" {
  metadata {
    name = "grafana-dashboards"
    labels = {
      grafana_dashboard = "1"
    }
  }

  data = {
    "main.json" = templatefile("/path/to/dashboard.json", {})
  }
}

Provisioning Alerts

I follow the same approach as above for declaring alerts, with the exception that grafana_alert is the expected label from the sidecar.


resource "kubernetes_config_map" "grafana_alerts" {
  metadata {
    name = "grafana-alerts"
    labels = {
      grafana_alert = "1"
    }
  }

  data = {
    "alerts.json" = templatefile("/path/to/alert.json", {})
  }
}

Provisioning Data-sources

I build the configuration myself when it comes to data sources. The specification varies between each data source. Thanks to using Terraform to deploy each data source, I can re-use the variables used to define the Service names of each data source so that Grafana can find them correctly.

Provisioning Prometheus as a data source

resource "kubernetes_config_map" "prometheus_grafana_discovery" {
  metadata {
    name = "prometheus-grafana-datasource"
    labels = {
      grafana_datasource = "prometheus"
    }
  }

  data = {
    "prometheus.yml" = yamlencode({
        apiVersion = 1,
        datasources = [
          {
            name = var.prometheus_deployment_name,
            type = "prometheus",
            url = "http://${var.prometheus_deployment_name}.${helm_release.prometheus.namespace}.svc.cluster.local",
            access = "proxy"
          }
        ]
    })
  }
}

With the above resource declared from Kubernetes, I then just manipulate datasources = [] to match the following specifications for each datasource;

Specification for Loki

"apiVersion": 1
"datasources":
- "jsonData":
    "httpHeaderName1": "X-Scope-OrgID"
  "name": "prometheus-server"
  "secureJsonData":
    "httpHeaderValue1": "1"
  "type": "loki"
  "url": "http://loki.default.svc.cluster.local"

X-Scope-OrgID is a trick to inject an Organization ID into the HTTP Header so that Grafana gets authenticated by Loki.
loki is the default name of the Kubernetes service created by the Helm chart

Specification for Elasticsearch

Elasticsearch needs one declaration per index (if splitting the data by index). I create an index for each source of data being ingested into Elastic, and postfix it with the date of ingestion.

For authentication, I use the password for the elastic as defined by the Helm chart. By default, this is randomly generated and stored within a Secret. I also use the tlsSkipVerify flag as additional configuration is needed for elasticsearch to use a TLS certificate that’s respected by Grafana. Since the traffic is internal, I’m not that concerned by this at this point.

elasticsearch-master is the default name of the service created by the Helm chart.

"apiVersion": 1
"datasources":
- "basicAuth": true
  "basicAuthUser": "elastic"
  "jsonData":
    "index": "twelvedata-*"
    "timeField": "@timestamp"
    "tlsSkipVerify": true
  "name": "Elasticsearch (Twelve Data)"
  "secureJsonData":
    "basicAuthPassword": ""
  "type": "elasticsearch"
  "url": "https://elasticsearch-master:9200"
- "basicAuth": true
  "basicAuthUser": "elastic"
  "jsonData":
    "index": "coinbase-*"
    "timeField": "@timestamp"
    "tlsSkipVerify": true
  "name": "Elasticsearch (Coinbase)"
  "secureJsonData":
    "basicAuthPassword": ""
  "type": "elasticsearch"
  "url": "https://elasticsearch-master:9200"

Specification for InfluxDB

"apiVersion": 1
"datasources":
- "jsonData":
    "default_bucket": "default"
    "organization": "influxdata"
    "version": "Flux"
  "name": "InfluxDB"
  "secureJsonData":
    "token": ""
  "type": "influxdb"
  "url": "http://influxdb-influxdb2:80"

The Flux version forces InfluxDB v2, which in turn requires a default_bucket and organization. These values are defined by the Helm chart, but its’ default values are used here.
token is also defined by the Helm chart and stored within a Secret. I opt for using the randomly generated default.
influxdb-influxdb2 is the default name of the Service created by the Helm chart.

With all this in place, I have a Terraform module that deploys a Grafana stack onto DigitalOcean’s Kubernetes platform, while maintaining portability.

Idea: Adopting Serverless for Trading Operations

Zak — Sat, 05 Oct 2024 21:20:54 +0000

I’m not very into day-trading, but I see the potential in the market from time to time, so I came up with this idea to create an automated trading system using AWS services exclusively.

The system will use Lambda, Timestream, EventBridge, S3, SQS and SageMaker to create a serverless architecture for monitoring and trading on the stock market, using the Twelvedata and Coinbase APIs for pulling in market data and executing trades, respectively.

To start, I will use EventBridge as an alternative to cron-jobs to add symbols to an SQS queue for ingestion. This ties in with the use of serverless architecture. For FOREX, the schedule will run every hour, and for Crypto and stock market, it will run every 15 minutes. This is a good balance as I’m not a professional trader and don’t need to use too many API calls.

I will have five Lambda functions:

The first Lambda function will listen to the SQS queue and query Twelvedata for the mentioned symbols. It will then insert the data directly into Timestream.
The second Lambda function will be triggered by an alert from Timestream when new data is available. For safety (and to start with), I have configured this alert to trigger hourly. The function will throw the data at the SageMaker model. If the model predicts a positive yield, the Lambda function will pass the symbol to the third lambda function via another SQS Queue.
The third Lambda function will execute a transaction on Coinbase.
The forth Lambda function will monitor Twelvedata and Coinbase for hot & trending symbols and add them to the monitoring queue, triggered by another EventBridge Schedule.
The fifth Lambda function will create a *.csv dataset from the data within Timescale.

I will use Secrets Manager to securely store the API keys for Twelvedata and Coinbase.

I’m not an AI expert and don’t know much about the specifics of training a model, so I’ll be using the SageMaker canvas feature to train the model. The canvas feature is the easiest way into training AI Models that doesn’t require making a Python script.

Finally, at the end of each day, I’ll extract a dataset from the Timestream database into a *.csv and store it in S3, then pass this file onto SageMaker for training. I’ll use one last EventBridge schedule to trigger this workflow.

Hopefully by following this approach, I’ll have a fully functioning market monitoring and trading system.

Overview: Extending my home network to the cloud

Zak — Wed, 11 Sep 2024 22:02:52 +0000

As a frequent traveller, I found it impractical to maintain a physical system infrastructure, so I relocated my home infrastructure to the cloud.

Establishing a VPN Connection

To begin, I set up a VPN connection from my OpenWRT router to the cloud provider using WireGuard. I created two VPCs in the cloud provider – one public and one private – to mimic the “WAN-LAN” scenario of at-home routers.

This setup provides isolation similar to a home network, where the resources on the private network can only be access by other resources on the same network, but they are also able to communicate with the outside world.

The intention is to have the private network as an extension to my “home” (at any given time).

Deploying a Cloud Router

I deployed a virtual machine that will act as a router spanning both networks. This needs to be across both networks as I need an endpoint to connect to (which requires an internet-exposed network) while still being able to access private resources.

I chose VyOS as the cloud router’s operating system because it is configuration-driven, allowing for an Infrastructure-as-Code (IaC) approach for easy re-deployment on any cloud provider.

Utilizing Object Storage for Plex Media Server

I adopted object storage to take advantage of the “unlimited” data offered by the cloud provider, and configured s3fs to mount the object storage on a specific node. With this, Plex can access data directly from the object storage bucket without many configuration changes or plugins to Plex.

The VPN connection allows me to access the Plex server securely as if it were local on both my PS5 and laptop. This setup ensures that the Plex interface remains non-accessible to the public and bypasses the bandwidth limit when proxying via the official Plex servers.

Securely Pushing Metrics from In-House Devices

By using the VPN connection, I can push metrics from my in-house devices directly, such as weather sensors without exposing my Prometheus instance to the public internet.

The VPN’s security layer wraps around all traffic, eliminating the need for implementing a CA chain for Prometheus when using platforms such as AWS IoT or Grafana Cloud (where devices are expected to communicate with a public HTTPS endpoint)

Automating At-Home Devices with HomeAssistant

I use HomeAssistant within the cloud provider to automate my at-home devices without worrying about downtime or maintaining a device inside my home. HomeAssistant is scriptable, easily re-deployable, and can bridge a wide range of IoT devices under a single platform, such as HomeKit and Hue.

I can now utilize my old infrastructure without worrying about maintaining hardware, and plan to deploy many services to the private cloud. Keep an eye out for a deeper breakdown on how I deployed and configured each element of my private cloud

Exoscale Exporter for Prometheus

Zak — Wed, 04 Sep 2024 11:18:20 +0000

I’d built a Prometheus exporter for Exoscale, allowing me to visualize cloud spending and resource usage from a central location alongside AWS and DigitalOcean.

The Exoscale exporter is built using Go and leverages the latest version of Exoscale’s Go API, egoscale v3 and includes basic integration tests and automatic package building for all major platforms and architectures.

Some of the metrics exported are;

Organization Information: Usage, Address, API Keys
Compute Resource Summary: Instances, Kubernetes, Node Pools
Storage Resource Summary: SOS Buckets & Usage, Block Volumes
Networking Resource Summary: Domain & Records, Load Balancers

By integrating organizational data from Exoscale into the Prometheus ecosystem, I can now configure alerts for spending or resource usage on either Exoscale specifically or for all platforms using AlertManager.

I can also identify where I may have left resources behind using Grafana, in the event I’m manually creating them or my IaC executions didn’t do a proper clean-up.

Metric Browser in Grafana; Showing some values exported from the Exporter

I decided to deploy the exporter to my Kubernetes cluster, scraping based on the default interval of 2 minutes. This is roughly a good balance between;

When a new billing amount gets updated (hourly)
How often infrastructure elements themselves gets updated (could be on a minutely-basis)
How much data gets consumed by the time-series

I chose Kubernetes cluster rather than a server-less solution or a dedicated VM so that I can optimize the costs of running the exporter by sharing resources, in addition to abstracting the cloud provider away from the application.

Building AMD64 QEMU Images remotely using Libvirt and Packer

Zak — Fri, 24 May 2024 22:05:58 +0000

I need to build images based off AMD64 architecture while working from an ARM64 machine. While this is possible directly using the qemu-system-x86_64 binary, it tends to be extremely slow due to the overhead of converting for the ARM architecture.

Workbench

Ubuntu 22.04 LTS with libvirt installed
MacBook Pro M2 with the Packer build files

Configuring the Libvirt Plugin

Connecting to the libvirt host

When using the libvirt plugin, I need to provide a Libvirt URI.

source "libvirt" "image" {
    libvirt_uri = "qemu+ssh://${var.user}@${var.host}/session?keyfile=${var.keyfile}&no_verify=1"
}

qemu+ssh:// denotes that I’ll be using the QEMU / KVM Backend and connecting via SSH. The connection method denotes the rest of the arguments of the string
${var.user}@${var.host} is in the SSH syntax, this is the username and hostname of the machine that is running libvirt
/session is to isolate the running builds from those on the system level. /system would work just as well.
keyfile=${var.keyfile} is used to automatically authenticate to the remote machine without the need of a password. This is useful in the future when I automatically trigger the packer build from a Git repository
no_verify=1 is added so that I can throw the build at any machine and have it “just work”. This is usually guided against due to spoofing attacks.

Communicating with the libvirt guest

communicator {
    communicator                 = "ssh"
    ssh_username                 = var.username
    ssh_bastion_host             = var.host
    ssh_bastion_username         = var.user
    ssh_bastion_private_key_file = var.private_key
  }

The difference between ssh_* and ssh_bastion_* is that the first refers to the target virtual machine being built, and the latter refers to the “middle-man” machine.
- I require this as I don’t plan to expose the VM to a network outside of the machine hosting it.
- Since I won’t have access from my local workstation, I need to communicate with the virtual machine via the machine that is hosting it.
- By adding ssh_bastion_* arguments, I’m telling packer that in-order to communicate with the VM, it needs to access the bastion machine first then execute all SSH commands through it.

Configuring the libvirt daemon

My Observations

I came across a “Permission Denied” error when attempting to upload an existing image (in my case, the KVM Ubuntu Server Image). This was due to AppArmor not being provided a trust rule upon creation of the domain. This error is first visible in the following form directly from Packer:

==> libvirt.example: DomainCreate.RPC: internal error: process exited while connecting to monitor: 2024-05-24T16:41:42.574660Z qemu-system-x86_64: -blockdev {"node-name":"libvirt-2-format","read-only":false,"driver":"qcow2","file":"libvirt-2-storage","backing":null}: Could not open '/var/lib/libvirt/images/packer-cp8c6ap1ijp2kss08iv0-ua-artifact': Permission denied

At first, I assumed that there was an obvious permissions problem, and at first glace there was in-fact that. When looking at this file upon creation, it had root permissions where only the root user can read/write.

# ls -lah /var/lib/libvirt/images
-rw------- 1 root root  925M May 24 16:41 packer-cp8c6ap1ijp2kss08iv0-ua-artifact

This makes sense since libvirtd is running under the root user, which is the default configuration from the Ubuntu repository. I didn’t see any configuration option to manipulate what the permissions should be after an upload with libvirt either. This was an assumed problem since all QEMU instances are running under a non-root user, libvirt-qemu

# ps -aux | grep libvirtd
# ps -aux | grep qemu

root      145945  0.4  0.1 1778340 28760 ?       Ssl  16:43   0:10 /usr/sbin/libvirtd
libvirt+    3312  2.2 11.1 4473856 1817572 ?     Sl   May12 405:19 /usr/bin/qemu-system-x86_64

My second observation was that all images created directly within libvirt (e.g: with virt-manager) had what looked like “correct” permissions, those that matched the user that QEMU would eventually run under;

# ls -lah /var/lib/libvirt/images
-rw-r--r-- 1 libvirt-qemu kvm   11G May 24 17:11 haos_ova-11.1.qcow2

Since no-one else had reported this particular issue when using the libvirt plugin, I had gone down the route of PEBKAC.

Allowing packer-uploaded images as backing store

Thanks to this discussion on Stack Overflow, I found that AppArmor had been blocking the request to the specific file in question.

# dmesg -w
[1081541.249157] audit: type=1400 audit(1716568577.970:119): apparmor="DENIED" operation="open" profile="libvirt-25106acc-cfd8-40f7-a7c6-f5c1c63bc16c" name="/var/lib/libvirt/images/packer-cp8c6ap1ijp2kss08iv0-ua-artifact" pid=43927 comm="qemu-system-x86" requested_mask="w" denied_mask="w" fsuid=64055 ouid=64055

Here, I can see that AppArmor is doing three things;

Denying an open request to the QEMU Image
- apparmor="DENIED"
- operation="open"
Denying writing to the QEMU Image
- denied_mask="w"
Using a profile that is specific to the domain being launched
- profile="libvirt-25106acc-cfd8-40f7-a7c6-f5c1c63bc16c"
- This is achieved because libvirt will automatically push AppArmor rules upon creation of a domain. This also means that libvirt will be using some form of template file or specification to create rules.

This means that I need to find the template file that libvirt is using to design the rules, and allow for writing to packer-uploaded QEMU Images.

# /etc/apparmor.d/libvirt/TEMPLATE.qemu
# This profile is for the domain whose UUID matches this file.
# 

#include 

profile LIBVIRT_TEMPLATE flags=(attach_disconnected) {
  #include 
  /var/lib/libvirt/images/packer-** rwk,
}

As mentioned in the Stack Overflow post, simply adding /var/lib/libvirt/images/packer-** rwk, to the template file is enough to get past this issue.

End Result

By bringing everything together, I get a successful QCOW2 image visible in my default storage pool. I’m using the Ansible provisioner within the build block so that I can keep the execution steps separate from the Packer build script, and re-usable across different cloud providers.

Building a PKI using Terraform

Zak — Sat, 24 Feb 2024 21:09:08 +0000

As part of building a hybrid infrastructure, I explored different technologies for achieving a stable VPN connection from on-premises to the AWS Infrastructure and found AWS’ Client-to-Site feature nested within AWS VPC. I explored this prior to AWS Site-to-Site VPN as I didn’t have the right setup for handling IPSec/L2TP tunnels at the time, and had OpenVPN already handy from my MacBook.

Since I would be using OpenVPN (As that’s what AWS Client VPN uses), I require TLS certificates as a method of authentication and encryption. While AWS provides certificate management features, it does have a cost, making it less suitable for my testing requirements.

I’ve opted to use Terraform to create a custom PKI solution locally, and to prepare for the re-use in larger infrastructure projects.

Working Environment

Machine
- MacBook Pro M2
Technologies
- Terraform

Terraform Module Breakdown

terraform {
  required_providers {
    tls = {
      source = "hashicorp/tls"
    }
    pkcs12 = {
      source = "chilicat/pkcs12"
    }
  }
}

I’m making use of the following modules within my Terraform project

hashicorp/tls
- For generating the private keys, certificate requests and certificates themselves
chilicat/pkcs12
- For combining the private key & certificate together, a requirement for using OpenVPN client without embedding the data inside the *.ovpn configuration file (which didn’t come out-of-the-box from AWS)

/**
 * Private key for use by the self-signed certificate, used for
 * future generation of child certificates. As long as the state
 * remains unchanged, the private key and certificate should not
 * re-update at every re-run unless any variable is changed.
 */
resource "tls_private_key" "pem_ca" {
  algorithm = var.algorithm
}

I’ve made the algorithm of the certificates controllable from a global variable due to customer requirements possibly needing to adopt a different level of encryption. This resource returns a PEM-formatted key.

/**
 * Generation of the CA Certificate, which is in turn used by
 * the client.tf and server.tf submodules to generate child
 * certificates
 */
resource "tls_self_signed_cert" "ca" {
  private_key_pem = tls_private_key.pem_ca.private_key_pem
  is_ca_certificate = true

  subject {
    country             = var.ca_country
    province            = var.ca_province
    locality            = var.ca_locality
    common_name         = var.ca_cn
    organization        = var.ca_org
    organizational_unit = var.ca_org_name
  }

  validity_period_hours = var.ca_validity

  allowed_uses = [
    "digital_signature",
    "cert_signing",
    "crl_signing",
  ]
}

I then used the tls_self_signed_cert resource to generate the CA certificate itself, providing the private key generated prior into the private_key_pem attribute. Again, by providing global variables for the ca subject and validity, I’m able to re-run the same terraform module for multiple clients under different workspaces (or by referencing this into larger modules).

The subject fields I had decided to expose are a way to describe exactly what and where the TLS certificate belongs to without needing to dive back into the module.

By adding cert_signing and crl_signing to the allowed_uses list, it adds permissions to the certificate for signing child certificates. This is essential as I would still need to generate the certificates for the OpenVPN server and the client.

This resource returns a PEM-formatted certificate.

/**
 * Return the certificate itself. It's the responsibility of
 * the user of this module to determine whether the certificate should
 * be stored locally, transferred or submitted directly to a cloud
 * service
 */
output "ca_certificate" {
  value = tls_self_signed_cert.ca.cert_pem
  sensitive = true
  description = "generated ca certificate"
}

Finally, I return the CA Certificate and its’ key from the module for the user to place it where it needs to be, for example;

To a local file

resource "local_file" "ca_key" {
  content_base64 = module.pki.ca_private_key
  filename = "${path.module}/certs/ca.key"
}
resource "local_file" "ca" {
  content_base64 = module.pki.ca_certificate
  filename = "${path.module}/certs/ca.crt"
}

To the AWS Certificate Manager

resource "aws_acm_certificate" "ca" {
  private_key = module.pki.ca_private_key
  certificate_body = module.pki.ca_certificate
}

Server & Client Certificates

resource "tls_cert_request" "csr" {
  for_each = var.clients # or var.servers
  private_key_pem = tls_private_key.pem_clients[each.key].private_key_pem
    # or pem_servers[each.key]
  dns_names = [each.key]

  subject {
    country = try(each.value.country, try(var.default_client_subject.country, var.default_subject.country))
    province = try(each.value.province, try(var.default_client_subject.province, var.default_subject.province))
    locality = try(each.value.locality, try(var.default_client_subject.locality, var.default_subject.locality))
    common_name = try(each.value.cn, try(var.default_client_subject.cn, var.default_subject.cn))
    organization = try(each.value.org, try(var.default_client_subject.org, var.default_subject.org))
    organizational_unit = try(each.value.ou, try(var.default_client_subject.ou, var.default_subject.ou))
  }
}

Regardless of whether generating a server or client TLS certificate, both need to go through the ‘certificate request’ process, which is to;

Generate a private key for the server or client
Generate a certificate signing request based on the private key
Using the CSR to get a CA-signed certificate

In this example, I made use of the try block to achieve a value priority in the following order;

Resource-level
- Do I have a value specific to the server or client?
Class-level
- Do I have a value specific to the target type?
Module-level
- Do I have a global default?

And each refers to a key / value pair that is identical for clients as it is servers, where the key is the machine name and the value is the subject data. Here is a sample of the *.tfvars.json file that drives this behaviour.

{
  "clients": {
    "mbp": {
      "country": "GB",
      "locality": "GB",
      "org": "ZAI",
      "org_name": "ZAI",
      "province": "GB"
    }
  }
}

In an ideal (and secure) scenario, the private keys should never be transmitted over the wire, instead, you generate a CSR and transmit that. Since this is aimed for test environments, security is not a concern for me. Should I want to do the generation securely, I’ve exposed the following variable as a way to override the CSR generation.

variable "client_csrs" {
  type = map
  description = "csrs to use instead of generating them within this module"
  default = {}
}

Getting the signed certificate

resource "tls_locally_signed_cert" "client" {
  for_each = var.clients
  cert_request_pem = tls_cert_request.csr_client[each.key].cert_request_pem
  ca_private_key_pem = tls_private_key.pem_ca.private_key_pem
  ca_cert_pem = tls_self_signed_cert.ca.cert_pem

  validity_period_hours = var.client_certificate_validity

  allowed_uses = [
    "digital_signature",
    "key_encipherment",
    "server_auth", # for server-side
    "client_auth", # for client-side
  ]
}

Once the *.csr is generated (or provided), I’m able to use the tls_locally_signed_cert resource type to connect that data with the CA Certificate for signing against the private key of the CA Certificate. The cert_request_pem, ca_private_key_pem and ca_cert_pem inputs allow me to do so using the raw PEM format, without needing to save to disk before passing the data in.

Relying on the data within the terraform state file allows me to also rule out any “external influence” when troubleshooting, as there will be only a single source of truth.

Adding either server_auth or client_auth (depending on use-case) to allowed_uses permits the use of the signed certificate for authentication, as required by OpenVPN.

Converting from *.PEM to PCKS12

resource "pkcs12_from_pem" "client" {
  for_each = var.clients
  ca_pem          = tls_self_signed_cert.ca.cert_pem
  cert_pem        = tls_locally_signed_cert.client[each.key].cert_pem
  private_key_pem = tls_private_key.pem_client[each.key].private_key_pem
  password = "123" # Testing purposes
  encoding = "legacyRC2"
}

Using the pkcs12_from_pem resource type from chilicat makes this process simple, as long as I have access to the private key in addition to the certificate and ca.

For compatibility with the OpenVPN Connect application, I needed to enforce the encoding of legacyRC2, rather than the modern encryption that’s offered by easy-rsa.

Returning the certificates

output "client_certificates" {
  value = [ for cert in tls_locally_signed_cert.client : cert.cert_pem ]
  description = "generated client certificates in ordered list form"
  sensitive = true
}

Finally, I return the generated certificates and their *.p12 equivalent from the module. I mark this data as sensitive due to the inclusion of private keys.

For the value, I needed to iterate over a list of resources (as I had used the foreach input earlier to handle a key/value pair) and re-build a single list with the result.

As mentioned above, it is then the responsibility of the user to determine what to do with the generated certificates, be it storing them locally or pushing them to AWS.

Authenticating DigitalOcean for Terraform OSS

Zak — Tue, 05 Dec 2023 19:21:25 +0000

Scenario

Why?

I’m diving into Terraform as part of my adventure into the DevOps world, which I’ve adopted an interest in the past few months.

I use 2 workstations with DigitalOcean
- MacBook; for when I’m out and about
- ArchLinux; for when I’m at home

Generating the API Tokens

Under API, located within the dashboards’ menu (on the left-hand side), I’m presented with the option to Generate New Token.

Followed by an interface to define;

Name
- I typically name this token as zai.dev or personal, as this token will be shared across my devices. While this approach isn’t the most secure (Ideally, I should have one token per machine), I’m going for the matter of convenience of having one token for my user profile.
Expiry date
- Since I’m sharing the token across workstations (including my laptop, which may be prone to theft), I set the expiration to the lowest possible value of 30 days.
Write permissions
- Since I’ll be using Terraform, and it’s main purpose is to ‘sculpt’ infrastructure, I require the token that it’ll use to connect to DigitalOcean to have write permissions.

Authenticating DigitalOcean Spaces

As the Terraform Provider allows the creation of Spaces, DigitalOceans’ equivalent to AWS’ S3-bucket, I should also create tokens for it. By navigating to the “Spaces Keys” tab under the APIs option, I can repeat the same steps as above

Installing the Tokens

Continuing from the setup of environment variables in my Synchronizing environment variables across Workstations post, I need to add 3 environment variables for connecting to DigitalOcean.

DIGITALOCEAN_TOKEN
- This is the value that is given to you after hitting “Generate Token” on the Tokens tab
SPACES_ACCESS_KEY_ID
- This is the value that is given to you after hitting “Generate Token” on the Spaces Tokens tab
SPACES_SECRET_ACCESS_KEY
- This is the one-time value that is given to you alongside the SPACES_ACCESS_KEY_ID value

Whilst I’m at it, I’m going to add the following environment variables so that I can use any S3-compliant tools to communicate with my object storage, such as the s3 copy command to push build artifacts

AWS_ACCESS_KEY_ID=${SPACES_ACCESS_KEY_ID}
AWS_SECRET_ACCESS_KEY=${SPACES_SECRET_ACCESS_KEY}

To keep things tidy, I created a separate environment file for digital ocean, under ~/.config/zai/env/digitalocean.sh

export DIGITALOCEAN_TOKEN=""
export SPACES_ACCESS_KEY_ID=""
export SPACES_SECRET_ACCESS_KEY=""
export AWS_ACCESS_KEY_ID=${SPACES_ACCESS_KEY_ID}
export AWS_SECRET_ACCESS_KEY=${SPACES_SECRET_ACCESS_KEY}