Cloud – Zak Abdel-Illah

Provisioning Grafana on DigitalOcean Kubernetes Service

Zak — Thu, 12 Dec 2024 22:00:10 +0000

The LGTM stack is my essential observability stack, and deploying the architecture on a vendor-agnostic basis allows me to;

Guarantee up-time in the event I need to switch due to cost increases
Re-deploy the stack to a client that has adopted another cloud provider

Anything that I find I can monitor or pull metrics from ends up in Grafana. I currently use three data sources with dashboards and alerts pulling and presenting information from all of them;

Prometheus: For monitoring real-time metrics such as CPU usage and weather
InfluxDB: For storing and monitoring historical metrics, such as stock market data
Loki: For monitoring system logs
Elasticsearch: For storing transactions, documents and RSS feeds.

I chose a Managed Kubernetes offering as a basis for deployment as opposed to virtual machines or self-hosted Kubernetes for two reasons;

Uptime is guaranteed by the vendor
I don’t have to maintain a Kubernetes cluster at a systems-level

Deploying a DigitalOcean Kubernetes Cluster

Overview of my DOKS cluster from the DigitalOcean Dashboard, where my cluster is a single Premium AMD node.

Droplet layout

I’m deploying my stack with two dedicated right-sized nodes living in two separate node pools. One is labelled as fixed and will contain only one instance, whereas the other is scalable with a maximum of three. With this design, I accomplish three essentials for the architecture;

Cost Efficiency
- I prevent over-provisioning by right-sizing, and rely on DigitalOcean’s control plane to scale the node pool when necessary, such as a larger dataset being held in memory by a data source
Availability & Scalability
- At-least one nodes should be available at all time, allowing for a single node failure to keep applications running. This is separate from a HA Control Plane, which offers a different benefit.

Cluster Location

lon1 is about 17km away from where I live, so I’ve deployed my cluster there. The location is not really important for my use case as I plan to do all data ingestion from within, but it’s cool to think it’s just a few streets down from me.

I played with the idea of taking availability even further by deploying to both ams1 and lon1 in a warm-standby style, but that’s a story for another post.

DOKS Cluster Resource in Terraform

resource "digitalocean_kubernetes_cluster" "primary" {
  name = "zai-lon1"
  region = "lon1"
  version = "1.30.4-do.0"
  vpc_uuid = digitalocean_vpc.primary.id
  auto_upgrade = false
  surge_upgrade = true
  ha = false
  registry_integration = true

  node_pool {
    name = "k8s-lon1-dedicated"
    size = "s-1vcpu-1gb-amd"
    node_count = 1
  }

  node_pool {
    name = "k8s-lon1-burst"
    size = "s-1vcpu-1gb-amd"
    node_count = 1
  }

  tags = local.resource_tags

  maintenance_policy {
    day = "saturday"
    start_time = "04:00"
  }
}

Deploying A Load Balancer

Traefik is my load balancer of choice. It’s written in Go, performant and integrates very well into the Kubernetes ecosystem. I’m not using it as a load balancer but as an ‘application gateway’, so that I can have a single IP address handle routing to many Kubernetes services based on HTTP Headers such as domain names or paths.

I’m also using it as a front for all of the web services in the cluster, so I can manage TLS certificates from the same location for all applications, not a unique configuration per-application. I’m not so concerned about inter-service TLS communication at this point, but rather over the public internet.

Traefik has its’ own form of an Operator that works with the Ingress resource definition within Kubernetes. When a resource is created, Traefik will automatically create a route based on the specification. This means that I can easily declare that https://grafana.zai.dev on the LoadBalancer will route to the grafana service on port 80.

When a LoadBalancer object is created within the Kubernetes cluster, DigitalOcean’s operator will proceed to create a Load Balancer and the charge will be applied accordingly.

The Traefik helm chart by default creates a LoadBalancer resource, which configures DigitalOcean to reserve a static public IP address that can be reached from the public internet. I don’t need to provide any additional configuration.

resource "helm_release" "traefik" {
  name       = "traefik"
  repository = "https://traefik.github.io/charts"
  chart      = "traefik"
  version    = "30.1.0"
}

Deploying an Identity Provider

My public-facing Grafana instance requiring external authentication

Authentication is required since Grafana is accessible from the public. With the same mindset for applying Traefik, I’d like to centrally control authentication & authorization rather than defining it on a per-application level.

I adopted Keycloak as it acts as an Identity Provider and / or Broker, supporting both OpenID Connect and SAML. OIDC is a common standard across many apps, including Grafana.

I use GitHub as an Identity Provider for Keycloak, and Keycloak as an Identity Provider for Grafana. I take this approach as it;

Allows me to integrate more OIDC or SAML compatible applications into my own provider
Reduces management of external accounts to a single point (rather than configuring GitHub per-application)
Allows me to add additional roles on-top of GitHub accounts required for Grafana to recognize who’s an Administrator.
Allows me to integrate LDAP in the future

I won’t go through deploying the Keycloak configuration in this post (a future one is coming with more detail on configuring Keycloak), but based on the OAuth2 specification, I have available to me the client_id, client_secret, auth_url, token_url and api_url that I pipe into the grafana.ini in the next stage. I can receive these details from GitHub directly by creating an OAuth2 application.

Deploying Grafana with Helm

To be agile in my deployments, I’m isolating the Grafana container from any configuration by deploying any configuration to Grafana through ConfigMaps. With that, I can truly version control the running version of Grafana without worrying about losing any stored work such as dashboards.

Provisionable elements, such as Dashboards, Alerts and Datasources can be loaded into Grafana by using its’ provisioning directory, defined below as /etc/grafana/provisioning. By using Kubernetes ConfigMaps, I can mount my configuration into Grafana outside of the instance itself.

By enabling the sidecar containers, I save myself from needing to maintain this volumeMount, as these act as operators monitoring for ConfigMaps with specific labels (described below), mounting them into the Grafana pod and instructing Grafana to reload the configuration without restarting the instance.

Within the grafana.ini file, auth.generic_oauth instructs grafana how to connect with an identity provider. Here, I pipe in the values received from Keycloak (or GitHub) above. To force that credentials are given from Keycloak, I enforce the disable_login_form setting.

The $__file{} operator reads a variable from a file on disk, allowing me to further protect the OAuth2 credentials by storing them in a Secret. I use HashiCorp Vault to protect secrets through ServiceAccount, but that’s outside the scope of this post.

role_attribute_path allows me to map user roles defined within Keycloak to Grafana roles, allowing me to centralize “how to define an administrator” across multiple applications, while scopes instructs Keycloak on what data Grafana requires in order to successfully authenticate and authorize.

Finally, ingress is the bridge between the Grafana instance and the load balancer. Within the Helm chart, an Ingress resource will be created that will point to the Service created by the chart, accessible on the domain grafana.zai.dev.

tls provides instructions on how to load the TLS Certificate associated with grafana.zai.dev. In my case, I store the certificate inside a Secret named grafana-tls.

resource "helm_release" "grafana" {
  name       = local.grafana_deployment_name
  repository = local.grafana_repository
  chart      = "grafana"
  version    = var.grafana_chart_version

  values = [
    yamlencode({ "grafana.ini" = {
      analytics = {
        check_for_updates = true
      },
      grafana_net = {
        url = "https://grafana.net"
      },
      log = {
        mode = "console"
      },
      paths = {
        data         = "/var/lib/grafana/",
        logs         = "/var/log/grafana",
        plugins      = "/var/lib/grafana/plugins",
        provisioning = "/etc/grafana/provisioning"
      },
      server = {
        domain   = "grafana.zai.dev",
        root_url = "https://grafana.zai.dev"
      },
      "auth.generic_oauth" = {
        enabled             = true,
        name                = "Keycloak",
        allow_sign_up       = true,
        client_id           = "$__file{/etc/secrets/oidc_credentials/id}",
        client_secret       = "$__file{/etc/secrets/oidc_credentials/secret}",
        disable_login_form  = true
        auth_url            = "$__file{/etc/secrets/oidc_credentials/auth_url}",
        token_url           = "$__file{/etc/secrets/oidc_credentials/token_url}",
        api_url             = "$__file{/etc/secrets/oidc_credentials/api_url}",
        scopes              = "openid profile email offline_access roles",
        role_attribute_path = "contains(realm_access.roles[*], 'admin') && 'Admin' || contains(realm_access.roles[*], 'editor') && 'Editor' || 'Viewer'"
      } 
      },
      "sidecar" = {
        "datasources" = { "enabled" = true },
        "alerts" = { "enabled" = true },
        "dashboards" = { "enabled" = true }
      },
      ingress = {
        enabled = true,
        hosts   = ["grafana.zai.dev"]
        tls = [
          {
            secretName = "grafana-tls",
            hosts = ["grafana.zai.dev"]
          }
        ]
      },
      assertNoLeakedSecrets = false,
    })
  ]
}

Deploying Datasources for Grafana

[]PersistentVolume are key to reliability. Without these, each data source has nowhere to store their data across crashes or reboot. All the Helm charts for each data source, by default, create a PersistentVolumeClaim and rely on the creation of a PersistentVolume with matching labels by an external factor, human or automated.

DigitalOcean’s Operator will create a Volume / Block Store whenever a PersistentVolumeClaim resource is created with any do-* storageClass.

By default, DOKS clusters have do-block-storage as a default storage class for PVCs. Once the block storage has been created, the operator will then create a PersistentVolume with matching labels so that the internal Kubernetes operator can take care of the binding between PVs and PVCs natively.

Deploying Prometheus

Prometheus is ideal for alerting on real-time numeric metrics, and doesn’t require much configuration in a small facility configuration. It includes the entire prometheus stack: AlertManager, Push Gateway and a node metrics exporter.

It contains an operator that provides Kubernetes service discovery by hooking into onto the Service creation loop and looks for prometheus.io/* annotations, and instructs prometheus to start scraping from them. At a minimum, these annotations look like;

prometheus.io/scrape=true
- Tells prometheus to actively scrape this Service
prometheus.io/path=/metrics
- Prometheus scrapes on HTTP. It will request this path.
prometheus.io/port=9090
- Prometheus will connect to a HTTP server on this port within the services’ Endpoint

This means that I don’t have to modify Prometheus configuration directly when expanding the services that my Kubernetes cluster is hosting. By simple appending annotations to any new services that expose metrics in OpenTelemetry format, I will immediately get data visible within Grafana from it.

resource "helm_release" "prometheus" {
  name       = "prometheus"
  repository = "https://prometheus-community.github.io/helm-charts"
  chart      = "prometheus"
  version    = "25.26.0"
}

Deploying Elasticsearch

Elasticsearch is great for analyzing documents and transactions where the data-type varies. It’s defined as a search engine. I use this data source for analyzing articles and stock market transactions.

My first problem was how resource-hungry Elasticsearch is in its’ nature. I had to dial down it’s memory usage to match the amount of content I was putting it through. 512Mb appears to be the right number for it to function as 256Mb causes it to fail to initialize. Increasing this value alongside the replicas value will give me higher availability.

Because of the 512Mb limit, I had to upsize my Kubernetes node as it would report that there was insufficient memory to deploy Elasticsearch.

To get data into Elasticsearch, I use the Elasticsearch Telegraf exporter and connect the input either RabbitMQ, a web socket or a HTTP polling feed. When I’m generating data through Python or Node.JS, I don’t push the data directly from the code, rather pushing the data through RabbitMQ for Telegraf to handle. I do this so that I can throttle the amount of data going through to elasticsearch that may take the service down.

resource "helm_release" "elasticsearch" {
  name       = "elasticsearch"
  repository = "https://helm.elastic.co"
  chart      = "elasticsearch"
  version    = "8.5.1"

  set {
    name  = "replicas"
    value = 1
  }

  set {
    name = "resources.requests.memory"
    value = "1Gi"
  }

  set {
    name = "resources.limits.memory"
    value = "1Gi"
  }

  set {
    name = "heapSize"
    value = "512Mi"
  }

  set {
    name  = "minimumMasterNodes"
    value = 1
  }

  set {
    name  = "volumeClaimTemplate.resources.requests.storage"
    value = "4Gi"
  }

  set {
    name = "cluster.initialMasterNodes"
    value = "elasticsearch-master"
  }
}

cluster.initialMasterNodes is needed in this helm chart as it instructs Elasticsearch to “find itself”. elasticsearch-master is the name of the Kubernetes Service that gets created, and in turn, will instruct the kube-dns service to return the IP Address of the elasticsearch instance when requesting elasticsearch-master.
I restrict the size of the DigitalOcean volume through volumeClaimTemplate.resources.requests.storage, as by default it’s around 20Gi.
minimumMasterNodes and replicas are restricted to 1 as I don’t need more than one instance of Elasticsearch. If I increase the amount of replicas and begin to shard, Grafana shouldn’t need additional configuration to cater for that.

Deploying InfluxDB

InfluxDB is my time-series database of choice when working with historical data that will need batch processing at some point (e.g: Grafana Alerting), such as Apple HealthKit and stock market data. Flux, the syntax used by InfluxDB, is extremely powerful in comparison to PromQL. But with more complexity comes a performance hit.

I also use Telegraf to ingest data into InfluxDB, with inputs pointing solely at RabbitMQ. I use NodeJS to listen to websocket streams and push data points to RabbitMQ for ingestion. Because of the amount of streaming data I plan to put into InfluxDB, I set persistence.size to a high amount of 12GB.

As the chart version hadn’t been updated in a while, using an image tag that was causing me some errors, I manually set the image.tag to the latest available version.

resource "helm_release" "influx" {
  name = "influxdb"
  repository = "https://helm.influxdata.com/"
  chart = "influxdb2"
  version = "2.1.2"

  set {
    name = "image.tag"
    value = "2.7.10"
  }

  set {
    name = "persistence.size"
    value = "12Gi"
  }
}

Deploying Loki

Loki is the most complex to configure, but I find it more intuitive (for Grafana) as a way to store system and application logs. I deploy it in a single binary configuration, and use DigitalOcean Spaces as the backend storage for logs themselves. Relying on a block storage may prove problematic as millions of messages would require constant re-provisioning of storage.

resource "helm_release" "loki" {
  name = "loki"
  repository = "https://grafana.github.io/helm-charts"
  chart = "loki"
  version = "6.18.0"

  values = [
    yamlencode({
        loki = {
          commonConfig = {
            replication_factor = 1
          }
          storage = {
            type = "s3"
            bucketNames = {
              chunks = "",
              ruler = "",
              admin = "",
            },
            s3 = {
              s3 = "s3://",
              endpoint = "lon1.digitaloceanspaces.com",
              region = "lon1",
              secretAccessKey = "",
              accessKeyId = "",
            }
          }
          schemaConfig = {
            configs = [
              {
                from = "2024-04-01",
                store = "tsdb",
                object_store = "s3",
                schema = "v13",
                index = {
                  "prefix" = "loki_index_",
                  "period" = "24h"
                }
              }
            ]
          },
        },
        deploymentMode = "SingleBinary",
        backend = { replicas = 0 },
        read = { replicas = 0 },
        write = { replicas = 0 },
        singleBinary = { replicas = 1 },
        chunksCache = { allocatedMemory = 2048 }
      })
  ]
}

Pushing logs to Loki

Loki exposes an API Endpoint for pushing logs to like the Prometheus Push Gateway, which accepts logs in a OpenTelemetry-compatible format. One tool, Promtail, will follow all container logs created by all pods in a Kubernetes cluster and stream them to the Loki push API.

resource "helm_release" "promtail" {
  name = "promtail"
  repository = "https://grafana.github.io/helm-charts"
  chart = "promtail"
  version = "6.16.6"

  values = [
    yamlencode({
        config = {
          clients = [{url = "http://loki-gateway/loki/api/v1/push", tenant_id = "zai"}]
        }
      })
  ]
}

loki-gateway is the default name of the Kubernetes Service created by the Loki helm chart. The kube-dns service will return the Endpoint IP Address of the Loki instance.

Deploying Provisioned Components for Grafana

Deploying Grafana with sidecar containers provisions operators that listen for []ConfigMap with specific labels for Dashboards, Alerts and Datasources. Simply, it takes the value of the ConfigMap and puts it into Grafana’s provisioning directory.

Grafana’s provisioning directory is defined by paths.provisioning within grafana.ini, which can be injected upon deploying the Grafana helm chart within the "grafana.ini" key. In my case, this path is /etc/grafana/provisioning.

Grafana natively will read its’ provisioning directory and load them into the instance, regardless if its’ containerized or running on the system directly.

Provisioning Dashboards

For dashboards, a label of grafana_dashboard needs to exist, but the value is irrelevant. I use templatefile() to load the file as string into main.json. This will allow me in the future to handle the renaming of data sources used within a Dashboard, or to manipulate a dashboard directly from Terraform.

I design dashboards directly within Grafana, export them as JSON and store them alongside the Terraform module for use by the ConfigMap. In the following resource, my exported dashboard will end up under /etc/grafana/provisioning/main.json.

Within the export menu, Grafana does provide the option to export using HCL (Terraform). I don’t opt for this option as this requires Grafana to be up and running in order to execute the resource. With the approach of declaring Dashboards via ConfigMap, I can re-deploy the dashboard in one go and remove the direct dependency on the Grafana instance running.

resource "kubernetes_config_map" "grafana_dashboards" {
  metadata {
    name = "grafana-dashboards"
    labels = {
      grafana_dashboard = "1"
    }
  }

  data = {
    "main.json" = templatefile("/path/to/dashboard.json", {})
  }
}

Provisioning Alerts

I follow the same approach as above for declaring alerts, with the exception that grafana_alert is the expected label from the sidecar.


resource "kubernetes_config_map" "grafana_alerts" {
  metadata {
    name = "grafana-alerts"
    labels = {
      grafana_alert = "1"
    }
  }

  data = {
    "alerts.json" = templatefile("/path/to/alert.json", {})
  }
}

Provisioning Data-sources

I build the configuration myself when it comes to data sources. The specification varies between each data source. Thanks to using Terraform to deploy each data source, I can re-use the variables used to define the Service names of each data source so that Grafana can find them correctly.

Provisioning Prometheus as a data source

resource "kubernetes_config_map" "prometheus_grafana_discovery" {
  metadata {
    name = "prometheus-grafana-datasource"
    labels = {
      grafana_datasource = "prometheus"
    }
  }

  data = {
    "prometheus.yml" = yamlencode({
        apiVersion = 1,
        datasources = [
          {
            name = var.prometheus_deployment_name,
            type = "prometheus",
            url = "http://${var.prometheus_deployment_name}.${helm_release.prometheus.namespace}.svc.cluster.local",
            access = "proxy"
          }
        ]
    })
  }
}

With the above resource declared from Kubernetes, I then just manipulate datasources = [] to match the following specifications for each datasource;

Specification for Loki

"apiVersion": 1
"datasources":
- "jsonData":
    "httpHeaderName1": "X-Scope-OrgID"
  "name": "prometheus-server"
  "secureJsonData":
    "httpHeaderValue1": "1"
  "type": "loki"
  "url": "http://loki.default.svc.cluster.local"

X-Scope-OrgID is a trick to inject an Organization ID into the HTTP Header so that Grafana gets authenticated by Loki.
loki is the default name of the Kubernetes service created by the Helm chart

Specification for Elasticsearch

Elasticsearch needs one declaration per index (if splitting the data by index). I create an index for each source of data being ingested into Elastic, and postfix it with the date of ingestion.

For authentication, I use the password for the elastic as defined by the Helm chart. By default, this is randomly generated and stored within a Secret. I also use the tlsSkipVerify flag as additional configuration is needed for elasticsearch to use a TLS certificate that’s respected by Grafana. Since the traffic is internal, I’m not that concerned by this at this point.

elasticsearch-master is the default name of the service created by the Helm chart.

"apiVersion": 1
"datasources":
- "basicAuth": true
  "basicAuthUser": "elastic"
  "jsonData":
    "index": "twelvedata-*"
    "timeField": "@timestamp"
    "tlsSkipVerify": true
  "name": "Elasticsearch (Twelve Data)"
  "secureJsonData":
    "basicAuthPassword": ""
  "type": "elasticsearch"
  "url": "https://elasticsearch-master:9200"
- "basicAuth": true
  "basicAuthUser": "elastic"
  "jsonData":
    "index": "coinbase-*"
    "timeField": "@timestamp"
    "tlsSkipVerify": true
  "name": "Elasticsearch (Coinbase)"
  "secureJsonData":
    "basicAuthPassword": ""
  "type": "elasticsearch"
  "url": "https://elasticsearch-master:9200"

Specification for InfluxDB

"apiVersion": 1
"datasources":
- "jsonData":
    "default_bucket": "default"
    "organization": "influxdata"
    "version": "Flux"
  "name": "InfluxDB"
  "secureJsonData":
    "token": ""
  "type": "influxdb"
  "url": "http://influxdb-influxdb2:80"

The Flux version forces InfluxDB v2, which in turn requires a default_bucket and organization. These values are defined by the Helm chart, but its’ default values are used here.
token is also defined by the Helm chart and stored within a Secret. I opt for using the randomly generated default.
influxdb-influxdb2 is the default name of the Service created by the Helm chart.

With all this in place, I have a Terraform module that deploys a Grafana stack onto DigitalOcean’s Kubernetes platform, while maintaining portability.

Deploying AWS Site-to-Site on OpenWRT

Zak — Tue, 10 Dec 2024 16:33:06 +0000

I want to connect to resources on AWS from my home with the least operational overhead, leading me to deploy AWS Site-to-Site for connecting resources from my home to a VPC.

The Environment

Some resources I want to access are;

G4dn.xlarge EC2 instances used for streaming games
t2.micro EC2 instances hosting Home Assistant
RDS (PostgreSQL) instances for hosting OpenStreetMap data

Home Environment

When setting up a connection from AWS to my home, I have to consider the following specifications;

I live in West London, relatively close to the eu-west-2 data center
- I have a VPC in eu-west-2 running on the 10.1.0.0/16 network
I use a publicly-offered ISP for accessing the internet
There are two hops (routers) between the public internet and my home network
- The first hop is the router provided by the ISP to connect to the internet
  - This network lives on the 192.168.0.0/24 subnet
- The second hop is my own off-the-shelf router from ASUS running OpenWRT
  - My home network lives on the 10.0.0.0/24 subnet
  - The router has 8MB of usable storage for packages and configuration

Setting up AWS Site-to-Site

AWS Site-to-Site is one of Amazon’s offerings for bridging an external network to a VPC over the public internet. Some alternatives are;

AWS Client VPN (based on OpenVPN)
- More expensive
- More complex, often tends to be slower without optimization
Self-managed VPN
- Allows use of any VPN technology, such as Wireguard
- Allows custom metric monitoring
- Requires management of VPC topologies and firewalls
- Can be more expensive

I chose to use the Site-to-Site in this occasion so I could learn about how IPSec works in more detail, and saw it as a challenge in deploying to OpenWRT. It’s also a lot cheaper than a firewall license, EC2 rental and public IP charges.

Deploying a Virtual Private Gateway

A Virtual Private Gateway is the AWS-side endpoint of an IPSec tunnel. It also hosts the configuration of the local BGP instance, and is what drives the propagation of routes between the IPSec tunnels and the VPC routing tables.

Dashboard view of the Virtual Private Gateway. I rely on an ASN generated by Amazon for this instance.

resource "aws_vpn_gateway" "main" {
  vpc_id = data.aws_vpc.main.id
}

There’s not much to configure with the VPG, so I left it with its’ defaults.

Deploying a Customer Gateway

A customer gateway represents the local end of the IPSec tunnel and the BGP daemon running on it. In my case, this is the OpenWRT router.

Dashboard view of the customer gateway, which represents my OpenWRT Router itself and the BGP daemon running on it. AWS by default provides an ASN of 65000, but I don’t have any need to customize it.

resource "aws_customer_gateway" "openwrt" {
  bgp_asn    = 65000
  ip_address = ""
  type       = "ipsec.1"
}

Deploying a Site-to-Site VPN Connection

The VPN Connection itself is what connects a VPG (AWS Endpoint) to a customer gateway (Local endpoint) in the form of an IPSec VPN connection.

Dashboard view of the Site-to-Site VPN. Everything is left as the default. For the purposes of building the automated workflow and testing connectivity, local and remote network CIDRs are 0.0.0.0/0

resource "aws_vpn_connection" "main" {
  customer_gateway_id = aws_customer_gateway.openwrt.id
  vpn_gateway_id      = aws_vpn_gateway.main.id
  type                = aws_customer_gateway.openwrt.type

  tunnel1_ike_versions = ["ikev1", "ikev2"]
  tunnel1_phase1_dh_group_numbers = [2, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]
  tunnel1_phase1_encryption_algorithms = ["AES128", "AES128-GCM-16", "AES256", "AES256-GCM-16"]
  tunnel1_phase1_integrity_algorithms = ["SHA1", "SHA2-256", "SHA2-384", "SHA2-512"]
  tunnel1_phase2_dh_group_numbers = [2, 5, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]
  tunnel1_phase2_encryption_algorithms = ["AES128", "AES128-GCM-16", "AES256", "AES256-GCM-16"]
  tunnel1_phase2_integrity_algorithms = ["SHA1", "SHA2-256", "SHA2-384", "SHA2-512"]

}

These (tunnel1_*) are the default values set by AWS and should be locked down. For the purpose of testing, I left them all to their defaults. This settings are directly tied to the IPSec encryption settings described below.

Connecting OpenWRT via IPSec

Ansible Role Variables

I’ve designed my Ansible role to be able to configure AWS IPSec tunnels with the bare minimum configuration. All information that the role requires is provided by Terraform upon provisioning of the AWS Site-to-Site configuration.

bgp_remote_as: "64512"
ipsec_tunnels:
  - ifid: "301"
    name: xfrm0
    inside_addr: 
    gateway: 
    psk: 
  - ifid: "302"
    name: xfrm1
    inside_addr: 
    gateway: 
    psk:

bgp_remote_as refers to the ASN of the Virtual Private Gateway, and is strictly used by the BGP Daemon offered by Quagga.
- BGP is used to propagate routes to-and-from AWS.
- When a Site-to-Site VPN is configured to use Dynamic routing, it will state that the tunnel is Down if AWS cannot reach the BGP instance.
ipsec_tunnels is used by XFRM and strongSwan to;
- Build one XFRM interface per-tunnel
- Build one alias interface bound to each XFRM interface for static routing
- Configure the static routing of the XFRM interfaces
- Configure the BGP daemon neighbours
- Configure one IPSec endpoint per-tunnel
- Configure one IPSec tunnel for each XFRM interface

Required packages

I used three components for this workflow to function, and a last one for debugging security association errors.

strongswan-full
- strongSwan provides an IPSec implementation for OpenWRT with full support for UCI. The -full variation of the package is overkill, but you never know!
quagga-bgpd
- A BGP implementation light enough to run on OpenWRT. quagga comes in as a dependency
luci-proto-xfrm
- A virtual interface for use by IPSec, where a tunnel requires a vif to bind to.
ip-full
- Provides an xfrm argument for debugging IPSec connections with.

name: install required packages
opkg:
  name: "{{ item }}"
loop:
  - strongswan-full
  - quagga-bgpd
  - luci-proto-xfrm
  - ip-full

Adding the XFRM Interface

OpenWRT LuCI dashboard, showing the final result of the interfaces tab. I declare two XFRM interfaces, one per VPN tunnel provided by AWS, each with an IPv4 assigned that matches the Inside IPv4 CIDRs defined within the AWS Site-to-site configuration. The IPv4 address is applied to an alias of the adapter rather than the adapter itself as the XFRM interface doesn’t support static IP addressing via UCI.

Ansible Task

name: configure xfrm adapter
uci:
  command: section
  key: network
  type: interface
  name: "{{ item.name }}"
  value:
    tunlink: 'wan'
    mtu: '1300'
    proto: xfrm
    ifid: "{{ item.ifid }}"
loop: "{{ ipsec_tunnels }}"

/etc/config/network – UCI Configuration

config interface 'xfrm0'
	option ifid '301'
	option tunlink 'wan'
	option mtu '1300'
	option proto 'xfrm'

config interface 'xfrm1'
	option tunlink 'wan'
	option mtu '1300'
	option proto 'xfrm'
	option ifid '302'

I use the uci task to deploy adapter configurations. I create one interface per-tunnel provided by AWS.

tunlink sets the IPSec tunnel to connect to & from the wan interface
mtu is 1300 by default, I didn’t need to configure this value
ifid is defined as strongSwan will use this to bind an IPSec tunnel to a network interface. This is separate from the name of the interface.

AWS needs to communicate with the BGP instance running on OpenWRT. The value of Inside IPv4 CIDR instructs AWS which IPs to listen on for their BGP instance, and which IP to connect to for fetching routes. The CIDRs will be restricted to the /30 prefix, which provides the range of 4 IP addresses, 2 of which are usable.

As an example, here is the Inside IPv4 CIDR of 169.254.181.60/30 and what that means.

IP Index	IP Address	Responsibility
0	`169.254.181.60`	Network address
1	`169.254.181.61`	IP Address reserved for AWS-side of the IPSec tunnel
2	`169.254.181.62`	IP Address reserved for the OpenWRT-side of the IPSec tunnel
3	`169.254.181.63`	Broadcast address

With this known, we know that;

On the AWS side of the IPSec tunnel	On the OpenWRT side of the IPSec tunnel
AWS has a BGP instance listening on the IP Address on the first index ( `169.254.181.61` )	OpenWRT needs to be configured to use the IP address on the second index (`169.254.181.62`)
AWS is expecting a BGP neighbour on the second index ( `169.254.181.62` )	The BGP daemon running on OpenWRT needs to be configured to use the neighbor at the first index (`169.254.181.61`)
AWS knows how to route traffic across the `169.254.181.60` network	OpenWRT needs to know to route traffic on the `169.254.181.60` network.

Configuring the IP Address on the IPSec tunnel

I create an alias on top of the originating XFRM interface so that I can utilize the static protocol within UCI to configure static routing in a declarative way.

Ansible Task

name: create xfrm alias for static addressing
uci:
  command: section
  key: network
  type: interface
  name: "{{ item.name }}_s"
  value:
    proto: static
    ipaddr:
      - "{{ item.inside_addr | ipaddr('net') | ipaddr(2) }}"
    device: "@{{ item.name }}"
loop: "{{ ipsec_tunnels }}"

/etc/config/network – UCI Configuration

config interface 'xfrm0_s'
	option proto 'static'
	option device '@xfrm0'
	list ipaddr '169.254.211.46/30'

config interface 'xfrm1_s'
	option proto 'static'
	list ipaddr '169.254.181.62/30'
	option device '@xfrm1'

I use ipaddr('net') | ipaddr(2) to simplify my Ansible configuration. inside_addr is 169.254.181.60/30 and these functions simply increase the IP address by two, giving the result of 169.254.181.62/30.

This will ensure two things;

The xfrm interface persistently holds the 169.254.181.62/30 IP Address
The Linux routing table holds a route of 169.254.181.60/30 via the xfrm interface

This resolves the issue of OpenWRT knowing what IP Address to use and how to route the traffic.

Setting up IPSec

Because I’m using strongSwan, I can also use UCI to configure the IPSec tunnel. With this workflow, IPSec configuration is broken down into three elements;

Endpoint
- Primarily what’s known as “IKE Phase 1”. This is the “How I will connect to the other end”.
Tunnel
- Primarily known as “IKE Phase 2”. This is the “How do I pass traffic through to the other end”.
Encryption
- A set of rules to describe how to handle the cryptography.

IPSec Encryption

What’s defined here drives whether Phase 1 will succeed, and must match the AWS VPN Encryption settings.

Ansible Task

name: define ipsec encryption
uci:
  command: section
  key: ipsec
  type: crypto_proposal
  name: "aws"
  value:
    is_esp: '1'
    dh_group: modp1024
    hash_algorithm: sha1

/etc/config/ipsec – UCI Configuration

config crypto_proposal 'aws'
	option is_esp '1'
	option dh_group 'modp1024'
	option encryption_algorithm 'aes128'
	option hash_algorithm 'sha1'

In my case, I’m;

Using AES128 for encryption of the traffic
Using SHA1 as the integrity algorithm for ensuring packets are correct upon arrival
Naming the crypto_proposal aws for use by the Endpoint and the Tunnel

AES128 and SHA1 are supported by the configuration defined on the VPN configuration above.

Declaring the IPSec Endpoint

Ansible Task

name: configure ipsec remote
uci:
  command: section
  key: ipsec
  type: remote
  name: "{{ item.name }}_ep"
  value:
    enabled: "1"
    gateway: "{{ item.gateway }}"
    local_gateway: ""
    local_ip: "10.0.0.1"
    crypto_proposal:
      - aws
    tunnel:
      - "{{ item.name }}"
    authentication_method: psk
    pre_shared_key: "{{ item.psk }}"
    fragmentation: yes
    keyingretries: '3'
    dpddelay: '30s'
    keyexchange: ikev2
loop: "{{ ipsec_tunnels }}"

/etc/config/ipsec – UCI Configuration

config remote 'xfrm0_ep'
	option enabled '1'
	option gateway ''
	option local_gateway ''
	option local_ip '10.0.0.1'
	list crypto_proposal 'ike2'
	list tunnel 'xfrm0'
	option authentication_method 'psk'
	option pre_shared_key ''
	option fragmentation '1'
	option keyingretries '3'
	option dpddelay '30s'
	option keyexchange 'ikev2'

config remote 'xfrm1_ep'
	option enabled '1'
	option gateway ''
	option local_gateway ''
	option local_ip '10.0.0.1'
	list crypto_proposal 'ike2'
	list tunnel 'xfrm1'
	option authentication_method 'psk'
	option pre_shared_key ''
	option fragmentation '1'
	option keyingretries '3'
	option dpddelay '30s'
	option keyexchange 'ikev2'

The gateway is known as the Outside IP Address on AWS
local_gateway points to the WAN Address of OpenWRT
local_ip points to the LAN address of OpenWRT
crypto_proposal points to aws (Defined above)
tunnel points to the name of the interface that this IPSec endpoint represents.
- Since there are two IPSec endpoints, two of these remotes are created. I use the interface name (from xfrm) across all duplicates to make sure that it’s visibly clear what’s being used where.
pre_shared_key is the PSK that gets generated (or set) within the VPN Tunnel.
- This is unique per-tunnel, meaning that there should be two different PSKs per Site-to-site VPN connection. They can be found under the Modify VPN Tunnel Options selection.

Configuring the IPSec Tunnel

The tunnel instructs strongSwan how to bind the IPSec tunnel to an interface. The key here is the ifid of the XFRM interfaces defined earlier.

Ansible Task

name: configure ipsec tunnel
uci:
  command: section
  key: ipsec
  type: tunnel
  name: "{{ item.name }}"
  value:
    startaction: start
    closeaction: start
    crypto_proposal: aws
    dpdaction: start
    if_id: "{{ item.ifid }}"
    local_ip: "10.0.0.1"
    local_subnet:
      - 0.0.0.0/0
    remote_subnet:
      - 0.0.0.0/0
loop: "{{ ipsec_tunnels }}"

/etc/config/ipsec – UCI Configuration

config tunnel 'xfrm0'
	option startaction 'start'
	option closeaction 'start'
	option crypto_proposal 'ike2'
	option dpdaction 'start'
	option if_id '301'
	option local_ip '10.0.0.1'
	list local_subnet '0.0.0.0/0'
	list remote_subnet '0.0.0.0/0'

config tunnel 'xfrm1'
	option startaction 'start'
	option closeaction 'start'
	option crypto_proposal 'ike2'
	option dpdaction 'start'
	option if_id '302'
	option local_ip '10.0.0.1'
	list local_subnet '0.0.0.0/0'
	list remote_subnet '0.0.0.0/0'

Like the AWS configuration, I define the local_subnet and remote_subnet to 0.0.0.0/0. This is so I can focus on testing connectivity.
if_id points to the XFRM interface that’s representing the tunnel in iteration.
- The if_id must match the tunnel in iteration, as the Inside IPv4 CIDRs have been bound to an interface.

Configuring BGP on OpenWRT

In order to apply BGP routes on the AWS-side, route propagation must be enabled on a routing table level. Otherwise, a static route pointing to my home IP Address (10.0.0.0/24) via the Virtual Private Gateway must be declared.

I opted for Quagga when using BGP on OpenWRT.

router bgp 65000
bgp router-id {{ ipsec_inside_cidrs[0] | ipaddr('net') | ipaddr(2) | split('/') | first }}
{% for ipsec_inside_cidr in ipsec_inside_cidrs %}
neighbor {{ ipsec_inside_cidr | ipaddr('net') | ipaddr(1) | split('/') | first }} remote-as {{ bgp_remote_as }}
neighbor {{ ipsec_inside_cidr | ipaddr('net') | ipaddr(1) | split('/') | first }} soft-reconfiguration inbound
neighbor {{ ipsec_inside_cidr | ipaddr('net') | ipaddr(1) | split('/') | first }} distribute-list localnet in
neighbor {{ ipsec_inside_cidr | ipaddr('net') | ipaddr(1) | split('/') | first }} distribute-list all out
neighbor {{ ipsec_inside_cidr | ipaddr('net') | ipaddr(1) | split('/') | first }} ebgp-multihop 2
{% endfor %}

/etc/quagga/bgpd.conf – Rendered Template

router bgp 65000
bgp router-id 169.254.211.46
neighbor 169.254.211.45 remote-as 64512
neighbor 169.254.211.45 soft-reconfiguration inbound
neighbor 169.254.211.45 distribute-list localnet in
neighbor 169.254.211.45 distribute-list all out
neighbor 169.254.211.45 ebgp-multihop 2
neighbor 169.254.181.61 remote-as 64512
neighbor 169.254.181.61 soft-reconfiguration inbound
neighbor 169.254.181.61 distribute-list localnet in
neighbor 169.254.181.61 distribute-list all out
neighbor 169.254.181.61 ebgp-multihop 2

Like earlier, I use ipaddr('net') | ipaddr(1) to increment the IP address from the CIDR
remote-as defines the AWS-side ASN.
- BGP at its’ core defines routes based on path to AS, a layer on-top of IP Addresses.
- It’s designed to work with direct connections, not over-the-internet.
  - ISPs & exchanges will, however, use BGP at their level to forward the traffic on.
router bgp states what the ASN of the OpenWRT router is. Because I used the default of 65000 from AWS, I place that here.
bgp router-id is set to the first XFRM interface’s IP address, since the same BGP instance will be shared by both tunnels in the event that one tunnel goes down. AWS does not do a validation check on the router-id.

Verifying the connection to IPSec

Using the swanctl command, I can identify whether my applied configuration is successful when logged into my OpenWRT router using SSH.

Start swanctl

I don’t use the legacy ipsec init script, instead, directly using the swanctl one. Under the hood, this will convert the UCI configuration into a strongSwan configuration located at /var/swanctl/swanctl.conf

/etc/init.d/swanctl start
ipsec statusall

Output of the ipsec statusall command, where both VPN tunnels are ESTABLISHED and INSTALLED. Established denotes that IKE Phase 1 (Encryption negotiation) was successful and Installed denotes that IKE Phase 2 (Authorization, the tunnel creation itself) was successful and is now in use.

Connection can also be verified from the AWS Console, by looking at the value of Details. If the connection doesn’t say IPSEC IS DOWN, the connection was successful. Status is only up when BGP can be reached from AWS. When using Dynamic (not static) routing in the configuration for Site-to-Site, AWS doesn’t declare a connection up unless BGP is reachable at the second address available in the Inside IPv4 CIDR.

Routing traffic to & from the XFRM Interface

I finally need to instruct OpenWRT to forward packets that are destined to xfrm0 or xfrm1 to be allowed. The fact that the Linux routing table will state that 10.1.0.0/24 is accessed via xfrm0, which is applied via BGP is enough to know that either xfrm0 or xfrm1 is the interface required.

By default, a flag of REJECT is defined. By applying the following firewall rule, packet successfully go through to the AWS VPC.

Ansible Task

name: install firewall zone
uci:
  command: section
  key: firewall
  type: zone
  find_by:
    name: 'tunnel'
  value:
    input: REJECT
    output: ACCEPT
    forward: REJECT
    network:
      - xfrm0
      - xfrm1

name: install firewall forwarding
uci:
  command: section
  key: firewall
  type: forwarding
  find_by:
    dest: 'tunnel'
  value:
    src: lan

/etc/config/firewall – UCI Configuration

config zone
	option name 'tunnel'
	option input 'REJECT'
	option output 'ACCEPT'
	option forward 'REJECT'
	list network 'xfrm0' 'xfrm1'

config forwarding
	option src 'lan'
	option dest 'tunnel'

Final tasks

The final steps of the Ansible playbook is to instruct the UCI framework to save the changes to the disk, and to reload the configuration of all services required.

name: commit changes
uci:
  command: commit

name: enable required services
service:
  name: "{{ item }}"
  enabled: yes
  state: reloaded
loop:
  - swanctl
  - quagga
  - network

I then invoke the Ansible playbook by using a local-exec provisioner on a null_resource within terraform, where the AWS Site-to-Site resource is a dependency. Along the lines of:

resource "null_resource" "cluster" {
  provisioner "local-exec" {
    command = <





This is a shortened version of what I have, but by simply piping the Ansible playbook with the outputs of the AWS Site-to-Site Resource, my router is automatically configured correctly when I create a Site-to-Site resource.



With IPSec now deployed, I can communicate directly with my resources hosted on AWS as if it were local.



Streaming a bike riders’ journey with AWS MediaLive
Zak — Mon, 25 Nov 2024 14:11:11 +0000

I want to showcase my bike journeys with live content from my GoPro and an overlay of my rough location on the stream. I’d like to use multiple GoPros mounted in different locations on my bike, with a ‘remote’ to switch between the main on-stream view with minimal effort.



Practical Considerations



When designing an approach, I had to consider: 




Streaming over a mobile connection meant that there could be dead zones

I’d need to ensure that the stream runs consistently in the event of any outage, such as providing a “waiting for connection” loop footage





The battery life of all equipment could mean I need to carry a lot of backup power



Risk of theft and too much weight if I carried too much equipment.



Determining whether 4G was enough or if I required 5G.

Taking a look at what the bitrate of the GoPro in streaming mode is





Having a portable network that supports IPSec or OpenVPN to work with AWS-offered VPNs. 

By adopting cloud computing, I don’t need to carry a computer with me or need to maintain a machine’s uptime for processing the video streams






AWS Components




Site-to-site or client VPN from the network that the GoPro is connected to into a VPC in AWS.

The VPC is located in the closest region to me to reduce latency. 



As RTMP is unsecured, I need an alternative method to protect the stream in transit





Stream the GoPro footage into Elemental MediaLive RTMP Inputs



Send GPS updates from my mobile phone to API Gateway

AWS Batch or EC2 will generate overlay images





API Gateway and Lambda expose functions for my “remote” to control the stream.




The Technical Journey I’ll Undergo



Choosing the right equipment for the job



To start, I’ll be looking at the bigger picture and choosing the right physical equipment that covers my needs of a long lasting battery life, portable networking while having the least weight on the bike.



Exploring Configuration for IPSec & OpenVPN



I will explore the configuration of IPSec and OpenVPN for getting my GoPros into the AWS network.



Exploring MediaLive and the Elemental Suite



I’ll be investigating what MediaLive and the Elemental suite has to offer when it comes to streaming for my use-case and designing the optimal streaming pipeline for my rides.



Developing APIs for MediaLive and Exploring API Gateway



I will develop APIs for MediaLive and explore API Gateway for controlling the stream on the road. I’ll have to define the API specification and decide on the appropriate language to write the Lambda functions in.



Exploring AWS Batch or EC2 for On-Demand Element Rendering



To render overlay elements on the fly, I will explore AWS Batch or EC2. As the elemental suite doesn’t cover this specific case of on-screen graphics (to the customization level I want it), I’d need to explore the best tool for the job that scales well.



Determining Which Codecs to Adopt and Where to Stream To



Investigation into the best way to distribute my stream, in addition to conforming my MediaLive channel pipeline into the target



Storing All Configuration as Code



To ensure that the setup is reproducible and easy to maintain, I will store all the configuration as code using Terraform and Ansible.



I’ll also be travelling on a creative journey during the process, such as designing the graphics, choosing the best shots for the stream and improving the experience for the user.







Deploying Region-locked AWS Organizations using Terraform
Zak — Mon, 07 Oct 2024 12:30:03 +0000

As a solutions architect, I was tasked with building an AWS Organizations hierarchy for a Canadian startup that needed to comply with local laws and enable multi-site configurations for networking.



To get started, I built an AWS Organizations hierarchy using Terraform. I chose Terraform because it allows me to use the same workflow for building organizations across multiple clouds. This post will focus on building an Organizational Unit (OU) tree for regions and localities.



To create OUs, I have a “basic” Terraform module that is a wrapper on the aws_organizations_organizational_unit resource. To make it reusable, I expose the name and parent. I then specialize the “basic” Terraform module into ones more specific to each organization by injecting tags and appending a postfix to the name of the OU, such as the region or locality.



For compliance, I restrict at the OU-level which zones can be used by the AWS account and any IAM users assuming the role of this AWS account. I use a Service Control Policy (SCP) to deny access to all regions except for those specified in the local.regions value. Because a lot of core infrastructure for AWS is located within us-east-1 and us-east-2, such as Billing, I need to always include it in the local.regions value.



Since I need to cater for both compliance and multi-site, I used my modules to build the OUs in the following hierarchy:




Root Organization

Region OU (e.g: North America)

Country OU (e.g: Canada)

Locality OU (e.g: Vancouver)










And with Terraform modules structured in the following way:




Root Organization

Client Module

Client Root Organizational Unit



Region / Country / Locality Module

Base OU Module

Region / Country / Locality Organizational Unit





Region Policies

SCP Policy












In the case of Vancouver, while the Seattle local zone or us-west-2 region is closer, it’s not located within Canada which may be a problem when looking at local labor laws and compliance, so Calgary (ca-west-1) is the next best thing. I’m waiting for the Vancouver local zone to become publicly available so that I can use that, but it will fall under Calgary anyway.



This means that my SCPs restrict the organizational units to the following regions:




North American OU

us-west-1, us-west-2, us-east-1, us-east-2, ca-central-1, ca-west-1





Canadian OU

us-east-1, us-east-2, ca-central-1, ca-west-1





Vancouver OU

us-east-1, us-east-2, ca-west-1






Because of the hierarchy approach, I can have AWS accounts in parents with shared resources such as VPCs, Databases, S3 and EFS shares. This will be hugely beneficial when working multiple sites.



My re-usable modules follow the following structure:



Core Organizational Unit



This holds the default values for all OUs within the organization, where tags for example would be shared.



resource "aws_organizations_organizational_unit" "root" {
  name      = var.name
  parent_id = var.parent

  tags = var.tags
}



Inheriting the Basic OU into Locality, Country & Region OUs



I re-use the basic module to make it follow a strict naming and tag convention based on the context (e.g: locality, country and region). This module is for the context and not specifically the region in question. The region in question will then re-use this module.



This makes sure that the NA Region and European Region have the same fundamentals between them.



module "basic" {
  source = "../basic"
  parent = var.parent
  name = "${var.name} - ${var.locality}"
  tags = local.tags
}



module "policies" {
  source = "../../../regions/policies"
  policy_name = local.policy_name
  target_id = module.basic.id
  regions = var.regions
}



Inheriting the Context OU Module into literal regions



Here, I take the region context module and adapt it specific to North America. The same logic applies to country and locality. This simply enforces that the tags and name of the OU contain the region and that the SCPs generated block all regions except the regions provided



module "region" {
  source = "../../templates/organization/region"
  region = "North America"
  parent = var.parent
  name = var.name
  tags = local.tags
  regions = [
    "us-east-1",
    "us-east-2",
    "us-west-1",
    "us-west-2",
    "ca-central-1",
    "ca-west-1"
    ]
}



Generating the SCPs from Terraform



policy_name here is the same as the name of an OU with spaces removed. Since SCPs require Deny rules, using the StringNotEquals test is needed.



data "aws_iam_policy_document" "region_restriction" {
  statement {
    sid = "RestrictRegionFor${var.policy_name}"
    effect    = "Deny"
    actions   = ["*"]
    resources = ["*"]

    condition {
      test = "StringNotEquals"
      variable = "aws:RequestedRegion"
      values = local.regions
    }
  }
}



resource "aws_organizations_policy" "region_restriction" {
  name    = "RestrictRegionFor${var.policy_name}"
  content = data.aws_iam_policy_document.region_restriction.json
  type = "SERVICE_CONTROL_POLICY"
}



Declaring a Regional OU for an Organization



Finally, I can use the North American OU to declare an OU that restricts any AWS Accounts inside to only create resources within North America.



module "region-na" {
  source = "../regions/north-america"
  parent = aws_organizations_organizational_unit.root.id
  name = var.name
  tags = local
}



I can do the same with Canada, and Vancouver.



module "region-ca" {
  source = "../regions/north-america/canada"
  parent = module.region-na.id
  name = var.name
  tags = module.region-na.tags
}

module "region-yvr" {
  source = "../regions/north-america/canada/vancouver"
  parent = module.region-ca.id
  name = var.name
  tags = module.region-ca.tags
}



By the end of the deployment, my hierarchy looks like the following:







And the attached SCP policies look like the following, where the SCP that is the direct parent of an AWS Account takes the most precedence:



ZAI – North America



{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "RestrictRegionForZAINorthAmerica",
      "Effect": "Deny",
      "Action": "*",
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:RequestedRegion": [
            "us-east-1",
            "us-east-2",
            "us-west-1",
            "us-west-2",
            "ca-central-1",
            "ca-west-1",
            "us-east-1",
            "us-east-2"
          ]
        }
      }
    }
  ]
}



ZAI – Canada



{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "RestrictRegionForZAICanada",
      "Effect": "Deny",
      "Action": "*",
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:RequestedRegion": [
            "ca-central-1",
            "ca-west-1",
            "us-east-1",
            "us-east-2"
          ]
        }
      }
    }
  ]
}



ZAI – Vancouver



{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "RestrictRegionForZAIVancouver",
      "Effect": "Deny",
      "Action": "*",
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:RequestedRegion": [
            "ca-west-1",
            "us-east-1",
            "us-east-2"
          ]
        }
      }
    }
  ]
}



Here it is in action, when exploring a region that is blocked by the SCP Policy:







Idea: Adopting Serverless for Trading Operations
Zak — Sat, 05 Oct 2024 21:20:54 +0000

I’m not very into day-trading, but I see the potential in the market from time to time, so I came up with this idea to create an automated trading system using AWS services exclusively.



The system will use Lambda, Timestream, EventBridge, S3, SQS and SageMaker to create a serverless architecture for monitoring and trading on the stock market, using the Twelvedata and Coinbase APIs for pulling in market data and executing trades, respectively.



To start, I will use EventBridge as an alternative to cron-jobs to add symbols to an SQS queue for ingestion. This ties in with the use of serverless architecture. For FOREX, the schedule will run every hour, and for Crypto and stock market, it will run every 15 minutes. This is a good balance as I’m not a professional trader and don’t need to use too many API calls.



I will have five Lambda functions:




The first Lambda function will listen to the SQS queue and query Twelvedata for the mentioned symbols. It will then insert the data directly into Timestream.



The second Lambda function will be triggered by an alert from Timestream when new data is available. For safety (and to start with), I have configured this alert to trigger hourly. The function will throw the data at the SageMaker model. If the model predicts a positive yield, the Lambda function will pass the symbol to the third lambda function via another SQS Queue.



The third Lambda function will execute a transaction on Coinbase.



The forth Lambda function will monitor Twelvedata and Coinbase for hot & trending symbols and add them to the monitoring queue, triggered by another EventBridge Schedule.



The fifth Lambda function will create a *.csv dataset from the data within Timescale.




I will use Secrets Manager to securely store the API keys for Twelvedata and Coinbase.



I’m not an AI expert and don’t know much about the specifics of training a model, so I’ll be using the SageMaker canvas feature to train the model. The canvas feature is the easiest way into training AI Models that doesn’t require making a Python script.



Finally, at the end of each day, I’ll extract a dataset from the Timestream database into a *.csv and store it in S3, then pass this file onto SageMaker for training. I’ll use one last EventBridge schedule to trigger this workflow.



Hopefully by following this approach, I’ll have a fully functioning market monitoring and trading system. 







Outline: Extending a home network setup in AWS
Zak — Sun, 22 Sep 2024 12:30:18 +0000

As someone without a permanent base, I needed a secure and flexible cloud infrastructure that allowed me to spawn powerful machines when needed. To achieve this, I built an isolated network on AWS.



I began by creating a Terraform module that provisions the infrastructure needed, such as




VPCs



Subnets



Routing tables



EC2 instances
While the module is tailored to AWS, I plan to keep the variable names consistent to other modules that re-create the setup for different cloud platforms, such as Exoscale.




The isolated network is centered around an EC2 instance, which acts as a router between a public VPC and a private VPC, similar to an at-home router. The EC2 instance has two ENI adapters, one attached to the public VPC and the other attached to the private VPC. The EC2 instance is running VyOS, which I configured using Ansible and the local-exec provisioner in Terraform upon creation.



data "aws_ami" "vyos" {
  most_recent = true

  filter {
    name   = "name"
    values = ["VyOS 1.4.0-*"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }

  owners = ["679593333241"]
}



resource "aws_instance" "vyos" {
  ami           = data.aws_ami.vyos.id
  availability_zone = data.aws_availability_zones.region.names[0]
  instance_type = "t3.small"

  network_interface {
    network_interface_id = aws_network_interface.public.id
    device_index         = 0
  }

  network_interface {
    network_interface_id = aws_network_interface.local.id
    device_index         = 1
  }

  provisioner "local-exec" {
        command = "ansible-playbook -i \"${aws_eip.public.public_ip},\" "
    }
}



The public VPC has an internet gateway attached to it, and all instances in the public VPC have internet access. The router instance is the only instance that resides in the public VPC. Both VPCs have a subnet within a single availability zone (AZ), as a single EC2 instance cannot span two AZs.



resource "aws_internet_gateway" "gw" {
}

resource "aws_internet_gateway_attachment" "gw" {
  internet_gateway_id = aws_internet_gateway.gw.id
  vpc_id              = aws_vpc.public.id
}



Each VPC has a routing table to correctly route traffic. The public VPC routes all traffic towards the internet gateway, while the private VPC routes all traffic within the subnet to each other.



resource "aws_route_table" "public" {
  vpc_id = aws_vpc.public.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.gw.id
  }
}

resource "aws_route_table" "internal" {
  vpc_id = aws_vpc.internal.id
}

resource "aws_route_table_association" "internal" {
  subnet_id      = aws_subnet.internal.id
  route_table_id = aws_route_table.internal.id
}

resource "aws_route_table_association" "public" {
  subnet_id      = aws_subnet.public.id
  route_table_id = aws_route_table.public.id
}



I connect to my isolated network primarily through my OpenWRT-based router using WireGuard. I also use the WireGuard client on my Mac or phone to connect to the cluster when I’m outside. Keep an eye out for my posts detailing how I deploy VyOS on AWS and configure OpenWRT to connect to WireGuard.



I attached an Elastic IP to the router instance, which lets me destroy and re-build the instance without issue. This is useful when I don’t need the network running, like when I’m flying, or when I’m actively improving the instance.



resource "aws_eip" "public" {
  domain   = "vpc"
}

resource "aws_eip_association" "public" {
  network_interface_id   = aws_network_interface.public.id
  allocation_id = aws_eip.public.id
}



If I need to access any other AWS resource, I add a VPC Endpoint for that resource directly to the private VPC. For example, I use S3FS to mount S3 storage directly on the instance and DynamoDB for building JSONL files for machine learning tasks.



Use Cases



I create a Windows instance inside the subnet when I need to do remote work that involves downloading when I’m outdoors. I also create larger instances for working with AI/Machine Learning models when my Mac isn’t able to load them or when I don’t have storage at a given time.



Multi-Region Setup



To transfer the setup to another region, I simply change the region variable in my Terraform module, and it magically appears in the new region.



Overview: Extending my home network to the cloud
Zak — Wed, 11 Sep 2024 22:02:52 +0000

As a frequent traveller, I found it impractical to maintain a physical system infrastructure, so I relocated my home infrastructure to the cloud.



Establishing a VPN Connection



To begin, I set up a VPN connection from my OpenWRT router to the cloud provider using WireGuard. I created two VPCs in the cloud provider – one public and one private – to mimic the “WAN-LAN” scenario of at-home routers. 



This setup provides isolation similar to a home network, where the resources on the private network can only be access by other resources on the same network, but they are also able to communicate with the outside world.



The intention is to have the private network as an extension to my “home” (at any given time).



Deploying a Cloud Router



I deployed a virtual machine that will act as a router spanning both networks. This needs to be across both networks as I need an endpoint to connect to (which requires an internet-exposed network) while still being able to access private resources.



I chose VyOS as the cloud router’s operating system because it is configuration-driven, allowing for an Infrastructure-as-Code (IaC) approach for easy re-deployment on any cloud provider.



Utilizing Object Storage for Plex Media Server



I adopted object storage to take advantage of the “unlimited” data offered by the cloud provider, and configured s3fs to mount the object storage on a specific node. With this, Plex can access data directly from the object storage bucket without many configuration changes or plugins to Plex.



The VPN connection allows me to access the Plex server securely as if it were local on both my PS5 and laptop. This setup ensures that the Plex interface remains non-accessible to the public and bypasses the bandwidth limit when proxying via the official Plex servers.



Securely Pushing Metrics from In-House Devices



By using the VPN connection, I can push metrics from my in-house devices directly, such as weather sensors without exposing my Prometheus instance to the public internet. 



The VPN’s security layer wraps around all traffic, eliminating the need for implementing a CA chain for Prometheus when using platforms such as AWS IoT or Grafana Cloud (where devices are expected to communicate with a public HTTPS endpoint)



Automating At-Home Devices with HomeAssistant



I use HomeAssistant within the cloud provider to automate my at-home devices without worrying about downtime or maintaining a device inside my home. HomeAssistant is scriptable, easily re-deployable, and can bridge a wide range of IoT devices under a single platform, such as HomeKit and Hue.



I can now utilize my old infrastructure without worrying about maintaining hardware, and plan to deploy many services to the private cloud. Keep an eye out for a deeper breakdown on how I deployed and configured each element of my private cloud







Exoscale Exporter for Prometheus
Zak — Wed, 04 Sep 2024 11:18:20 +0000

Visit Repository



I’d built a Prometheus exporter for Exoscale, allowing me to visualize cloud spending and resource usage from a central location alongside AWS and DigitalOcean.



The Exoscale exporter is built using Go and leverages the latest version of Exoscale’s Go API, egoscale v3 and includes basic integration tests and automatic package building for all major platforms and architectures.



Some of the metrics exported are;




Organization Information: Usage, Address, API Keys



Compute Resource Summary: Instances, Kubernetes, Node Pools



Storage Resource Summary: SOS Buckets & Usage, Block Volumes



Networking Resource Summary: Domain & Records, Load Balancers




By integrating organizational data from Exoscale into the Prometheus ecosystem, I can now configure alerts for spending or resource usage on either Exoscale specifically or for all platforms using AlertManager.



I can also identify where I may have left resources behind using Grafana, in the event I’m manually creating them or my IaC executions didn’t do a proper clean-up.



Metric Browser in Grafana; Showing some values exported from the Exporter


I decided to deploy the exporter to my Kubernetes cluster, scraping based on the default interval of 2 minutes. This is roughly a good balance between;




When a new billing amount gets updated (hourly)



How often infrastructure elements themselves gets updated (could be on a minutely-basis)



How much data gets consumed by the time-series




I chose Kubernetes cluster rather than a server-less solution or a dedicated VM so that I can optimize the costs of running the exporter by sharing resources, in addition to abstracting the cloud provider away from the application.







Using AWS CodeBuild to execute Ansible playbooks
Zak — Sat, 06 Apr 2024 19:31:19 +0000

I wanted a clean and automate-able way to package third party software into *.deb format (and multiple others, if needed, in the future), and I had three ways to achieve that;




The simple way: Write a Bash script



The easy way: Write a Python script



My chosen method: Write an Ansible role




While all of the options can get me where I wanted, it felt a lot cleaner to go the Ansible route as I can clearly state (and see) what packages I am building either from the command line level or from a playbook level, rather than having to maintain a separate configuration file to drive what to build and where in an alternative format for either the Bash or Python approaches.



The playbook approach also allows me to monitor and execute a build on a remote machine, should I wish to build cross-platform or need larger resources for testing. 



In this scenario, I’ll be executing the Ansible role locally on the CodeBuild instance.



Configuring the CodeBuild Environment



Using GitHub as a source



I have one git repository per Ansible playbook, so by linking CodeBuild to the repository in question I’m able to (eventually) automatically trigger the execution of CodeBuild upon a pushed commit on the main branch.



The only additional setting under sources that I define is the Source version, as I don’t want build executions happening for all branches (as that can get costly).



CodeBuild Environment




For the first iteration of this setup, I am installing the (same) required packages at every launch. This is not the best way to handle pre-installation in terms of cost and build speed. In this instance, I’ve chosen to ignore this and “brute-force” my way through to get a proof-of-concept.





Provisioning Model: On-demand

I’m not pushing enough packages to require a dedicated fleet, so spinning up VMs in response to a pushed commit (~5 times a week) is good enough.





Environment Image: Managed Image

As stated above, I had my focus towards a proof-of-concept that running Ansible under CodeBuild was possible. A custom image with pre-installed packages is the way to go in the long run.





Compute: EC2

Since I’m targeting *.deb format, I choose Ubuntu as the operating system. The playbook I’m expecting to execute doesn’t require GPU resources either.



Amazon Lambda doesn’t support Ubuntu, nor is able to execute Ansible (directly). I’d have to write a wrapper in Python that will execute the Ansible Playbook which is more overhead.



Depending on the build time and size of the result package, I had to adjust the memory required accordingly. However, this may be because I’m making use of the /tmp directory by default.






buildspec.yml



I store the following file at the root level of the same Git repository that contains the Ansible playbook.



version: 0.2

phases:
  pre_build:
    commands:
      - apt install -y ansible python3-botocore python3-boto3
      - ansible-galaxy install -r requirements.yaml
      - ansible-galaxy collection install amazon.aws
  build:
    commands:
      - ansible-playbook build.yaml
artifacts:
  files:
    - /tmp/*.deb




As stated above, I’m always installing the required System packages prior to interacting with Ansible. This line (apt install) should be moved into a pre-built image that this CodeBuild environment will then source from.




I keep the role (and therefore, tasks) separate from the playbook itself, which is why I use ansible-galaxy to install the requirements. Each time the session is started, it pulls down a fresh copy of any requirements. This can differ from playbook to playbook.



I use the role for the execution steps, and the playbook (or inventory) to hold the settings that influence the execution, such as (in this scenario) what the package name is and how to package it.



I explicitly include the amazon.aws Ansible collection in this scenario as I’m using the S3 module to pull down sources (or builds of third party software) and to push build packages up to S3. I’m doing this via Ansible as opposed to storing it within Git due to its’ size, as well as opposed to CodeDeploy as I don’t plan on deploying the packages to infrastructure, rather, to a repository.



I did have some issues using the Artifacts option within CodeBuild also, which lead to pushing from Ansible.



Finally, the ansible-playbook can be executed once all the pre-requisites are needed. The only adaptation that’s needed on the playbook level, is that localhost is listed as a target. This ensures that the playbook will execute on the local machine.



---
- hosts: localhost



Once all the configuration and repository setup is done, the build executed successfully and I received my first Debian package via CodeBuild using Ansible.



Building a PKI using Terraform
Zak — Sat, 24 Feb 2024 21:09:08 +0000

View Source Code



As part of building a hybrid infrastructure, I explored different technologies for achieving a stable VPN connection from on-premises to the AWS Infrastructure and found AWS’ Client-to-Site feature nested within AWS VPC. I explored this prior to AWS Site-to-Site VPN as I didn’t have the right setup for handling IPSec/L2TP tunnels at the time, and had OpenVPN already handy from my MacBook. 



Since I would be using OpenVPN (As that’s what AWS Client VPN uses), I require TLS certificates as a method of authentication and encryption. While AWS provides certificate management features, it does have a cost, making it less suitable for my testing requirements.



I’ve opted to use Terraform to create a custom PKI solution locally, and to prepare for the re-use in larger infrastructure projects.



Working Environment




Machine

MacBook Pro M2





Technologies

Terraform






Terraform Module Breakdown



terraform {
  required_providers {
    tls = {
      source = "hashicorp/tls"
    }
    pkcs12 = {
      source = "chilicat/pkcs12"
    }
  }
}



I’m making use of the following modules within my Terraform project




hashicorp/tls

For generating the private keys, certificate requests and certificates themselves





chilicat/pkcs12

For combining the private key & certificate together, a requirement for using OpenVPN client without embedding the data inside the *.ovpn configuration file (which didn’t come out-of-the-box from AWS)






/**
 * Private key for use by the self-signed certificate, used for
 * future generation of child certificates. As long as the state
 * remains unchanged, the private key and certificate should not
 * re-update at every re-run unless any variable is changed.
 */
resource "tls_private_key" "pem_ca" {
  algorithm = var.algorithm
}



I’ve made the algorithm of the certificates controllable from a global variable due to customer requirements possibly needing to adopt a different level of encryption. This resource returns a PEM-formatted key.



/**
 * Generation of the CA Certificate, which is in turn used by
 * the client.tf and server.tf submodules to generate child
 * certificates
 */
resource "tls_self_signed_cert" "ca" {
  private_key_pem = tls_private_key.pem_ca.private_key_pem
  is_ca_certificate = true

  subject {
    country             = var.ca_country
    province            = var.ca_province
    locality            = var.ca_locality
    common_name         = var.ca_cn
    organization        = var.ca_org
    organizational_unit = var.ca_org_name
  }

  validity_period_hours = var.ca_validity

  allowed_uses = [
    "digital_signature",
    "cert_signing",
    "crl_signing",
  ]
}



I then used the tls_self_signed_cert resource to generate the CA certificate itself, providing the private key generated prior into the private_key_pem attribute. Again, by providing global variables for the ca subject and validity, I’m able to re-run the same terraform module for multiple clients under different workspaces (or by referencing this into larger modules). 



The subject fields I had decided to expose are a way to describe exactly what and where the TLS certificate belongs to without needing to dive back into the module.



By adding cert_signing and crl_signing to the allowed_uses list, it adds permissions to the certificate for signing child certificates. This is essential as I would still need to generate the certificates for the OpenVPN server and the client.



This resource returns a PEM-formatted certificate.



/**
 * Return the certificate itself. It's the responsibility of
 * the user of this module to determine whether the certificate should
 * be stored locally, transferred or submitted directly to a cloud
 * service
 */
output "ca_certificate" {
  value = tls_self_signed_cert.ca.cert_pem
  sensitive = true
  description = "generated ca certificate"
}



Finally, I return the CA Certificate and its’ key from the module for the user to place it where it needs to be, for example;



To a local file



resource "local_file" "ca_key" {
  content_base64 = module.pki.ca_private_key
  filename = "${path.module}/certs/ca.key"
}
resource "local_file" "ca" {
  content_base64 = module.pki.ca_certificate
  filename = "${path.module}/certs/ca.crt"
}



To the AWS Certificate Manager



resource "aws_acm_certificate" "ca" {
  private_key = module.pki.ca_private_key
  certificate_body = module.pki.ca_certificate
}




Server & Client Certificates



resource "tls_cert_request" "csr" {
  for_each = var.clients # or var.servers
  private_key_pem = tls_private_key.pem_clients[each.key].private_key_pem
    # or pem_servers[each.key]
  dns_names = [each.key]

  subject {
    country = try(each.value.country, try(var.default_client_subject.country, var.default_subject.country))
    province = try(each.value.province, try(var.default_client_subject.province, var.default_subject.province))
    locality = try(each.value.locality, try(var.default_client_subject.locality, var.default_subject.locality))
    common_name = try(each.value.cn, try(var.default_client_subject.cn, var.default_subject.cn))
    organization = try(each.value.org, try(var.default_client_subject.org, var.default_subject.org))
    organizational_unit = try(each.value.ou, try(var.default_client_subject.ou, var.default_subject.ou))
  }
}



Regardless of whether generating a server or client TLS certificate, both need to go through the ‘certificate request’ process, which is to;




Generate a private key for the server or client



Generate a certificate signing request based on the private key



Using the CSR to get a CA-signed certificate




In this example, I made use of the try block to achieve a value priority in the following order;




Resource-level

Do I have a value specific to the server or client?





Class-level

Do I have a value specific to the target type?





Module-level

Do I have a global default?






And each refers to a key / value pair that is identical for clients as it is servers, where the key is the machine name and the value is the subject data. Here is a sample of the *.tfvars.json file that drives this behaviour.



{
  "clients": {
    "mbp": {
      "country": "GB",
      "locality": "GB",
      "org": "ZAI",
      "org_name": "ZAI",
      "province": "GB"
    }
  }
}





In an ideal (and secure) scenario, the private keys should never be transmitted over the wire, instead, you generate a CSR and transmit that. Since this is aimed for test environments, security is not a concern for me. Should I want to do the generation securely, I’ve exposed the following variable as a way to override the CSR generation.



variable "client_csrs" {
  type = map
  description = "csrs to use instead of generating them within this module"
  default = {}
}





Getting the signed certificate



resource "tls_locally_signed_cert" "client" {
  for_each = var.clients
  cert_request_pem = tls_cert_request.csr_client[each.key].cert_request_pem
  ca_private_key_pem = tls_private_key.pem_ca.private_key_pem
  ca_cert_pem = tls_self_signed_cert.ca.cert_pem

  validity_period_hours = var.client_certificate_validity

  allowed_uses = [
    "digital_signature",
    "key_encipherment",
    "server_auth", # for server-side
    "client_auth", # for client-side
  ]
}



Once the *.csr is generated (or provided), I’m able to use the tls_locally_signed_cert resource type to connect that data with the CA Certificate for signing against the private key of the CA Certificate. The cert_request_pem, ca_private_key_pem and ca_cert_pem inputs allow me to do so using the raw PEM format, without needing to save to disk before passing the data in.



Relying on the data within the terraform state file allows me to also rule out any “external influence” when troubleshooting, as there will be only a single source of truth.



Adding either server_auth or client_auth (depending on use-case) to allowed_uses permits the use of the signed certificate for authentication, as required by OpenVPN.



Converting from *.PEM to PCKS12



resource "pkcs12_from_pem" "client" {
  for_each = var.clients
  ca_pem          = tls_self_signed_cert.ca.cert_pem
  cert_pem        = tls_locally_signed_cert.client[each.key].cert_pem
  private_key_pem = tls_private_key.pem_client[each.key].private_key_pem
  password = "123" # Testing purposes
  encoding = "legacyRC2"
}



Using the pkcs12_from_pem resource type from chilicat makes this process simple, as long as I have access to the private key in addition to the certificate and ca.




For compatibility with the OpenVPN Connect application, I needed to enforce the encoding of legacyRC2, rather than the modern encryption that’s offered by easy-rsa. 




Returning the certificates



output "client_certificates" {
  value = [ for cert in tls_locally_signed_cert.client : cert.cert_pem ]
  description = "generated client certificates in ordered list form"
  sensitive = true
}



Finally, I return the generated certificates and their *.p12 equivalent from the module. I mark this data as sensitive due to the inclusion of private keys.



For the value, I needed to iterate over a list of resources (as I had used the foreach input earlier to handle a key/value pair) and re-build a single list with the result.



As mentioned above, it is then the responsibility of the user to determine what to do with the generated certificates, be it storing them locally or pushing them to AWS.