Ralph k8s

2026-01-21 • Alexander Baxter

Deploying a Modular “Ralph” AI Agent to Kubernetes (Helm + Docker + ConfigMaps)

This idea is to create a Helm chart (.helm/) for running an autonomous coding agent (“Ralph”) as a Kubernetes Job. The core concept is:

Prompts are code (versioned, modular, templated).
The agent runtime is a container image (CLIs + polyglot toolchain).
Scripts + prompts ship as ConfigMaps (mounted into the Job).
Workspace + credentials persist in a PVC (so runs can resume and auth survives across pods).

Below is the setup process, from prompt design through operational tooling.

Designing the prompt (and making it modular)

1) Split “base” rules from provider-specific rules

The easiest way to keep prompts maintainable is to have:

a base prompt that defines how the agent works (iteration loop, task selection, boundaries, etc.)
a provider prompt that defines model-specific guidance (Claude vs Gemini vs others)

This repo follows that structure under:

.helm/prompts/base-prompt.md
.helm/prompts/claude-prompt.md
.helm/prompts/gemini-prompt.md

The chart compiles the final prompt using a ConfigMap template that concatenates the base + provider prompt and runs Helm templating (tpl) so you can inject values like job.expertise:

# .helm/templates/configmap-prompts.yaml
data:
  system-prompt.md: |
    {{- $base := .Files.Get "prompts/base-prompt.md" -}}
    {{- $providerSpec := .Files.Get (printf "prompts/%s-prompt.md" .Values.job.aiProvider) -}}
    {{- $combined := printf "%s\n\n%s" $base $providerSpec -}}
    {{- tpl $combined . | nindent 4 -}}

2) Use a completion sentinel (but keep it safe)

Autonomous agents typically need a single unambiguous “done” signal so the outer loop can stop cleanly.

The base prompt and scripts use a strict completion sentinel tag. In this blog post, we’ll refer to it abstractly as:

<<COMPLETE_SENTINEL>>

…to avoid accidentally copy/pasting the real tag into places it doesn’t belong.

Setting up the script loop

It ships provider-specific loop scripts and selects one at runtime based on AI_PROVIDER.

Concretely:

Scripts live in the chart under .helm/scripts/ (e.g. ralph-claude.sh, ralph-gemini.sh)
Helm packages them into a scripts ConfigMap (.helm/templates/configmap-scripts.yaml)
The Job mounts that ConfigMap at /app/scripts
The Job selects the right script at runtime with:
- SCRIPT_PATH="/app/scripts/ralph-${AI_PROVIDER}.sh"
- bash "$SCRIPT_PATH" "$REPO_NAME" <maxIterations>

Provider scripts live at:

.helm/scripts/ralph-claude.sh
.helm/scripts/ralph-gemini.sh

They both implement:

prompt resolution (env RALPH_PROMPT_FILE, fallback to /etc/ralph/system-prompt.md)
loop for maxIterations
completion sentinel detection
rate limit detection + sleep-and-retry logic

Building the Docker image (agents + runtimes installed)

The container image is the “agent workstation.” The Dockerfile builds a polyglot base:

Ubuntu 24.04
git, ssh, jq, coreutils, build tools
Python 3, Node 20, Java 17
Claude CLI + Gemini CLI

Excerpt:

FROM ubuntu:24.04
RUN apt-get update && apt-get install -y --no-install-recommends 
  git openssh-client jq python3 python3-pip python3-venv openjdk-17-jdk

RUN curl -fsSL https://deb.nodesource.com/setup_20.x | bash - 
  && apt-get install -y nodejs

RUN npm install -g @google/gemini-cli pnpm
RUN curl -fsSL https://claude.ai/install.sh | bash

This image is referenced from Helm values:

# .helm/values.yaml
image:
  repository: registry/ralph-polyglot-base
  tag: latest

Wiring prompts + scripts via ConfigMaps

Prompts ConfigMap

configmap-prompts.yaml builds a single file system-prompt.md at deploy time (base + provider-specific prompt).

The Job then mounts it into /etc/ralph and points RALPH_PROMPT_FILE at it.

Scripts ConfigMap

configmap-scripts.yaml packages every scripts/*.sh into a ConfigMap:

# .helm/templates/configmap-scripts.yaml
data:
  # This ranges through the 'scripts/' folder and includes every .sh file
{{- range $path, $_ := .Files.Glob "scripts/*.sh" }}
  {{ base $path }}: |-
{{ $.Files.Get $path | indent 4 }}
{{- end }}

The Job mounts this ConfigMap at /app/scripts with defaultMode: 0755 so scripts are executable.

The main workload: `.helm/templates/job.yaml`

At a high level, the Job does three things:

InitContainer clones/refreshes the repo into a persistent workspace PVC (/work).
Main container restores auth, runs the loop script, then syncs auth back to the PVC on exit.
Volumes mount the PVC, ssh key Secret, scripts ConfigMap, and prompt ConfigMap.

1) InitContainer: git clone with guardrails

The init container:

configures SSH (/ssh-secret/id_ed25519)
clones the repo (or reuses an existing clone)
refuses to run if there are uncommitted/untracked changes (excluding ralph-log.txt)
checks out the target branch and hard-resets to origin

This prevents “silent corruption” of a long-lived PVC workspace.

2) Main container: restore creds, run provider script, sync creds back

Key behaviors:

Restores credentials from the PVC into $HOME if present
Picks a script by provider: ralph-${AI_PROVIDER}.sh
On exit, copies $HOME credential directories back into /work so future Jobs can restore them

Conceptually:

SCRIPT_PATH="/app/scripts/ralph-${AI_PROVIDER}.sh"
trap sync_auth_exit EXIT
bash "$SCRIPT_PATH" "$REPO_NAME" <maxIterations>

3) Volumes: PVC + Secret + ConfigMaps

The Job mounts:

PVC at /work (workspace + persisted auth)
SSH Secret at /ssh-secret (git clone/push)
Scripts ConfigMap at /app/scripts
Prompts ConfigMap at /etc/ralph

`pod-auth.yaml`: logging agents in (and persisting credentials)

Because provider logins can be browser/interactive, the chart includes an Auth Pod you can exec into.

pod-auth.yaml runs a simple long-lived pod that mounts the same workspace PVC:

# .helm/templates/pod-auth.yaml
kind: Pod
spec:
  containers:
  - name: auth
    command: ["sleep", "infinity"]
    volumeMounts:
    - name: workspace
      mountPath: /work

Typical flow

Deploy auth pod.
kubectl exec into it.
Run claude login / gemini login.
Copy auth state into /work so Jobs can restore it later.

# Claude
cp -r ~/.claude /work/.claude-auth
cp ~/.claude.json /work/

# Gemini
cp -r ~/.gemini /work/.gemini

`job-cleanup.yaml`: when things go wrong (rescue + reset)

Long-lived workspaces are great… until they aren’t. If a run leaves the repo in a messy state, the cleanup Job does two important things:

Rescues changes to a timestamped branch (so you don’t lose work)
Resets the workspace back to origin/<integration branch>

Key behaviors in .helm/templates/job-cleanup.yaml:

Detects dirty working tree or untracked files
Creates ralph/rescue-<timestamp>
Commits everything (git add -A) and pushes the rescue branch
Hard resets and cleans to the integration branch

This is the operational “break glass” tool for PVC-backed sessions.

`Justfile`: ergonomic commands for humans

The Justfile is the operator UX: it wraps Docker + Helm + kubectl into repeatable commands.

Highlights:

just build: build + push the image
just auth: deploy auth pod, exec in, then uninstall
just start <name>: deploy a job with a project-specific values file
just logs <name>: tail logs
just cleanup <name>: run the cleanup Job and stream logs

Example excerpt:

auth:
    helm upgrade --install {{RELEASE_NAME}}-auth {{CHART_DIR}}         -n {{NAMESPACE}}         --set enableAuth=true         --set enableJob=false         -f {{VALUES_FILE}}         --wait
    kubectl exec -it {{RELEASE_NAME}}-auth -n {{NAMESPACE}} -- bash
    helm uninstall {{RELEASE_NAME}}-auth -n {{NAMESPACE}}

start name:
    helm upgrade --install {{RELEASE_NAME}}-{{name}} {{CHART_DIR}}         -n {{NAMESPACE}}         --set enableAuth=false         -f {{VALUES_FILE}}         -f {{name}}-values.yaml

Putting it all together

Design prompt: split base + provider prompts; keep it templated and versioned.
Script loop: dispatch by provider/agent, detect completion sentinel, handle rate limits.
Docker image: install CLIs + runtimes; run as non-root; include git/ssh/jq.
ConfigMaps: mount scripts + compiled prompt into the Job.
Job: init clone → run agent loop → sync auth/state → exit.
Auth pod: interactive login; copy credentials into the PVC.
Cleanup job: rescue branch + reset when the workspace gets stuck.
Justfile: make all of the above easy to run consistently.