Engineering – Docker https://www.docker.com Mon, 23 Feb 2026 14:00:07 +0000 en-US hourly 1 https://wordpress.org/?v=6.9 https://www.docker.com/app/uploads/2024/02/cropped-docker-logo-favicon-32x32.png Engineering – Docker https://www.docker.com 32 32 Run OpenClaw Securely in Docker Sandboxes https://www.docker.com/blog/run-openclaw-securely-in-docker-sandboxes/ Mon, 23 Feb 2026 14:00:00 +0000 https://www.docker.com/?p=85410 Docker Sandboxes is a new primitive in the Docker’s ecosystem that allows you to run AI agents or any other workloads in isolated micro VMs. It provides strong isolation, convenient developer experience and a strong security boundary with a network proxy configurable to deny agents connecting to arbitrary internet hosts. The network proxy will also conveniently inject the API keys, like your ANTHROPIC_API_KEY, or OPENAI_API_KEY in the network proxy so the agent doesn’t have access to them at all and cannot leak them. 

In a previous article I showed how Docker Sandboxes lets you install any tools an AI agent might need, like a JDK for Java projects or some custom CLIs, into a container that’s isolated from the host. Today we’re going a step further: we’ll run OpenClaw, an open-source AI coding agent, on a local model via Docker Model Runner.

No API keys, no cloud costs, fully private. And you can do it in 2-ish commands.

Quick Start

Make sure you have Docker Desktop and that Docker Model Runner is enabled (Settings → Docker Model Runner → Enable), then pull a model:

docker model pull ai/gpt-oss:20B-UD-Q4_K_XL

Now create and run the sandbox:

docker sandbox create --name openclaw -t olegselajev241/openclaw-dmr:latest shell .
docker sandbox network proxy openclaw --allow-host localhost
docker sandbox run openclaw

Inside the sandbox:

~/start-openclaw.sh
Running OpenClaw inside a Docker Sandbox

And that’s it. You’re in OpenClaw’s terminal UI, talking to a local gpt-oss model on your machine. The model runs in Docker Model Runner on your host, and OpenClaw runs completely isolated in the sandbox: it can only read and write files in the workspace you give it, and there’s a network proxy to deny connections to unwanted hosts. 

Cloud models work too

The sandbox proxy will automatically inject API keys from your host environment. If you have ANTHROPIC_API_KEY or OPENAI_API_KEY set, OpenClaw can run cloud models,  just specify them in OpenClaw settings. The proxy takes care of credential injection, so your keys will never be exposed inside the sandbox.

This means you can use free local models for experimentation, then switch to cloud models for serious work all in the same sandbox. With cloud models you don’t even need to allow to proxy to host’s localhost, so don’t run docker sandbox network proxy openclaw --allow-host localhost.

Choose Your Model

The startup script automatically discovers models available in your Docker Model Runner. List them:

~/start-openclaw.sh list

Use a specific model:

~/start-openclaw.sh ai/qwen2.5:7B-Q4_K_M

Any model you’ve pulled with docker model pull is available.

How it works (a bit technical)

The pre-built image (olegselajev241/openclaw-dmr:latest) is based on the shell sandbox template with three additions: Node.js 22, OpenClaw, and a tiny networking bridge.

The bridge is needed because Docker Model Runner runs on your host and binds to localhost:12434. But localhost inside the sandbox means the sandbox itself, not your host. The sandbox does have an HTTP proxy, at host.docker.internal:3128, that can reach host services, and we allow it to reach localhost with docker sandbox network proxy --allow-host localhost.

The problem is OpenClaw is Node.js, and Node.js doesn’t respect HTTP_PROXY environment variables. So we wrote a ~20-line bridge script that OpenClaw connects to at 127.0.0.1:54321, which explicitly forwards requests through the proxy to reach Docker Model Runner on the host:

OpenClaw → bridge (localhost:54321) → proxy (host.docker.internal:3128) → Model Runner (host localhost:12434)

The start-openclaw.sh script starts the bridge, starts OpenClaw’s gateway (with proxy vars cleared so it hits the bridge directly), and runs the TUI.

Build Your Own

Want to customize the image or just see how it works? Here’s the full build process. 

1. Create a base sandbox and install OpenClaw

docker sandbox create --name my-openclaw shell .
docker sandbox network proxy my-openclaw --allow-host localhost
docker sandbox run my-openclaw

Now let’s install OpenClaw in the sandbox:

# Install Node 22 (OpenClaw requires it)
npm install -g n && n 22
hash -r

# Install OpenClaw
npm install -g openclaw@latest

# Run initial setup
openclaw setup

2. Create the Model Runner bridge

This is the magic piece — a tiny Node.js server that forwards requests through the sandbox proxy to Docker Model Runner on your host:

cat > ~/model-runner-bridge.js << 'EOF'
const http = require("http");
const { URL } = require("url");

const PROXY = new URL(process.env.HTTP_PROXY || "http://host.docker.internal:3128");
const TARGET = "localhost:12434";

http.createServer((req, res) => {
  const proxyReq = http.request({
    hostname: PROXY.hostname,
    port: PROXY.port,
    path: "http://" + TARGET + req.url,
    method: req.method,
    headers: { ...req.headers, host: TARGET }
  }, proxyRes => {
    res.writeHead(proxyRes.statusCode, proxyRes.headers);
    proxyRes.pipe(res);
  });
  proxyReq.on("error", e => { res.writeHead(502); res.end(e.message); });
  req.pipe(proxyReq);
}).listen(54321, "127.0.0.1");
EOF

3. Configure OpenClaw to use Docker Model Runner

Now merge the Docker Model Runner provider into OpenClaw’s config:

python3 -c "
import json
p = '$HOME/.openclaw/openclaw.json'
with open(p) as f: cfg = json.load(f)
cfg['models'] = cfg.get('models', {})
cfg['models']['mode'] = 'merge'
cfg['models']['providers'] = cfg['models'].get('providers', {})
cfg['models']['providers']['docker-model-runner'] = {
    'baseUrl': 'http://127.0.0.1:54321/engines/llama.cpp/v1',
    'apiKey': 'not-needed',
    'api': 'openai-completions',
    'models': [{
        'id': 'ai/qwen2.5:7B-Q4_K_M',
        'name': 'Qwen 2.5 7B (Docker Model Runner)',
        'reasoning': False, 'input': ['text'],
        'cost': {'input': 0, 'output': 0, 'cacheRead': 0, 'cacheWrite': 0},
        'contextWindow': 32768, 'maxTokens': 8192
    }]
}
cfg['agents'] = cfg.get('agents', {})
cfg['agents']['defaults'] = cfg['agents'].get('defaults', {})
cfg['agents']['defaults']['model'] = {'primary': 'docker-model-runner/ai/qwen2.5:7B-Q4_K_M'}
cfg['gateway'] = {'mode': 'local'}
with open(p, 'w') as f: json.dump(cfg, f, indent=2)
"

4. Save and share

Exit the sandbox and save it as a reusable image:

docker sandbox save my-openclaw my-openclaw-image:latest

Push it to a registry so anyone can use it:

docker tag my-openclaw-image:latest yourname/my-openclaw:latest
docker push yourname/my-openclaw:latest

Anyone with Docker Desktop (with the modern sandboxes includes) can spin up the same environment with:

docker sandbox create --name openclaw -t yourname/my-openclaw:latest shell .

What’s next

Docker Sandboxes make it easy to run any AI coding agent in an isolated, reproducible environment. With Docker Model Runner, you get a fully local AI coding setup: no cloud dependencies, no API costs, and complete privacy.

Try it out and let us know what you think.

]]>
How Medplum Secured Their Healthcare Platform with Docker Hardened Images (DHI) https://www.docker.com/blog/medplum-healthcare-docker-hardened-images/ Thu, 19 Feb 2026 14:00:00 +0000 https://www.docker.com/?p=85305 Special thanks to Cody Ebberson and the Medplum team for their open-source contribution and for sharing their migration experience with the community. A real-world example of migrating a HIPAA-compliant EHR platform to DHI with minimal code changes.

Healthcare software runs on trust. When patient data is at stake, security isn’t just a feature but a fundamental requirement. For healthcare platform providers, proving that trust to enterprise customers is an ongoing challenge that requires continuous investment in security posture, compliance certifications, and vulnerability management.

That’s why we’re excited to share how Medplum, an open-source healthcare platform serving over 20 million patients, recently migrated to Docker Hardened Images (DHI). This migration demonstrates exactly what we designed DHI to deliver: enterprise-grade security with minimal friction. Medplum’s team made the switch with just 54 lines of changes across 5 files – a near net-zero code change that dramatically improved their security posture.

Medplum is a headless EHR; the platform handles patient data, clinical workflows, and compliance so developers can focus on building healthcare apps. Built by and for healthcare developers, the platform provides:

  • HIPAA and SOC2 compliance out of the box
  • FHIR R4 API for healthcare data interoperability
  • Self-hosted or managed deployment options
  • Support for 20+ million patients across hundreds of practices

With over 500,000 pulls on Docker Hub for their medplum-server image, Medplum has become a trusted foundation for healthcare developers worldwide. As an open-source project licensed under Apache 2.0, their entire codebase, including Docker configurations, is publicly available onGitHub. This transparency made their DHI migration a perfect case study for the community.

Diagram of Medplum as headless EHR

Caption: Medplum is a headless EHR; the platform handles patient data, clinical workflows, and compliance so developers can focus on building healthcare apps.

Medplum is developer-first. It’s not a plug-and-play low-code tool, it’s designed for engineering teams that want a strong FHIR-based foundation with full control over the codebase.

The Challenge: Vulnerability Noise and Security Toil

Healthcare software development comes with unique challenges. Integration with existing EHR systems, compliance with regulations like HIPAA, and the need for robust security all add complexity and cost to development cycles.

“The Medplum team found themselves facing a challenge common to many high-growth platforms: “Vulnerability Noise.” Even with lean base images, standard distributions often include non-essential packages that trigger security flags during enterprise audits. For a company helping others achieve HIPAA compliance, every “Low” or “Medium” CVE (Common Vulnerability and Exposure) requires investigation and documentation, creating significant “security toil” for their engineering team.”

Reshma Khilnani

CEO, Medplum

Medplum addresses this by providing a compliant foundation. But even with that foundation, their team found themselves facing another challenge common to high-growth platforms: “Vulnerability Noise.”

Healthcare is one of the most security-conscious industries. Medplum’s enterprise customers, including Series C and D funded digital health companies, don’t just ask about security; they actively verify it. These customers routinely scan Medplum’s Docker images as part of their security due diligence.

Even with lean base images, standard distributions often include non-essential packages that trigger security flags during enterprise audits. For a company helping others achieve HIPAA compliance, every “Low” or “Medium” CVE requires investigation and documentation. This creates significant “security toil” for their engineering team.

The First Attempt: Distroless

This wasn’t Medplum’s first attempt at solving the problem. Back in November 2024, the team investigated Google’s distroless images as a potential solution.

The motivations were similar to what DHI would later deliver:

  • Less surface area in production images, and therefore less CVE noise
  • Smaller images for faster deployments
  • Simpler build process without manual hardening scripts

The idea was sound. Distroless images strip away everything except the application runtime: no shell, no package manager, minimal attack surface. On paper, it was exactly what Medplum needed.

But the results were mixed. Image sizes actually increased. Build times went up. There were concerns about multi-architecture support for native dependencies. The PR was closed without merging.

The core problem remained: many CVEs in standard images simply aren’t actionable. Often there isn’t a fix available, so all you can do is document and explain why it doesn’t apply to your use case. And often the vulnerability is in a corner of the image you’re not even using, like Perl, which comes preinstalled on Debian but serves no purpose in a Node.js application.

Fully removing these unused components is the only real answer. The team knew they needed hardened images. They just hadn’t found the right solution yet.

The Solution: Docker Hardened Images

When Docker made Hardened Images freely available under Apache 2.0, Medplum’s team saw an opportunity to simplify their security posture while maintaining compatibility with their existing workflows.

By switching to Docker Hardened Images, Medplum was able to offload the repetitive work of OS-level hardening – like configuring non-root users and stripping out unnecessary binaries – to Docker. This allowed them to provide their users with a “Secure-by-Default” image that meets enterprise requirements without adding complexity to their open-source codebase.

This shift is particularly significant for an open-source project. Rather than maintaining custom hardening scripts that contributors need to understand and maintain, Medplum can now rely on Docker’s expertise and continuous maintenance. The security posture improves automatically with each DHI update, without requiring changes to Medplum’s Dockerfiles.

“By switching to Docker Hardened Images, Medplum was able to offload the repetitive work of OS-level hardening—like configuring non-root users and stripping out unnecessary binaries—to Docker. This allowed their users to provide their users with a “Secure-by-Default” image that meets enterprise requirements without adding complexity to their open-source codebase.”

Cody Ebberson

CTO, Medplum

The Migration: Real Code Changes

The migration was remarkably clean. Previously, Medplum’s Dockerfile required manual steps to ensure security best practices. By moving to DHI, they could simplify their configuration significantly.

Let’s look at what actually changed. Here’s the complete server Dockerfile after the migration:

# Medplum production Dockerfile
# Uses Docker "Hardened Images":
# https://hub.docker.com/hardened-images/catalog/dhi/node/guides

# Supported architectures: linux/amd64, linux/arm64

# Stage 1: Build the application and install production dependencies
FROM dhi.io/node:24-dev AS build-stage
ENV NODE_ENV=production
WORKDIR /usr/src/medplum
ADD ./medplum-server-metadata.tar.gz ./
RUN npm ci --omit=dev && \
  rm package-lock.json

# Stage 2: Create the runtime image
FROM dhi.io/node:24 AS runtime-stage
ENV NODE_ENV=production
WORKDIR /usr/src/medplum
COPY --from=build-stage /usr/src/medplum/ ./
ADD ./medplum-server-runtime.tar.gz ./

EXPOSE 5000 8103

ENTRYPOINT [ "node", "--require", "./packages/server/dist/otel/instrumentation.js", "packages/server/dist/index.js" ]

Notice what’s not there:

  • No groupadd or useradd commands: DHI runs as non-root by default
  • No chown commands: permissions are already correct
  • No USER directive: the default user is already non-privileged

Before vs. After: Server Dockerfile

Before (node:24-slim):

FROM node:24-slim
ENV NODE_ENV=production
WORKDIR /usr/src/medplum

ADD ./medplum-server.tar.gz ./

# Install dependencies, create non-root user, and set permissions
RUN npm ci && \
  rm package-lock.json && \
  groupadd -r medplum && \
  useradd -r -g medplum medplum && \
  chown -R medplum:medplum /usr/src/medplum

EXPOSE 5000 8103

# Switch to the non-root user
USER medplum

ENTRYPOINT [ "node", "--require", "./packages/server/dist/otel/instrumentation.js", "packages/server/dist/index.js" ]

After (dhi.io/node:24):

FROM dhi.io/node:24-dev AS build-stage
ENV NODE_ENV=production
WORKDIR /usr/src/medplum
ADD ./medplum-server-metadata.tar.gz ./
RUN npm ci --omit=dev && rm package-lock.json

FROM dhi.io/node:24 AS runtime-stage
ENV NODE_ENV=production
WORKDIR /usr/src/medplum
COPY --from=build-stage /usr/src/medplum/ ./
ADD ./medplum-server-runtime.tar.gz ./

EXPOSE 5000 8103

ENTRYPOINT [ "node", "--require", "./packages/server/dist/otel/instrumentation.js", "packages/server/dist/index.js" ]

The migration also introduced a cleaner multi-stage build pattern, separating metadata (package.json files) from runtime artifacts.

Before vs. After: App Dockerfile (Nginx)

The web app migration was even more dramatic:

Before (nginx-unprivileged:alpine):

FROM nginxinc/nginx-unprivileged:alpine

# Start as root for permissions
USER root

COPY <<EOF /etc/nginx/conf.d/default.conf
# ... nginx config ...
EOF

ADD ./medplum-app.tar.gz /usr/share/nginx/html
COPY ./docker-entrypoint.sh /docker-entrypoint.sh

# Manual permission setup
RUN chown -R 101:101 /usr/share/nginx/html && \
    chown 101:101 /docker-entrypoint.sh && \
    chmod +x /docker-entrypoint.sh

EXPOSE 3000

# Switch back to non-root
USER 101

ENTRYPOINT ["/docker-entrypoint.sh"]

After (dhi.io/nginx:1):

FROM dhi.io/nginx:1

COPY <<EOF /etc/nginx/nginx.conf
# ... nginx config ...
EOF

ADD ./medplum-app.tar.gz /usr/share/nginx/html
COPY ./docker-entrypoint.sh /docker-entrypoint.sh

EXPOSE 3000

ENTRYPOINT ["/docker-entrypoint.sh"]

Results: Improved Security Posture

After merging the changes, Medplum’s team shared their improved security scan results. The migration to DHI resulted in:

  • Dramatically reduced CVE count – DHI’s minimal base means fewer packages to patch
  • Non-root by default – No manual user configuration required
  • No shell access in production – Reduced attack surface for container escape attempts
  • Continuous patching – All DHI images are rebuilt when upstream security updates are available

For organizations that require stronger guarantees, Docker Hardened Images Enterprise adds SLA-backed remediation timelines, image customizations, and FIPS/STIG variants.

Most importantly, all of this was achieved with zero functional changes to the application. The same tests passed, the same workflows worked, and the same deployment process applied.

CI/CD Integration

Medplum also updated their GitHub Actions workflow to authenticate with the DHI registry:

- name: Login to Docker Hub
  uses: docker/login-action@v2.2.0
  with:
    username: ${{ secrets.DOCKERHUB_USERNAME }}
    password: ${{ secrets.DOCKERHUB_TOKEN }}

- name: Login to Docker Hub Hardened Images
  uses: docker/login-action@v2.2.0
  with:
    registry: dhi.io
    username: ${{ secrets.DOCKERHUB_USERNAME }}
    password: ${{ secrets.DOCKERHUB_TOKEN }}

This allows their CI/CD pipeline to pull hardened base images during builds. The same Docker Hub credentials work for both standard and hardened image registries.

The Multi-Stage Pattern for DHI

One pattern worth highlighting from Medplum’s migration is the use of multi-stage builds with DHI variants:

  1. Build stage: Use dhi.io/node:24-dev which includes npm/yarn for installing dependencies
  2. Runtime stage: Use dhi.io/node:24 which is minimal and doesn’t include package managers

This pattern ensures that build tools never make it into the production image, further reducing the attack surface. It’s a best practice for any containerized Node.js application, and DHI makes it straightforward by providing purpose-built variants for each stage.

Medplum’s Production Architecture

Medplum’s hosted offering runs on AWS using containerized workloads. Their medplum/medplum-server image, built on DHI base images, now deploys to production.

Medplum production architecture

Here’s how the build-to-deploy flow works:

  1. Build time: GitHub Actions pulls dhi.io/node:24-dev and dhi.io/node:24 as base images
  2. Push: The resulting hardened image is pushed to medplum/medplum-server on Docker Hub
  3. Deploy: AWS Fargate pulls medplum/medplum-server:latest and runs the hardened container

The deployed containers inherit all DHI security properties (non-root execution, minimal attack surface, no shell) because they’re built on DHI base images. This demonstrates that DHI works seamlessly with production-grade infrastructure including:

  • AWS Fargate/ECS for container orchestration
  • Elastic Load Balancing for high availability
  • Aurora PostgreSQL for managed database
  • ElastiCache for Redis caching
  • CloudFront for CDN and static assets

No infrastructure changes were required. The same deployment pipeline, the same Fargate configuration, just a more secure base image.

Why This Matters for Healthcare

For healthcare organizations evaluating container security, Medplum’s migration offers several lessons:

1. Eliminating “Vulnerability Noise”

The biggest win from DHI isn’t just security, it’s reducing the operational burden of security. Fewer packages means fewer CVEs to investigate, document, and explain to customers. For teams without dedicated security staff, this reclaimed time is invaluable.

2. Compliance-Friendly Defaults

HIPAA requires covered entities to implement technical safeguards including access controls and audit controls. DHI’s non-root default and minimal attack surface align with these requirements out of the box. For companies pursuing SOC 2 Type 2 certification, which Medplum implemented from Day 1, or HITRUST certification, DHI provides a stronger foundation for the technical controls auditors evaluate.

3. Reduced Audit Surface

When security teams audit container configurations, DHI provides a cleaner story. Instead of explaining custom hardening scripts or why certain CVEs don’t apply, teams can point to Docker’s documented hardening methodology, SLSA Level 3 provenance, and independent security validation by SRLabs. This is particularly valuable during enterprise sales cycles where customers scan vendor images as part of due diligence.

4. Practicing What You Preach

For platforms like Medplum that help customers achieve compliance, using hardened images isn’t just good security, it’s good business. When you’re helping healthcare organizations meet regulatory requirements, your own infrastructure needs to set the example.

5. Faster Security Response

With DHI Enterprise, critical CVEs are patched within 7 days. For healthcare organizations where security incidents can have regulatory implications, this SLA provides meaningful risk reduction and a concrete commitment to share with customers.

Conclusion

Medplum’s migration to Docker Hardened Images demonstrates that improving container security doesn’t have to be painful. With minimal code changes (54 additions and 52 deletions) they achieved:

  • Secure-by-Default images that meet enterprise requirements
  • Automatic non-root execution
  • Dramatically reduced CVE surface
  • Simplified Dockerfiles with no manual hardening scripts
  • Less “security toil” for their engineering team
  • A stronger compliance story for enterprise customers

By offloading OS-level hardening to Docker, Medplum can focus on what they do best: building healthcare infrastructure while their security posture improves automatically with each DHI update.

For a platform with 500,000+ Docker Hub pulls serving healthcare organizations worldwide, this migration shows that DHI is ready for production workloads at scale. More importantly, it shows that security improvements can actually reduce operational burden rather than add to it.

For platforms helping others achieve compliance, practicing what you preach matters. With Docker Hardened Images, that just got a lot easier.

Ready to harden your containers? Explore the Docker Hardened Images documentation or browse the free DHI catalog to find hardened versions of your favorite base images.

Resources

]]>
Running NanoClaw in a Docker Shell Sandbox https://www.docker.com/blog/run-nanoclaw-in-docker-shell-sandboxes/ Mon, 16 Feb 2026 14:00:00 +0000 https://www.docker.com/?p=85239 Ever wanted to run a personal AI assistant that monitors your WhatsApp messages 24/7, but worried about giving it access to your entire system? Docker Sandboxes’ new shell sandbox type is the perfect solution. In this post, I’ll show you how to run NanoClaw, a lightweight Claude-powered WhatsApp assistant, inside a secure, isolated Docker sandbox.

What is the Shell Sandbox?

Docker Sandboxes provides pre-configured environments for running AI coding agents like Claude Code, Gemini CLI, and others. But what if you want to run a different agent or tool that isn’t built-in?
That’s where the shell sandbox comes in. It’s a minimal sandbox that drops you into an interactive bash shell inside an isolated microVM. No pre-installed agent, no opinions — just a clean Ubuntu environment with Node.js, Python, git, and common dev tools. You install whatever you need.

Why Run NanoClaw in a Sandbox?

NanoClaw already runs its agents in containers, so it’s security-conscious by design. But running the entire NanoClaw process inside a Docker sandbox adds another layer:

  1. Filesystem isolation – NanoClaw can only see the workspace directory you mount, not your home directory
  2. Credential management – API keys are injected via Docker’s proxy, never stored inside the sandbox
  3. Clean environment – No conflicts with your host’s Node.js version or global packages
  4. Disposability – Nuke it and start fresh anytime with docker sandbox rm

Prerequisites

  • Docker Desktop installed and running
  • Docker Sandboxes CLI (docker sandbox command available) (v.0.12.0 available in the nightly build as of Feb 13)
  • An Anthropic API key in an env variable

Setting It Up

Create the sandbox

Pick a directory on your host that will be mounted as the workspace inside the sandbox. This is the only part of your filesystem the sandbox can see:

mkdir -p ~/nanoclaw-workspace
docker sandbox create --name nanoclaw shell ~/nanoclaw-workspace

Connect to it

docker sandbox run nanoclaw

You’re now inside the sandbox – an Ubuntu shell running in an isolated VM. Everything from here on happens inside the sandbox.

Install Claude Code

The shell sandbox comes with Node.js 20 pre-installed, so we can install Claude Code directly via npm:

npm install -g @anthropic-ai/claude-code

Configure the API key

This is the one extra step needed in a shell sandbox. The built-in claude sandbox type does this automatically, but since we’re in a plain shell, we need to tell Claude Code to get its API key from Docker’s credential proxy:

mkdir -p ~/.claude && cat > ~/.claude/settings.json << 'EOF'
{
  "apiKeyHelper": "echo proxy-managed",
  "defaultMode": "bypassPermissions",
  "bypassPermissionsModeAccepted": true
}
EOF

What this does: apiKeyHelper tells Claude Code to run echo proxy-managed to get its API key. The sandbox’s network proxy intercepts outgoing API calls and swaps this sentinel value for your real Anthropic key, so the actual key never exists inside the sandbox.

Clone NanoClaw and install dependencies

cd ~/workspace
git clone https://github.com/qwibitai/nanoclaw
cd nanoclaw
npm install

Run Claude and set up NanoClaw

NanoClaw uses Claude Code for its initial setup – configuring WhatsApp authentication, the database, and the container runtime:

claude

Once Claude starts, run /setup and follow the prompts. Claude will walk you through scanning a WhatsApp QR code and configuring everything else.

Start NanoClaw

After setup completes, start the assistant:

npm start

NanoClaw is now running and listening for WhatsApp messages inside the sandbox.

Managing the Sandbox

# List all sandboxes
docker sandbox ls

# Stop the sandbox (stops NanoClaw too)
docker sandbox stop nanoclaw

# Start it again
docker sandbox start nanoclaw

# Remove it entirely
docker sandbox rm nanoclaw

What Else Could You Run?

The shell sandbox isn’t specific to NanoClaw. Anything that runs on Linux and talks to AI APIs is a good fit:

  • Custom agents built with the Claude Agent SDK or any other AI agent: Claude code, Codex, Github Copilot, OpenCode, Kiro, and more. 
  • AI-powered bots and automation scripts
  • Experimental tools you don’t want running on your host

The pattern is always the same: create a sandbox, install what you need, configure credentials via the proxy, and run it.

docker sandbox create --name my-shell shell ~/my-workspace
docker sandbox run my-shell

]]>
Hardened Images Are Free. Now What? https://www.docker.com/blog/hardened-images-free-now-what/ Tue, 10 Feb 2026 14:00:00 +0000 https://www.docker.com/?p=85125 Docker Hardened Images are now free, covering Alpine, Debian, and over 1,000 images including databases, runtimes, and message buses. For security teams, this changes the economics of container vulnerability management.

DHI includes security fixes from Docker’s security team, which simplifies security response. Platform teams can pull the patched base image and redeploy quickly. But free hardened images raise a question: how should this change your security practice? Here’s how our thinking is evolving at Docker.

What Changes (and What Doesn’t)

DHI gives you a security “waterline.” Below the waterline, Docker owns vulnerability management. Above it, you do. When a scanner flags something in a DHI layer, it’s not actionable by your team. Everything above the DHI boundary remains yours.

The scope depends on which DHI images you use. A hardened Python image covers the OS and runtime, shrinking your surface to application code and direct dependencies. A hardened base image with your own runtime on top sets the boundary lower. The goal is to push your waterline as high as possible.

Vulnerabilities don’t disappear. Below the waterline, you need to pull patched DHI images promptly. Above it, you still own application code, dependencies, and anything you’ve layered on top.

Supply Chain Isolation

DHI provides supply chain isolation beyond CVE remediation.

Community images like python:3.11 carry implicit trust assumptions: no compromised maintainer credentials, no malicious layer injection via tag overwrite, no tampering since your last pull. The Shai Hulud campaign(s) demonstrated the consequences when attackers exploit stolen PATs and tag mutability to propagate through the ecosystem.

DHI images come from a controlled namespace where Docker rebuilds from source with review processes and cooldown periods. Supply chain attacks that burn through community images stop at the DHI boundary. You’re not immune to all supply chain risk, but you’ve eliminated exposure to attacks that exploit community image trust models.

This is a different value proposition than CVE reduction. It’s isolation from an entire class of increasingly sophisticated attacks.

The Container Image as the Unit of Assessment

Security scanning is fragmented. Dependency scanning, SAST, and SCA all run in different contexts, and none has full visibility into how everything fits together at deployment time.

The container image is where all of this converges. It’s the actual deployment artifact, which makes it the checkpoint where you can guarantee uniform enforcement from developer workstation to production. The same evaluation criteria you run locally after docker build can be identical to what runs in CI and what gates production deployments.

This doesn’t need to replace earlier pipeline scanning altogether. It means the image is where you enforce policy consistency and build a coherent audit trail that maps directly to what you’re deploying.

Policy-Driven Automation

Every enterprise has a vulnerability management policy. The gap is usually between policy (PDFs and wikis) and practice (spreadsheets and Jira tickets).

DHI makes that gap easier to close by dramatically reducing the volume of findings that require policy decisions in the first place. When your scanner returns 50 CVEs instead of 500, even basic severity filtering becomes a workable triage system rather than an overwhelming backlog.

A simple, achievable policy might include the following:

  • High and critical severity vulnerabilities require remediation or documented exception
  • Medium and lower severity issues are accepted with periodic review
  • CISA KEV vulnerabilities are always in scope

Most scanning platforms support this level of filtering natively, including Grype, Trivy, Snyk, Wiz, Prisma Cloud, Aqua, and Docker Scout. You define your severity thresholds, apply them automatically, and surface only what requires human judgment.

For teams wanting tighter integration with DHI coverage data, Docker Scout evaluates policies against DHI status directly. Third-party scanners can achieve similar results through pipeline scripting or by exporting DHI coverage information for comparison.

The goal isn’t perfect automation but rather reducing noise enough that your existing policy becomes enforceable without burning out your engineers.

VEX: What You Can Do Today

Docker Hardened Images ship with VEX attestations that suppress CVEs Docker has assessed as not exploitable in context. The natural extension is for your teams to add their own VEX statements for application-layer findings.

Here’s what your security team can do today:

Consume DHI VEX data. Grype (v0.65+), Trivy, Wiz, and Docker Scout all ingest DHI VEX attestations automatically or via flags. Scanners without VEX support can still use extracted attestations to inform manual triage.

Write your own VEX statements. OpenVEX provides the JSON format. Use vexctl to generate and sign statements.

Attach VEX to images. Docker recommends docker scout attestation add for attaching VEX to images already in a registry:

docker scout attestation add \
  --file ./cve-2024-1234.vex.json \
  --predicate-type https://openvex.dev/ns/v0.2.0 \
  <image>

Alternatively, COPY VEX documents into the image filesystem at build time, though this prevents updates without rebuilding.

Configure scanner VEX ingestion. The workflow: scan, identify investigated findings, document as VEX, feed back into scanner config. Future scans automatically suppress assessed vulnerabilities.

Compliance: What DHI Actually Provides

Compliance frameworks such as ISO 27001, SOC 2, and the EU Cyber Resilience Act require systematic, auditable vulnerability management. DHI addresses specific control requirements:

Vulnerability management documentation (ISO 27001  A.8.8. , SOC 2 CC7.1). The waterline model provides a defensible answer to “how do you handle base image vulnerabilities?” Point to DHI, explain the attestation model, show policy for everything above the waterline.

Continuous monitoring evidence. DHI images rebuild and re-scan on a defined cadence. New digests mean current assessments. Combined with your scanner’s continuous monitoring, you demonstrate ongoing evaluation rather than point-in-time checks.

Remediation traceability. VEX attestations create machine-readable records of how each CVE was handled. When auditors ask about specific CVEs in specific deployments, you have answers tied to image digests and timestamps.

CRA alignment. The Cyber Resilience Act requires “due diligence” vulnerability handling and SBOMs. DHI images include SBOM attestations, and VEX aligns with CRA expectations for exploitability documentation.

This won’t satisfy every audit question, but it provides the foundation most organizations lack.

What to Do After You Read This Post

  1. Identify high-volume base images. Check Docker Hub’s Hardened Images catalog (My Hub → Hardened Images → Catalog) for coverage of your most-used images (Python, Node, Go, Alpine, Debian).
  2. Swap one image. Pick a non-critical service, change the FROM line to DHI equivalent, rebuild, scan, compare results.
  3. Configure policy-based filtering. Set up your scanner to distinguish DHI-covered vulnerabilities from application-layer findings. Use Docker Scout or Wiz for native VEX integration, or configure Grype/Trivy ignore policies based on extracted VEX data.
  4. Document your waterline. Write down what DHI covers and what remains your responsibility. This becomes your policy reference and audit documentation.
  5. Start a VEX practice. Convert one informally-documented vulnerability assessment into a VEX statement and attach it to the relevant image.

DHI solves specific, expensive problems around base image vulnerabilities and supply chain trust. The opportunity is building a practice around it that scales.

The Bigger Picture

DHI coverage is expanding. Today it might cover your OS layer, tomorrow it extends through runtimes and into hardened libraries. Build your framework to be agnostic to where the boundary sits. The question is always the same, though, namely —  what has Docker attested to, and what remains yours to assess?

The methodology Docker uses for DHI (policy-driven assessment, VEX attestations, auditable decisions) extends into your application layer. We can’t own your custom code, but we can provide the framework for consistent practices above the waterline. Whether you use Scout, Wiz, Grype, Trivy, or another scanner, the pattern is the same. You can let DHI handle what it covers, automate policy for what remains, and document decisions in formats that travel with artifacts.

At Docker, we’re using DHI internally to build this vulnerability management model. The framework stays constant regardless of how much of our stack is hardened today versus a year from now. Only the boundary moves.

The hardened images are free. The VEX attestations are included. What’s left is integrating these pieces into a coherent security practice where the container is the unit of truth, policy drives automation, and every vulnerability decision is documented by default.

For organizations that require stronger guarantees, FIPS-enabled and STIG-ready images, and customizations, DHI Enterprise is tailor made for those use cases. Get in touch with the Docker team if you would like a demo. If you’re still exploring, take a look at the catalog (no-signup needed) or take DHI Enterprise for a spin with a free trial.

]]>
Get Started with the Atlassian Rovo MCP Server Using Docker https://www.docker.com/blog/atlassian-remote-mcp-server-getting-started-with-docker/ Wed, 04 Feb 2026 13:52:53 +0000 https://www.docker.com/?p=85051 We’re excited to announce that the remote Atlassian Rovo MCP server is now available in Docker’s MCP Catalog and Toolkit, making it easier than ever to connect AI assistants to Jira and Confluence. With just a few clicks, technical teams can use their favorite AI agents to create and update Jira issues, epics, and Confluence pages without complex setup or manual integrations.

In this post, we’ll show you how to get started with the Atlassian remote MCP server in minutes and how to use it to automate everyday workflows for product and engineering teams.

Atlassian server figure 1

Figure 1: Discover over 300+ MCP servers including the remote Atlassian MCP server in Docker MCP Catalog.

What is the Atlassian Rovo MCP Server?

Like many teams, we rely heavily on Atlassian tools, especially Jira to plan, track, and ship product and engineering work. The Atlassian Rovo MCP server enables AI assistants and agents to interact directly with Jira and Confluence, closing the gap between where work happens and how teams want to use AI.

With the Atlassian Rovo MCP server, you can:

  • Create and update Jira issues and epics
  • Generate and edit Confluence pages
  • Use your preferred AI assistant or agent to automate everyday workflows

Traditionally, setting up and configuring MCP servers can be time-consuming and complex. Docker removes that friction, making it easy to get up and running securely in minutes.

Enable the Atlassian Rovo MCP Server with One Click

Docker’s MCP Catalog is a curated collection of 300+ MCP servers, including both local and remote options. It provides a reliable starting point for developers building with MCP so you don’t have to wire everything together yourself.

Prerequisites

To get started with the Atlassian remote MCP server:

  1. Open Docker Desktop and click on the MCP Toolkit tab. 
  2. Navigate to Docker MCP Catalog
  3. Search for the Atlassian Rovo MCP server. 
  4. Select the remote version with cloud icon
  5. Enable it with a single click

That’s it. No manual installs. No dependency wrangling.

Why use the Atlassian Rovo MCP server with Docker

Demo by Cecilia Liu: Set up the Atlassian Rovo MCP server with Docker with just a few clicks and use it to generate Jira epics with Claude Desktop

Seamless Authentication with Built-in OAuth

The Atlassian Rovo MCP server uses Docker’s built-in OAuth, so authorization is seamless. Docker securely manages your credentials and allows you to reuse them across multiple MCP clients. You authenticate once, and you’re good to go.

Behind the scenes, this frictionless experience is powered by the MCP Toolkit, which handles environment setup and dependency management for you.

Works with Your Favorite AI Agent

Once the Atlassian Rovo MCP server is enabled, you can connect it to any MCP-compatible client.

For popular clients like Claude Desktop, Claude Code, Codex, or Gemini CLI, connecting is just one click. Just click Connect, restart Claude Desktop, and now we’re ready to go.

From there, we can ask Claude to:

  • Write a short PRD about MCP
  • Turn that PRD into Jira epics and stories
  • Review the generated epics and confirm they’re correct

And just like that, Jira is updated.

One Setup, Any MCP Client

Sometimes AI assistants have hiccups. Maybe you hit a daily usage limit in one tool. That’s not a blocker here.

Because the Atlassian Rovo MCP server is connected through the Docker MCP Toolkit, the setup is completely client-agnostic. Switching to another assistant like Gemini CLI or Cursor is as simple as clicking Connect. No need for reconfiguration or additional setup!

Now we can ask any connected AI assistant such as Gemini CLI to, for example, check all new unassigned Jira tickets. It just works.

Coming Soon: Share Atlassian-Based Workflows Across Teams

We’re working on new enhancements that will make Atlassian-powered workflows even more powerful and easy to share. Soon, you’ll be able to package complete workflows that combine MCP servers, clients, and configurations. Imagine a workflow that turns customer feedback into Jira tickets using Atlassian and Confluence, then shares that entire setup instantly with your team or across projects. That’s where we’re headed.

Frequently Asked Questions (FAQ)

What is the Atlassian Rovo MCP server?

The Atlassian MCP Rovo server enables AI assistants and agents to securely interact with Jira and Confluence. It allows AI tools to create and update Jira issues and epics, generate and edit Confluence pages, and automate everyday workflows for product and engineering teams.

How do I use the Atlassian Rovo MCP server with Docker? 

You can enable the Atlassian Rovo MCP server directly from Docker Desktop or CLI. Simply open the MCP Toolkit tab, search for the Atlassian MCP server, select the remote version, and enable it with one click. Connect to any MCP-compatible client. For popular tools like Claude Code, Codex, and Gemini, setup is even easier with one-click integration. 

Why use Docker to run the Atlassian Rovo MCP server?

Using Docker to run the Atlassian Rovo MCP server removes the complexity of setup, authentication, and client integration. Docker provides one-click enablement through the MCP Catalog, built-in OAuth for secure credential management, and a client-agnostic MCP Toolkit that lets teams connect any AI assistant or agent without reconfiguration so you can focus on automating Jira and Confluence workflows instead of managing infrastructure.

Less Setup. Less Context Switching. More Work Shipped.

That’s how easy it is to set up and use the Atlassian Rovo MCP server with Docker. By combining the MCP Catalog and Toolkit, Docker removes the friction from connecting AI agents to the tools teams already rely on.

Learn more

]]>
The 3Cs: A Framework for AI Agent Security https://www.docker.com/blog/the-3cs-a-framework-for-ai-agent-security/ Wed, 04 Feb 2026 02:02:19 +0000 https://www.docker.com/?p=85069 Every time execution models change, security frameworks need to change with them. Agents force the next shift.

The Unattended Laptop Problem

No developer would leave their laptop unattended and unlocked. The risk is obvious. A developer laptop has root-level access to production systems, repositories, databases, credentials, and APIs. If someone sat down and started using it, they could review pull requests, modify files, commit code, and access anything the developer can access.

Yet this is how many teams are deploying agents today. Autonomous systems are given credentials, tools, and live access to sensitive environments with minimal structure. Work executes in parallel and continuously, at a pace no human could follow. Code is generated faster than developers can realistically review, and they cannot monitor everything operating on their behalf.

Once execution is parallel and continuous, the potential for mistakes or cascading failures scales quickly. Teams will continue to adopt agents because the gains are real. What remains unresolved is how to make this model safe enough to operate without requiring manual approval for every action. Manual approval slows execution back down to human speed and eliminates the value of agents entirely. And consent fatigue is real.

Why AI Agents Break Existing Governance

Traditional security controls were designed around a human operator. A person sits at the keyboard, initiates actions deliberately, and operates within organizational and social constraints. Reviews worked because there was time between intent and execution. Perimeter security protected the network boundary, while automated systems operated within narrow execution limits.

But traditional security assumes something deeper: that a human is operating the machine.  Firewalls trust the laptop because an employee is using it. VPNs trust the connection because an engineer authenticated. Secrets managers grant access because a person requested it. The model depends on someone who can be held accountable and who operates at human speed.

Agents break this assumption. They act directly, reading repositories, calling APIs, modifying files, using credentials. They have root-level privileges and execute actions at machine speed.  

Legacy controls were never intended for this. The default response has been more visibility and approvals, adding alerts, prompts, and confirmations for every action. This does not scale and generates “consent fatigue”, annoying developers and undermining the very security it seeks to enforce. When agents execute hundreds of actions in parallel, humans cannot review them meaningfully. Warnings become noise.

AI Governance and the Execution Layer: The Three Cs Framework

Each major shift in computing has moved security closer to execution. Agents follow the same trajectory. If agents execute, security must operate at the agentic execution layer.

That shift maps governance to three structural requirements: the 3Cs.

Contain: Bound the Blast Radius

Every execution model relies on isolation. Processes required memory protection. Virtual machines required hypervisors. Containers required namespaces. Agents require an equivalent boundary. Containment limits failure so mistakes made by an agent don’t have permanent consequences for your data, workflows, and business. Unlocking full agent autonomy requires the confidence that experimentation won’t be reckless. . Without it, autonomous execution fails.

Curate: Define the Agent’s Environment

What an agent can do is determined by what exists in its environment. The tools it can invoke, the code it can see, the credentials it can use, the context it operates within. All of this shapes execution before the agent acts.

Curation isn’t approval. It is construction. You are not reviewing what the agent wants to do. You are defining the world it operates in. Agents do not reason about your entire system. They act within the environment they are given. If that environment is deliberate, execution becomes predictable. If it is not, you have autonomy without structure, which is just risk.

Control: Enforce Boundaries in Real Time

Governance that exists only on paper has no effect on autonomous systems. Rules must apply as actions occur. File access, network calls, tool invocation, and credential use require runtime enforcement. This is where alert-based security breaks down. Logging and warnings explain what happened or ask permission after execution is already underway. 

Control determines what can happen, when, where, and who has the privilege to make it happen. Properly executed control does not remove autonomy. It defines its limits and removes the need for humans to approve every action under pressure. If this sounds like a policy engine, you aren’t wrong. But this must be dynamic and adaptable, able to keep pace with an agentic workforce.

Putting the 3Cs Into Practice

The three Cs reinforce one another. Containment limits the cost of failure. Curation narrows what agents can attempt and makes them more useful to developers by applying semantic knowledge to craft tools and context to suit the specific environment and task. Control at the runtime layer replaces reactive approval with structural enforcement.

In practice, this work falls to platform teams. It means standardized execution environments with isolation by default, curated tool and credential surfaces aligned to specific use cases, and policy enforcement that operates before actions complete rather than notifying humans afterward. Teams that build with these principles can use agents effectively without burning out developers or drowning them in alerts. Teams that do not will discover that human attention is not a scalable control plane.

]]>
Docker Sandboxes: Run Claude Code and Other Coding Agents Unsupervised (but Safely) https://www.docker.com/blog/docker-sandboxes-run-claude-code-and-other-coding-agents-unsupervised-but-safely/ Fri, 30 Jan 2026 23:39:54 +0000 https://www.docker.com/?p=84895 We introduced Docker Sandboxes in experimental preview a few months ago. Today, we’re launching the next evolution with microVM isolation, available now for macOS and Windows. 

We started Docker Sandboxes to answer the question:

How do I run Claude Code or Gemini CLI safely?

Sandboxes provide disposable, isolated environments purpose-built for coding agents. Each agent runs in an isolated version of your development environment, so when it installs packages, modifies configurations, deletes files, or runs Docker containers, your host machine remains untouched.

This isolation lets you run agents like Claude Code, Codex CLI, Copilot CLI, Gemini CLI, and Kiro with autonomy. Since they can’t harm your computer, let them run free.

Since our first preview, Docker Sandboxes have evolved. They’re now more secure, easier to use, and more powerful.

Level 4 Coding Agent Autonomy

Claude Code and other coding agents fundamentally change how developers write and maintain code. But a practical question remains: how do you let an agent run unattended (without constant permission prompts), while still protecting your machine and data? 

Most developers quickly run into the same set of problems trying to solve this:

  • OS-level sandboxing interrupts workflows and isn’t consistent across platforms
  • Containers seem like the obvious answer, until the agent needs to run Docker itself
  • Full VMs work, but are slow, manual, and hard to reuse across projects

We started building Docker Sandboxes specifically to fill this gap.

Docker Sandboxes: MicroVM-Based Isolation for Coding Agents

Defense-in-depth, isolation by default

  • Each agent runs inside a dedicated microVM
  • Only your project workspace is mounted into the sandbox
  • Hypervisor-based isolation significantly reduces host risk

A real development environment

  • Agents can install system packages, run services, and modify files
  • Workflows run unattended, without constant permission approvals

Safe Docker access for coding agents

  • Coding agents can build and run Docker containers inside the MicroVM
  • They have no access to the host Docker daemon

One sandbox, many coding agents

  • Use the same sandbox experience with Claude Code, Copilot CLI, Codex CLI, Gemini CLI, and Kiro
  • More to come (and we’re taking requests!)

Fast reset, no cleanup

  • If an agent goes off the rails, delete the sandbox and spin up a fresh one in seconds

What’s New Since the Preview and What’s Next

The experimental preview validated the core idea: coding agents need an execution environment with clear isolation boundaries, not a stream of permission prompts. The early focus was developer experience, making it easy to spin up an environment that felt natural and productive for real workflows.

As Matt Pocock put it, “Docker Sandboxes have the best DX of any local AI coding sandbox I’ve tried.”

With this release, we’re making Sandboxes more powerful and secure with no compromise on developer experience.

What’s New

  • MicroVM-based isolation
    Sandboxes now run on dedicated microVMs, adding a hard security boundary.
  • Network isolation with allow and deny lists
    Control over coding agent network access.
  • Secure Docker execution for agents
    Docker Sandboxes are the only sandboxing solution we’re aware of that allows coding agents to build and run Docker containers while remaining isolated from the host system.

What’s Next

We’re continuing to expand Docker Sandboxes based on developer feedback:

  • Linux support
  • MCP Gateway support
  • Ability to expose ports to the host device and access host-exposed services
  • Support for additional coding agents

Docker Sandboxes were made for developers who want to run coding agents unattended, experiment freely, and recover instantly when something goes wrong. They extend the usability of containers’ isolation principles but with hard boundaries.

If you’ve been holding back on using agents because of permission prompts, system risk, or Docker-in-Docker limitations, Docker Sandboxes are built to remove those constraints.

We’re iterating quickly, and feedback from real-world usage will directly shape what comes next.

]]>
Your Dependencies Don’t Care About Your FIPS Configuration https://www.docker.com/blog/fips-dependencies-and-prebuilt-binaries/ Thu, 22 Jan 2026 14:00:00 +0000 https://www.docker.com/?p=84608 FIPS compliance is a great idea that makes the entire software supply chain safer. But teams adopting FIPS-enabled container images are running into strange errors that can be challenging to debug. What they are learning is that correctness at the base image layer does not guarantee compatibility across the ecosystem. Change is complicated, and changing complicated systems with intricate dependency webs often yields surprises. We are in the early adaptation phase of FIPS, and that actually provides interesting opportunities to optimize how things work. Teams that recognize this will rethink how they build FIPS and get ahead of the game.

FIPS in practice

FIPS is a U.S. government standard for cryptography. In simple terms, if you say a system is “FIPS compliant,” that means the cryptographic operations like TLS, hashing, signatures, and random number generation are performed using a specific, validated crypto module in an approved mode. That sounds straightforward until you remember that modern software is built not as one compiled program, but as a web of dependencies that carry their own baggage and quirks.

The FIPS crypto error that caught us off guard

We got a ticket recently for a Rails application in a FIPS-enabled container image. On the surface, everything looked right. Ruby was built to use OpenSSL 3.x with the FIPS provider. The OpenSSL configuration was correct. FIPS mode was active.

However, the application started throwing cryptography module errors from the Postgres Rubygem module. Even more confusing, a minimal reproducer of a basic Ruby app and a stock postgres did not reproduce the error and a connection was successfully established. The issue only manifested when using ActiveRecord.

The difference came down to code paths. A basic Ruby script using the pg gem directly exercises a simpler set of operations. ActiveRecord triggers additional functionality that exercises different parts of libpq. The non-FIPS crypto was there all along, but only certain operations exposed it.

Your container image can be carefully configured for FIPS, and your application can still end up using non-FIPS crypto because a dependency brought its own crypto along for the ride. In this case, the culprit was a precompiled native artifact associated with the database stack. When you install pg, Bundler may choose to download a prebuilt binary dependency such as libpq.

Unfortunately those prebuilt binaries are usually built with assumptions that cause problems. They may be linked against a different OpenSSL than the one in your image. They may contain statically embedded crypto code. They may load crypto at runtime in a way that is not obvious.

This is the core challenge with FIPS adoption. Your base image can do everything right, but prebuilt dependencies can silently bypass your carefully configured crypto boundary.

Why we cannot just fix it in the base image yet

The practical fix for the Ruby case was adding this to your Gemfile.

gem "pg", "~> 1.1", force_ruby_platform: true

You also need to install libpq-dev to allow compiling from source. This forces Bundler to build the gem from source on your system instead of using a prebuilt binary. When you compile from source inside your controlled build environment, the resulting native extension is linked against the OpenSSL that is actually in your FIPS image.

Bundler also supports an environment/config knob for the same idea called BUNDLE_FORCE_RUBY_PLATFORM. The exact mechanism matters less than the underlying strategy of avoiding prebuilt native artifacts when you are trying to enforce a crypto boundary.

You might reasonably ask why we do not just add BUNDLE_FORCE_RUBY_PLATFORM to the Ruby FIPS image by default. We discussed this internally, and the answer illustrates why FIPS complexity cascades.

Setting that flag globally is not enough on its own. You also need a C compiler and the relevant libraries and headers in the build stage. And not every gem needs this treatment. If you flip the switch globally, you end up compiling every native gem from source, which drags in additional headers and system libraries that you now need to provide. The “simple fix” creates a new dependency management problem.

Teams adopt FIPS images to satisfy compliance. Then they have to add back build complexity to make the crypto boundary real and verify that every dependency respects it. This is not a flaw in FIPS or in the tooling. It is an inherent consequence of retrofitting a strict cryptographic boundary onto an ecosystem built around convenience and precompiled artifacts.

The patterns we are documenting today will become the defaults tomorrow. The tooling will catch up. Prebuilt packages will get better. Build systems will learn to handle the edge cases. But right now, teams need to understand where the pitfalls are.

What to do if you are starting a FIPS journey

You do not need to become a crypto expert to avoid the obvious traps. You only need a checklist mindset. The teams working through these problems now are building real expertise that will be valuable as FIPS requirements expand across industries.

  • Treat prebuilt native dependencies as suspect. If a dependency includes compiled code, assume it might carry its own crypto linkage until you verify otherwise. You can use ldd on Linux to inspect dynamic linking and confirm that binaries link against your system OpenSSL rather than a bundled alternative.
  • Use a multi-stage build and compile where it matters. Keep your runtime image slim, but allow a builder stage with the compiler and headers needed to compile the few native pieces that must align with your FIPS OpenSSL.
  • Test the real execution path, not just “it starts.” For Rails, that means running a query, not only booting the app or opening a connection. The failures we saw appeared when using the ORM, not on first connection.
  • Budget for supply-chain debugging. The hard part is not turning on FIPS mode. The hard part is making sure all the moving parts actually respect it. Expect to spend time tracing crypto usage through your dependency graph.

Why this matters beyond government contracts

FIPS compliance has traditionally been seen as a checkbox for federal sales. That is changing. As supply chain security becomes a board-level concern across industries, validated cryptography is moving from “nice to have” to “expected.” The skills teams build solving FIPS problems today translate directly to broader supply chain security challenges.

Think about what you learn when you debug a FIPS failure. You learn to trace crypto usage through your dependency graph, to question prebuilt artifacts, to verify that your security boundaries are actually enforced at runtime. Those skills matter whether you are chasing a FedRAMP certification or just trying to answer your CISO’s questions about software provenance.

The opportunity in the complexity

FIPS is not “just a switch” you flip in a base image. View FIPS instead as a new layer of complexity that you might have to debug across your dependency graph. That can sound like bad news, but switch the framing and it becomes an opportunity to get ahead of where the industry is going.

The ecosystem will adapt and the tooling will improve. The teams investing in understanding these problems now will be the ones who can move fastest when FIPS or something like it becomes table stakes.

If you are planning a FIPS rollout, start by controlling the prebuilt native artifacts that quietly bypass the crypto module you thought you were using. Recognize that every problem you solve is building institutional knowledge that compounds over time. This is not just compliance work. It is an investment in your team’s security engineering capability.

]]>
Making (Very) Small LLMs Smarter https://www.docker.com/blog/making-small-llms-smarter/ Fri, 16 Jan 2026 13:47:26 +0000 https://www.docker.com/?p=84663 Hello, I’m Philippe, and I am a Principal Solutions Architect helping customers with their usage of Docker. I started getting seriously interested in generative AI about two years ago. What interests me most is the ability to run language models (LLMs) directly on my laptop (For work, I have a MacBook Pro M2 max, but on a more personal level, I run LLMs on my personal MacBook Air M4 and on Raspberry Pis – yes, it’s possible, but I’ll talk about that another time).

Let’s be clear, reproducing a Claude AI Desktop or Chat GPT on a laptop with small language models is not possible. Especially since I limit myself to models that have between 0.5 and 7 billion parameters. But I find it an interesting challenge to see how far we can go with these small models. So, can we do really useful things with small LLMs? The answer is yes, but you need to be creative and put in a bit of effort.

I’m going to take a concrete use case, related to development (but in the future I’ll propose “less technical” use cases).

(Specific) Use Case: Code Writing Assistance

I need help writing code

Currently, I’m working in my free time on an open-source project, which is a Golang library for quickly developing small generative AI agents. It’s both to get my hands dirty with Golang and prepare tools for other projects. This project is called Nova; there’s nothing secret about it, you can find it here.

If I use Claude AI and ask it to help me write code with Nova: “I need a code snippet of a Golang Nova Chat agent using a stream completion.”

tiny model fig 1

The response will be quite disappointing, because Claude doesn’t know Nova (which is normal, it’s a recent project). But Claude doesn’t want to disappoint me and will still propose something which has nothing to do with my project.

And it will be the same with Gemini.

tiny model fig 2

So, you’ll tell me, give the “source code of your repository to feed” to Claude AI or Gemini. OK, but imagine the following situation: I don’t have access to these services, for various reasons. Some of these reasons could be confidentiality, the fact that I’m on a project where we don’t have the right to use the internet, for example. That already disqualifies Claude AI and Gemini. How can I get help writing code with a small local LLM? So as you guessed, with a local LLM. And moreover, a “very small” LLM.

Choosing a language model

When you develop a solution based on generative AI, the choice of language model(s) is crucial. And you’ll have to do a lot of technology watching, research, and testing to find the model that best fits your use case. And know that this is non-negligible work.

For this article (and also because I use it), I’m going to use hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m, which you can find here. It’s a 3 billion parameter language model, optimized for code generation. You can install it with Docker Model Runner with the following command:

docker model pull hf.co/Qwen/Qwen2.5-Coder-3B-Instruct-GGUF:Q4_K_M

And to start chatting with the model, you can use the following command:

docker model run hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m

Or use Docker Desktop:

tiny model fig 3

So, of course, as you can see in the illustration above, this little “Qwen Coder” doesn’t know my Nova library either. But we’re going to fix that.

Feeding the model with specific information

For my project, I have a markdown file in which I save the code snippets I use to develop examples with Nova. You can find it here. For now, there’s little content, but it will be enough to prove and illustrate my point.

So I could add the entire content of this file to a user prompt that I would give to the model. But that will be ineffective. Indeed, small models have a relatively small context window. But even if my “Qwen Coder” was capable of ingesting all the content of my markdown file, it would have trouble focusing on my request and on what it should do with this information. So,

  • 1st essential rule: when you use a very small LLM, the larger the content provided to the model, the less effective the model will be.
  • 2nd essential rule: the more you keep the conversation history, the more the content provided to the model will grow, and therefore it will decrease the effectiveness of the model.

So, to work around this problem, I’m going to use a technique called RAG (Retrieval Augmented Generation). The principle is simple: instead of providing all the content to the model, we’re going to store this content in a “vector” type database, and when the user makes a request, we’re going to search in this database for the most relevant information based on the user’s request. Then, we’re going to provide only this relevant information to the language model. For this blog post, the data will be kept in memory (which is not optimal, but sufficient for a demonstration).

RAG?

There are already many articles on the subject, so I won’t go into detail. But here’s what I’m going to do for this blog post:

  1. My snippets file is composed of sections: a markdown title (## snippet name), possibly a description in free text, and a code block (golang … ).
  2. I’m going to split this file by sections into chunks of text (we also talk about “chunks”),
  3. Then, for each section I’m going to create an “embedding” (vector representation of text == mathematical representation of the semantic meaning of the text) with the ai/embeddinggemma:latest model (a relatively small and efficient embedding model). Then I’m going to store these embeddings (and the associated text) in an in-memory vector database (a simple array of JSON objects).
  4. If you want to learn more about embedding, please read this article:Run Embedding Models and Unlock Semantic Search with Docker Model Runner

Diagram of the vector database creation process:

tiny model fig 4

Similarity search and user prompt construction

Once I have this in place, when I make a request to the language model (so hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m), I’m going to:

  1. Create an embedding of the user’s request with the embedding model.
  2. Compare this embedding with the embeddings stored in the vector database to find the most relevant sections (by calculating the distance between the vector representation of my question and the vector representations of the snippets). This is called a similarity search.
  3. From the most relevant sections (the most similar), I’ll be able to construct a user prompt that includes only the relevant information and my initial request.

Diagram of the search and user prompt construction process:

tiny model fig 5

So the final user prompt will contain:

  • The system instructions. For example: “You are a helpful coding assistant specialized in Golang and the Nova library. Use the provided code snippets to help the user with their requests.”
  • The relevant sections were extracted from the vector database.
  • The user’s request.

Remarks:

  • I explain the principles and results, but all the source code (NodeJS with LangchainJS) used to arrive at my conclusions is available in this project 
  • To calculate distances between vectors, I used cosine similarity (A cosine similarity score of 1 indicates that the vectors point in the same direction. A cosine similarity score of 0 indicates that the vectors are orthogonal, meaning they have no directional similarity.)
  • You can find the JavaScript function I used here
  • And the piece of code that I use to split the markdown snippets file
  • Warning: embedding models are limited by the size of text chunks they can ingest. So you have to be careful not to exceed this size when splitting the source file. And in some cases, you’ll have to change the splitting strategy (fixed-size chunk,s for example, with or without overlap)

Implementation and results, or creating my Golang expert agent

Now that we have the operating principle, let’s see how to put this into music with LangchainJS, Docker Model Runner, and Docker Agentic Compose.

Docker Agentic Compose configuration

Let’s start with the Docker Agentic Compose project structure:

services:
  golang-expert:
    build:
      context: .
      dockerfile: Dockerfile
    environment:
      TERM: xterm-256color

      HISTORY_MESSAGES: 2
      MAX_SIMILARITIES: 3
      COSINE_LIMIT: 0.45

      OPTION_TEMPERATURE: 0.0
      OPTION_TOP_P: 0.75
      OPTION_PRESENCE_PENALTY: 2.2

      CONTENT_PATH: /app/data

    volumes:
      - ./data:/app/data

    stdin_open: true   # docker run -i
    tty: true          # docker run -t

    configs:
      - source: system.instructions.md
        target: /app/system.instructions.md

    models:
      chat-model:
        endpoint_var: MODEL_RUNNER_BASE_URL
        model_var: MODEL_RUNNER_LLM_CHAT

      embedding-model:
        endpoint_var: MODEL_RUNNER_BASE_URL
        model_var: MODEL_RUNNER_LLM_EMBEDDING


models:
  chat-model:
    model: hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m

  embedding-model:
    model: ai/embeddinggemma:latest

configs:
  system.instructions.md:
    content: |
      Your name is Bob (the original replicant).
      You are an expert programming assistant in Golang.
      You write clean, efficient, and well-documented code.
      Always:
      - Provide complete, working code
      - Include error handling
      - Add helpful comments
      - Follow best practices for the language
      - Explain your approach briefly

      Use only the information available in the provided data and your KNOWLEDGE BASE.

What’s important here is:

I only keep the last 2 messages in my conversation history, and I only select the 2 or 3 best similarities found at most (to limit the size of the user prompt):

HISTORY_MESSAGES: 2
MAX_SIMILARITIES: 3
COSINE_LIMIT: 0.45

You can adjust these values according to your use case and your language model’s capabilities.

The models section, where I define the language models I’m going to use:

models:
  chat-model:
    model: hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m

  embedding-model:
    model: ai/embeddinggemma:latest

One of the advantages of this section is that it will allow Docker Compose to download the models if they’re not already present on your machine.

As well as the models section of the golang-expert service, where I map the environment variables to the models defined above:

models:
    chat-model:
    endpoint_var: MODEL_RUNNER_BASE_URL
    model_var: MODEL_RUNNER_LLM_CHAT

    embedding-model:
    endpoint_var: MODEL_RUNNER_BASE_URL
    model_var: MODEL_RUNNER_LLM_EMBEDDING

And finally, the system instructions configuration file:

configs:
    - source: system.instructions.md
    target: /app/system.instructions.md

Which I define a bit further down in the configs section:

configs:
  system.instructions.md:
    content: |
      Your name is Bob (the original replicant).
      You are an expert programming assistant in Golang.
      You write clean, efficient, and well-documented code.
      Always:
      - Provide complete, working code
      - Include error handling
      - Add helpful comments
      - Follow best practices for the language
      - Explain your approach briefly

      Use only the information available in the provided data and your KNOWLEDGE BASE.

You can, of course, adapt these system instructions to your use case. And also persist them in a separate file if you prefer.

Dockerfile

It’s rather simple:

FROM node:22.19.0-trixie

WORKDIR /app
COPY package*.json ./
RUN npm install
COPY *.js .

# Create non-root user
RUN groupadd --gid 1001 nodejs &amp;&amp; \
    useradd --uid 1001 --gid nodejs --shell /bin/bash --create-home bob-loves-js

# Change ownership of the app directory
RUN chown -R bob-loves-js:nodejs /app

# Switch to non-root user
USER bob-loves-js

Now that the configuration is in place, let’s move on to the agent’s source code.

Golang expert agent source code, a bit of LangchainJS with RAG

The JavaScript code is rather simple (probably improvable, but functional) and follows these main steps:

1. Initial configuration

  • Connection to both models (chat and embeddings) via LangchainJS
  • Loading parameters from environment variables

2. Vector database creation (at startup)

  • Reading the snippets.md file
  • Splitting into sections (chunks)
  • Generating an embedding for each section
  • Storing in an in-memory vector database

3. Interactive conversation loop

  • The user asks a question
  • Creating an embedding of the question
  • Similarity search in the vector database to find the most relevant snippets
  • Construction of the final prompt with: history + system instructions + relevant snippets + question
  • Sending to the LLM and displaying the response in streaming
  • Updating the history (limited to the last N messages)
import { ChatOpenAI } from "@langchain/openai";
import { OpenAIEmbeddings} from '@langchain/openai';

import { splitMarkdownBySections } from './chunks.js'
import { VectorRecord, MemoryVectorStore } from './rag.js';


import prompts from "prompts";
import fs from 'fs';

// Define [CHAT MODEL] Connection
const chatModel = new ChatOpenAI({
  model: process.env.MODEL_RUNNER_LLM_CHAT || `ai/qwen2.5:latest`,
  apiKey: "",
  configuration: {
    baseURL: process.env.MODEL_RUNNER_BASE_URL || "http://localhost:12434/engines/llama.cpp/v1/",
  },
  temperature: parseFloat(process.env.OPTION_TEMPERATURE) || 0.0,
  top_p: parseFloat(process.env.OPTION_TOP_P) || 0.5,
  presencePenalty: parseFloat(process.env.OPTION_PRESENCE_PENALTY) || 2.2,
});


// Define [EMBEDDINGS MODEL] Connection
const embeddingsModel = new OpenAIEmbeddings({
    model: process.env.MODEL_RUNNER_LLM_EMBEDDING || "ai/embeddinggemma:latest",
    configuration: {
    baseURL: process.env.MODEL_RUNNER_BASE_URL || "http://localhost:12434/engines/llama.cpp/v1/",
        apiKey: ""
    }
})

const maxSimilarities = parseInt(process.env.MAX_SIMILARITIES) || 3
const cosineLimit = parseFloat(process.env.COSINE_LIMIT) || 0.45

// ----------------------------------------------------------------
//  Create the embeddings and the vector store from the content file
// ----------------------------------------------------------------

console.log("========================================================")
console.log(" Embeddings model:", embeddingsModel.model)
console.log(" Creating embeddings...")
let contentPath = process.env.CONTENT_PATH || "./data"

const store = new MemoryVectorStore();

let contentFromFile = fs.readFileSync(contentPath+"/snippets.md", 'utf8');
let chunks = splitMarkdownBySections(contentFromFile);
console.log(" Number of documents read from file:", chunks.length);


// -------------------------------------------------
// Create and save the embeddings in the memory vector store
// -------------------------------------------------
console.log(" Creating the embeddings...");

for (const chunk of chunks) {
  try {
    // EMBEDDING COMPLETION:
    const chunkEmbedding = await embeddingsModel.embedQuery(chunk);
    const vectorRecord = new VectorRecord('', chunk, chunkEmbedding);
    store.save(vectorRecord);

  } catch (error) {
    console.error(`Error processing chunk:`, error);
  }
}

console.log(" Embeddings created, total of records", store.records.size);
console.log();


console.log("========================================================")


// Load the system instructions from a file
let systemInstructions = fs.readFileSync('/app/system.instructions.md', 'utf8');

// ----------------------------------------------------------------
// HISTORY: Initialize a Map to store conversations by session
// ----------------------------------------------------------------
const conversationMemory = new Map()

let exit = false;

// CHAT LOOP:
while (!exit) {
  const { userMessage } = await prompts({
    type: "text",
    name: "userMessage",
    message: `Your question (${chatModel.model}): `,
    validate: (value) => (value ? true : "Question cannot be empty"),
  });

  if (userMessage == "/bye") {
    console.log(" See you later!");
    exit = true;
    continue
  }

  // HISTORY: Get the conversation history for this session
  const history = getConversationHistory("default-session-id")

  // ----------------------------------------------------------------
  // SIMILARITY SEARCH:
  // ----------------------------------------------------------------
  // -------------------------------------------------
  // Create embedding from the user question
  // -------------------------------------------------
  const userQuestionEmbedding = await embeddingsModel.embedQuery(userMessage);

  // -------------------------------------------------
  // Use the vector store to find similar chunks
  // -------------------------------------------------
  // Create a vector record from the user embedding
  const embeddingFromUserQuestion = new VectorRecord('', '', userQuestionEmbedding);

  const similarities = store.searchTopNSimilarities(embeddingFromUserQuestion, cosineLimit, maxSimilarities);

  let knowledgeBase = "KNOWLEDGE BASE:\n";

  for (const similarity of similarities) {
    console.log(" CosineSimilarity:", similarity.cosineSimilarity, "Chunk:", similarity.prompt);
    knowledgeBase += `${similarity.prompt}\n`;
  }

  console.log("\n Similarities found, total of records", similarities.length);
  console.log();
  console.log("========================================================")
  console.log()

  // -------------------------------------------------
  // Generate CHAT COMPLETION:
  // -------------------------------------------------

  // MESSAGES== PROMPT CONSTRUCTION:
  let messages = [
      ...history,
      ["system", systemInstructions],
      ["system", knowledgeBase],
      ["user", userMessage]
  ]

  let assistantResponse = ''
  // STREAMING COMPLETION:
  const stream = await chatModel.stream(messages);
  for await (const chunk of stream) {
    assistantResponse += chunk.content
    process.stdout.write(chunk.content);
  }
  console.log("\n");

  // HISTORY: Add both user message and assistant response to history
  addToHistory("default-session-id", "user", userMessage)
  addToHistory("default-session-id", "assistant", assistantResponse)

}

// Helper function to get or create a conversation history
function getConversationHistory(sessionId, maxTurns = parseInt(process.env.HISTORY_MESSAGES)) {
  if (!conversationMemory.has(sessionId)) {
    conversationMemory.set(sessionId, [])
  }
  return conversationMemory.get(sessionId)
}

// Helper function to add a message to the conversation history
function addToHistory(sessionId, role, content) {
  const history = getConversationHistory(sessionId)
  history.push([role, content])

  // Keep only the last maxTurns conversations
  const maxTurns = parseInt(process.env.HISTORY_MESSAGES) // Adjust this value based on your needs
  if (history.length > maxTurns * 2) { // *2 because each turn has user & assistant message
    history.splice(0, 2) // Remove oldest turn (user + assistant messages)
  }
}

You can find it here

All that’s left is to launch to verify my hypotheses

In the project folder, run the following command:

docker compose up --build --no-log-prefix -d

Then connect to the container and launch the application:

docker compose exec golang-expert /bin/bash
node index.js

Nova Chat Agent with streaming completion

Once the application started, it’s time to ask our question to our Golang expert agent in development with the Nova library:

========================================================
 Embeddings model: ai/embeddinggemma:latest
 Creating embeddings...
 Number of documents read from file: 4
 Creating the embeddings...
 Embeddings created, total of records 4

========================================================
?  Your question (hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m):  › I need a code snippet of a Golang Nova Chat agent using a stream completion


After a quick similarity search, our agent found two relevant snippets in the vector database. And it was able to provide me with complete, functional, and commented code to answer my request:

Certainly! Below is a complete, working Go code snippet for a Nova Chat agent that uses streaming completion. This example demonstrates how to set up an agent with the necessary configurations and generate a stream completion response.

```go
package main

import (
        "context"
        "fmt"

        "github.com/snipwise/nova/nova-sdk/agents"
        "github.com/snipwise/nova/nova-sdk/agents/chat"
        "github.com/snipwise/nova/nova-sdk/messages"
        "github.com/snipwise/nova/nova-sdk/messages/roles"
        "github.com/snipwise/nova/nova-sdk/models"
)

func main() {

        ctx := context.Background()

        // Create a new chat agent with streaming capabilities
        agent, err := chat.NewAgent(
                ctx,
                agents.Config{
                        EngineURL:          "http://localhost:12434/engines/llama.cpp/v1",
                        SystemInstructions: "You are Bob, a helpful AI assistant.",
                        KeepConversationHistory: true,
                },
                models.Config{
                        Name:        "ai/qwen2.5:1.5B-F16",
                        Temperature: models.Float64(0.8),
                },
        )
        if err != nil {
                panic(err)
        }

        // Generate a stream completion response
        result, err := agent.GenerateStreamCompletion(
                []messages.Message{
                        {Role: roles.User, Content: "Who is James T Kirk?"},
                },
                func(chunk string, finishReason string) error {
                        // Simple callback that receives strings only
                        if chunk != "" {
                                fmt.Print(chunk)
                        }
                        if finishReason == "stop" {
                                fmt.Println()
                        }
                        return nil
                },
        )
        if err != nil {
                panic(err)
        }

        // Print the final response and finish reason
        fmt.Println("Final Response:", result.Response)
        fmt.Println("Finish Reason:", result.FinishReason)
}

### Explanation:

1. **Imports**: The code imports necessary packages from the Nova SDK to interact with chat agents.

2. **Context**: A context is created using `context.Background()` to manage the lifecycle of requests.

3. **Agent Creation**:
   - An agent is created using `chat.NewAgent` with specified configurations:
     - `EngineURL`: The URL of the engine to use for generating text.
     - `SystemInstructions`: Instructions for the AI assistant, such as who Bob is.
     - `KeepConversationHistory`: Whether to keep a history of previous messages.

4. **Model Configuration**:
   - A model configuration is provided with:
     - `Name`: The name of the model to use.
     - `Temperature`: Controls the randomness of the output (higher values make responses more random).
     - `MaxTokens`: Limits the maximum number of tokens in the response.

5. **Stream Completion Generation**:
   - The `GenerateStreamCompletion` method is called with a list of messages and a callback function that processes each chunk of text as it arrives.
   - The callback prints chunks to the console until the finish reason is "stop", indicating the completion has been reached.

6. **Output**:
   - The final response and finish reason are printed after the stream completes.

This code snippet demonstrates how to set up a streaming chat agent using Nova, allowing you to receive text responses incrementally as they become available.

I then asked other questions.

Nova Structured Agent

This time, I asked for a structured agent: “I need a code snippet of a Nova Structured Agent.”

And the result was conclusive:

Extract from the response:

Certainly! Below is a complete, working example of a Nova Structured Agent in Go. This agent will generate structured data about a specified country based on the user's input.

```go
package main

import (
        "context"
        "fmt"

        "github.com/snipwise/nova/nova-sdk/agents"
        "github.com/snipwise/nova/nova-sdk/agents/structured"
        "github.com/snipwise/nova/nova-sdk/messages"
        "github.com/snipwise/nova/nova-sdk/messages/roles"
        "github.com/snipwise/nova/nova-sdk/models"
)

type Country struct {
        Name       string   `json:"name"`
        Capital    string   `json:"capital"`
        Population int      `json:"population"`
        Languages  []string `json:"languages"`
}

func main() {
        ctx := context.Background()

        // Create a new Structured Agent
        agent, err := structured.NewAgent[Country](
                ctx,
                agents.Config{
                        EngineURL: "http://localhost:12434/engines/llama.cpp/v1",
                        SystemInstructions: `
                                Your name is Bob.
                                You are an assistant that answers questions about countries around the world.
                        `,
                },
                models.Config{
                        Name:        "hf.co/menlo/jan-nano-gguf:q4_k_m",
                        Temperature: models.Float64(0.0),
                },
        )
        if err != nil {
                panic(err)
        }

        // Generate structured data about Canada
        response, finishReason, err := agent.GenerateStructuredData([]messages.Message{
                {Role: roles.User, Content: "Tell me about Canada."},
        })

        if err != nil {
                panic(err)
        }

        // Print the generated country information
        fmt.Println("Country Name:", response.Name)
        fmt.Println("Capital:", response.Capital)
        fmt.Println("Population:", response.Population)
        fmt.Println("Languages:", strings.Join(response.Languages, ", "))
}

### Explanation:

1. **Imports**: The code imports necessary packages from the Nova SDK to interact with agents and models.

2. **Country Struct**: A simple struct `Country` is defined to represent the data structure expected by the agent.

3. **Main Function**:
   - A context is created using `context.Background()`.
   - A new Structured Agent is created for generating structured data about countries.
   - The system instructions are set to provide a friendly greeting and instruction on answering questions about countries.
   - The model configuration specifies the engine URL, temperature, and other parameters.

4. **GenerateStructuredData Method**: This method generates structured data based on the user's input. It takes an array of messages as input, where each message has a role (e.g., `User`) and content (e.g., "Tell me about Canada").

5. **Error Handling**: The code includes error handling to ensure that any issues during agent creation or data generation are properly caught and reported.

6. **Output**: The generated country information is printed, including the name, capital, population, and languages.

This example demonstrates how to use a Nova Structured Agent to generate structured data about countries based on user input. You can modify the system instructions and model configuration as needed for your specific use case.

Then, I moved on to the next question.

Nova RAG Agent with vector store. Be careful with your similarity settings!

This time, I asked for a “RAG” agent: “I need a snippet of a Nova RAG agent with a vector store.”

And once again, I got a relevant response.

However, when I tried with this question (after restarting the agent to start from a clean base without conversation history): “I need a snippet of a Nova RAG agent.”

The similarity search returned no relevant results (because the words “vector store” were not present in the snippets). And the agent responded with generic code that had nothing to do with Nova or was using code from Nova Chat Agents.

There may be several possible reasons:

  • The embedding model is not suitable for my use case,
  • The embedding model is not precise enough,
  • The splitting of the code snippets file is not optimal (you can add metadata to chunks to improve similarity search, for example, but don’t forget that chunks must not exceed the maximum size that the embedding model can ingest).

In that case, there’s a simple solution that works quite well: you lower the similarity thresholds and/or increase the number of returned similarities. This allows you to have more results to construct the user prompt, but be careful not to exceed the maximum context size of the language model. And you can also do tests with other “bigger” LLMs (more parameters and/or larger context window).

In the latest version of the snippets file, I added a KEYWORDS: … line below the markdown titles to help with similarity search. Which greatly improved the results obtained.

Conclusion

Using “Small Language Models” (SLM) or “Tiny Language Models” (TLM) requires a bit of energy and thought to work around their limitations. But it’s possible to build effective solutions for very specific problems. And once again, always think about the context size for the chat model and how you’ll structure the information for the embedding model. And by combining several specialized “small agents”, you can achieve very interesting results. This will be the subject of future articles.

Learn more

]]>
Deterministic AI Testing with Session Recording in cagent https://www.docker.com/blog/deterministic-ai-testing-with-session-recording-in-cagent/ Tue, 06 Jan 2026 19:16:04 +0000 https://www.docker.com/?p=84349 AI agents introduce a challenge that traditional software doesn’t have: non-determinism. The same prompt can produce different outputs across runs, making reliable testing difficult. Add API costs and latency to the mix, and developer productivity takes a hit.

Session recording in cagent addresses this directly. Record an AI interaction once, replay it indefinitely—with identical results, zero API costs, and millisecond execution times.

How session recording works

cagent implements the VCR pattern, a proven approach for HTTP mocking. During recording, cagent proxies requests to the AI provider, captures the full request/response cycle, and saves it to a YAML “cassette” file. During replay, incoming requests are matched against the recording and served from cache—no network calls required.

Getting started

Recording a session requires a single flag:

cagent run my-agent.yaml --record "What is Docker?"
# creates: cagent-recording-1736089234.yaml

cagent run my-agent.yaml --record="my-test" "Explain containers"
# creates: my-test.yaml

Replaying uses the --fake flag with the cassette path:

cagent exec my-agent.yaml --fake my-test "Explain containers"

The replay completes in milliseconds with no API calls.

One implementation detail worth noting: tool call IDs are normalized before matching. OpenAI generates random IDs on each request, which would otherwise break replay. cagent handles this automatically.

Example: CI/CD integration testing

Consider a code review agent:

# code-reviewer.yaml
agents:
  root:
  model: anthropic/claude-sonnet-4-0
  description: Code review assistant
  instruction: |
	  You are an expert code reviewer. Analyze code for best practices,
	  security issues, performance concerns, and readability.
  toolsets:
  - type: filesystem

Record the interaction with --yolo to auto-approve tool calls:

cagent exec code-reviewer.yaml --record="code-review" --yolo \\
  "Review pkg/auth/handler.go for security issues"

In CI, replay without API keys or network access:

cagent exec code-reviewer.yaml --fake code-review \\
  "Review pkg/auth/handler.go for security issues"

Cassettes can be version-controlled alongside test code. When agent instructions change significantly, delete the cassette and re-record to capture the new behaviour.

Other use cases

Cost-effective prompt iteration. Record a single interaction with an expensive model, then iterate on agent configuration against that recording. The first run incurs API costs; subsequent iterations are free.

cagent exec ./agent.yaml --record="expensive-test" "Complex task"
for i in {1..100}; do
  cagent exec ./agent-v$i.yaml --fake expensive-test "Complex task"
done

Issue reproduction. Users can record a session with --record bug-report and share the cassette file. Support teams replay the exact interaction locally for debugging.

Multi-agent systems. Recording captures the complete delegation graph: root agent decisions, sub-agent tool calls, and inter-agent communication.

Security and provider support

Cassettes automatically strip sensitive headers (Authorization, X-Api-Key) before saving, making them safe to commit to version control. The format is human-readable YAML:

version: 2
interactions:
  - id: 0
    request:
      method: POST
      url: &lt;https://api.openai.com/v1/chat/completions&gt;
      body: "{...}"
    response:
      status: 200 OK
      body: "data: {...}"

Session recording works with all supported providers: OpenAI, Anthropic, Google, Mistral, xAI, and Nebius.

Get started

Session recording is available now in cagent. To try it:

cagent run ./your-agent.yaml --record="my-session" "Your prompt here"

For questions, feedback, or feature requests, visit the cagent repository or join the GitHub Discussions.

]]>