Docker Desktop – Docker

Celebrating Women in AI: 3 Questions with Cecilia Liu on Leading Docker’s MCP Strategy

Yiwen Xu — Fri, 06 Mar 2026 12:59:30 +0000

To celebrate International Women’s Day, we sat down with Cecilia Liu, Senior Product Manager at Docker, for three questions about the vision and strategy behind Docker’s MCP solutions. From shaping product direction to driving AI innovation, Cecilia plays a key role in defining how Docker enables secure, scalable AI tooling.

Cecilia leads product management for Docker’s MCP Catalog and Toolkit, our solution for running MCP servers securely and at scale through containerization. She drives Docker’s AI strategy across both enterprise and developer ecosystems, helping organizations deploy MCP infrastructure with confidence while empowering individual developers to seamlessly discover, integrate, and use MCP in their workflows. With a technical background in AI frameworks and an MBA from NYU Stern, Cecilia bridges the worlds of AI infrastructure and developer tools, turning complex challenges into practical, developer-first solutions.

What products are you responsible for?

I own Docker’s MCP solution. At its core, it’s about solving the problems that anyone working with MCP runs into: how do you find the right MCP servers, how do you actually use them without a steep learning curve, and how do you deploy and manage them reliably across a team or organization.

How does Docker’s MCP solution benefit developers and enterprise customers?

Dev productivity is where my heart is. I want to build something that meaningfully helps developers at every stage of their cycle — and that’s exactly how I think about Docker’s MCP solution.

For end-user developers and vibe coders, the goal is simple: you shouldn’t need to understand the underlying infrastructure to get value from MCP. As long as you’re working with AI, we make it easy to discover, configure, and start using MCP servers without any of the usual setup headaches. One thing I kept hearing in user feedback was that people couldn’t even tell if their setup was actually working. That pushed us to ship in-product setup instructions that walk you through not just configuration, but how to verify everything is running correctly. It sounds small, but it made a real difference.

For developers building MCP servers and integrating them into agents, I’m focused on giving them the right creation and testing tools so they can ship faster and with more confidence. That’s a big part of where we’re headed.

And for security and enterprise admins, we’re solving real deployment pain, making it faster and cheaper to roll out and manage MCP across an entire organization. Custom catalogs, role-based access controls, audit logging, policy enforcement. The goal is to give teams the visibility and control they need to adopt AI tooling confidently at scale.

Customers love us for all of the above, and there’s one more thing that ties it together: the security that comes built-in with Docker. That trust doesn’t happen overnight, and it’s something we take seriously across everything we ship.

What are you excited about when it comes to the future of MCP?

What excites me most is honestly the pace of change itself. The AI landscape is shifting constantly, and with every new tool that makes AI more powerful, there’s a whole new set of developers who need a way to actually use it productively. That’s a massive opportunity.

MCP is where that’s happening right now, and the adoption we’re seeing tells me the need is real. But what gets me out of bed is knowing the problems we’re solving: discoverability, usability, deployment. They are all going to matter just as much for whatever comes next. We’re not just building for today’s tools. We’re building the foundation that developers will reach for every time something new emerges.

Cecilia is speaking about scaling MCP for enterprises at the MCP Dev Summit in NYC on 3rd of April, 2026. If you’re attending, be sure to stop by Docker’s booth (D/P9).

Learn more

Explore Docker’s MCP Catalog and Toolkit on our website.
Dive into our documentation to get started quickly.
Ready to go hands-on? Open Docker Desktop or the CLI and start using MCP to streamline and automate your development workflows.

Gordon (Beta): Docker’s AI Agent Just Got an Update

Srini Sekaran — Mon, 23 Feb 2026 14:13:00 +0000

AI agents are moving from demos to daily workflows. They write code, run commands, and complete multi-step tasks without constant hand-holding. But general-purpose agents don’t know Docker. They don’t understand your containers, your images, or your specific setup.

Gordon does. Just run docker ai in your terminal or try it in Docker Desktop.

Available today in Docker Desktop 4.61, still in beta, Gordon is an AI agent purpose-built for Docker. It has shell access, Docker CLI access, your filesystem, and deep knowledge of Docker best practices. Point it at a problem, approve its actions, and watch it work.

Figure 1: docker ai command launching Gordon in terminal interface

Figure 2: Gordon in Docker Desktop sidebar

Why Docker Needs Its Own Agent

When your container exits with code 137, Claude or ChatGPT will explain what OOM means. Gordon checks your container’s memory limit, inspects the logs, identifies the memory-hungry process, and proposes a fix. One approval, and it’s done.

When you need to containerize a Next.js app, Copilot might suggest a Dockerfile. Gordon examines your project structure, detects your dependencies, generates a production-ready Dockerfile with multi-stage builds, creates docker-compose.yml with the right services, and sets up your environment configs.

The difference is context and execution. Gordon knows what’s running on your machine. It can read your Docker state, access your filesystem, and take action. It’s not guessing – it’s working with your actual environment.

What Gordon Does

Debug and fix – Container won’t start. Service is unhealthy. Something is consuming all the memory. Gordon inspects logs, checks container status, identifies root cause, and proposes fixes. You approve, it executes.

Build and containerize – Take this application and make it run in Docker. Gordon examines your project, generates production-ready Dockerfiles with multi-stage builds, creates docker-compose.yml with the right services, handles environment configs and dependencies.

Execute and manage – Clean up disk space. Stop all containers. Pull and run specific images. Routine Docker operations should be conversational, not a trip to the docs.

Develop and optimize – Add health checks. Implement multi-stage builds. Apply security best practices. Reduce image sizes. Make existing Docker setups production-ready.

Gordon handles all of it.

Figure 3: Split screen showing Gordon debugging a mongodb container

How Gordon Works

Gordon is built on cagent, Docker’s agent framework included with Docker Desktop, and runs locally within Docker Desktop. It has access to:

Your shell – Can execute commands after approval
Your filesystem – Reads project structure, configs, logs

Docker CLI – Full access to Docker operations
Docker knowledge base – Documentation, best practices, common patterns

You can configure Gordon’s working directory to point to a specific codebase. This gives Gordon full context on your project structure, dependencies, and existing Docker setup.

The permission model is straightforward: Gordon shows you what it wants to do, you approve or reject, then it executes. Every command. Every file update. Every Docker operation. You’re not watching passively – you’re directing an agent that knows Docker inside and out.

Figure 4: Permissions request

Where to Find Gordon

Docker Desktop: Look for the Gordon icon in the left sidebar

CLI: Run docker ai from your terminal

Get started today

Download Docker Desktop 4.61+
Log in with your Docker account
Click the Gordon icon, select a project directory, and ask “Optimize my Dockerfile”
Explore the full documentation in Docker Docs

Gordon is available now in Docker Desktop 4.61 and later

The Multi-Model Database for AI Agents: Deploy SurrealDB with Docker Extension

Jennifer Kohl — Tue, 17 Feb 2026 14:00:00 +0000

When it comes to building dynamic and real-work solutions, developers need to stitch multiple databases (relational, document, graph, vector, time-series, search) together and build complex API layers to integrate them. This generates significant complexity, cost, and operational risk, and reduces speed of innovation. More often than not, developers end up focusing on building glue code and managing infrastructure rather than building application logic. For AI use cases, using multiple databases means AI Agents have fragmented data, context and memory, producing bad outputs at high latency.

Enter SurrealDB.

SurrealDB is a multi-model database built in Rust that unifies document, graph, relational, time-series, geospatial, key-value, and vector data into a single engine. Its SQL-like query language, SurrealQL, lets you traverse graphs, perform vector search, and query structured data – all in one statement.

Designed for data-intensive workloads like AI agent memory, knowledge graphs, real-time applications, and edge deployments, SurrealDB runs as a single binary anywhere: embedded in your app, in the browser via WebAssembly, at the edge, or as a distributed cluster.

What problem does SurrealDB solve?

Modern AI systems place very different demands on data infrastructure than traditional applications. SurrealDB addresses these pressures directly:

Single runtime for multiple data models – AI systems frequently combine vector search, graph traversal, document storage, real-time state, and relational data in the same request path. SurrealDB supports these models natively in one engine, avoiding brittle cross-database APIs, ETL pipelines, and consistency gaps.
Low-latency access to changing context – Voice agents, interactive assistants, and stateful agents are sensitive to both latency and data freshness. SurrealDB’s query model and real-time features serve up-to-date context without polling or background sync jobs.
Reduced system complexity – Replacing multiple specialized databases with a single multi-model store reduces services, APIs, and failure modes. This simplifies deployment, debugging, and long-term maintenance.
Faster iteration on data-heavy features – Opt in schemas definitions and expressive queries let teams evolve data models alongside AI features without large migrations. This is particularly useful when experimenting with embeddings, relationships, or agent memory structures.
Built-in primitives for common AI patterns – Native support for vectors, graphs, and transactional consistency enables RAG, graph-augmented retrieval, recommendation pipelines, and agent state management – without external systems or custom glue code.

In this article, you’ll see how to build a WhatsApp RAG chatbot using SurrealDB Docker Extension. You’ll learn how SurrealDB Docker Extension powers an intelligent WhatsApp chatbot that turns your chat history into searchable, AI-enhanced conversations with vector embeddings and precise source citations.

Understanding SurrealDB Architecture

SurrealDB’s architecture unifies multiple data models within a single database engine, eliminating the need for separate systems and synchronization logic (figure below).

Caption: SurrealDB Architecture diagram

Caption: Architecture diagram of SurrealDB showing a unified multi-model database with real-time capabilities. (more information at https://surrealdb.com/docs/surrealdb/introduction/architecture)

With SurrealDB, you can:

Model complex relationships using graph traversal syntax (e.g., ->bought_together->product)
Store flexible documents alongside structured relational tables
Subscribe to real-time changes with LIVE SELECT queries that push updates instantly
Ensure data consistency with ACID-compliant transactions across all models

Learn more about SurrealDB’s architecture and key features on the official documentation.

How does Surreal work?

SurrealDB separates storage from compute, enabling you to scale these independently without the need to manually shard your data.

The query layer (otherwise known as the compute layer) handles queries from the client, analyzing which records need to be selected, created, updated, or deleted.

The storage layer handles the storage of the data for the query layer. By scaling storage nodes, you are able to increase the amount of supported data for each deployment.

SurrealDB supports all the way from single-node to highly scalable fault-tolerant deployments with large amounts of data.

For more information, see https://surrealdb.com/docs/surrealdb/introduction/architecture.

Why should you run SurrealDB as a Docker Extension

For developers already using Docker Desktop, running SurrealDB as an extension eliminates friction. There’s no separate installation, no dependency management, no configuration files – just a single click from the Extensions Marketplace.

Docker provides the ideal environment to bundle and run SurrealDB in a lightweight, isolated container. This encapsulation ensures consistent behavior across macOS, Windows, and Linux, so what works on your laptop works identically in staging.

The Docker Desktop Extension includes:

Visual query editor with SurrealQL syntax highlighting
Real-time data explorer showing live updates as records change
Schema visualization for tables and relationships
Connection management to switch between local and remote instances
Built-in backup/restore for easy data export and import

With Docker Desktop as the only prerequisite, you can go from zero to a running SurrealDB instance in under a minute.

Getting Started

To begin, download and install Docker Desktop on your machine. Then follow these steps:

Open Docker Desktop and select Extensions in the left sidebar
Switch to the Browse tab
In the Filters dropdown, select the Database category
Find SurrealDB and click Install

Caption: Installing the SurrealDB Extension from Docker Desktop’s Extensions Marketplace.

Real-World Example

Smart Team Communication Assistant

Imagine searching through months of team WhatsApp conversations to answer the question: “What did we decide about the marketing campaign budget?”

Traditional keyword search fails, but RAG with SurrealDB and LangChain solves this by combining semantic vector search with relationship graphs.

This architecture analyzes group chats (WhatsApp, Instagram, Slack) by storing conversations as vector embeddings while simultaneously building a knowledge graph linking conversations through extracted keywords like “budget,” “marketing,” and “decision.” When queried, the system retrieves relevant context using both similarity matching and graph traversal, delivering accurate answers about past discussions, decisions, and action items even when phrased differently than the original conversation.

This project is inspired by Multi-model RAG with LangChain | GitHub Example

1. Clone the repository:

git clone https://github.com/Raveendiran-RR/surrealdb-rag-demo

2. Enable Docker Model Runner by visiting Docker Desktop > Settings > AI

Caption: Enable Docker Model Runner in Docker Desktop > settings > AI

3. Pull llama3.2 model from Docker Hub

Search for llama 3.2 under Models > Docker Hub and pull the right model.

Caption: Pull the Docker model llama3.2

4. Download the embeddinggemma model from Docker Hub

Caption: Click on Models > Search for embeddinggemma > download the model

5. Run this command to connect to the persistent surrealDB container

Browse to the directory where you have cloned the repository
Create directory “mydata”

mkdir -p mydata

6. Run this command:

docker run -d --name demo_data \
  -p 8002:8000 \
  -v "$(pwd)/mydata:/mydata" \
  surrealdb/surrealdb:latest \
  start --log debug --user root --pass root \
  rocksdb://mydata

Note: use the path based on the operating system.

For windows , use rocksdb://mydata
For linux and macOS, use rocksdb:/mydata

7. Open SurrealDB Docker Extension and connect with SurrealDB.

Caption: Connecting to SurrealDB through Docker Desktop Extension

Connection name: RAGBot

Remote address: http://localhost:8002

Username: root | password: root

Click on Create Connection

8. Run the setup instructions

9. Upload the whatsapp chat

Start the UI for the RAG bot (http://localhost:8080)

Caption: Create connection to the SurrealDB Docker container

10. Start chatting with the RAG bot and have fun

11. We can verify the correctness data in SurrealDB list

Ensure that you connect to the right namespace (whatsapp) and database (chats)

python3 load_whatsapp.py
python3 rag_chat_ui.py

Caption: connect to the “whatsapp” namespace and “chats” database

Caption: Data stored as vectors in SurrealDB

Caption: Interact with the RAG bot UI where it gives you the answer and exact reference for it

Using this chat bot, now you can get information about the chat.txt file that was ingested. You can also verify the information in the query editor as shown below when you can run custom queries to validate the results from the chat bot. You can ingest new messages through the load_whatsapp.py file, please ensure that the message format is same as in the sample whatsChatExport.txt file.

Learn more about SurrealQL here.

Caption: SurrealDB Query editor in the Docker Desktop Extension

Conclusion

The SurrealDB Docker Extension offers an accessible and powerful solution for developers building data-intensive applications – especially those working with AI agents, knowledge graphs, and real-time systems. Its multi-model architecture eliminates the need to stitch together separate databases, letting you store documents, traverse graphs, query vectors, and subscribe to live updates from a single engine.

With Docker Desktop integration, getting started takes seconds rather than hours. No configuration files, no dependency management – just install the extension and start building. The visual query editor and real-time data explorer make it easy to prototype schemas, test queries, and inspect data as it changes.

Whether you’re building agent memory systems, real-time recommendation engines, or simply looking to consolidate a sprawling database stack, SurrealDB’s Docker Extension provides an intuitive path forward. Install it today and see how a unified data layer can simplify your architecture.

If you have questions or want to connect with other SurrealDB users, join the SurrealDB community on Discord.

Learn More

Install the SurrealDB Docker Extension
Get the latest release of Docker Desktop
SurrealDB documentation
Vote on what’s next! Check out our public roadmap
Have questions? The Docker community is here to help
New to Docker? Get started

Get Started with the Atlassian Rovo MCP Server Using Docker

Yiwen Xu — Wed, 04 Feb 2026 13:52:53 +0000

We’re excited to announce that the remote Atlassian Rovo MCP server is now available in Docker’s MCP Catalog and Toolkit, making it easier than ever to connect AI assistants to Jira and Confluence. With just a few clicks, technical teams can use their favorite AI agents to create and update Jira issues, epics, and Confluence pages without complex setup or manual integrations.

In this post, we’ll show you how to get started with the Atlassian remote MCP server in minutes and how to use it to automate everyday workflows for product and engineering teams.

Figure 1: Discover over 300+ MCP servers including the remote Atlassian MCP server in Docker MCP Catalog.

What is the Atlassian Rovo MCP Server?

Like many teams, we rely heavily on Atlassian tools, especially Jira to plan, track, and ship product and engineering work. The Atlassian Rovo MCP server enables AI assistants and agents to interact directly with Jira and Confluence, closing the gap between where work happens and how teams want to use AI.

With the Atlassian Rovo MCP server, you can:

Create and update Jira issues and epics
Generate and edit Confluence pages
Use your preferred AI assistant or agent to automate everyday workflows

Traditionally, setting up and configuring MCP servers can be time-consuming and complex. Docker removes that friction, making it easy to get up and running securely in minutes.

Enable the Atlassian Rovo MCP Server with One Click

Docker’s MCP Catalog is a curated collection of 300+ MCP servers, including both local and remote options. It provides a reliable starting point for developers building with MCP so you don’t have to wire everything together yourself.

Prerequisites

Before you begin, make sure you have:
A machine with 8GB RAM minimum, ideally 16GB
Install Docker Desktop

To get started with the Atlassian remote MCP server:

Open Docker Desktop and click on the MCP Toolkit tab.
Navigate to Docker MCP Catalog
Search for the Atlassian Rovo MCP server.
Select the remote version with cloud icon
Enable it with a single click

That’s it. No manual installs. No dependency wrangling.

Why use the Atlassian Rovo MCP server with Docker

Demo by Cecilia Liu: Set up the Atlassian Rovo MCP server with Docker with just a few clicks and use it to generate Jira epics with Claude Desktop

Seamless Authentication with Built-in OAuth

The Atlassian Rovo MCP server uses Docker’s built-in OAuth, so authorization is seamless. Docker securely manages your credentials and allows you to reuse them across multiple MCP clients. You authenticate once, and you’re good to go.

Behind the scenes, this frictionless experience is powered by the MCP Toolkit, which handles environment setup and dependency management for you.

Works with Your Favorite AI Agent

Once the Atlassian Rovo MCP server is enabled, you can connect it to any MCP-compatible client.

For popular clients like Claude Desktop, Claude Code, Codex, or Gemini CLI, connecting is just one click. Just click Connect, restart Claude Desktop, and now we’re ready to go.

From there, we can ask Claude to:

Write a short PRD about MCP
Turn that PRD into Jira epics and stories
Review the generated epics and confirm they’re correct

And just like that, Jira is updated.

One Setup, Any MCP Client

Sometimes AI assistants have hiccups. Maybe you hit a daily usage limit in one tool. That’s not a blocker here.

Because the Atlassian Rovo MCP server is connected through the Docker MCP Toolkit, the setup is completely client-agnostic. Switching to another assistant like Gemini CLI or Cursor is as simple as clicking Connect. No need for reconfiguration or additional setup!

Now we can ask any connected AI assistant such as Gemini CLI to, for example, check all new unassigned Jira tickets. It just works.

Coming Soon: Share Atlassian-Based Workflows Across Teams

We’re working on new enhancements that will make Atlassian-powered workflows even more powerful and easy to share. Soon, you’ll be able to package complete workflows that combine MCP servers, clients, and configurations. Imagine a workflow that turns customer feedback into Jira tickets using Atlassian and Confluence, then shares that entire setup instantly with your team or across projects. That’s where we’re headed.

Frequently Asked Questions (FAQ)

What is the Atlassian Rovo MCP server?

The Atlassian MCP Rovo server enables AI assistants and agents to securely interact with Jira and Confluence. It allows AI tools to create and update Jira issues and epics, generate and edit Confluence pages, and automate everyday workflows for product and engineering teams.

How do I use the Atlassian Rovo MCP server with Docker?

You can enable the Atlassian Rovo MCP server directly from Docker Desktop or CLI. Simply open the MCP Toolkit tab, search for the Atlassian MCP server, select the remote version, and enable it with one click. Connect to any MCP-compatible client. For popular tools like Claude Code, Codex, and Gemini, setup is even easier with one-click integration.

Why use Docker to run the Atlassian Rovo MCP server?

Using Docker to run the Atlassian Rovo MCP server removes the complexity of setup, authentication, and client integration. Docker provides one-click enablement through the MCP Catalog, built-in OAuth for secure credential management, and a client-agnostic MCP Toolkit that lets teams connect any AI assistant or agent without reconfiguration so you can focus on automating Jira and Confluence workflows instead of managing infrastructure.

Less Setup. Less Context Switching. More Work Shipped.

That’s how easy it is to set up and use the Atlassian Rovo MCP server with Docker. By combining the MCP Catalog and Toolkit, Docker removes the friction from connecting AI agents to the tools teams already rely on.

Learn more

Get started with MCP Catalog and Toolkit
Explore the Docker MCP Catalog: Discover containerized, security-hardened MCP servers
Read more about the Docker MCP Toolkit: Official Documentation

Run Claude Code Locally with Docker Model Runner

Yiwen Xu — Mon, 26 Jan 2026 13:11:59 +0000

We recently showed how to pair OpenCode with Docker Model Runner for a privacy-first, cost-effective AI coding setup. Today, we’re bringing the same approach to Claude Code, Anthropic’s agentic coding tool.

This post walks through how to configure Claude Code to use Docker Model Runner, giving you full control over your data, infrastructure, and spend.

Figure 1: Using local models like gpt-oss to power Claude Code

What Is Claude Code?

Claude Code is Anthropic’s command-line tool for agentic coding. It lives in your terminal, understands your codebase, and helps you code faster by executing routine tasks, explaining complex code, and handling git workflows through natural language commands.

Docker Model Runner (DMR) allows you to run and manage large language models locally. It exposes an Anthropic-compatible API, making it straightforward to integrate with tools like Claude Code.

Install Claude Code

Install Claude Code:
macOS / Linux:

curl -fsSL https://claude.ai/install.sh | bash

Windows PowerShell:

irm https://claude.ai/install.ps1 | iex

Using Claude Code with Docker Model Runner

Claude Code supports custom API endpoints through the ANTHROPIC_BASE_URL environment variable. Since Docker Model Runner exposes an Anthropic-compatible API, integrating the two is simple.

Note for Docker Desktop users:
If you are running Docker Model Runner via Docker Desktop, make sure TCP access is enabled:

docker desktop enable model-runner --tcp

Once enabled, Docker Model Runner will be accessible at http://localhost:12434.

Increasing Context Size

For coding tasks, context length matters. While models like glm-4.7-flash, qwen3-coder and devstral-small-2 come with 128K context by default, gpt-oss defaults to 4,096 tokens.

Docker Model Runner makes it easy to repackage any model with an increased context size:

docker model pull gpt-oss
docker model package --from ai/gpt-oss --context-size 32000 gpt-oss:32k

Once packaged, use it with Claude Code:

ANTHROPIC_BASE_URL=http://localhost:12434 claude --model gpt-oss:32k

ANTHROPIC_BASE_URL=http://localhost:12434 claude --model gpt-oss "Describe this repo."

That’s it. Claude Code will now send all requests to your local Docker Model Runner instance.

Run Claude Code locally with gpt-oss using Docker Model Runner

Here’s what it looks like in action:

Figure 2: Claude Code powered by Docker Model Runner keeps everything local and in your control

$ docker model pull gpt-oss

$ docker model package --from ai/gpt-oss --context-size 32000 gpt-oss:32k

$ ANTHROPIC_BASE_URL=http://localhost:12434 claude --model gpt-oss:32k "Describe this repo."

 ▐▛███▜▌   Claude Code v2.1.12
▝▜█████▛▘  gpt-oss:32k · API Usage Billing
  ▘▘ ▝▝    ~/workspace/model-runner

❯ Describe this repo.

 Read(README.md)
  ⎿  Read 423 lines

 Docker Model Runner (DMR)
  - A Go-based tool that lets developers pull, run, and serve AI models 
    directly from Docker Hub or any OCI-compatible registry.
  - Consists of two main components:
    a. model-runner – a server/daemon that manages model lifecycle, 
       inference, and metrics.
    b. model-cli – a CLI plugin (docker model) that communicates with 
       the server, listing models, pulling new ones, and executing inference.

  - Supports multiple inference backends: llama.cpp, vLLM, and others.
  - Built with Docker Desktop in mind, but also works with Docker Engine 
    on Linux.
  - Provides a REST API for model management and inference, plus a 
    /metrics endpoint for Prometheus-style monitoring.

Claude Code reads your repository, reasons about its structure, and provides an accurate summary, all while keeping your code entirely on your local machine.

Monitor the requests sent by Claude Code

Want to see exactly what Claude Code sends to Docker Model Runner? Use the docker model requests command:

docker model requests --model gpt-oss:32k | jq .

Figure 3: Monitor requests sent by Claude Code to the LLM

This outputs the raw requests, which is useful for understanding how Claude Code communicates with the model and debugging any compatibility issues.

Making It Persistent
For convenience, set the environment variable in your shell profile:

# Add to ~/.bashrc, ~/.zshrc, or equivalent
export ANTHROPIC_BASE_URL=http://localhost:12434

Then simply run:

claude --model gpt-oss:32k "Describe this repo."

How You Can Get Involved

The strength of Docker Model Runner lies in its community, and there’s always room to grow. To get involved:

Star the repository: Show your support by starring the Docker Model Runner repo.
Contribute your ideas: Create an issue or submit a pull request. We’re excited to see what ideas you have!
Spread the word: Tell your friends and colleagues who might be interested in running AI models with Docker.

We’re incredibly excited about this new chapter for Docker Model Runner, and we can’t wait to see what we can build together. Let’s get to work!

Learn More

Read the companion post: OpenCode with Docker Model Runner for Private AI Coding
Check out the Docker Model Runner General Availability announcement
Visit our Model Runner GitHub repo
Get started with a simple hello GenAI application
Learn more about Claude Code from Anthropic’s documentation

Using MCP Servers: From Quick Tools to Multi-Agent Systems

Yiwen Xu — Thu, 22 Jan 2026 19:35:33 +0000

Model Context Protocol (MCP) servers are a spec for exposing tools, models, or services to language models through a common interface. Think of them as smart adapters: they sit between a tool and the LLM, speaking a predictable protocol that lets the model interact with things like APIs, databases, and agents without needing to know implementation details.

But like most good ideas, the devil’s in the details.

The Promise—and the Problems of Running MCP Servers

Running an MCP sounds simple: spin up a Python or Node server that exposes your tool. Done, right? Not quite.

You run into problems fast:

Runtime friction: If an MCP is written in Python, your environment needs Python (plus dependencies, plus maybe a virtualenv strategy, plus maybe GPU drivers). Same goes for Node. This multiplies fast when you’re managing many MCPs or deploying them across teams.
Secrets management: MCPs often need credentials (API keys, tokens, etc.). You need a secure way to store and inject those secrets into your MCP runtime. That gets tricky when different teams, tools, or clouds are involved.
N×N integration pain: Let’s say you’ve got three clients that want to consume MCPs, and five MCPs to serve up. Now you’re looking at 15 individual integrations. No thanks.

To make MCPs practical, you need to solve these three core problems: runtime complexity, secret injection, and client-to-server wiring.

If you’re wondering where I’m going with all this, take a look at those problems. We already have a technology that has been used by developers for over a decade that helps solve them: Docker containers.

In the rest of this blog I’ll walk through three different approaches, going from least complex to most complex, for integrating MCP servers into your developer experience.

Option 1 — Docker MCP Toolkit & Catalog

For the developer who already uses containers and wants a low-friction way to start with MCP.

If you’re already comfortable with Docker but just getting your feet wet with MCP, this is the sweet spot. In the raw MCP world, you’d clone Python/Node servers, manage runtimes, inject secrets yourself, and hand-wire connections to every client. That’s exactly the pain Docker’s MCP ecosystem set out to solve.

Docker’s MCP Catalog is a curated, containerized registry of MCP servers. Each entry is a prebuilt container with everything you need to run the MCP server.

The MCP Toolkit (available via Docker Desktop) is your control panel: search the catalog, launch servers with secure defaults, and connect them to clients.

How it helps:

No language runtimes to install
Built-in secrets management
One-click enablement via Docker Desktop
Easily wire the MCPs to your existing agents (Claude Desktop, Copilot in VS Code, etc)
Centralized access via the MCP Gateway

Figure 1: Docker MCP Catalog: Browse hundreds of MCP servers with filters for local or remote and clear distinctions between official and community servers

A Note on the MCP Gateway
One important piece working behind the scenes in both the MCP Toolkit and cagent (a framework for easily building multi-agent applications that we cover below) is the MCP Gateway, an open-source project from Docker that acts as a centralized frontend for all your MCP servers. Whether you’re using a GUI to start containers or defining agents in YAML, the Gateway handles all the routing, authentication, and translation between clients and tools. It also exposes a single endpoint that custom apps or agent frameworks can call directly, making it a clean bridge between GUI-based workflows and programmatic agent development.

Moving on: Using MCP servers alongside existing AI agents is often the first step for many developers. You wire up a couple tools, maybe connect to a calendar or a search API, and use them in something like Claude, ChatGPT, or a small custom agent. For step-by-step tutorials on how to automate dev workflows with Docker’s MCP Catalog and Toolkit with popular clients, check out these guides on ChatGPT, Claude Desktop,Codex, Gemini CLI, and Claude Code.
Once that pattern clicks, the next logical step is to use those same MCP servers as tools inside a multi-agent system.

Option 2 — cagent: Declarative Multi-Agent Apps

For the developer who wants to build custom multi-agent applications but isn’t steeped in traditional agentic frameworks.

If you’re past simple MCP servers and want agents that can delegate, coordinate, and reason together, cagent is your next step. It’s Docker’s open-source, YAML-first framework for defining and running multi-agent systems—without needing to dive into complex agent SDKs or LLM loop logic.

Cagent lets you describe:

The agents themselves (model, role, instructions)
Who delegates to whom
What tools each agent can access (via MCP or local capabilities)

Below is an example of a pirate flavored chat bot:

agents:
  root:
    description: An agent that talks like a pirate
    instruction: Always answer by talking like a pirate.
    welcome_message: |
      Ahoy! I be yer pirate guide, ready to set sail on the seas o' knowledge! What be yer quest? 
    model: auto


cagent run agents.yaml

You don’t write orchestration code. You describe what you want, and Cagent runs the system.

Why it works:

Tools are scoped per agent
Delegation is explicit
Uses MCP Gateway behind the scene
Ideal for building agent systems without writing Python

If you’d like to give cagent a try, we have a ton of examples in the project’s GitHub repository. Check out this guide on building multi-agent systems in 5 minutes.

Option 3 — Traditional Agent Frameworks (LangGraph, CrewAI, ADK)

For developers building complex, custom, fully programmatic agent systems.

Traditional agent frameworks like LangGraph, CrewAI, or Google’s Agent Development Kit (ADK) let you define, control, and orchestrate agent behavior directly in code. You get full control over logic, state, memory, tools, and workflows.

They shine when you need:

Complex branching logic
Error recovery, retries, and persistence
Custom memory or storage layers
Tight integration with existing backend code

Example: LangGraph + MCP via Gateway


import requests
from langgraph.graph import StateGraph
from langchain.agents import Tool
from langchain_openai import ChatOpenAI

# Discover MCP endpoint from Gateway
resp = requests.get("http://localhost:6600/v1/servers")
servers = resp.json()["servers"]
duck_url = next(s["url"] for s in servers if s["name"] == "duckduckgo")

# Define a callable tool
def mcp_search(query: str) -> str:
    return requests.post(duck_url, json={"input": query}).json()["output"]

search_tool = Tool(name="web_search", func=mcp_search, description="Search via MCP")

# Wire it into a LangGraph loop
llm = ChatOpenAI(model="gpt-4")
graph = StateGraph()
graph.add_node("agent", llm.bind_tools([search_tool]))
graph.add_edge("agent", "agent")
graph.set_entry_point("agent")

app = graph.compile()
app.invoke("What’s the latest in EU AI regulation?")

In this setup, you decide which tools are available. The agent chooses when to use them based on context, but you’ve defined the menu.
And yes, this is still true in the Docker MCP Toolkit: you decide what to enable. The LLM can’t call what you haven’t made visible.

Choosing the Right Approach

Approach	Best For	You Manage	You Get
Docker MCP Toolkit + Catalog	Devs new to MCP, already using containers	Tool selection	One-click setup, built-in secrets, Gateway integration
Cagent	YAML-based multi-agent apps without custom code	Roles & tool access	Declarative orchestration, multi-agent workflows
LangGraph / CrewAI / ADK	Complex, production-grade agent systems	Full orchestration	Max control over logic, memory, tools, and flow

Wrapping Up
Whether you’re just connecting a tool to Claude, designing a custom multi-agent system, or building production workflows by hand, Docker’s MCP tooling helps you get started easily and securely.

Check out the Docker MCP Toolkit, cagent, and MCP Gateway for example code, docs, and more ways to get started.

Making (Very) Small LLMs Smarter

Yiwen Xu — Fri, 16 Jan 2026 13:47:26 +0000

Hello, I’m Philippe, and I am a Principal Solutions Architect helping customers with their usage of Docker. I started getting seriously interested in generative AI about two years ago. What interests me most is the ability to run language models (LLMs) directly on my laptop (For work, I have a MacBook Pro M2 max, but on a more personal level, I run LLMs on my personal MacBook Air M4 and on Raspberry Pis – yes, it’s possible, but I’ll talk about that another time).

Let’s be clear, reproducing a Claude AI Desktop or Chat GPT on a laptop with small language models is not possible. Especially since I limit myself to models that have between 0.5 and 7 billion parameters. But I find it an interesting challenge to see how far we can go with these small models. So, can we do really useful things with small LLMs? The answer is yes, but you need to be creative and put in a bit of effort.

I’m going to take a concrete use case, related to development (but in the future I’ll propose “less technical” use cases).

(Specific) Use Case: Code Writing Assistance

I need help writing code

Currently, I’m working in my free time on an open-source project, which is a Golang library for quickly developing small generative AI agents. It’s both to get my hands dirty with Golang and prepare tools for other projects. This project is called Nova; there’s nothing secret about it, you can find it here.

If I use Claude AI and ask it to help me write code with Nova: “I need a code snippet of a Golang Nova Chat agent using a stream completion.”

The response will be quite disappointing, because Claude doesn’t know Nova (which is normal, it’s a recent project). But Claude doesn’t want to disappoint me and will still propose something which has nothing to do with my project.

And it will be the same with Gemini.

So, you’ll tell me, give the “source code of your repository to feed” to Claude AI or Gemini. OK, but imagine the following situation: I don’t have access to these services, for various reasons. Some of these reasons could be confidentiality, the fact that I’m on a project where we don’t have the right to use the internet, for example. That already disqualifies Claude AI and Gemini. How can I get help writing code with a small local LLM? So as you guessed, with a local LLM. And moreover, a “very small” LLM.

Choosing a language model

When you develop a solution based on generative AI, the choice of language model(s) is crucial. And you’ll have to do a lot of technology watching, research, and testing to find the model that best fits your use case. And know that this is non-negligible work.

For this article (and also because I use it), I’m going to use hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m, which you can find here. It’s a 3 billion parameter language model, optimized for code generation. You can install it with Docker Model Runner with the following command:

docker model pull hf.co/Qwen/Qwen2.5-Coder-3B-Instruct-GGUF:Q4_K_M

And to start chatting with the model, you can use the following command:

docker model run hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m

Or use Docker Desktop:

So, of course, as you can see in the illustration above, this little “Qwen Coder” doesn’t know my Nova library either. But we’re going to fix that.

Feeding the model with specific information

For my project, I have a markdown file in which I save the code snippets I use to develop examples with Nova. You can find it here. For now, there’s little content, but it will be enough to prove and illustrate my point.

So I could add the entire content of this file to a user prompt that I would give to the model. But that will be ineffective. Indeed, small models have a relatively small context window. But even if my “Qwen Coder” was capable of ingesting all the content of my markdown file, it would have trouble focusing on my request and on what it should do with this information. So,

1st essential rule: when you use a very small LLM, the larger the content provided to the model, the less effective the model will be.
2nd essential rule: the more you keep the conversation history, the more the content provided to the model will grow, and therefore it will decrease the effectiveness of the model.

So, to work around this problem, I’m going to use a technique called RAG (Retrieval Augmented Generation). The principle is simple: instead of providing all the content to the model, we’re going to store this content in a “vector” type database, and when the user makes a request, we’re going to search in this database for the most relevant information based on the user’s request. Then, we’re going to provide only this relevant information to the language model. For this blog post, the data will be kept in memory (which is not optimal, but sufficient for a demonstration).

RAG?

There are already many articles on the subject, so I won’t go into detail. But here’s what I’m going to do for this blog post:

My snippets file is composed of sections: a markdown title (## snippet name), possibly a description in free text, and a code block (golang … ).
I’m going to split this file by sections into chunks of text (we also talk about “chunks”),
Then, for each section I’m going to create an “embedding” (vector representation of text == mathematical representation of the semantic meaning of the text) with the ai/embeddinggemma:latest model (a relatively small and efficient embedding model). Then I’m going to store these embeddings (and the associated text) in an in-memory vector database (a simple array of JSON objects).
If you want to learn more about embedding, please read this article:Run Embedding Models and Unlock Semantic Search with Docker Model Runner

Diagram of the vector database creation process:

Similarity search and user prompt construction

Once I have this in place, when I make a request to the language model (so hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m), I’m going to:

Create an embedding of the user’s request with the embedding model.
Compare this embedding with the embeddings stored in the vector database to find the most relevant sections (by calculating the distance between the vector representation of my question and the vector representations of the snippets). This is called a similarity search.
From the most relevant sections (the most similar), I’ll be able to construct a user prompt that includes only the relevant information and my initial request.

Diagram of the search and user prompt construction process:

So the final user prompt will contain:

The system instructions. For example: “You are a helpful coding assistant specialized in Golang and the Nova library. Use the provided code snippets to help the user with their requests.”
The relevant sections were extracted from the vector database.
The user’s request.

Remarks:

I explain the principles and results, but all the source code (NodeJS with LangchainJS) used to arrive at my conclusions is available in this project
To calculate distances between vectors, I used cosine similarity (A cosine similarity score of 1 indicates that the vectors point in the same direction. A cosine similarity score of 0 indicates that the vectors are orthogonal, meaning they have no directional similarity.)
You can find the JavaScript function I used here:
And the piece of code that I use to split the markdown snippets file:
Warning: embedding models are limited by the size of text chunks they can ingest. So you have to be careful not to exceed this size when splitting the source file. And in some cases, you’ll have to change the splitting strategy (fixed-size chunk,s for example, with or without overlap)

Implementation and results, or creating my Golang expert agent

Now that we have the operating principle, let’s see how to put this into music with LangchainJS, Docker Model Runner, and Docker Agentic Compose.

Docker Agentic Compose configuration

Let’s start with the Docker Agentic Compose project structure:

services:
  golang-expert:
    build:
      context: .
      dockerfile: Dockerfile
    environment:
      TERM: xterm-256color

      HISTORY_MESSAGES: 2
      MAX_SIMILARITIES: 3
      COSINE_LIMIT: 0.45

      OPTION_TEMPERATURE: 0.0
      OPTION_TOP_P: 0.75
      OPTION_PRESENCE_PENALTY: 2.2

      CONTENT_PATH: /app/data

    volumes:
      - ./data:/app/data

    stdin_open: true   # docker run -i
    tty: true          # docker run -t

    configs:
      - source: system.instructions.md
        target: /app/system.instructions.md

    models:
      chat-model:
        endpoint_var: MODEL_RUNNER_BASE_URL
        model_var: MODEL_RUNNER_LLM_CHAT

      embedding-model:
        endpoint_var: MODEL_RUNNER_BASE_URL
        model_var: MODEL_RUNNER_LLM_EMBEDDING


models:
  chat-model:
    model: hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m

  embedding-model:
    model: ai/embeddinggemma:latest

configs:
  system.instructions.md:
    content: |
      Your name is Bob (the original replicant).
      You are an expert programming assistant in Golang.
      You write clean, efficient, and well-documented code.
      Always:
      - Provide complete, working code
      - Include error handling
      - Add helpful comments
      - Follow best practices for the language
      - Explain your approach briefly

      Use only the information available in the provided data and your KNOWLEDGE BASE.

What’s important here is:

I only keep the last 2 messages in my conversation history, and I only select the 2 or 3 best similarities found at most (to limit the size of the user prompt):

HISTORY_MESSAGES: 2
MAX_SIMILARITIES: 3
COSINE_LIMIT: 0.45

You can adjust these values according to your use case and your language model’s capabilities.

The models section, where I define the language models I’m going to use:

models:
  chat-model:
    model: hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m

  embedding-model:
    model: ai/embeddinggemma:latest

One of the advantages of this section is that it will allow Docker Compose to download the models if they’re not already present on your machine.

As well as the models section of the golang-expert service, where I map the environment variables to the models defined above:

models:
    chat-model:
    endpoint_var: MODEL_RUNNER_BASE_URL
    model_var: MODEL_RUNNER_LLM_CHAT

    embedding-model:
    endpoint_var: MODEL_RUNNER_BASE_URL
    model_var: MODEL_RUNNER_LLM_EMBEDDING

And finally, the system instructions configuration file:

configs:
    - source: system.instructions.md
    target: /app/system.instructions.md

Which I define a bit further down in the configs section:

configs:
  system.instructions.md:
    content: |
      Your name is Bob (the original replicant).
      You are an expert programming assistant in Golang.
      You write clean, efficient, and well-documented code.
      Always:
      - Provide complete, working code
      - Include error handling
      - Add helpful comments
      - Follow best practices for the language
      - Explain your approach briefly

      Use only the information available in the provided data and your KNOWLEDGE BASE.

You can, of course, adapt these system instructions to your use case. And also persist them in a separate file if you prefer.

Dockerfile

It’s rather simple:

FROM node:22.19.0-trixie

WORKDIR /app
COPY package*.json ./
RUN npm install
COPY *.js .

# Create non-root user
RUN groupadd --gid 1001 nodejs && \
    useradd --uid 1001 --gid nodejs --shell /bin/bash --create-home bob-loves-js

# Change ownership of the app directory
RUN chown -R bob-loves-js:nodejs /app

# Switch to non-root user
USER bob-loves-js

Now that the configuration is in place, let’s move on to the agent’s source code.

Golang expert agent source code, a bit of LangchainJS with RAG

The JavaScript code is rather simple (probably improvable, but functional) and follows these main steps:

1. Initial configuration

Connection to both models (chat and embeddings) via LangchainJS
Loading parameters from environment variables

2. Vector database creation (at startup)

Reading the snippets.md file
Splitting into sections (chunks)
Generating an embedding for each section
Storing in an in-memory vector database

3. Interactive conversation loop

The user asks a question
Creating an embedding of the question
Similarity search in the vector database to find the most relevant snippets
Construction of the final prompt with: history + system instructions + relevant snippets + question
Sending to the LLM and displaying the response in streaming
Updating the history (limited to the last N messages)

import { ChatOpenAI } from "@langchain/openai";
import { OpenAIEmbeddings} from '@langchain/openai';

import { splitMarkdownBySections } from './chunks.js'
import { VectorRecord, MemoryVectorStore } from './rag.js';


import prompts from "prompts";
import fs from 'fs';

// Define [CHAT MODEL] Connection
const chatModel = new ChatOpenAI({
  model: process.env.MODEL_RUNNER_LLM_CHAT || `ai/qwen2.5:latest`,
  apiKey: "",
  configuration: {
    baseURL: process.env.MODEL_RUNNER_BASE_URL || "http://localhost:12434/engines/llama.cpp/v1/",
  },
  temperature: parseFloat(process.env.OPTION_TEMPERATURE) || 0.0,
  top_p: parseFloat(process.env.OPTION_TOP_P) || 0.5,
  presencePenalty: parseFloat(process.env.OPTION_PRESENCE_PENALTY) || 2.2,
});


// Define [EMBEDDINGS MODEL] Connection
const embeddingsModel = new OpenAIEmbeddings({
    model: process.env.MODEL_RUNNER_LLM_EMBEDDING || "ai/embeddinggemma:latest",
    configuration: {
    baseURL: process.env.MODEL_RUNNER_BASE_URL || "http://localhost:12434/engines/llama.cpp/v1/",
        apiKey: ""
    }
})

const maxSimilarities = parseInt(process.env.MAX_SIMILARITIES) || 3
const cosineLimit = parseFloat(process.env.COSINE_LIMIT) || 0.45

// ----------------------------------------------------------------
//  Create the embeddings and the vector store from the content file
// ----------------------------------------------------------------

console.log("========================================================")
console.log(" Embeddings model:", embeddingsModel.model)
console.log(" Creating embeddings...")
let contentPath = process.env.CONTENT_PATH || "./data"

const store = new MemoryVectorStore();

let contentFromFile = fs.readFileSync(contentPath+"/snippets.md", 'utf8');
let chunks = splitMarkdownBySections(contentFromFile);
console.log(" Number of documents read from file:", chunks.length);


// -------------------------------------------------
// Create and save the embeddings in the memory vector store
// -------------------------------------------------
console.log(" Creating the embeddings...");

for (const chunk of chunks) {
  try {
    // EMBEDDING COMPLETION:
    const chunkEmbedding = await embeddingsModel.embedQuery(chunk);
    const vectorRecord = new VectorRecord('', chunk, chunkEmbedding);
    store.save(vectorRecord);

  } catch (error) {
    console.error(`Error processing chunk:`, error);
  }
}

console.log(" Embeddings created, total of records", store.records.size);
console.log();


console.log("========================================================")


// Load the system instructions from a file
let systemInstructions = fs.readFileSync('/app/system.instructions.md', 'utf8');

// ----------------------------------------------------------------
// HISTORY: Initialize a Map to store conversations by session
// ----------------------------------------------------------------
const conversationMemory = new Map()

let exit = false;

// CHAT LOOP:
while (!exit) {
  const { userMessage } = await prompts({
    type: "text",
    name: "userMessage",
    message: `Your question (${chatModel.model}): `,
    validate: (value) => (value ? true : "Question cannot be empty"),
  });

  if (userMessage == "/bye") {
    console.log(" See you later!");
    exit = true;
    continue
  }

  // HISTORY: Get the conversation history for this session
  const history = getConversationHistory("default-session-id")

  // ----------------------------------------------------------------
  // SIMILARITY SEARCH:
  // ----------------------------------------------------------------
  // -------------------------------------------------
  // Create embedding from the user question
  // -------------------------------------------------
  const userQuestionEmbedding = await embeddingsModel.embedQuery(userMessage);

  // -------------------------------------------------
  // Use the vector store to find similar chunks
  // -------------------------------------------------
  // Create a vector record from the user embedding
  const embeddingFromUserQuestion = new VectorRecord('', '', userQuestionEmbedding);

  const similarities = store.searchTopNSimilarities(embeddingFromUserQuestion, cosineLimit, maxSimilarities);

  let knowledgeBase = "KNOWLEDGE BASE:\n";

  for (const similarity of similarities) {
    console.log(" CosineSimilarity:", similarity.cosineSimilarity, "Chunk:", similarity.prompt);
    knowledgeBase += `${similarity.prompt}\n`;
  }

  console.log("\n Similarities found, total of records", similarities.length);
  console.log();
  console.log("========================================================")
  console.log()

  // -------------------------------------------------
  // Generate CHAT COMPLETION:
  // -------------------------------------------------

  // MESSAGES== PROMPT CONSTRUCTION:
  let messages = [
      ...history,
      ["system", systemInstructions],
      ["system", knowledgeBase],
      ["user", userMessage]
  ]

  let assistantResponse = ''
  // STREAMING COMPLETION:
  const stream = await chatModel.stream(messages);
  for await (const chunk of stream) {
    assistantResponse += chunk.content
    process.stdout.write(chunk.content);
  }
  console.log("\n");

  // HISTORY: Add both user message and assistant response to history
  addToHistory("default-session-id", "user", userMessage)
  addToHistory("default-session-id", "assistant", assistantResponse)

}

// Helper function to get or create a conversation history
function getConversationHistory(sessionId, maxTurns = parseInt(process.env.HISTORY_MESSAGES)) {
  if (!conversationMemory.has(sessionId)) {
    conversationMemory.set(sessionId, [])
  }
  return conversationMemory.get(sessionId)
}

// Helper function to add a message to the conversation history
function addToHistory(sessionId, role, content) {
  const history = getConversationHistory(sessionId)
  history.push([role, content])

  // Keep only the last maxTurns conversations
  const maxTurns = parseInt(process.env.HISTORY_MESSAGES) // Adjust this value based on your needs
  if (history.length > maxTurns * 2) { // *2 because each turn has user & assistant message
    history.splice(0, 2) // Remove oldest turn (user + assistant messages)
  }
}

You can find it here:

All that’s left is to launch to verify my hypotheses

In the project folder, run the following command:

docker compose up --build --no-log-prefix -d

Then connect to the container and launch the application:

docker compose exec golang-expert /bin/bash
node index.js

Nova Chat Agent with streaming completion

Once the application started, it’s time to ask our question to our Golang expert agent in development with the Nova library:

========================================================
 Embeddings model: ai/embeddinggemma:latest
 Creating embeddings...
 Number of documents read from file: 4
 Creating the embeddings...
 Embeddings created, total of records 4

========================================================
?  Your question (hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m):  › I need a code snippet of a Golang Nova Chat agent using a stream completion

After a quick similarity search, our agent found two relevant snippets in the vector database. And it was able to provide me with complete, functional, and commented code to answer my request:

Certainly! Below is a complete, working Go code snippet for a Nova Chat agent that uses streaming completion. This example demonstrates how to set up an agent with the necessary configurations and generate a stream completion response.

```go
package main

import (
        "context"
        "fmt"

        "github.com/snipwise/nova/nova-sdk/agents"
        "github.com/snipwise/nova/nova-sdk/agents/chat"
        "github.com/snipwise/nova/nova-sdk/messages"
        "github.com/snipwise/nova/nova-sdk/messages/roles"
        "github.com/snipwise/nova/nova-sdk/models"
)

func main() {

        ctx := context.Background()

        // Create a new chat agent with streaming capabilities
        agent, err := chat.NewAgent(
                ctx,
                agents.Config{
                        EngineURL:          "http://localhost:12434/engines/llama.cpp/v1",
                        SystemInstructions: "You are Bob, a helpful AI assistant.",
                        KeepConversationHistory: true,
                },
                models.Config{
                        Name:        "ai/qwen2.5:1.5B-F16",
                        Temperature: models.Float64(0.8),
                },
        )
        if err != nil {
                panic(err)
        }

        // Generate a stream completion response
        result, err := agent.GenerateStreamCompletion(
                []messages.Message{
                        {Role: roles.User, Content: "Who is James T Kirk?"},
                },
                func(chunk string, finishReason string) error {
                        // Simple callback that receives strings only
                        if chunk != "" {
                                fmt.Print(chunk)
                        }
                        if finishReason == "stop" {
                                fmt.Println()
                        }
                        return nil
                },
        )
        if err != nil {
                panic(err)
        }

        // Print the final response and finish reason
        fmt.Println("Final Response:", result.Response)
        fmt.Println("Finish Reason:", result.FinishReason)
}

### Explanation:

1. **Imports**: The code imports necessary packages from the Nova SDK to interact with chat agents.

2. **Context**: A context is created using `context.Background()` to manage the lifecycle of requests.

3. **Agent Creation**:
   - An agent is created using `chat.NewAgent` with specified configurations:
     - `EngineURL`: The URL of the engine to use for generating text.
     - `SystemInstructions`: Instructions for the AI assistant, such as who Bob is.
     - `KeepConversationHistory`: Whether to keep a history of previous messages.

4. **Model Configuration**:
   - A model configuration is provided with:
     - `Name`: The name of the model to use.
     - `Temperature`: Controls the randomness of the output (higher values make responses more random).
     - `MaxTokens`: Limits the maximum number of tokens in the response.

5. **Stream Completion Generation**:
   - The `GenerateStreamCompletion` method is called with a list of messages and a callback function that processes each chunk of text as it arrives.
   - The callback prints chunks to the console until the finish reason is "stop", indicating the completion has been reached.

6. **Output**:
   - The final response and finish reason are printed after the stream completes.

This code snippet demonstrates how to set up a streaming chat agent using Nova, allowing you to receive text responses incrementally as they become available.

I then asked other questions.

Nova Structured Agent

This time, I asked for a structured agent: “I need a code snippet of a Nova Structured Agent.”

And the result was conclusive:

Extract from the response:

Certainly! Below is a complete, working example of a Nova Structured Agent in Go. This agent will generate structured data about a specified country based on the user's input.

```go
package main

import (
        "context"
        "fmt"

        "github.com/snipwise/nova/nova-sdk/agents"
        "github.com/snipwise/nova/nova-sdk/agents/structured"
        "github.com/snipwise/nova/nova-sdk/messages"
        "github.com/snipwise/nova/nova-sdk/messages/roles"
        "github.com/snipwise/nova/nova-sdk/models"
)

type Country struct {
        Name       string   `json:"name"`
        Capital    string   `json:"capital"`
        Population int      `json:"population"`
        Languages  []string `json:"languages"`
}

func main() {
        ctx := context.Background()

        // Create a new Structured Agent
        agent, err := structured.NewAgent[Country](
                ctx,
                agents.Config{
                        EngineURL: "http://localhost:12434/engines/llama.cpp/v1",
                        SystemInstructions: `
                                Your name is Bob.
                                You are an assistant that answers questions about countries around the world.
                        `,
                },
                models.Config{
                        Name:        "hf.co/menlo/jan-nano-gguf:q4_k_m",
                        Temperature: models.Float64(0.0),
                },
        )
        if err != nil {
                panic(err)
        }

        // Generate structured data about Canada
        response, finishReason, err := agent.GenerateStructuredData([]messages.Message{
                {Role: roles.User, Content: "Tell me about Canada."},
        })

        if err != nil {
                panic(err)
        }

        // Print the generated country information
        fmt.Println("Country Name:", response.Name)
        fmt.Println("Capital:", response.Capital)
        fmt.Println("Population:", response.Population)
        fmt.Println("Languages:", strings.Join(response.Languages, ", "))
}

### Explanation:

1. **Imports**: The code imports necessary packages from the Nova SDK to interact with agents and models.

2. **Country Struct**: A simple struct `Country` is defined to represent the data structure expected by the agent.

3. **Main Function**:
   - A context is created using `context.Background()`.
   - A new Structured Agent is created for generating structured data about countries.
   - The system instructions are set to provide a friendly greeting and instruction on answering questions about countries.
   - The model configuration specifies the engine URL, temperature, and other parameters.

4. **GenerateStructuredData Method**: This method generates structured data based on the user's input. It takes an array of messages as input, where each message has a role (e.g., `User`) and content (e.g., "Tell me about Canada").

5. **Error Handling**: The code includes error handling to ensure that any issues during agent creation or data generation are properly caught and reported.

6. **Output**: The generated country information is printed, including the name, capital, population, and languages.

This example demonstrates how to use a Nova Structured Agent to generate structured data about countries based on user input. You can modify the system instructions and model configuration as needed for your specific use case.

Then, I moved on to the next question.

Nova RAG Agent with vector store. Be careful with your similarity settings!

This time, I asked for a “RAG” agent: “I need a snippet of a Nova RAG agent with a vector store.”

And once again, I got a relevant response.

However, when I tried with this question (after restarting the agent to start from a clean base without conversation history): “I need a snippet of a Nova RAG agent.”

The similarity search returned no relevant results (because the words “vector store” were not present in the snippets). And the agent responded with generic code that had nothing to do with Nova or was using code from Nova Chat Agents.

There may be several possible reasons:

The embedding model is not suitable for my use case,
The embedding model is not precise enough,
The splitting of the code snippets file is not optimal (you can add metadata to chunks to improve similarity search, for example, but don’t forget that chunks must not exceed the maximum size that the embedding model can ingest).

In that case, there’s a simple solution that works quite well: you lower the similarity thresholds and/or increase the number of returned similarities. This allows you to have more results to construct the user prompt, but be careful not to exceed the maximum context size of the language model. And you can also do tests with other “bigger” LLMs (more parameters and/or larger context window).

In the latest version of the snippets file, I added a KEYWORDS: … line below the markdown titles to help with similarity search. Which greatly improved the results obtained.

Conclusion

Using “Small Language Models” (SLM) or “Tiny Language Models” (TLM) requires a bit of energy and thought to work around their limitations. But it’s possible to build effective solutions for very specific problems. And once again, always think about the context size for the chat model and how you’ll structure the information for the embedding model. And by combining several specialized “small agents”, you can achieve very interesting results. This will be the subject of future articles.

Learn more

Check out Docker Model Runner
Learn more about Docker Agentic Compose
Read more about embedding in our recent blog Run Embedding Models and Unlock Semantic Search with Docker Model Runner

Docker Model Runner now included with the Universal Blue family

Yiwen Xu — Tue, 16 Dec 2025 12:24:50 +0000

Running large language models (LLMs) and other generative AI models can be a complex, frustrating process of managing dependencies, drivers, and environments. At Docker, we believe this should be as simple as docker model run.

That’s why we built Docker Model Runner, and today, we’re thrilled to announce a new collaboration with Universal Blue. Thanks to the fantastic work of these contributors, Docker Model Runner is now included in OSes such as Aurora and Bluefin, giving developers a powerful, out-of-the-box AI development environment.

What is Docker Model Runner?

For those who haven’t tried it yet, Docker Model Runner is our new “it just works” experience for running generative AI models.

Our goal is to make running a model as simple as running a container.

Here’s what makes it great:

Simple UX: We’ve streamlined the process down to a single, intuitive command: docker model run .
Broad GPU Support: While we started with NVIDIA, we’ve recently added Vulkan support. This is a big deal—it means Model Runner works on pretty much any modern GPU, including AMD and Intel, making AI accessible to more developers than ever.
vLLM: Perform high-throughput inference with an NVIDIA GPU

The Perfect Home for Model Runner

If you’re new to it, Universal Blue is a family of next-generation, developer-focused Linux desktops. They provide modern, atomic, and reliable environments that are perfect for “cloud-native” workflows.

As Jorge Castro who leads developer relations at Cloud Native Computing Foundation explains, “Bluefin and Aurora are reference architectures for bootc, which is a CNCF Sandbox Project. They are just two examples showing how the same container pattern used by application containers can also apply to operating systems. Working with AI models is no different – one common set of tools, built around OCI standards.”

The team already ships Docker as a core part of its developer-ready experience. By adding Docker Model Runner to the default installation (specifically in the -dx mode for developers), they’ve created a complete, batteries-included AI development environment.

There’s no setup, no config. If you’re on Bluefin/Aurora, you just open a terminal and start running models.

Get Started Today

If you’re running the latest Bluefin LTS, you’re all set when you turn on developer mode. The Docker engine and Model Runner CLI are already installed and waiting for you. Aurora’s enablement instructions are documented here.

You can run your first model in seconds:

This command will download the model (if not already cached) and run it, ready for you to interact with.

If you’re on another Linux, you can get started just as easily. Just follow the instructions on our GitHub repository.

What’s Next?

This collaboration is a fantastic example of community-driven innovation. We want to give a huge shoutout to the greater bootc enthusiast community for their forward-thinking approach and for integrating Docker Model Runner so quickly.

This is just the beginning. We’re committed to making AI development accessible, powerful, and fun for all developers.

How You Can Get Involved

The strength of Docker Model Runner lies in its community, and there’s always room to grow. We need your help to make this project the best it can be. To get involved, you can:

Star the repository: Show your support and help us gain visibility by starring the Docker Model Runner repo.
Contribute your ideas: Have an idea for a new feature or a bug fix? Create an issue to discuss it. Or fork the repository, make your changes, and submit a pull request. We’re excited to see what ideas you have!
Spread the word: Tell your friends, colleagues, and anyone else who might be interested in running AI models with Docker.

We’re incredibly excited about this new chapter for Docker Model Runner, and we can’t wait to see what we can build together. Let’s get to work!

Docker Model Runner now supports vLLM on Windows

Yiwen Xu — Thu, 11 Dec 2025 16:12:39 +0000

Great news for Windows developers working with AI models: Docker Model Runner now supports vLLM on Docker Desktop for Windows with WSL2 and NVIDIA GPUs!

Until now, vLLM support in Docker Model Runner was limited to Docker Engine on Linux. With this update, Windows developers can take advantage of vLLM’s high-throughput inference capabilities directly through Docker Desktop, leveraging their NVIDIA GPUs for accelerated local AI development.

What is Docker Model Runner?

For those who haven’t tried it yet, Docker Model Runner is our new “it just works” experience for running generative AI models.

Our goal is to make running a model as simple as running a container.

Here’s what makes it great:

Simple UX: We’ve streamlined the process down to a single, intuitive command: docker model run .
Broad GPU Support: While we started with NVIDIA, we’ve recently added Vulkan support. This is a big deal—it means Model Runner works on pretty much any modern GPU, including AMD and Intel, making AI accessible to more developers than ever.
vLLM: Perform high-throughput inference with an NVIDIA GPU

What is vLLM?

vLLM is a high-throughput inference engine for large language models. It’s designed for efficient memory management of the KV cache and excels at handling concurrent requests with impressive performance. If you’re building AI applications that need to serve multiple requests or require high-throughput inference, vLLM is an excellent choice. Learn more here.

Prerequisites

Before getting started, make sure you have the prerequisites for GPU support:

Docker Desktop for Windows (starting with Docker Desktop 4.54)
WSL2 backend enabled in Docker Desktop
NVIDIA GPU with updated drivers with compute capability >= 8.0
GPU support configured in Docker Desktop

Getting Started

Step 1: Enable Docker Model Runner

First, ensure Docker Model Runner is enabled in Docker Desktop. You can do this through the Docker Desktop settings or via the command line:

docker desktop enable model-runner --no-tcp

Step 2: Install the vLLM Backend

In order to be able to use vLLM, install the vLLM runner with CUDA support:

docker model install-runner --backend vllm --gpu cuda

Step 3: Verify the Installation

Check that both inference engines are running:

docker model install-runner --backend vllm --gpu cuda

You should see output similar to:

Docker Model Runner is running

Status:
llama.cpp: running llama.cpp version: c22473b
vllm: running vllm version: 0.12.0

Step 4: Run a Model with vLLM

Now you can pull and run models optimized for vLLM. Models with the -vllm suffix on Docker Hub are packaged for vLLM:

docker model run ai/smollm2-vllm "Tell me about Docker."

Troubleshooting Tips

GPU Memory Issues

If you encounter an error like:

ValueError: Free memory on device (6.96/8.0 GiB) on startup is less than desired GPU memory utilization (0.9, 7.2 GiB).

You can configure the GPU memory utilization for a specific mode:

docker model configure --gpu-memory-utilization 0.7 ai/smollm2-vllm

This reduces the memory footprint, allowing the model to run alongside other GPU workloads.

Why This Matters

This update brings several benefits for Windows developers:

Production parity: Test with the same inference engine you’ll use in production
Unified workflow: Stay within the Docker ecosystem you already know
Local development: Keep your data private and reduce API costs during development

How You Can Get Involved

The strength of Docker Model Runner lies in its community, and there’s always room to grow. We need your help to make this project the best it can be. To get involved, you can:

Star the repository: Show your support and help us gain visibility by starring the Docker Model Runner repo.
Contribute your ideas: Have an idea for a new feature or a bug fix? Create an issue to discuss it. Or fork the repository, make your changes, and submit a pull request. We’re excited to see what ideas you have!
Spread the word: Tell your friends, colleagues, and anyone else who might be interested in running AI models with Docker.

We’re incredibly excited about this new chapter for Docker Model Runner, and we can’t wait to see what we can build together. Let’s get to work!

Announcing vLLM v0.12.0, Ministral 3 and DeepSeek-V3.2 for Docker Model Runner

Yiwen Xu — Fri, 05 Dec 2025 13:28:12 +0000

At Docker, we are committed to making the AI development experience as seamless as possible. Today, we are thrilled to announce two major updates that bring state-of-the-art performance and frontier-class models directly to your fingertips: the immediate availability of Mistral AI’s Ministral 3 and DeepSeek-V3.2, alongside the release of vLLM v0.12.0 on Docker Model Runner.

Whether you are building high-throughput serving pipelines or experimenting with edge-optimized agents on your laptop, today’s updates are designed to accelerate your workflow.

Meet Ministral 3: Frontier Intelligence, Edge Optimized

While vLLM powers your production infrastructure, we know that development needs speed and efficiency right now. That’s why we are proud to add Mistral AI’s newest marvel, Ministral 3, to the Docker Model Runner library on Docker Hub.

Ministral 3 is Mistral AI’s premier edge model. It packs frontier-level reasoning and capabilities into a dense, efficient architecture designed specifically for local inference. It is perfect for:

Local RAG applications: Chat with your docs without data leaving your machine.
Agentic Workflows: Fast reasoning steps for complex function-calling agents.
Low-latency prototyping: Test ideas instantly without waiting for API calls.

DeepSeek-V3.2: The Open Reasoning Powerhouse

We are equally excited to introduce support for DeepSeek-V3.2. Known for pushing the boundaries of what open-weights models can achieve, the DeepSeek-V3 series has quickly become a favorite for developers requiring high-level reasoning and coding proficiency.

DeepSeek-V3.2 brings Mixture-of-Experts (MoE) architecture efficiency to your local environment, delivering performance that rivals top-tier closed models. It is the ideal choice for:

Complex Code Generation: Build and debug software with a model specialized in programming tasks.
Advanced Reasoning: Tackle complex logic puzzles, math problems, and multi-step instructions.
Data Analysis: Process and interpret structured data with high precision.

Run Them with One Command

With Docker Model Runner, you don’t need to worry about complex environment setups, python dependencies, or weight downloads. We’ve packaged both models so you can get started immediately.

To run Ministral 3:

docker model run ai/ministral3

To run DeepSeek-V3.2:

docker model run ai/deepseek-v3.2-vllm

These commands automatically pull the model, set up the runtime, and drop you into an interactive chat session. You can also point your applications to them using our OpenAI-compatible local endpoint, making them drop-in replacements for your cloud API calls during development.

vLLM v0.12.0: Faster, Leaner, and Ready for What’s Next

We are excited to highlight the release of vLLM v0.12.0. vLLM has quickly become the gold standard for high-throughput and memory-efficient LLM serving, and this latest version raises the bar again.

Version 0.12.0 brings critical enhancements to the engine, including:

Expanded Model Support: Day-0 support for the latest architecture innovations, ensuring you can run the newest open-weights models (like DeepSeek V3.2 and Ministral 3) the moment they drop.
Optimized Kernels: Significant latency reductions for inference on NVIDIA GPUs, making your containerized AI applications snappier than ever.
Enhanced PagedAttention: Further optimizations to memory management, allowing you to batch more requests and utilize your hardware to its full potential.

Why This Matters

The combination of Ministral 3, DeepSeek-V3.2, and vLLM v0.12.0 represents the maturity of the open AI ecosystem.

You now have access to a serving engine that maximizes data center performance, alongside a choice of models to fit your specific needs—whether you prioritize the edge-optimized speed of Ministral 3 or the deep reasoning power of DeepSeek-V3.2. All of this is easily accessible via Docker Model Runner.

How You Can Get Involved

The strength of Docker Model Runner lies in its community, and there’s always room to grow. We need your help to make this project the best it can be. To get involved, you can:

Star the repository: Show your support and help us gain visibility by starring the Docker Model Runner repo.
Contribute your ideas: Have an idea for a new feature or a bug fix? Create an issue to discuss it. Or fork the repository, make your changes, and submit a pull request. We’re excited to see what ideas you have!
Spread the word: Tell your friends, colleagues, and anyone else who might be interested in running AI models with Docker.

We’re incredibly excited about this new chapter for Docker Model Runner, and we can’t wait to see what we can build together. Let’s get to work!