Docker Compose – Docker

How to solve the context size issues with context packing with Docker Model Runner and Agentic Compose

Yiwen Xu — Fri, 13 Feb 2026 13:57:36 +0000

If you’ve worked with local language models, you’ve probably run into the context window limit, especially when using smaller models on less powerful machines. While it’s an unavoidable constraint, techniques like context packing make it surprisingly manageable.

Hello, I’m Philippe, and I am a Principal Solutions Architect helping customers with their usage of Docker. In my previous blog post, I wrote about how to make a very small model useful by using RAG. I had limited the message history to 2 to keep the context length short.

But in some cases, you’ll need to keep more messages in your history. For example, a long conversation to generate code:

- generate an http server server in golang
- add a human structure and a list of humans
- add a handler to add a human to the list
- add a handler to list all humans
- add a handler to get a human by id
- etc...

Let’s imagine we have a conversation for which we want to keep 10 messages in the history. Moreover, we’re using a very verbose model (which a lot of tokens), so we’ll quickly encounter this type of error:

error: {
    code: 400,
    message: 'request (8860 tokens) exceeds the available context size (8192 tokens), try increasing it',
    type: 'exceed_context_size_error',
    n_prompt_tokens: 8860,
    n_ctx: 8192
  },
  code: 400,
  param: undefined,
  type: 'exceed_context_size_error'
}

What happened?

Understanding context windows and their limits in local LLMs

Our LLM has a context window, which has a limited size. This means that if the conversation becomes too long… It will bug out.

This window is the total number of tokens the model can process at once, like a short-term working memory. Read this IBM article for a deep dive on context window

In our example in the code snippet above, this size was set to 8192 tokens for LLM engines that power local LLM, like Docker Model Runner, Ollama, Llamacpp, …

This window includes everything: system prompt, user message, history, injected documents, and the generated response. Refer to this Redis post for more info.

Example: if the model has 32k context, the sum (input + history + generated output) must remain ≤ 32k tokens. Learn more here.

It’s possible to change the default context size (up or down) in the compose.yml file:

models:
  chat-model:
    model: hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m
    # Increased context size for better handling of larger inputs
    context_size: 16384

You can also do this with Docker with the following command: docker model configure –context-size 8192 ai/qwen2.5-coder `

And so we solve the problem, but only part of the problem. Indeed, it’s not guaranteed that your model supports a larger context size (like 16384), and even if it does, it can very quickly degrade the model’s performance.

Thus, with hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m, when the number of tokens in the context approaches 16384 tokens, generation can become (much) slower (at least on my machine). Again, this will depend on the model’s capacity (read its documentation). And remember, the smaller the model, the harder it will be to handle a large context and stay focused.

Tips: always provide an option (a /clear command for example) in your application to empty the message list, or to reduce it. Automatic or manual. Keep the initial system instructions though.

So we’re at an impasse. How can we go further with our small models?

Well, there is still a solution, which is called context packing.

Using context packing to fit more information into limited context windows

We can’t indefinitely increase the context size. To still manage to fit more information in the context, we can use a technique called “context packing”, which consists of having the model itself summarize previous messages (or entrust the task to another model), and replace the history with this summary and thus free up space in the context.

So we decide that from a certain token limit, we’ll have the history of previous messages summarized, and replace this history with the generated summary.

I’ve therefore modified my example to add a context packing step. For the exercise, I decided to use another model to do the summarization.

Modification of the compose.yml file

I added a new model in the compose.yml file: ai/qwen2.5:1.5B-F16

models:
  chat-model:
    model: hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m

  embedding-model:
    model: ai/embeddinggemma:latest

  context-packing-model:
    model: ai/qwen2.5:1.5B-F16

Then:

I added the model in the models section of the service that runs our program.
I increased the number of messages in the history to 10 (instead of 2 previously).
I set a token limit at 5120 before triggering context compression.
And finally, I defined instructions for the “context packing” model, asking it to summarize previous messages.

excerpt from the service:

golang-expert-v3:
build:
    context: .
    dockerfile: Dockerfile
environment:

    HISTORY_MESSAGES: 10
    TOKEN_LIMIT: 5120
    # ...
   
configs:
    - source: system.instructions.md
    target: /app/system.instructions.md
    - source: context-packing.instructions.md
    target: /app/context-packing.instructions.md

models:
    chat-model:
    endpoint_var: MODEL_RUNNER_BASE_URL
    model_var: MODEL_RUNNER_LLM_CHAT

    context-packing-model:
    endpoint_var: MODEL_RUNNER_BASE_URL
    model_var: MODEL_RUNNER_LLM_CONTEXT_PACKING

    embedding-model:
    endpoint_var: MODEL_RUNNER_BASE_URL
    model_var: MODEL_RUNNER_LLM_EMBEDDING

You’ll find the complete version of the file here: compose.yml

System instructions for the context packing model

Still in the compose.yml file, I added a new system instruction for the “context packing” model, in a context-packing.instructions.md file:

context-packing.instructions.md:
content: |\
    You are a context packing assistant.
    Your task is to condense and summarize provided content to fit within token limits while preserving essential information.
    Always:
    - Retain key facts, figures, and concepts
    - Remove redundant or less important details
    - Ensure clarity and coherence in the condensed output
    - Aim to reduce the token count significantly without losing critical information

    The goal is to help fit more relevant information into a limited context window for downstream processing.

All that’s left is to implement the context packing logic in the assistant’s code.

Applying context packing to the assistant’s code

First, I define the connection with the context packing model in the Setup part of my assistant:

const contextPackingModel = new ChatOpenAI({
  model: process.env.MODEL_RUNNER_LLM_CONTEXT_PACKING || `ai/qwen2.5:1.5B-F16`,
  apiKey: "",
  configuration: {
    baseURL: process.env.MODEL_RUNNER_BASE_URL || "http://localhost:12434/engines/llama.cpp/v1/",
  },
  temperature: 0.0,
  top_p: 0.9,
  presencePenalty: 2.2,
});

I also retrieve the system instructions I defined for this model, as well as the token limit:

let contextPackingInstructions = fs.readFileSync('/app/context-packing.instructions.md', 'utf8');

let tokenLimit = parseInt(process.env.TOKEN_LIMIT) || 7168

Once in the conversation loop, I’ll estimate the number of tokens consumed by previous messages, and if this number exceeds the defined limit, I’ll call the context packing model to summarize the history of previous messages and replace this history with the generated summary (the assistant-type message: [“assistant”, summary]). Then I continue generating the response using the main model.

excerpt from the conversation loop:

 let estimatedTokenCount = messages.reduce((acc, [role, content]) => acc + Math.ceil(content.length / 4), 0);
  console.log(` Estimated token count for messages: ${estimatedTokenCount} tokens`);

  if (estimatedTokenCount >= tokenLimit) {
    console.log(` Warning: Estimated token count (${estimatedTokenCount}) exceeds the model's context limit (${tokenLimit}). Compressing conversation history...`);

    // Calculate original history size
    const originalHistorySize = history.reduce((acc, [role, content]) => acc + Math.ceil(content.length / 4), 0);

    // Prepare messages for context packing
    const contextPackingMessages = [
      ["system", contextPackingInstructions],
      ...history,
      ["user", "Please summarize the above conversation history to reduce its size while retaining important information."]
    ];

    // Generate summary using context packing model
    console.log(" Generating summary with context packing model...");
    let summary = '';
    const summaryStream = await contextPackingModel.stream(contextPackingMessages);
    for await (const chunk of summaryStream) {
      summary += chunk.content;
      process.stdout.write('\x1b[32m' + chunk.content + '\x1b[0m');
    }
    console.log();

    // Calculate compressed size
    const compressedSize = Math.ceil(summary.length / 4);
    const reductionPercentage = ((originalHistorySize - compressedSize) / originalHistorySize * 100).toFixed(2);

    console.log(` History compressed: ${originalHistorySize} tokens → ${compressedSize} tokens (${reductionPercentage}% reduction)`);

    // Replace all history with the summary
    conversationMemory.set("default-session-id", [["assistant", summary]]);

    estimatedTokenCount = compressedSize

    // Rebuild messages with compressed history
    messages = [
      ["assistant", summary],
      ["system", systemInstructions],
      ["system", knowledgeBase],
      ["user", userMessage]
    ];
  }

You’ll find the complete version of the code here: index.js

All that’s left is to test our assistant and have it hold a long conversation, to see context packing in action.

docker compose up --build -d
docker compose exec golang-expert-v3 node index.js

And after a while in the conversation, you should see the warning message about the token limit, followed by the summary generated by the context packing model, and finally, the reduction in the number of tokens in the history:

Estimated token count for messages: 5984 tokens
Warning: Estimated token count (5984) exceeds the model's context limit (5120). Compressing conversation history...
Generating summary with context packing model...
Sure, here's a summary of the conversation:

1. The user asked for an example in Go of creating an HTTP server.
2. The assistant provided a simple example in Go that creates an HTTP server and handles GET requests to display "Hello, World!".
3. The user requested an equivalent example in Java.
4. The assistant presented a Java implementation that uses the `java.net.http` package to create an HTTP server and handle incoming requests.

The conversation focused on providing examples of creating HTTP servers in both Go and Java, with the goal of reducing the token count while retaining essential information.
History compressed: 4886 tokens → 153 tokens (96.87% reduction)

This way, we ensure that our assistant can handle a long conversation while maintaining good generation performance.

Summary

The context window is an unavoidable constraint when working with local language models, particularly with small models and on machines with limited resources. However, by using techniques like context packing, you can easily work around this limitation. Using Docker Model Runner and Agentic Compose, you can implement this pattern to support long, verbose conversations without overwhelming your model.

All the source code is available on Codeberg: context-packing. Give it a try!

Making (Very) Small LLMs Smarter

Yiwen Xu — Fri, 16 Jan 2026 13:47:26 +0000

Hello, I’m Philippe, and I am a Principal Solutions Architect helping customers with their usage of Docker. I started getting seriously interested in generative AI about two years ago. What interests me most is the ability to run language models (LLMs) directly on my laptop (For work, I have a MacBook Pro M2 max, but on a more personal level, I run LLMs on my personal MacBook Air M4 and on Raspberry Pis – yes, it’s possible, but I’ll talk about that another time).

Let’s be clear, reproducing a Claude AI Desktop or Chat GPT on a laptop with small language models is not possible. Especially since I limit myself to models that have between 0.5 and 7 billion parameters. But I find it an interesting challenge to see how far we can go with these small models. So, can we do really useful things with small LLMs? The answer is yes, but you need to be creative and put in a bit of effort.

I’m going to take a concrete use case, related to development (but in the future I’ll propose “less technical” use cases).

(Specific) Use Case: Code Writing Assistance

I need help writing code

Currently, I’m working in my free time on an open-source project, which is a Golang library for quickly developing small generative AI agents. It’s both to get my hands dirty with Golang and prepare tools for other projects. This project is called Nova; there’s nothing secret about it, you can find it here.

If I use Claude AI and ask it to help me write code with Nova: “I need a code snippet of a Golang Nova Chat agent using a stream completion.”

The response will be quite disappointing, because Claude doesn’t know Nova (which is normal, it’s a recent project). But Claude doesn’t want to disappoint me and will still propose something which has nothing to do with my project.

And it will be the same with Gemini.

So, you’ll tell me, give the “source code of your repository to feed” to Claude AI or Gemini. OK, but imagine the following situation: I don’t have access to these services, for various reasons. Some of these reasons could be confidentiality, the fact that I’m on a project where we don’t have the right to use the internet, for example. That already disqualifies Claude AI and Gemini. How can I get help writing code with a small local LLM? So as you guessed, with a local LLM. And moreover, a “very small” LLM.

Choosing a language model

When you develop a solution based on generative AI, the choice of language model(s) is crucial. And you’ll have to do a lot of technology watching, research, and testing to find the model that best fits your use case. And know that this is non-negligible work.

For this article (and also because I use it), I’m going to use hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m, which you can find here. It’s a 3 billion parameter language model, optimized for code generation. You can install it with Docker Model Runner with the following command:

docker model pull hf.co/Qwen/Qwen2.5-Coder-3B-Instruct-GGUF:Q4_K_M

And to start chatting with the model, you can use the following command:

docker model run hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m

Or use Docker Desktop:

So, of course, as you can see in the illustration above, this little “Qwen Coder” doesn’t know my Nova library either. But we’re going to fix that.

Feeding the model with specific information

For my project, I have a markdown file in which I save the code snippets I use to develop examples with Nova. You can find it here. For now, there’s little content, but it will be enough to prove and illustrate my point.

So I could add the entire content of this file to a user prompt that I would give to the model. But that will be ineffective. Indeed, small models have a relatively small context window. But even if my “Qwen Coder” was capable of ingesting all the content of my markdown file, it would have trouble focusing on my request and on what it should do with this information. So,

1st essential rule: when you use a very small LLM, the larger the content provided to the model, the less effective the model will be.
2nd essential rule: the more you keep the conversation history, the more the content provided to the model will grow, and therefore it will decrease the effectiveness of the model.

So, to work around this problem, I’m going to use a technique called RAG (Retrieval Augmented Generation). The principle is simple: instead of providing all the content to the model, we’re going to store this content in a “vector” type database, and when the user makes a request, we’re going to search in this database for the most relevant information based on the user’s request. Then, we’re going to provide only this relevant information to the language model. For this blog post, the data will be kept in memory (which is not optimal, but sufficient for a demonstration).

RAG?

There are already many articles on the subject, so I won’t go into detail. But here’s what I’m going to do for this blog post:

My snippets file is composed of sections: a markdown title (## snippet name), possibly a description in free text, and a code block (golang … ).
I’m going to split this file by sections into chunks of text (we also talk about “chunks”),
Then, for each section I’m going to create an “embedding” (vector representation of text == mathematical representation of the semantic meaning of the text) with the ai/embeddinggemma:latest model (a relatively small and efficient embedding model). Then I’m going to store these embeddings (and the associated text) in an in-memory vector database (a simple array of JSON objects).
If you want to learn more about embedding, please read this article:Run Embedding Models and Unlock Semantic Search with Docker Model Runner

Diagram of the vector database creation process:

Similarity search and user prompt construction

Once I have this in place, when I make a request to the language model (so hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m), I’m going to:

Create an embedding of the user’s request with the embedding model.
Compare this embedding with the embeddings stored in the vector database to find the most relevant sections (by calculating the distance between the vector representation of my question and the vector representations of the snippets). This is called a similarity search.
From the most relevant sections (the most similar), I’ll be able to construct a user prompt that includes only the relevant information and my initial request.

Diagram of the search and user prompt construction process:

So the final user prompt will contain:

The system instructions. For example: “You are a helpful coding assistant specialized in Golang and the Nova library. Use the provided code snippets to help the user with their requests.”
The relevant sections were extracted from the vector database.
The user’s request.

Remarks:

I explain the principles and results, but all the source code (NodeJS with LangchainJS) used to arrive at my conclusions is available in this project
To calculate distances between vectors, I used cosine similarity (A cosine similarity score of 1 indicates that the vectors point in the same direction. A cosine similarity score of 0 indicates that the vectors are orthogonal, meaning they have no directional similarity.)
You can find the JavaScript function I used here:
And the piece of code that I use to split the markdown snippets file:
Warning: embedding models are limited by the size of text chunks they can ingest. So you have to be careful not to exceed this size when splitting the source file. And in some cases, you’ll have to change the splitting strategy (fixed-size chunk,s for example, with or without overlap)

Implementation and results, or creating my Golang expert agent

Now that we have the operating principle, let’s see how to put this into music with LangchainJS, Docker Model Runner, and Docker Agentic Compose.

Docker Agentic Compose configuration

Let’s start with the Docker Agentic Compose project structure:

services:
  golang-expert:
    build:
      context: .
      dockerfile: Dockerfile
    environment:
      TERM: xterm-256color

      HISTORY_MESSAGES: 2
      MAX_SIMILARITIES: 3
      COSINE_LIMIT: 0.45

      OPTION_TEMPERATURE: 0.0
      OPTION_TOP_P: 0.75
      OPTION_PRESENCE_PENALTY: 2.2

      CONTENT_PATH: /app/data

    volumes:
      - ./data:/app/data

    stdin_open: true   # docker run -i
    tty: true          # docker run -t

    configs:
      - source: system.instructions.md
        target: /app/system.instructions.md

    models:
      chat-model:
        endpoint_var: MODEL_RUNNER_BASE_URL
        model_var: MODEL_RUNNER_LLM_CHAT

      embedding-model:
        endpoint_var: MODEL_RUNNER_BASE_URL
        model_var: MODEL_RUNNER_LLM_EMBEDDING


models:
  chat-model:
    model: hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m

  embedding-model:
    model: ai/embeddinggemma:latest

configs:
  system.instructions.md:
    content: |
      Your name is Bob (the original replicant).
      You are an expert programming assistant in Golang.
      You write clean, efficient, and well-documented code.
      Always:
      - Provide complete, working code
      - Include error handling
      - Add helpful comments
      - Follow best practices for the language
      - Explain your approach briefly

      Use only the information available in the provided data and your KNOWLEDGE BASE.

What’s important here is:

I only keep the last 2 messages in my conversation history, and I only select the 2 or 3 best similarities found at most (to limit the size of the user prompt):

HISTORY_MESSAGES: 2
MAX_SIMILARITIES: 3
COSINE_LIMIT: 0.45

You can adjust these values according to your use case and your language model’s capabilities.

The models section, where I define the language models I’m going to use:

models:
  chat-model:
    model: hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m

  embedding-model:
    model: ai/embeddinggemma:latest

One of the advantages of this section is that it will allow Docker Compose to download the models if they’re not already present on your machine.

As well as the models section of the golang-expert service, where I map the environment variables to the models defined above:

models:
    chat-model:
    endpoint_var: MODEL_RUNNER_BASE_URL
    model_var: MODEL_RUNNER_LLM_CHAT

    embedding-model:
    endpoint_var: MODEL_RUNNER_BASE_URL
    model_var: MODEL_RUNNER_LLM_EMBEDDING

And finally, the system instructions configuration file:

configs:
    - source: system.instructions.md
    target: /app/system.instructions.md

Which I define a bit further down in the configs section:

configs:
  system.instructions.md:
    content: |
      Your name is Bob (the original replicant).
      You are an expert programming assistant in Golang.
      You write clean, efficient, and well-documented code.
      Always:
      - Provide complete, working code
      - Include error handling
      - Add helpful comments
      - Follow best practices for the language
      - Explain your approach briefly

      Use only the information available in the provided data and your KNOWLEDGE BASE.

You can, of course, adapt these system instructions to your use case. And also persist them in a separate file if you prefer.

Dockerfile

It’s rather simple:

FROM node:22.19.0-trixie

WORKDIR /app
COPY package*.json ./
RUN npm install
COPY *.js .

# Create non-root user
RUN groupadd --gid 1001 nodejs && \
    useradd --uid 1001 --gid nodejs --shell /bin/bash --create-home bob-loves-js

# Change ownership of the app directory
RUN chown -R bob-loves-js:nodejs /app

# Switch to non-root user
USER bob-loves-js

Now that the configuration is in place, let’s move on to the agent’s source code.

Golang expert agent source code, a bit of LangchainJS with RAG

The JavaScript code is rather simple (probably improvable, but functional) and follows these main steps:

1. Initial configuration

Connection to both models (chat and embeddings) via LangchainJS
Loading parameters from environment variables

2. Vector database creation (at startup)

Reading the snippets.md file
Splitting into sections (chunks)
Generating an embedding for each section
Storing in an in-memory vector database

3. Interactive conversation loop

The user asks a question
Creating an embedding of the question
Similarity search in the vector database to find the most relevant snippets
Construction of the final prompt with: history + system instructions + relevant snippets + question
Sending to the LLM and displaying the response in streaming
Updating the history (limited to the last N messages)

import { ChatOpenAI } from "@langchain/openai";
import { OpenAIEmbeddings} from '@langchain/openai';

import { splitMarkdownBySections } from './chunks.js'
import { VectorRecord, MemoryVectorStore } from './rag.js';


import prompts from "prompts";
import fs from 'fs';

// Define [CHAT MODEL] Connection
const chatModel = new ChatOpenAI({
  model: process.env.MODEL_RUNNER_LLM_CHAT || `ai/qwen2.5:latest`,
  apiKey: "",
  configuration: {
    baseURL: process.env.MODEL_RUNNER_BASE_URL || "http://localhost:12434/engines/llama.cpp/v1/",
  },
  temperature: parseFloat(process.env.OPTION_TEMPERATURE) || 0.0,
  top_p: parseFloat(process.env.OPTION_TOP_P) || 0.5,
  presencePenalty: parseFloat(process.env.OPTION_PRESENCE_PENALTY) || 2.2,
});


// Define [EMBEDDINGS MODEL] Connection
const embeddingsModel = new OpenAIEmbeddings({
    model: process.env.MODEL_RUNNER_LLM_EMBEDDING || "ai/embeddinggemma:latest",
    configuration: {
    baseURL: process.env.MODEL_RUNNER_BASE_URL || "http://localhost:12434/engines/llama.cpp/v1/",
        apiKey: ""
    }
})

const maxSimilarities = parseInt(process.env.MAX_SIMILARITIES) || 3
const cosineLimit = parseFloat(process.env.COSINE_LIMIT) || 0.45

// ----------------------------------------------------------------
//  Create the embeddings and the vector store from the content file
// ----------------------------------------------------------------

console.log("========================================================")
console.log(" Embeddings model:", embeddingsModel.model)
console.log(" Creating embeddings...")
let contentPath = process.env.CONTENT_PATH || "./data"

const store = new MemoryVectorStore();

let contentFromFile = fs.readFileSync(contentPath+"/snippets.md", 'utf8');
let chunks = splitMarkdownBySections(contentFromFile);
console.log(" Number of documents read from file:", chunks.length);


// -------------------------------------------------
// Create and save the embeddings in the memory vector store
// -------------------------------------------------
console.log(" Creating the embeddings...");

for (const chunk of chunks) {
  try {
    // EMBEDDING COMPLETION:
    const chunkEmbedding = await embeddingsModel.embedQuery(chunk);
    const vectorRecord = new VectorRecord('', chunk, chunkEmbedding);
    store.save(vectorRecord);

  } catch (error) {
    console.error(`Error processing chunk:`, error);
  }
}

console.log(" Embeddings created, total of records", store.records.size);
console.log();


console.log("========================================================")


// Load the system instructions from a file
let systemInstructions = fs.readFileSync('/app/system.instructions.md', 'utf8');

// ----------------------------------------------------------------
// HISTORY: Initialize a Map to store conversations by session
// ----------------------------------------------------------------
const conversationMemory = new Map()

let exit = false;

// CHAT LOOP:
while (!exit) {
  const { userMessage } = await prompts({
    type: "text",
    name: "userMessage",
    message: `Your question (${chatModel.model}): `,
    validate: (value) => (value ? true : "Question cannot be empty"),
  });

  if (userMessage == "/bye") {
    console.log(" See you later!");
    exit = true;
    continue
  }

  // HISTORY: Get the conversation history for this session
  const history = getConversationHistory("default-session-id")

  // ----------------------------------------------------------------
  // SIMILARITY SEARCH:
  // ----------------------------------------------------------------
  // -------------------------------------------------
  // Create embedding from the user question
  // -------------------------------------------------
  const userQuestionEmbedding = await embeddingsModel.embedQuery(userMessage);

  // -------------------------------------------------
  // Use the vector store to find similar chunks
  // -------------------------------------------------
  // Create a vector record from the user embedding
  const embeddingFromUserQuestion = new VectorRecord('', '', userQuestionEmbedding);

  const similarities = store.searchTopNSimilarities(embeddingFromUserQuestion, cosineLimit, maxSimilarities);

  let knowledgeBase = "KNOWLEDGE BASE:\n";

  for (const similarity of similarities) {
    console.log(" CosineSimilarity:", similarity.cosineSimilarity, "Chunk:", similarity.prompt);
    knowledgeBase += `${similarity.prompt}\n`;
  }

  console.log("\n Similarities found, total of records", similarities.length);
  console.log();
  console.log("========================================================")
  console.log()

  // -------------------------------------------------
  // Generate CHAT COMPLETION:
  // -------------------------------------------------

  // MESSAGES== PROMPT CONSTRUCTION:
  let messages = [
      ...history,
      ["system", systemInstructions],
      ["system", knowledgeBase],
      ["user", userMessage]
  ]

  let assistantResponse = ''
  // STREAMING COMPLETION:
  const stream = await chatModel.stream(messages);
  for await (const chunk of stream) {
    assistantResponse += chunk.content
    process.stdout.write(chunk.content);
  }
  console.log("\n");

  // HISTORY: Add both user message and assistant response to history
  addToHistory("default-session-id", "user", userMessage)
  addToHistory("default-session-id", "assistant", assistantResponse)

}

// Helper function to get or create a conversation history
function getConversationHistory(sessionId, maxTurns = parseInt(process.env.HISTORY_MESSAGES)) {
  if (!conversationMemory.has(sessionId)) {
    conversationMemory.set(sessionId, [])
  }
  return conversationMemory.get(sessionId)
}

// Helper function to add a message to the conversation history
function addToHistory(sessionId, role, content) {
  const history = getConversationHistory(sessionId)
  history.push([role, content])

  // Keep only the last maxTurns conversations
  const maxTurns = parseInt(process.env.HISTORY_MESSAGES) // Adjust this value based on your needs
  if (history.length > maxTurns * 2) { // *2 because each turn has user & assistant message
    history.splice(0, 2) // Remove oldest turn (user + assistant messages)
  }
}

You can find it here:

All that’s left is to launch to verify my hypotheses

In the project folder, run the following command:

docker compose up --build --no-log-prefix -d

Then connect to the container and launch the application:

docker compose exec golang-expert /bin/bash
node index.js

Nova Chat Agent with streaming completion

Once the application started, it’s time to ask our question to our Golang expert agent in development with the Nova library:

========================================================
 Embeddings model: ai/embeddinggemma:latest
 Creating embeddings...
 Number of documents read from file: 4
 Creating the embeddings...
 Embeddings created, total of records 4

========================================================
?  Your question (hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m):  › I need a code snippet of a Golang Nova Chat agent using a stream completion

After a quick similarity search, our agent found two relevant snippets in the vector database. And it was able to provide me with complete, functional, and commented code to answer my request:

Certainly! Below is a complete, working Go code snippet for a Nova Chat agent that uses streaming completion. This example demonstrates how to set up an agent with the necessary configurations and generate a stream completion response.

```go
package main

import (
        "context"
        "fmt"

        "github.com/snipwise/nova/nova-sdk/agents"
        "github.com/snipwise/nova/nova-sdk/agents/chat"
        "github.com/snipwise/nova/nova-sdk/messages"
        "github.com/snipwise/nova/nova-sdk/messages/roles"
        "github.com/snipwise/nova/nova-sdk/models"
)

func main() {

        ctx := context.Background()

        // Create a new chat agent with streaming capabilities
        agent, err := chat.NewAgent(
                ctx,
                agents.Config{
                        EngineURL:          "http://localhost:12434/engines/llama.cpp/v1",
                        SystemInstructions: "You are Bob, a helpful AI assistant.",
                        KeepConversationHistory: true,
                },
                models.Config{
                        Name:        "ai/qwen2.5:1.5B-F16",
                        Temperature: models.Float64(0.8),
                },
        )
        if err != nil {
                panic(err)
        }

        // Generate a stream completion response
        result, err := agent.GenerateStreamCompletion(
                []messages.Message{
                        {Role: roles.User, Content: "Who is James T Kirk?"},
                },
                func(chunk string, finishReason string) error {
                        // Simple callback that receives strings only
                        if chunk != "" {
                                fmt.Print(chunk)
                        }
                        if finishReason == "stop" {
                                fmt.Println()
                        }
                        return nil
                },
        )
        if err != nil {
                panic(err)
        }

        // Print the final response and finish reason
        fmt.Println("Final Response:", result.Response)
        fmt.Println("Finish Reason:", result.FinishReason)
}

### Explanation:

1. **Imports**: The code imports necessary packages from the Nova SDK to interact with chat agents.

2. **Context**: A context is created using `context.Background()` to manage the lifecycle of requests.

3. **Agent Creation**:
   - An agent is created using `chat.NewAgent` with specified configurations:
     - `EngineURL`: The URL of the engine to use for generating text.
     - `SystemInstructions`: Instructions for the AI assistant, such as who Bob is.
     - `KeepConversationHistory`: Whether to keep a history of previous messages.

4. **Model Configuration**:
   - A model configuration is provided with:
     - `Name`: The name of the model to use.
     - `Temperature`: Controls the randomness of the output (higher values make responses more random).
     - `MaxTokens`: Limits the maximum number of tokens in the response.

5. **Stream Completion Generation**:
   - The `GenerateStreamCompletion` method is called with a list of messages and a callback function that processes each chunk of text as it arrives.
   - The callback prints chunks to the console until the finish reason is "stop", indicating the completion has been reached.

6. **Output**:
   - The final response and finish reason are printed after the stream completes.

This code snippet demonstrates how to set up a streaming chat agent using Nova, allowing you to receive text responses incrementally as they become available.

I then asked other questions.

Nova Structured Agent

This time, I asked for a structured agent: “I need a code snippet of a Nova Structured Agent.”

And the result was conclusive:

Extract from the response:

Certainly! Below is a complete, working example of a Nova Structured Agent in Go. This agent will generate structured data about a specified country based on the user's input.

```go
package main

import (
        "context"
        "fmt"

        "github.com/snipwise/nova/nova-sdk/agents"
        "github.com/snipwise/nova/nova-sdk/agents/structured"
        "github.com/snipwise/nova/nova-sdk/messages"
        "github.com/snipwise/nova/nova-sdk/messages/roles"
        "github.com/snipwise/nova/nova-sdk/models"
)

type Country struct {
        Name       string   `json:"name"`
        Capital    string   `json:"capital"`
        Population int      `json:"population"`
        Languages  []string `json:"languages"`
}

func main() {
        ctx := context.Background()

        // Create a new Structured Agent
        agent, err := structured.NewAgent[Country](
                ctx,
                agents.Config{
                        EngineURL: "http://localhost:12434/engines/llama.cpp/v1",
                        SystemInstructions: `
                                Your name is Bob.
                                You are an assistant that answers questions about countries around the world.
                        `,
                },
                models.Config{
                        Name:        "hf.co/menlo/jan-nano-gguf:q4_k_m",
                        Temperature: models.Float64(0.0),
                },
        )
        if err != nil {
                panic(err)
        }

        // Generate structured data about Canada
        response, finishReason, err := agent.GenerateStructuredData([]messages.Message{
                {Role: roles.User, Content: "Tell me about Canada."},
        })

        if err != nil {
                panic(err)
        }

        // Print the generated country information
        fmt.Println("Country Name:", response.Name)
        fmt.Println("Capital:", response.Capital)
        fmt.Println("Population:", response.Population)
        fmt.Println("Languages:", strings.Join(response.Languages, ", "))
}

### Explanation:

1. **Imports**: The code imports necessary packages from the Nova SDK to interact with agents and models.

2. **Country Struct**: A simple struct `Country` is defined to represent the data structure expected by the agent.

3. **Main Function**:
   - A context is created using `context.Background()`.
   - A new Structured Agent is created for generating structured data about countries.
   - The system instructions are set to provide a friendly greeting and instruction on answering questions about countries.
   - The model configuration specifies the engine URL, temperature, and other parameters.

4. **GenerateStructuredData Method**: This method generates structured data based on the user's input. It takes an array of messages as input, where each message has a role (e.g., `User`) and content (e.g., "Tell me about Canada").

5. **Error Handling**: The code includes error handling to ensure that any issues during agent creation or data generation are properly caught and reported.

6. **Output**: The generated country information is printed, including the name, capital, population, and languages.

This example demonstrates how to use a Nova Structured Agent to generate structured data about countries based on user input. You can modify the system instructions and model configuration as needed for your specific use case.

Then, I moved on to the next question.

Nova RAG Agent with vector store. Be careful with your similarity settings!

This time, I asked for a “RAG” agent: “I need a snippet of a Nova RAG agent with a vector store.”

And once again, I got a relevant response.

However, when I tried with this question (after restarting the agent to start from a clean base without conversation history): “I need a snippet of a Nova RAG agent.”

The similarity search returned no relevant results (because the words “vector store” were not present in the snippets). And the agent responded with generic code that had nothing to do with Nova or was using code from Nova Chat Agents.

There may be several possible reasons:

The embedding model is not suitable for my use case,
The embedding model is not precise enough,
The splitting of the code snippets file is not optimal (you can add metadata to chunks to improve similarity search, for example, but don’t forget that chunks must not exceed the maximum size that the embedding model can ingest).

In that case, there’s a simple solution that works quite well: you lower the similarity thresholds and/or increase the number of returned similarities. This allows you to have more results to construct the user prompt, but be careful not to exceed the maximum context size of the language model. And you can also do tests with other “bigger” LLMs (more parameters and/or larger context window).

In the latest version of the snippets file, I added a KEYWORDS: … line below the markdown titles to help with similarity search. Which greatly improved the results obtained.

Conclusion

Using “Small Language Models” (SLM) or “Tiny Language Models” (TLM) requires a bit of energy and thought to work around their limitations. But it’s possible to build effective solutions for very specific problems. And once again, always think about the context size for the chat model and how you’ll structure the information for the embedding model. And by combining several specialized “small agents”, you can achieve very interesting results. This will be the subject of future articles.

Learn more

Check out Docker Model Runner
Learn more about Docker Agentic Compose
Read more about embedding in our recent blog Run Embedding Models and Unlock Semantic Search with Docker Model Runner

From Compose to Kubernetes to Cloud: Designing and Operating Infrastructure with Kanvas

Jennifer Kohl — Mon, 08 Dec 2025 14:00:00 +0000

Docker has long been the simplest way to run containers. Developers start with a docker-compose.yml file, run docker compose up, and get things running fast.

As teams grow and workloads expand into Kubernetes and integrate into cloud services, simplicity fades. Kubernetes has become the operating system of the cloud, but your clusters rarely live in isolation. Real-world platforms are a complex intermixing of proprietary cloud services – AWS S3 buckets, Azure Virtual Machines, Google Cloud SQL databases – all running alongside your containerized workloads. You and your teams are working with clusters and clouds in a sea of YAML.

Managing this hybrid sprawl often means context switching between Docker Desktop, the Kubernetes CLI, cloud provider consoles, and infrastructure as code. Simplicity fades as you juggle multiple distinct tools.

Bringing clarity back from this chaos is the new Docker Kanvas Extension from Layer5 – a visual, collaborative workspace built right into Docker Desktop that allows you to design, deploy, and operate not just Kubernetes resources, but your entire cloud infrastructure across AWS, GCP, and Azure.

What Is Kanvas?

Kanvas is a collaborative platform designed for engineers to visualize, manage, and design multi-cloud and Kubernetes-native infrastructure. Kanvas transforms the concept of infrastructure as code into infrastructure as design. This means your architecture diagram is no longer just documentation – it is the source of truth that drives your deployment. Built on top of Meshery (one of the Cloud Native Computing Foundation’s highest-velocity open source projects), Kanvas moves beyond simple Kubernetes manifests by using Meshery Models – definitions that describe the properties and behavior of specific cloud resources. This allows Kanvas to support a massive catalog of Infrastructure-as-a-Service (IaaS) components:

AWS: Over 55+ services (e.g., EC2, Lambda, RDS, DynamoDB).
Azure: Over 50+ components (e.g., Virtual Machines, Blob Storage, VNet).
GCP: Over 60+ services (e.g., Compute Engine, BigQuery, Pub/Sub).

Kanvas bridges the gap between abstract architecture and concrete operations through two integrated modes: Designer and Operator.

Designer Mode (declarative mode)

Designer mode serves as a “blueprint studio” for cloud architects and DevOps teams, emphasizing declarative modeling – describing what your infrastructure should look like rather than how to build it step-by-step – making it ideal for GitOps workflows and team-based planning.

Build and iterate collaboratively: Add annotations, comments for design reviews, and connections between components to visualize data flows, architectures, and relationships.
Dry-run and validate deployments: Before touching production, simulate your deployments by performing a dry-run to verify that your configuration is valid and that you have the necessary permissions.
Import and export: Brownfield designs by connecting your existing clusters or importing Helm charts from your GitHub repositories.
Reuse patterns, clone, and share: Pick from a catalog of reference architectures, sample configurations, and infrastructure templates, so you can start from proven blueprints rather than a blank design. Share designs just as you would a Google Doc. Clone designs just as you would a GitHub repo. Merge designs just as you would in a pull request.

Operator Mode (imperative mode)

Kanvas Operator mode transforms static diagrams into live, managed infrastructure. When you switch to Operator mode, Kanvas stops being a configuration tool and becomes an active infrastructure console, using Kubernetes controllers (like AWS Controllers for Kubernetes (ACK) or Google Config Connector) to actively manage your designs.

Operator mode allows you to:

Load testing and performance management: With Operator’s built-in load generator, you can execute stress tests and characterize service behavior by analyzing latency and throughput against predefined performance profiles, establishing baselines to measure the impact of infrastructure configuration changes made in Designer mode.

Multi-player, interactive terminal: Open a shell session with your containers and execute commands, stream and search container logs without leaving the visual topology. Streamline your troubleshooting by sharing your session with teammates. Stay in-context and avoid context-switching to external command-line tools like kubectl.
Integrated observability: Use the Prometheus integration to overlay key performance metrics (CPU usage, memory, request latency) and quickly find spot “hotspots” in your architecture visually. Import your existing Grafana dashboards for deeper analysis.
Multi-cluster, multi-cloud operations: Connect multiple Kubernetes clusters (across different clouds or regions) and manage workloads that span across a GKE cluster and an EKS cluster in a single topology view.them all from a single Kanvas interface.

While Kanvas Designer mode is about intent (what you want to build), Operator mode is about reality (what is actually running). Kanvas Designer mode and Operator mode are simply two, tightly integrated sides of the same coin.

With this understanding, let’s see both modes in-action in Docker Desktop.

Walk-Through: From Compose to Kubernetes in Minutes

With the Docker Kanvas extension (install from Docker Hub), you can take any existing Docker Compose file and instantly see how it translates into Kubernetes, making it incredibly easy to understand, extend, and deploy your application at scale.

The Docker Samples repository offers a plethora of samples. Let’s use the Spring-based PetClinic example below.

# sample docker-compose.yml

services:
  petclinic:
    build:
      context: .
      dockerfile: Dockerfile.multi
      target: development
    ports:
      - 8000:8000
      - 8080:8080
    environment:
      - SERVER_PORT=8080
      - MYSQL_URL=jdbc:mysql://mysqlserver/petclinic
    volumes:
      - ./:/app
    depends_on:
      - mysqlserver
    
  mysqlserver:
    image: mysql:8
    ports:
      - 3306:3306
    environment:
      - MYSQL_ROOT_PASSWORD=
      - MYSQL_ALLOW_EMPTY_PASSWORD=true
      - MYSQL_USER=petclinic
      - MYSQL_PASSWORD=petclinic
      - MYSQL_DATABASE=petclinic
    volumes:
      - mysql_data:/var/lib/mysql
      - mysql_config:/etc/mysql/conf.d
volumes:
  mysql_data:
  mysql_config:

With your Docker Kanvas extension installed:

Import sample app: Save the PetClinic docker-compose.yml file to your computer, then click to import or drag and drop the file onto Kanvas.

Kanvas renders an interactive topology of your stack showing services, dependencies (like MySQL), volumes, ports, and configurations, all mapped to their Kubernetes equivalents. Kanvas performs this rendering in phases, applying an increasing degree of scrutiny in the evaluation performed in each phase. Let’s explore the specifics of this tiered evaluation process in a moment.

Enhance the PetClinic design

From here, you can enhance the generated design in a visual, no-YAML way:

Add a LoadBalancer, Ingress, or ConfigMap
Configure Secrets for your database URL or sensitive environment variables
Modify service relationships or attach new components
Add comments or any other annotations.

Importantly, Kanvas saves your design as you make changes. This gives you production-ready deployment artifacts generated directly from your Compose file.

Deploy to a cluster

With one click, deploy the design to any cluster connected to Docker Desktop or any other remote cluster. Kanvas handles the translation and applies your configuration.

Switch modes and interact with your app

After deploying (or when managing an existing workload), switch to Operator mode to observe and manage your deployed design. You can:

Inspect Deployments, Services, Pods, and their relationships.
Open a terminal session with your containers for quick debugging.
Tail and search your container logs and monitor resource metrics.
Generate traffic and analyze the performance of your deployment under heavy load.
Share your Operator View with teammates for collaborative management.

Within minutes, a Compose-based project becomes a fully managed Kubernetes workload, all without leaving Docker Desktop. This seamless flow from a simple Compose file to a fully managed, operable workload highlights the ease by which infrastructure can be visually managed, leading us to consider the underlying principle of Infrastructure as Design.

Infrastructure as Design

Infrastructure as design elevates the visual layout of your stack to be the primary driver of its configuration, where the act of adjusting the proximity and connectedness of components is one in the same as the process of configuring your infrastructure. In other words, the presence, absence, proximity, or connectedness of individual components (all of which affect how one component relates to another) respectively augments the underlying configuration of each. Kanvas is highly intelligent in this way, understanding at a very granular level of detail how each individual component relates to all other components and will augment the configuration of those components accordingly.

Understand that the process by which Kanvas renders the topology of your stack’s architecture in phases. The initial rendering involves a lightweight analysis of each component, establishing a baseline for the contents of your new design. A subsequent phase of rendering applies a higher level of sophistication in its analysis as Kanvas introspect the configuration of each of your stack’s components, their interdependencies, and proactively evaluates the manner in which each component relates to one another. Kanvas will add, remove, and update the configuration of your components as a result of this relationship evaluation.

This process of relationship evaluation is ongoing. Every time you make a change to your design, Kanvas re-evaluates each component configuration.

To offer an example, if you were to bring a Kubernetes Deployment in the same vicinity of the Kubernetes Namespace you will find that one magnetizes to the next and that your Deployment is visually placed inside of the Namespace, and at the same time, that Deployment’s configuration is mutated to include its new Namespace designation. Kanvas proactively evaluates and mutates the configuration of the infrastructure resources in your design as you make changes.

This ability for Kanvas to intelligently interpret and adapt to changes in your design—automatically managing configuration and relationships—is the key to achieving infrastructure as design. This power comes from a sophisticated system that gives Kanvas a level of intelligence, but with the reliability of a policy-driven engine.

AI-like Intelligence, Anchored by Deterministic Truth

In an era where generative AI dramatically accelerates infrastructure design, the risk of “hallucinations”—plausible but functionally invalid configurations—remains a critical bottleneck. Kanvas solves this by pairing the generative power of AI with a rigid, deterministic policy engine.

This engine acts as an architectural guardrail, offering you precise control over the degree to which AI is involved in assessing configuration correctness. It transforms designs from simple visual diagrams into validated, deployable blueprints.

While AI models function probabilistically, Kanvas’s policy engine functions deterministically, automatically analyzing designs to identify, validate, and enforce connections between components based on ground-truth rules. Each of these rules are statically defined and versioned in their respective Kanvas models.

Deep Contextualization: The evaluation goes beyond simple visualization. It treats relationships as context-aware and declarative, interpreting how components interact (e.g., data flows, dependencies, or resource sharing) to ensure designs are not just imaginative, but deployable and compliant.
Semantic Rigor: The engine distinguishes between semantic relationships (infrastructure-meaningful, such as a TCP connection that auto-configures ports) and non-semantic relationships (user-defined visuals, like annotations). This ensures that aesthetic choices never compromise infrastructure integrity.

Kanvas acknowledges that trust is not binary. You maintain sovereignty over your designs through granular controls that dictate how the engine interacts with AI-generated suggestions:

“Human-in-the-Loop” Slider: You can modulate the strictness of the policy evaluation. You might allow the AI to suggest high-level architecture while enforcing strict policies on security configurations (e.g., port exposure or IAM roles).
Selective Evaluation: You can disable evaluations via preferences for specific categories. For example, you may trust the AI to generate a valid Kubernetes Service definition, but rely entirely on the policy engine to validate the Ingress controller linking to it.

Kanvas does not just flag errors; it actively works to resolve them using sophisticated detection and correction strategies.

Intelligent Scanning: The engine scans for potential relationships based on component types, kinds, and subtypes (e.g., a Deployment linking to a Service via port exposure), catching logical gaps an AI might miss.
Patches and Resolvers: When a partial or a hallucinated configuration is detected, Kanvas applies patches to either propagate missing configuration or dynamically adjusts configurations to resolve conflicts, ensuring the final infrastructure-as-code export (e.g., Kubernetes manifests, Helm chart) is clean, versionable, and secure.

Turn Complexity into Clarity

Kanvas takes the guesswork out of managing modern infrastructure. For developers used to Docker Compose, it offers a natural bridge to Kubernetes and cloud services — with visibility and collaboration built in.

Capability	How It Helps You
Import and Deploy Compose Apps	Move from Compose, Helm, or Kustomize to Kubernetes in minutes.
Visual Designer	Understand your architecture through connected, interactive diagrams.
Design Catalog	Use ready-made templates and proven infrastructure patterns.
Terminal Integration	Debug directly from the Kanvas UI, without switching tools.
Sharable Views	Collaborate on live infrastructure with your team.
Multi-Environment Management	Operate across local, staging, and cloud clusters from one dashboard.

Kanvas brings visual design and real-time operations directly into Docker Desktop. Import your Compose files, Kubernetes Manifests, Helm Charts, and Kustomize files to explore the catalog of ready-to-use architectures, and deploy to Kubernetes in minutes — no YAML wrangling required.

Designs can also be exported in a variety of formats, including as OCI-compliant images and shared through registries like Docker Hub, GitHub Container Registry, or AWS ECR — keeping your infrastructure as design versioned and portable.

Install the Kanvas Extension from Docker Hub and start designing your infrastructure today.

From Shell Scripts to Science Agents: How AI Agents Are Transforming Research Workflows

Esteban Maya Cadavid — Thu, 02 Oct 2025 13:00:00 +0000

It’s 2 AM in a lab somewhere. A researcher has three terminals open, a half-written Jupyter notebook on one screen, an Excel sheet filled with sample IDs on another, and a half-eaten snack next to shell commands. They’re juggling scripts to run a protein folding model, parsing CSVs from the last experiment, searching for literature, and Googling whether that one Python package broke in the latest update, again.

This isn’t the exception; it’s the norm. Scientific research today is a patchwork of tools, scripts, and formats, glued together by determination and late-night caffeine. Reproducibility is a wishlist item. Infrastructure is an afterthought. And while automation exists, it’s usually hand-rolled and stuck on someone’s laptop.

But what if science workflows could be orchestrated, end-to-end, by an intelligent agent?

What if instead of writing shell scripts and hoping the dependencies don’t break, a scientist could describe the goal, “read this CSV of compounds and proteins, search for literature, admet, and more”, and an AI agent could plan the steps, spin up the right tools in containers, execute the tasks, and even summarize the results?

That’s the promise of science agents. AI-powered systems that don’t just answer questions like ChatGPT, but autonomously carry out entire research workflows. And thanks to the convergence of LLMs, GPUs, Dockerized environments, and open scientific tools, this shift isn’t theoretical anymore.

It’s happening now.

What is a Science Agent?

A Science Agent is more than just a chatbot or a smart prompt generator; it’s an autonomous system designed to plan, execute, and iterate on entire scientific workflows with minimal human input.

Instead of relying on one-off questions like “What is ADMET?” or “Summarize this paper,” a science agent operates like a digital research assistant. It understands goals, breaks them into steps, selects the right tools, runs computations, and even reflects on results.

CrewAI: AI agents framework -> https://www.crewai.com/
ADMET: how a drug is absorbed, distributed, metabolized, and excreted, and its toxicity

Let’s make it concrete:

Take this multi-agent system you might build with CrewAI:

Curator: Data-focused agent whose primary role is to ensure data quality and standardization.
Researcher: Literature specialist. Its main goal is to find relevant academic papers on PubMed for the normalized entities provided by the Curator.
Web Scraper: Specialized agent for extracting information from websites.
Analyst: Predicts ADMET properties and toxicity using models or APIs.
Reporter: Compiles all results into a clean Markdown report.

Each of these agents acts independently but works as part of a coordinated system. Together, they automate what would take a human team hours or even days, now in minutes and reproducibly.

Why This Is Different from ChatGPT

You’ve probably used ChatGPT to summarize papers, write Python code, or explain complex topics. And while it might seem like a simple question-answer engine, there’s often more happening behind the scenes, prompt chains, context windows, and latent loops of reasoning. But even with those advances, these interactions are still mostly human-in-the-loop: you ask, it answers.

Science agents are a different species entirely.

Instead of waiting for your next prompt, they plan and execute entire workflows autonomously. They decide which tools to use based on context, how to validate results, and when to pivot. Where ChatGPT responds, agents act. They’re less like assistants and more like collaborators.

Let’s break down the key differences:

Feature	LLMs (ChatGPT & similar)	Science Agents (CrewAI, LangGraph, etc.)
Interaction	Multi-turn, often guided by user prompts or system instructions	Long-running, autonomous workflows across multiple tools
Role	Assistant with agentic capabilities abstracted away	Explicit research collaborator executing role-specific tasks
Autonomy	Semi-autonomous; requires external prompting or embedded system orchestration	Fully autonomous planning, tool selection, and iteration
Tool Use	Some tools are used via plugins/functions (e.g., browser, code interpreter)	Explicit tool integration (APIs, simulations, databases, Dockerized tools)
Memory	Short- to medium-term context (limited per session or chat, non-explicit workspace)	Persistent long-term memory (vector DBs, file logs, databases, explicit and programmable)
Reproducibility	Very limited, without the ability to define agents’ roles/tasks and their tools	Fully containerized, versioned workflows, reproducible workflows with defined agent roles/tasks

Try it yourself

If you’re curious, here’s a two-container demo you can run in minutes.

git repo: https://github.com/estebanx64/docker_blog_ai_agents_research

We just have two containers/services for this example:

Prerequisites

Docker and Docker Compose
OpenAI API key (for GPT-4o model access)
Sample CSV file with biological entities

Follow the instructions from README.md in our repo to set up your OpenAI API KEY

Running the next workflow with the example included in our repo is going to charge ~1-2 USD for the OpenAI API.

Run the workflow.

docker compose up

The logs above demonstrate how our agents autonomously plan and execute a complete workflow.

Ingest CSV File

The agents load and parse the input CSV dataset.

Query PubMed

They automatically search PubMed for relevant scientific articles.

Generate Literature Summaries

The retrieved articles are summarized into concise, structured insights.

Calculate ADMET Properties

The agents call an external API to compute ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) predictions.

Compile Results into Markdown Report

All findings are aggregated and formatted into a structured report.md.

Output Files

report.md – Comprehensive research report.
JSON files – Contain normalized entities, literature references, and ADMET predictions.

This showcases the agents’ ability to make decisions, use tools, and coordinate tasks without manual intervention.

If you want to explore and dive in more, please check the README.md included in the github repository

Imagine if your lab could run 100 experiments overnight, what would you discover first?

But to make this vision real, the hard part isn’t just the agents, it’s the infrastructure they need to run.

Infrastructure: The Bottleneck

AI science agents are powerful, but without the right infrastructure, they break quickly or can’t scale. Real research workflows involve GPUs, complex dependencies, and large datasets. Here’s where things get challenging, and where Docker becomes essential.

The Pain Points

Heavy workloads: Running tools like AlphaFold or Boltz requires high-performance GPUs and smart scheduling (e.g., EKS, Slurm).
Reproducibility chaos: Different systems = broken environments. Scientists spend hours debugging libraries instead of doing science.
Toolchain complexity: Agents rely on multiple scientific tools (RDKit, PyMOL, Rosetta, etc.), each with their own dependencies.
Versioning hell: Keeping track of dataset/model versions across runs is non-trivial, especially when collaborating.

Why Containers Matter

Standardized environments: Package your tools once, run them anywhere, from a laptop to the cloud.
Reproducible workflows: Every step of your agent’s process is containerized, making it easy to rerun or share experiments.
Composable agents: Treat each step (e.g., literature search, folding, ADMET prediction) as a containerized service.
Smooth orchestration: You can use the CrewAI or other frameworks’ capabilities to spin up containers and isolate tasks that need to run or validated output code without compromising the host.

Open Challenges & Opportunities

Science agents are powerful, but still early. There’s a growing list of challenges where developers, researchers, and hackers can make a huge impact.

Unsolved Pain Points

Long-term memory: Forgetful agents aren’t useful. We need better semantic memory systems (e.g., vector stores, file logs) for scientific reasoning over time.
Orchestration frameworks: Complex workflows require robust pipelines. Temporal, Kestra, Prefect, and friends could be game changers for bio.
Safety & bounded autonomy: How do we keep agents focused and avoid “hallucinated science”? Guardrails are still missing.
Benchmarking agents: There’s no standard to compare science agents. We need tasks, datasets, and metrics to measure real-world utility.

Ways to Contribute

Containerize more tools (models, pipelines, APIs) to plug into agent systems.
Create tests and benchmarks for evaluating agent performance in scientific domains.

Conclusion

We’re standing at the edge of a new scientific paradigm, one where research isn’t just accelerated by AI, but partnered with it. Science agents are transforming what used to be days of fragmented work into orchestrated workflows that run autonomously, reproducibly, and at scale.

This shift from messy shell scripts and notebooks to containerized, intelligent agents isn’t just about convenience. It’s about opening up research to more people, compressing discovery cycles, and building infrastructure that’s as powerful as the models it runs.

Science is no longer confined to the lab. It’s being automated in containers, scheduled on GPUs, and shipped by developers like you.

Check out the repo and try building your own science agent. What workflow would you automate first?

How to Build Secure AI Coding Agents with Cerebras and Docker Compose

Oleg Selajev — Wed, 17 Sep 2025 16:00:00 +0000

In the recent article, Building Isolated AI Code Environments with Cerebras and Docker Compose, our friends at Cerebras showcased how one can build a coding agent to use worlds fastest Cerebras’ AI inference API, Docker Compose, ADK-Python, and MCP servers.

In this post, we’ll dive deeper into the underlying technologies and show how the pieces come together to build an AI agent environment that’s portable, secure, and fully containerized. You’ll learn how to create multi-agent systems, run some agents with local models in Docker Model Runner, and integrate custom tools as MCP servers into your AI agent’s workflow.

We’ll also touch on how to build a secure sandbox for executing the code your agent writes, an ideal use case for containers in real-world development.

Getting Started

To begin, clone the repository from GitHub and navigate into the project directory.

Get the code for the agent, and prepare the .env file to provide your Cerebras API key:

git clone https://github.com/dockersamples/docker-cerebras-demo && cd docker-cerebras-demo

Next, prepare the .env file to provide your Cerebras API key. You can get a key from the Cerebras Cloud platform.

# This copies the sample environment file to your local .env file
cp .env-sample .env

Now, open the .env file in your favorite editor and add your API key to the CEREBRAS_API_KEY line. Once that’s done, run the system using Docker Compose:

docker compose up --build

The first run may take a few minutes to pull the model and containers. Once it’s up, you can see the agent at localhost:8000.

The first run may take a few minutes to pull the necessary Docker images and the AI model. Once it’s running, you can access the agent’s interface at http://localhost:8000. From there, you can interact with your agent and issue commands like “write code,” “initialize the sandbox environment,” or request specific tools like “cerebras, curl docker.com for me please.”

Understanding the Architecture

This demo follows the architecture from our Compose for Agents repository, which breaks down an agent into three core components:

The Agentic Loop: This is the main application logic that orchestrates the agent’s behavior. In our case, it’s an ADK-Python-based application. The ADK-Python framework also includes a visualizer that lets you inspect tool calls and trace how the system reached specific decisions.

The MCP Tools: These are the external tools the agent can use. We provide them securely via the Docker MCP Gateway. In this app we use context7 and node sandbox MCP servers.
The AI Model: You can define any local or remote AI model you want to use. Here, we’re using a local Qwen model for routing between the local agent and the powerful Cerebras agent which will use Cerebras API.

Cerebras Cloud serves as a specialized, high-performance inference backend. It can run massive models, like a half-trillion parameter Qwen coder, at thousands of tokens per second. While our simple demo doesn’t require this level of speed, such performance is a game-changer for real-world applications.

Most of the prompts and responses are a few hundred tokens long, as they are simple commands to initialize a sandbox or write some JavaScript code in it. You’re welcome to make the agent work harder and see Cerebras’ performance on more verbose requests.

For example, you can ask the Cerebras agent to write some JavaScript code, and see it call the functions from the MCP tools to read and write the files and run them as you see on the screenshot below.

Building a Custom Sandbox as an MCP Server

A key feature of this setup is the ability to create a secure sandbox for code execution. To do this, we’ll build a custom MCP server. In our example, we enable two MCP servers:

context7: This gives our agent access to the latest documentation for various application frameworks.
node-code-sandbox: This is our custom-made sandbox for executing the code our agent writes.

You can find the implementation of our Node.js sandbox server in the node-sandbox-mcp GitHub repository. It’s a Quarkus application written in Java that exposes itself as an stdio mcp-server and uses the awesome Testcontainers library to create and manage the sandbox containers programmatically.

An important detail is that you have full control over the sandbox configuration. We start the container with a common Node.js development image and, as a crucial security measure, disable its networking. But since it’s a custom MCP server, you can enable any security measures you deem necessary.

Here’s a snippet of the Testcontainers-java code used to create the container:

GenericContainer sandboxContainer = new GenericContainer<>("mcr.microsoft.com/devcontainers/javascript-node:20")
                .withNetworkMode("none") // disable network!!
                .withWorkingDirectory("/workspace")
                .withCommand("sleep", "infinity");

 sandboxContainer.start();

Testcontainers provides a flexible, idiomatic API to interact with the sandbox. Running a command or writing a file becomes a simple one-line method call:

// To execute a command inside the sandbox
sandbox.execInContainer(command);

// To write a file into the sandbox
sandbox.copyFileToContainer(Transferable.of(contents.getBytes()), filename);

The actual implementation has a bit more glue code for managing background processes or selecting the correct sandbox if you’ve created multiple, but these one-liners are the core of the interaction.

Packaging and Using the Custom Server

To use our custom server, we first need to package it as a Docker image. For Quarkus applications, a single command does the trick:

./mvnw package -DskipTests=true -Dquarkus.container-image.build=true

This command produces a local Docker image and outputs its name, something like:

[INFO] [io.quarkus.container.image.docker.deployment.DockerProcessor] Built container image shelajev/node-sandbox:1.0.0-SNAPSHOT

Since we’re running everything locally, we don’t even need to push this image to a remote registry. You can inspect this image in Docker Desktop and find its hash, which we’ll use in the next step.

Integrating the Sandbox via the MCP Gateway

With our custom MCP server image ready, it’s time to plug it into the MCP Gateway. We’ll create a custom catalog file (mcp-gateway-catalog.yaml) that enables both the standard context7 server and our new node-code-sandbox.

Currently, creating this file is a manual process, but we’re working on simplifying it. The result is a portable catalog file that mixes standard and custom MCP servers.

Notice two key things in the configuration for the node-code-sandbox MCP server in the catalog:

longLived: true: This tells the gateway that our server needs to persist between the tool calls to track the sandbox’s state.
image:: We reference the specific Docker image using its sha256 hash to ensure reproducibility.

If you’re building the custom server for the sandbox MCP, you can replace the image reference with the one your build step produced.

    longLived: true
    image: olegselajev241/node-sandbox@sha256:44437d5b61b6f324d3bb10c222ac43df9a5b52df9b66d97a89f6e0f8d8899f67

Finally, we update our docker-compose.yml to mount this catalog file and enable both servers:

  mcp-gateway:
    # mcp-gateway secures your MCP servers
    image: docker/mcp-gateway:latest
    use_api_socket: true
    command:
      - --transport=sse
      # add any MCP servers you want to use
      - --servers=context7,node-code-sandbox
      - --catalog=/mcp-gateway-catalog.yaml
    volumes:
      - ./mcp-gateway-catalog.yaml:/mcp-gateway-catalog.yaml:ro

When you run docker compose up, the gateway starts, which in turn starts our node-sandbox MCP server. When the agent requests a sandbox, a third container is launched – the actual isolated environment.

You can use tools like Docker Desktop to inspect all running containers, view files, or even open a shell for debugging.

The Security Benefits of Containerized Sandboxes

This containerized sandbox approach is a significant security win. Containers provide a well-understood security boundary with a smaller vulnerability profile than running random internet code on your host machine, and you can harden them as needed.

Remember how we disabled networking in the sandbox container? This means any code the agent generates cannot leak local secrets or data to the internet. If you ask the agent to run code that tries to access, for example, google.com, it will fail.

This demonstrates a key advantage: granular control. While the sandbox is cut off from the network, other tools are not. The context7 MCP server can still access the internet to fetch documentation, allowing the agent to write better code without compromising the security of the execution environment.

Oh, and a neat detail is that when you stop the containers managed by compose, it also kills the sandbox MCP server, and that in turn triggers Testcontainers to clean up all the sandbox containers, just like it cleans after a typical test run.

Next Steps and Extensibility

This coding agent is a great starting point, but it isn’t production-ready. For a real-world application, you might want to grant controlled access to resources like the npm registry. You could, for example, achieve this by mapping your local npm cache from the host system into the sandbox. This way, you, the developer, control exactly which npm libraries are accessible.

Because the sandbox is a custom MCP server, the possibilities are endless. You can build it yourself, tweak it however you want, and integrate any tools or constraints you need.

Conclusion

In this post, we demonstrated how to build a secure and portable AI coding agent using Docker Compose and the MCP Toolkit. By creating a custom MCP server with Testcontainers, we built a sandboxed execution environment that offers granular security controls, like disabling network access, without limiting the agent’s other tools. We connect this coding agent to Cerebras API, so we get incredible inference speed. This architecture provides a powerful and secure foundation for building your own AI agents. We encourage you to clone the repository and experiment with the code! You probably already have Docker and can sign up for a Cerebras API key here.

Hybrid AI Isn’t the Future — It’s Here (and It Runs in Docker using the Minions protocol)

Ignasi Lopez Luna — Thu, 04 Sep 2025 13:00:44 +0000

Running large AI models in the cloud gives access to immense capabilities, but it doesn’t come for free. The bigger the models, the bigger the bills, and with them, the risk of unexpected costs.

Local models flip the equation. They safeguard privacy and keep costs predictable, but their smaller size often limits what you can achieve.

For many GenAI applications, like analyzing long documents or running workflows that need a large context, developers face a tradeoff between quality and cost. But there might be a smarter way forward: a hybrid approach that combines the strengths of remote intelligence with local efficiency.

This idea is well illustrated by the Minions protocol, which coordinates lightweight local “minions” with a stronger remote model to achieve both cost reduction and accuracy preservation. By letting local agents handle routine tasks while deferring complex reasoning to a central intelligence, Minions demonstrates how organizations can cut costs without sacrificing quality.

With Docker and Docker Compose, the setup becomes simple, portable, and secure.

In this post, we’ll show how to use Docker Compose, Model Runner, and the MinionS protocol to deploy hybrid models and break down the results and trade-offs.

What’s Hybrid AI

Hybrid AI combines the strengths of powerful cloud models with efficient local models, creating a balance between performance, cost, and privacy. Instead of choosing between quality and affordability, Hybrid AI workflows let developers get the best of both worlds.

Next, let’s see an example of how this can be implemented in practice.

The Hybrid Model: Supervisors and Minions

Think of it as a teamwork model:

Remote Model (Supervisor): Smarter, more capable, but expensive. It doesn’t do all the heavy lifting, it directs the workflow.
Local Models (Minions): Lightweight and inexpensive. They handle the bulk of the work in parallel, following the supervisor’s instructions.

Here’s how it plays out in practice in our new Dockerized Minions integration:

Spin up the Minions application server with docker compose up
A request is sent to the remote model. Instead of processing all the data directly, it generates executable code that defines how to split the task into smaller jobs.
Execute that orchestration code inside the Minions application server, which runs in a Docker container and provides sandboxed isolation.
Local models run those subtasks, analyzing chunks of a large document, summarizing sections, or performing classification in parallel.
The results are sent back to the remote model, which aggregates them into a coherent answer.

The remote model acts like a supervisor, while the local models are the team members doing the work. The result is a division of labor that’s efficient, scalable, and cost-effective.

Why Hybrid?

Cost Reduction: Local models handle most of the tokens and context, thereby reducing cloud model usage.
Scalability: By splitting large jobs into smaller ones, workloads scale horizontally across local models.
Security: The application server runs in a Docker container, and orchestration code is executed there in a sandboxed environment.
Quality: Hybrid protocols pair the cost savings of local execution with the coherence and higher-level reasoning of remote models, delivering better results than local-only setups.
Developer Simplicity: Docker Compose ties everything together into a single configuration file, with no messy environment setup.

Research Benchmarks: Validating the Hybrid Approach

The ideas behind this hybrid architecture aren’t just theoretical, they’re backed by research. In this recent research paper Minions: Cost-efficient Collaboration Between On-device and Cloud Language Models, the authors evaluated different ways of combining smaller local models with larger remote models.

The results demonstrate the value of the hybrid design where a local and remote model collaborate on a task:

Minion Protocol: A local model interacts directly with the remote model, which reduces cloud usage significantly. This setup achieves a 30.4× reduction in remote inference costs, while maintaining about 87% of the performance of relying solely on the remote model.
MinionS Protocol: A local model executes parallel subtasks defined by code generated by the remote model. This structured decomposition achieves a 5.7× cost reduction while preserving ~97.9% of the remote model’s performance.

This is an important validation: hybrid AI architectures can deliver nearly the same quality as high-end proprietary APIs, but at a fraction of the cost.

For developers, this means you don’t need to choose between quality and cost, you can have both. Using Docker Compose as the orchestration layer, the hybrid MinionS protocol becomes straightforward to implement in a real-world developer workflow.

Compose-Driven Developer Experience

What makes this approach especially attractive for developers is how little configuration it actually requires.

With Docker Compose, setting up a local AI model doesn’t involve wrestling with dependencies, library versions, or GPU quirks. Instead, the model can be declared as a service in a few simple lines of YAML, making the setup both transparent and reproducible.

models:

  worker:

    model: ai/llama3.2

    context_size: 10000

This short block is all it takes to bring up a worker running a local Llama 3.2 model with a 10k context window. Under the hood, Docker ensures that this configuration is portable across environments, so every developer runs the same setup, without ever needing to install or manage the model manually.

Please note that, depending on the environment you are running in, Docker Model Runner might run as a host process (Docker Desktop) instead of in a container (Docker CE) to ensure optimal inference performance.

Beyond convenience, containerization adds something essential: security.

In a hybrid system like this, the remote model generates code to orchestrate local execution. By running that code inside a Docker container, it’s safely sandboxed from the host machine. This makes it possible to take full advantage of dynamic orchestration without opening up security risks.

The result is a workflow that feels effortless: declare the model in Compose, start it with a single command, and trust that Docker takes care of both reproducibility and isolation. Hybrid AI becomes not just powerful and cost-efficient, but also safe and developer-friendly.

You can find a complete example ready to use here. In practice, using ai/qwen3 as a local model can cut cloud usage significantly. For a typical workload, only ~15,000 remote tokens are needed, about half the amount required if everything ran on the remote model.

This reduction comes with a tradeoff: because tasks are split, orchestrated, and processed locally before aggregation, responses may take longer to generate (up to ~10× slower). For many scenarios, the savings in cost and control over data can outweigh the added latency.

Conclusion

Hybrid AI is no longer just an interesting idea, it is a practical path forward for developers who want the power of advanced models while keeping costs low.

The research behind Minions shows that this approach can preserve nearly all the quality of large remote models while reducing cloud usage dramatically. Docker, in turn, makes the architecture simple to run, easy to reproduce, and secure by design.

By combining remote intelligence with local efficiency, and wrapping it all in a developer-friendly Compose setup, we can better control the tradeoff between capability and cost. What emerges is an AI workflow that is smarter, more sustainable, and accessible to any developer, not just those with deep infrastructure expertise.

This shows a realistic direction for GenAI: not always chasing bigger models, but finding smarter, safer, and more efficient ways to use them. By combining Docker and MinionS, developers already have the tools to experiment with this hybrid approach and start building cost-effective, reproducible AI workflows today. Try it yourself today by visiting the project GitHub repo!

Learn more

Read our quickstart guide to Docker Model Runner.
Visit our Model Runner GitHub repo! Docker Model Runner is open-source, and we welcome collaboration and contributions from the community!
Discover other AI solutions from Docker
Learn how Compose makes building AI apps and agents easier

Building AI Agents with Docker MCP Toolkit: A Developer’s Real-World Setup

Rajesh Padmakumaran — Tue, 19 Aug 2025 14:59:46 +0000

Building AI agents in the real world often involves more than just making model calls — it requires integrating with external tools, handling complex workflows, and ensuring the solution can scale in production.

In this post, we’ll walk through a real-world developer setup for creating an agent using the Docker MCP Toolkit.

To make things concrete, I’ve built an agent that takes a Git repository as input and can answer questions about its contents — whether it’s explaining the purpose of a function, summarizing a module, or finding where a specific API call is made. This simple but practical use case serves as a foundation for exploring how agents can interact with real-world data sources and respond intelligently.

I built and ran it using the Docker MCP Toolkit, which made setup and integration fast, portable, and repeatable. This blog walks you through that developer setup and explains why Docker MCP is a game changer for building and running agents.

Use Case: GitHub Repo Question-Answering Agent

The goal: Build an AI agent that can connect to a GitHub repository, retrieve relevant code or metadata, and answer developer questions in plain language.

Example queries:

“Summarize this repo: https://github.com/owner/repo”
“Where is the authentication logic implemented?”
“List main modules and their purpose.”
“Explain the function parse_config and show where it’s used.”

This goes beyond a simple code demo — it reflects how developers work in real-world environments

The agent acts like a code-aware teammate you can query anytime.
The MCP Gateway handles tooling integration (GitHub API) without bloating the agent code.
Docker Compose ties the environment together so it runs the same in dev, staging, or production.

Role of Docker MCP Toolkit

Without MCP Toolkit, you’d spend hours wiring up API SDKs, managing auth tokens, and troubleshooting environment differences.

With MCP Toolkit:

Containerized connectors – Run the GitHub MCP Gateway as a ready-made service (docker/mcp-gateway:latest), no SDK setup required.
Consistent environments – The container image has fixed dependencies, so the setup works identically for every team member.
Rapid integration – The agent connects to the gateway over HTTP; adding a new tool is as simple as adding a new container.
Iterate faster – Restart or swap services in seconds using docker compose.
Focus on logic, not plumbing – The gateway handles the GitHub-specific heavy lifting while you focus on prompt design, reasoning, and multi-agent orchestration.

Role of Docker Compose

Running everything via Docker Compose means you treat the entire agent environment as a single deployable unit:

One-command startup – docker compose up brings up the MCP Gateway (and your agent, if containerized) together.
Service orchestration – Compose ensures dependencies start in the right order.
Internal networking – Services talk to each other by name (http://mcp-gateway-github:8080) without manual port wrangling.
Scaling – Run multiple agent instances for concurrent requests.
Unified logging – View all logs in one place for easier debugging.

Architecture Overview

This setup connects a developer’s local agent to GitHub through a Dockerized MCP Gateway, with Docker Compose orchestrating the environment. Here’s how it works step-by-step:

User Interaction

The developer runs the agent from a CLI or terminal.
They type a question about a GitHub repository — e.g., “Where is the authentication logic implemented?”

Agent Processing

The Agent (LLM + MCPTools) receives the question.
The agent determines that it needs repository data and issues a tool call via MCPTools.

MCPTools → MCP Gateway

MCPTools sends the request using streamable-http to the MCP Gateway running in Docker.
This gateway is defined in docker-compose.yml and configured for the GitHub server (--servers=github --port=8080).

GitHub Integration

The MCP Gateway handles all GitHub API interactions — listing files, retrieving content, searching code — and returns structured results to the agent.

LLM Reasoning

The agent sends the retrieved GitHub context to OpenAI GPT-4o as part of a prompt.
The LLM reasons over the data and generates a clear, context-rich answer.

Response to User

The agent prints the final answer back to the CLI, often with file names and line references.

Code Reference & File Roles

The detailed source code for this setup is available at this link.

Rather than walk through it line-by-line, here’s what each file does in the real-world developer setup:

`docker-compose.yml`

Defines the MCP Gateway service for GitHub.
Runs the docker/mcp-gateway:latest container with GitHub as the configured server.
Exposes the gateway on port 8080.
Can be extended to run the agent and additional connectors as separate services in the same network.

`app.py`

Implements the GitHub Repo Summarizer Agent.
Uses MCPTools to connect to the MCP Gateway over streamable-http.
Sends queries to GitHub via the gateway, retrieves results, and passes them to GPT-4o for reasoning.
Handles the interactive CLI loop so you can type questions and get real-time responses.

In short: the Compose file manages infrastructure and orchestration, while the Python script handles intelligence and conversation.

Setup and Execution

Clone the repository

git clone https://github.com/rajeshsgr/mcp-demo-agents/tree/main

cd mcp-demo-agents

Configure environment

Create a .env file in the root directory and add your OpenAI API key:

OPEN_AI_KEY = <<Insert your Open AI Key>>

Configure GitHub Access

To allow the MCP Gateway to access GitHub repositories, set your GitHub personal access token:

docker mcp secret set github.personal_access_token=<YOUR_GITHUB_TOKEN>

Start MCP Gateway

Bring up the GitHub MCP Gateway container using Docker Compose:

docker compose up -d

Install Dependencies & Run Agent

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python app.py

Ask Queries

Enter your query: Summarize https://github.com/owner/repo

Real-World Agent Development with Docker, MCP, and Compose

This setup is built with production realities in mind —

Docker ensures each integration (GitHub, databases, APIs) runs in its own isolated container with all dependencies preconfigured.
MCP acts as the bridge between your agent and real-world tools, abstracting away API complexity so your agent code stays clean and focused on reasoning.
Docker Compose orchestrates all these moving parts, managing startup order, networking, scaling, and environment parity between development, staging, and production.

From here, it’s easy to add:

More MCP connectors (Jira, Slack, internal APIs).
Multiple agents specializing in different tasks.
CI/CD pipelines that spin up this environment for automated testing

Final Thoughts

By combining Docker for isolation, MCP for seamless tool integration, and Docker Compose for orchestration, we’ve built more than just a working AI agent — we’ve created a repeatable, production-ready development pattern. This approach removes environment drift, accelerates iteration, and makes it simple to add new capabilities without disrupting existing workflows. Whether you’re experimenting locally or deploying at scale, this setup ensures your agents are reliable, maintainable, and ready to handle real-world demands from day one.

Before vs. After: The Developer Experience

Aspect	Without Docker + MCP + Compose	With Docker + MCP + Compose
Environment Setup	Manual SDK installs, dependency conflicts, “works on my machine” issues.	Prebuilt container images with fixed dependencies ensure identical environments everywhere.
Integration with Tools (GitHub, Jira, etc.)	Custom API wiring in the agent code; high maintenance overhead.	MCP handles integrations in separate containers; agent code stays clean and focused.
Startup Process	Multiple scripts/terminals; manual service ordering.	`docker compose up` launches and orchestrates all services in the right order.
Networking	Manually configuring ports and URLs; prone to errors.	Internal Docker network with service name resolution (e.g., `http://mcp-gateway-github:8080`).
Scalability	Scaling services requires custom scripts and reconfigurations.	Scale any service instantly with `docker compose up --scale`.
Extensibility	Adding a new integration means changing the agent’s code and redeploying.	Add new MCP containers to `docker-compose.yml` without modifying the agent.
CI/CD Integration	Hard to replicate environments in pipelines; brittle builds.	Same Compose file works locally, in staging, and in CI/CD pipelines.
Iteration Speed	Restarting services or switching configs is slow and error-prone.	Containers can be stopped, replaced, and restarted in seconds.

Build a Recipe AI Agent with Koog and Docker

Philippe Charrière — Fri, 08 Aug 2025 16:23:59 +0000

Hi, I’m Philippe Charriere, a Principal Solutions Architect at Docker. I like to test new tools and see how they fit into real-world workflows. Recently, I set out to see if JetBrains’ Koog framework could run with Docker Model Runner, and what started as a quick test turned into something a lot more interesting than I expected. In this new blog post, we’ll explore how to create a small Koog agent specializing in ratatouille recipes using popular Docker AI tools (disclaimer: I’m French). We’ll be using:

Koog: a framework for building AI Agents in Kotlin
Docker Model Runner: a Docker feature that allows deploying AI models locally, based on Llama.cpp
Agentic Compose: a Docker Compose feature to easily integrate AI models into your applications
Docker MCP Gateway: a gateway to access MCP (Model Context Protocol) servers from the Docker MCP Catalog

Prerequisites: Kotlin project initialization

I use IntelliJ IDEA Community Edition to initialize the Kotlin project.
I use OpenJDK 23 and Gradle Kotlin DSL for project configuration.

Step 1: Gradle Configuration

Here’s my build configuration: build.gradle.kts

plugins {
    kotlin("jvm") version "2.1.21"
    application
}

group = "kitchen.ratatouille"
version = "1.0-SNAPSHOT"

repositories {
    mavenCentral()
}

dependencies {
    testImplementation(kotlin("test"))
    implementation("ai.koog:koog-agents:0.3.0")
    implementation("org.slf4j:slf4j-simple:2.0.9")

}

application {
    mainClass.set("kitchen.ratatouille.MainKt")
}

tasks.test {
    useJUnitPlatform()
}

tasks.jar {
    duplicatesStrategy = DuplicatesStrategy.EXCLUDE
    manifest {
        attributes("Main-Class" to "kitchen.ratatouille.MainKt")
    }
    from(configurations.runtimeClasspath.get().map { if (it.isDirectory) it else zipTree(it) })
}

kotlin {
    jvmToolchain(23)
}

Step 2: Docker Compose Project Configuration

The new “agentic” feature of Docker Compose allows defining the models to be used by Docker Compose services.

With the content below, I define that I will use the hf.co/menlo/lucy-128k-gguf:q4_k_m model from Hugging Face for my “Koog agent”.

models:
  app_model:
    model: hf.co/menlo/lucy-128k-gguf:q4_k_m

And I make the “link” between the koog-app service and the app_model model and the Koog agent as follows at the service level:

models:
      app_model:
        endpoint_var: MODEL_RUNNER_BASE_URL
        model_var: MODEL_RUNNER_CHAT_MODEL

Docker Compose will automatically inject the MODEL_RUNNER_BASE_URL and MODEL_RUNNER_CHAT_MODEL environment variables into the koog-app service, which allows the Koog agent to connect to the model.

If you entered interactive mode in the koog-app container, you could verify that the environment variables are properly defined with the command:

env | grep '^MODEL_RUNNER'

And you would get something like:

MODEL_RUNNER_BASE_URL=http://model-runner.docker.internal/engines/v1/
MODEL_RUNNER_CHAT_MODEL=hf.co/menlo/lucy-128k-gguf:q4_k_m

It’s entirely possible to define multiple models.

The complete compose.yaml file looks like this:

services:

  koog-app:
    build:
      context: .
      dockerfile: Dockerfile
    environment:
      SYSTEM_PROMPT: You are a helpful cooking assistant.
      AGENT_INPUT: How to cook a ratatouille?
    models:
      app_model:
        endpoint_var: MODEL_RUNNER_BASE_URL
        model_var: MODEL_RUNNER_CHAT_MODEL

models:
  app_model:
    model: hf.co/menlo/lucy-128k-gguf:q4_k_m

Step 3: Dockerfile

Next, we’ll need a Dockerfile to build the Docker image of our Koog application. The Dockerfile uses multi-stage build to optimize the final image size, so it’s divided into two parts/stages: one for building the application (build) and one for execution (runtime). Here’s the content of the Dockerfile:

# Stage 1: Build
FROM eclipse-temurin:23-jdk-noble AS build

WORKDIR /app

COPY gradlew .
COPY gradle/ gradle/
COPY build.gradle.kts .
COPY settings.gradle.kts .

RUN chmod +x ./gradlew

COPY src/ src/

# Build
RUN ./gradlew clean build

# Stage 2: Runtime
FROM eclipse-temurin:23-jre-noble AS runtime

WORKDIR /app

COPY --from=build /app/build/libs/ratatouille-1.0-SNAPSHOT.jar app.jar
CMD ["java", "-jar", "app.jar"]

Step 4: Kotlin side:

Connecting to Docker Model Runner

Now, here’s the source code of our application, in the src/main/kotlin/Main.kt file to be able to use Docker Model Runner. The API exposed by Docker Model Runner is compatible with the OpenAI API, so we’ll use Koog’s OpenAI client to interact with our model:

package kitchen.ratatouille

import ai.koog.prompt.executor.clients.openai.OpenAIClientSettings
import ai.koog.prompt.executor.clients.openai.OpenAILLMClient

suspend fun main() {

    val apiKey = "nothing"
    val customEndpoint = System.getenv("MODEL_RUNNER_BASE_URL").removeSuffix("/")
    val model = System.getenv("MODEL_RUNNER_CHAT_MODEL")

    val client = OpenAILLMClient(
        apiKey=apiKey,
        settings = OpenAIClientSettings(customEndpoint)
    )
}

First Koog Agent

Creating an agent with Koog is relatively simple as you can see in the code below. We’ll need:

a SingleLLMPromptExecutor that will use the OpenAI client we created previously to execute requests to the model.
an LLModel that will define the model we’re going to use.
an AIAgent that will encapsulate the model and the prompt executor to execute requests.

Regarding the prompt, I use the SYSTEM_PROMPT environment variable to define the agent’s system prompt, and AGENT_INPUT to define the agent’s input (the “user message”). These variables were defined in the compose.yaml file previously:

environment:
      SYSTEM_PROMPT: You are a helpful cooking assistant.
      AGENT_INPUT: How to cook a ratatouille?

And here’s the complete code of the Koog agent in the src/main/kotlin/Main.kt file:

package kitchen.ratatouille

import ai.koog.agents.core.agent.AIAgent
import ai.koog.prompt.executor.clients.openai.OpenAIClientSettings
import ai.koog.prompt.executor.clients.openai.OpenAILLMClient
import ai.koog.prompt.executor.llms.SingleLLMPromptExecutor
import ai.koog.prompt.llm.LLMCapability
import ai.koog.prompt.llm.LLMProvider
import ai.koog.prompt.llm.LLModel

suspend fun main() {

    val apiKey = "nothing"
    val customEndpoint = System.getenv("MODEL_RUNNER_BASE_URL").removeSuffix("/")
    val model = System.getenv("MODEL_RUNNER_CHAT_MODEL")

    val client = OpenAILLMClient(
        apiKey=apiKey,
        settings = OpenAIClientSettings(customEndpoint)
    )

    val promptExecutor = SingleLLMPromptExecutor(client)

    val llmModel = LLModel(
        provider = LLMProvider.OpenAI,
        id = model,
        capabilities = listOf(LLMCapability.Completion)
    )

    val agent = AIAgent(
        executor = promptExecutor,
        systemPrompt = System.getenv("SYSTEM_PROMPT"),
        llmModel = llmModel,
        temperature = 0.0
    )

    val recipe = agent.run(System.getenv("AGENT_INPUT"))

    println("Recipe:\n $recipe")

}

Running the project

All that’s left is to launch the project with the following command:

docker compose up --build --no-log-prefix

Then wait a moment, depending on your machine, the build and completion times will be more or less long. I nevertheless chose Lucy 128k because it can run on small configurations, even without a GPU. This model also has the advantage of being quite good at “function calling” detection despite its small size (however, it doesn’t support parallel tool calls). And you should finally get something like this in the console:

Recipe:
 Sure! Here's a step-by-step guide to cooking a classic ratatouille:

---

### **Ingredients**  
- 2 boneless chicken thighs or 1-2 lbs rabbit (chicken is common, but rabbit is traditional)  
- 1 small onion (diced)  
- 2 garlic cloves (minced)  
- 1 cup tomatoes (diced)  
- 1 zucchini (sliced)  
- 1 yellow squash or eggplant (sliced)  
- 1 bell pepper (sliced)  
- 2 medium potatoes (chopped)  
- 1 red onion (minced)  
- 2 tbsp olive oil  
- 1 tbsp thyme (or rosemary)  
- Salt and pepper (to taste)  
- Optional: 1/4 cup wine (white or red) to deglaze the pan  

---

### **Steps**  
1. **Prep the Ingredients**  
   - Dice the onion, garlic, tomatoes, zucchini, squash, bell pepper, potatoes.  
   - Sauté the chicken in olive oil until browned (about 10–15 minutes).  
   - Add the onion and garlic, sauté for 2–3 minutes.  

2. **Add Vegetables & Flavor**  
   - Pour in the tomatoes, zucchini, squash, bell pepper, red onion, and potatoes.  
   - Add thyme, salt, pepper, and wine (if using). Stir to combine.  
   - Add about 1 cup water or stock to fill the pot, if needed.  

3. **Slow Cook**  
   - Place the pot in a large pot of simmering water (or use a Dutch oven) and cook on low heat (around 200°F/90°C) for about 30–40 minutes, or until the chicken is tender.  
   - Alternatively, use a stovetop pot with a lid to cook the meat and vegetables together, simmering until the meat is cooked through.  

4. **Finish & Serve**  
   - Remove the pot from heat and let it rest for 10–15 minutes to allow flavors to meld.  
   - Stir in fresh herbs (like rosemary or parsley) if desired.  
   - Serve warm with crusty bread or on the plate as is.  

---

### **Tips**  
- **Meat Variations**: Use duck or other meats if you don't have chicken.  
- **Vegetables**: Feel free to swap out any vegetables (e.g., mushrooms, leeks).  
- **Liquid**: If the mixture is too dry, add a splash of water or stock.  
- **Serving**: Ratatouille is often eaten with bread, so enjoy it with a side of crusty bread or a simple salad.  

Enjoy your meal!

As you can see, it’s quite simple to create an agent with Koog and Docker Model Runner!

But we have a problem, I told you I was French and the ratatouille recipe proposed by Lucy 128k doesn’t really suit me: there’s no rabbit, chicken, or duck in a ratatouille!!!. But let’s see how to fix that.

Let’s add superpowers to our Koog agent with the Docker MCP Gateway

What I’d like to do now is have my application first search for information about ratatouille ingredients, and then have the Koog agent use this information to improve the recipe. For this, I’d like to use the DuckDuckGo MCP server that’s available on the Docker MCP Hub. And to make my life easier, I’m going to use the Docker MCP Gateway to access this MCP server.

Configuring the Docker MCP Gateway in Docker Compose

To use the Docker MCP Gateway, I’ll first modify the compose.yml file to add the gateway configuration.

Configuring the gateway in the compose.yaml file

Here’s the configuration I added for the gateway in the compose.yaml file:

 mcp-gateway:
    image: docker/mcp-gateway:latest
    command:
      - --port=8811
      - --transport=sse
      - --servers=duckduckgo
      - --verbose
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock

This configuration will create an mcp-gateway service that will listen on port 8811 and use the sse (Server-Sent Events) transport to communicate with MCP servers.

Important:

with –servers=duckduckgo I can filter the available MCP servers to only use the DuckDuckGo server.
the MCP Gateway will automatically pull the available MCP servers from the Docker MCP Hub.

The MCP Gateway is an open-source project that you can find here:

Next, I’ll modify the koog-app service so it can communicate with the gateway by adding the MCP_HOST environment variable that will point to the gateway URL, as well as the dependency on the mcp-gateway service:

environment:
      MCP_HOST: http://mcp-gateway:8811/sse
    depends_on:
      - mcp-gateway

I’ll also modify the system prompt and user message:

environment:
      SYSTEM_PROMPT: |
        You are a helpful cooking assistant.
        Your job is to understand the user prompt and decide if you need to use tools to run external commands.
      AGENT_INPUT: |
        Search for the ingredients to cook a ratatouille, max result 1
        Then, from these found ingredients, generate a yummy ratatouille recipe
        Do it only once

So here’s the complete compose.yml file with the MCP Gateway configuration and the modifications made to the koog-app service:

services:

  koog-app:
    build:
      context: .
      dockerfile: Dockerfile
    environment:
      SYSTEM_PROMPT: |
        You are a helpful cooking assistant.
        Your job is to understand the user prompt and decide if you need to use tools to run external commands.
      AGENT_INPUT: |
        Search for the ingredients to cook a ratatouille, max result 1
        Then, from these found ingredients, generate a yummy ratatouille recipe
        Do it only once
      MCP_HOST: http://mcp-gateway:8811/sse
    depends_on:
      - mcp-gateway
    models:
      app_model:
        # NOTE: populate the environment variables with the model runner endpoint and model name
        endpoint_var: MODEL_RUNNER_BASE_URL
        model_var: MODEL_RUNNER_CHAT_MODEL

  mcp-gateway:
    image: docker/mcp-gateway:latest
    command:
      - --port=8811
      - --transport=sse
      - --servers=duckduckgo
      - --verbose
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock

models:
  app_model:
    model: hf.co/menlo/lucy-128k-gguf:q4_k_m

Now, let’s modify the Kotlin code to use the MCP Gateway and search for ratatouille ingredients.

Modifying the Kotlin code to use the MCP Gateway

The modification is extremely simple; you just need to:

define the MCP transport (SseClientTransport) with the gateway URL: val transport = McpToolRegistryProvider.defaultSseTransport(System.getenv(“MCP_HOST”))
create the MCP tools registry with the gateway: val toolRegistry = McpToolRegistryProvider.fromTransport(transport = transport, name = “sse-client”, version = “1.0.0”)
and finally, add the tools registry to the Koog agent constructor: toolRegistry = toolRegistry

Extremely important: I added capabilities = listOf(LLMCapability.Completion, LLMCapability.Tools) for the LLM model, because we’re going to use its “function calling” capabilities (the tools are defined and provided by the MCP server).

Here’s the complete code of the Koog agent modified to use the MCP Gateway in the src/main/kotlin/Main.kt file:

package kitchen.ratatouille

import ai.koog.agents.core.agent.AIAgent
import ai.koog.agents.mcp.McpToolRegistryProvider
import ai.koog.prompt.executor.clients.openai.OpenAIClientSettings
import ai.koog.prompt.executor.clients.openai.OpenAILLMClient
import ai.koog.prompt.executor.llms.SingleLLMPromptExecutor
import ai.koog.prompt.llm.LLMCapability
import ai.koog.prompt.llm.LLMProvider
import ai.koog.prompt.llm.LLModel

suspend fun main() {

    val transport = McpToolRegistryProvider.defaultSseTransport(System.getenv("MCP_HOST"))
    // Create a tool registry with tools from the MCP server
    val toolRegistry = McpToolRegistryProvider.fromTransport(
        transport = transport,
        name = "sse-client",
        version = "1.0.0"
    )
    println(toolRegistry.tools)

    val apiKey = "nothing"
    val customEndpoint = System.getenv("MODEL_RUNNER_BASE_URL").removeSuffix("/")
    val model = System.getenv("MODEL_RUNNER_CHAT_MODEL")

    val client = OpenAILLMClient(
        apiKey=apiKey,
        settings = OpenAIClientSettings(customEndpoint)
    )

    val promptExecutor = SingleLLMPromptExecutor(client)

    val llmModel = LLModel(
        provider = LLMProvider.OpenAI,
        id = model,
        capabilities = listOf(LLMCapability.Completion, LLMCapability.Tools)
    )

    val agent = AIAgent(
        executor = promptExecutor,
        systemPrompt = System.getenv("SYSTEM_PROMPT"),
        llmModel = llmModel,
        temperature = 0.0,
        toolRegistry = toolRegistry
    )

    val recipe = agent.run(System.getenv("AGENT_INPUT"))

    println("Recipe:\n $recipe")

}

Launching the project with the MCP Gateway

Let’s launch the project again with the command:

docker compose up --build --no-log-prefix

And after a while, you should get a new ratatouille recipe, but the LLM will have relied on the search results performed by the DuckDuckGo MCP server (via the MCP Gateway) to improve the recipe. The response time will be a bit longer because the LLM will first query the MCP server to get the ratatouille ingredients, then it will generate the recipe. And the DuckDuckGo MCP server will search for links and then retrieve the content of those links (indeed, the DuckDuckGo MCP server exposes 2 tools: search and fetch_content).

Here’s an example of what you might get with an improved and more “authentic” ratatouille recipe:

Recipe:
 Here's a **complete and easy-to-follow version** of **Ratatouille**, based on the recipe you provided, with tips and variations to suit your preferences:

---

###  **What Is Ratatouille?**  
A classic French vegetable stew, traditionally made with eggplant, tomatoes, zucchini, bell peppers, onions, and mushrooms. It's often seasoned with herbs like parsley, thyme, or basil and paired with crusty bread or a light sauce.

---

###  **Ingredients** (for 4 servings):  
- **1/2 cup olive oil** (divided)  
- **2 tbsp olive oil** (for the skillet)  
- **3 cloves garlic**, minced  
- **1 eggplant**, cubed  
- **2 zucchinis**, sliced  
- **2 large tomatoes**, chopped  
- **2 cups fresh mushrooms**, sliced  
- **1 large onion**, sliced  
- **1 green or red bell pepper**, sliced  
- **1/2 tsp dried parsley**  
- **Salt to taste**  
- **1/2 cup grated Parmesan cheese** (or pecorino, as you mentioned)  

---

###  **How to Make Ratatouille**  
**Preheat oven** to 350°F (175°C).  

1. **Prepare the dish**: Coat a 1½-quart casserole dish with 1 tbsp olive oil.  
2. **Cook the base**: In a skillet, sauté garlic until fragrant (about 1–2 minutes). Add eggplant, parsley, and salt; cook for 10 minutes until tender.  
3. **Layer the vegetables**: Spread the eggplant mixture in the dish, then add zucchini, tomatoes, mushrooms, onion, and bell pepper. Top with Parmesan.  
4. **Bake**: Cover and bake for 45 minutes. Check for tenderness; adjust time if needed.  

**Cook's Note**:  
- Add mushrooms (optional) or omit for a traditional flavor.  
- Use fresh herbs like thyme or basil if preferred.  
- Substitute zucchini with yellow squash or yellow bell pepper for color.  

---

###  **How to Serve**  
- **Main dish**: Serve with crusty French bread or rice.  
- **Side**: Pair with grilled chicken or fish.  
- **Guilt-free twist**: Add black olives or a sprinkle of basil/others for a lighter version.  

---

Conclusion

This blog post perfectly illustrates the modern containerized AI ecosystem that Docker is building. By combining Docker Model Runner, Agentic Compose, Docker MCP Gateway, and the Koog framework (but we could of course use other frameworks), we were able to create an “intelligent” agent quite simply.

Docker Model Runner allowed us to use an AI model locally.
Agentic Compose simplified the integration of the model into our application by automatically injecting the necessary environment variables.
The Docker MCP Gateway transformed our little agent into a system capable of interacting with the outside world.
The Koog framework allowed us to orchestrate these components in Kotlin.

Soon, I’ll go deeper into the MCP Gateway and how to use it with your own MCP servers, and not just with Koog. And I continue my explorations with Koog and Docker Model Runner. Check out the entire source code of this project is available here

Learn more

If you need more GPUs to experiment with different models, sign up for Docker Offload beta program and get 300 minutes for free.
Discover hundreds of curated MCP servers on the Docker MCP Catalog
Learn more about Docker MCP Toolkit
Explore Docker MCP Gateway on GitHub
Get started with Docker Model Runner
Get more practical agent examples from Agentic Compose repos

Compose Editing Evolved: Schema-Driven and Context-Aware

Remy Suen — Tue, 22 Jul 2025 15:00:00 +0000

Every day, thousands of developers are creating and editing Compose files. At Docker, we are regularly adding more features to Docker Compose such as the new provider services capability that lets you run AI models as part of your multi-container applications with Docker Model Runner. We know that providing a first-class editing experience for Compose files is key to empowering our users to ship amazing products that will delight their customers. We are pleased to announce today some new additions to the Docker Language Server that will make authoring Compose files easier than ever before.

Schema-Driven Features

To help you stay on the right track as you edit your Compose file, the Docker Language Server brings the Compose specification into the editor to help minimize window switching and keeps you in your editor where you are most productive.

Figure 1: Leverage hover tooltips to quickly understand what a specific Compose attribute is for.

Context-Aware Intelligence

Although attribute names and types can be inferred from the Compose specification, certain attributes have a contextual meaning on them and reference values of different attributes or content from another file. The Docker Language Server understands these relationships and will suggest the available values so that there is no guesswork on your part.

Figure 2: Code completion understands how your files are connected and will only give you suggestions that are relevant in your current context.

Freedom of Choice

The Docker Language Server is built on the Language Server Protocol (LSP) which means you can connect it with any LSP-compatible editor of your choosing. Whatever editor you like using, we will be right there with you to guide you along your software development journey.

Figure 3: The Docker Language Server can run in any LSP-compliant editor such as the JetBrains IDE with the LSP4IJ plugin.

Conclusion

Docker Compose is a core part of hundreds of companies’ development cycles. By offering a feature-rich editing experience with the Docker Language Server, developers everywhere can test and ship their products faster than ever before. Install the Docker DX extension for Visual Studio Code today or download the Docker Language Server to integrate it with your favourite editor.

What’s Next

Your feedback is critical in helping us improve and shape the Docker DX extension and the Docker Language Server.

If you encounter any issues or have ideas for enhancements that you would like to see, please let us know:

Open an issue on the Docker DX VS Code extension GitHub repository or the Docker Language Server GitHub repository
Or submit feedback through the Docker feedback page

We’re listening and excited to keep making things better for you!

Learn More

Setup the Docker Language Server after installing LSP4IJ in your favorite JetBrains IDE.

Docker Unveils the Future of Agentic Apps at WeAreDevelopers

Michael Irwin — Mon, 21 Jul 2025 22:53:28 +0000

Agentic applications – what actually are they and how do we make them easier to build, test, and deploy? At WeAreDevelopers, we defined agentic apps as those that use LLMs to define execution workflows based on desired goals with access to your tools, data, and systems.

While there are new elements to this application stack, there are many aspects that feel very similar. In fact, many of the same problems experienced with microservices now exist with the evolution of agentic applications.

Therefore, we feel strongly that teams should be able to use the same processes and tools, but with the new agentic stack. Over the past few months, we’ve been working to evolve the Docker tooling to make this a reality and we were excited to share it with the world at WeAreDevelopers.

Let’s unpack those announcements, as well as dive into a few other things we’ve been working on!

Docker Captain Alan Torrance from JPMC with Docker COO Mark Cavage and WeAreDevelopers organizers

WeAreDeveloper keynote announcements

Mark Cavage, Docker’s President and COO, and Tushar Jain, Docker’s EVP of Product and Engineering, took the stage for a keynote at WeAreDevelopers and shared several exciting new announcements – Compose for agentic applications, native Google Cloud support for Compose, and Docker Offload. Watch the keynote in its entirety here.

Docker EVP, Product & Engineering Tushar Jain delivering the keynote at WeAreDevelopers

Compose has evolved to support agentic applications

Agentic applications need three things – models, tools, and your custom code that glues it all together.

The Docker Model Runner provides the ability to download and run models.

The newly open-sourced MCP Gateway provides the ability to run containerized MCP servers, giving your application access to the tools it needs in a safe and secure manner.

With Compose, you can now define and connect all three in a single compose.yaml file!

Here’s an example of a Compose file bringing it all together:

# Define the models
models:
  gemma3:
    model: ai/gemma3

services:
  # Define a MCP Gateway that will provide the tools needed for the app
  mcp-gateway:
    image: docker/mcp-gateway
    command: --transport=sse --servers=duckduckgo
    use_api_socket: true

  # Connect the models and tools with the app
  app:
    build: .
    models:
      gemma3:
        endpoint_var: OPENAI_BASE_URL
        model_var: OPENAI_MODEL
    environment:
      MCP_GATEWAY_URL: http://mcp-gateway:8811/sse

The application can leverage a variety of agentic frameworks – ADK, Agno, CrewAI, Vercel AI SDK, Spring AI, and more.

Check out the newly released compose-for-agents sample repo for examples using a variety of frameworks and ideas.

Taking Compose to production with native cloud-provider support

During the keynote, we shared the stage with Google Cloud’s Engineering Director Yunong Xiao to demo how you can now easily deploy your Compose-based applications with Google Cloud Run. With this capability, the same Compose specification works from dev to prod – no rewrites and no reconfig.

Google Cloud’s Engineering Director Yunong Xiao announcing the native Compose support in Cloud Run

With Google Cloud Run (via gcloud run compose up) and soon Microsoft Azure Container Apps, you can deploy apps to serverless platforms with ease. It already has support for the newly released model support too!

Compose makes the entire journey from dev to production consistent, portable, and effortless – just the way applications should be.

Learn more about the Google Cloud Run support with their announcement post here.

Announcing Docker Offload – access to cloud-based compute resources and GPUs during development and testing

Running LLMs requires a significant amount of compute resources and large GPUs. Not every developer has access to those resources on their local machines. Docker Offload allows you to run your containers and models using cloud resources, yet still feel local. Port publishing and bind mounts? It all just works.

No complex setup, no GPU shortages, no configuration headaches. It’s a simple toggle switch in Docker Desktop. Sign up for our beta program and get 300 free GPU minutes!

Getting hands-on with our new agentic Compose workshop

Selfie with at the end of the workshop with attendees

At WeAreDevelopers, we released and ran a workshop to enable everyone to get hands-on with the new Compose capabilities, and the response blew us away!

In the room, every seat was filled and a line remained outside well into the workshop hoping folks would leave early to open a spot. But, not a single person left early! It was so thrilling to see attendees stay fully engaged for the entire workshop.

During the workshop, participants were able to learn about the agentic application stack, digging deep into models, tools, and agentic frameworks. They used Docker Model Runner, the Docker MCP Gateway, and the Compose integrations to package it all together.

Want to try the workshop yourself? Check it out on GitHub at dockersamples/workshop-agentic-compose.

Lightning talks that sparked ideas

Lightning talk on testing with LLMs in the Docker booth

In addition to the workshop, we hosted a rapid-fire series of lightning talks in our booth on a range of topics. These talks were intended to inspire additional use cases and ideas for agentic applications:

Digging deep into the fundamentals of GenAI applications
Going beyond the chatbot with event-driven agentic applications
Using LLMs to perform semantic testing of applications and websites
Using Gordon to build safer and more secure images with Docker Hardened Images

These talks made it clear: agentic apps aren’t just a theory—they’re here, and Docker is at the center of how they get built.

Stay tuned for future blog posts that dig deeper into each of these topics.

Sharing our industry insights and learnings

At WeAreDevelopers, our UX Research team was able to present their findings and insights after analyzing the past three years of Docker-sponsored industry research. And interestingly, the AI landscape is already starting to have an impact in language selection, attitudes toward trends like shifted-left security, and more!

Julia Wilson on stage sharing insights from the Docker UX research team

To learn more about the insights, view the talk here.

Bringing a European Powerhouse to North America

In addition to the product announcements, we announced a major co‑host partnership between Docker and WeAreDevelopers, launching WeAreDevelopers World Congress North America, set for September 2026. Personally, I’m super excited for this because WeAreDevelopers is a genuine developer-first conference – it covers topics at all levels, has an incredibly fun and exciting atmosphere, live coding hackathons, and helps developers find jobs and further their career!

The 2026 WeAreDevelopers World Congress North America will mark the event’s first major expansion outside Europe. This creates a new, developer-first alternative to traditional enterprise-style conferences, with high-energy talks, live coding, and practical takeaways tailored to real builders.

Docker Captains Mohammad-Ali A’râbi (left) and Francesco Ciulla (right) in attendance with Docker Principal Product Manager Francesco Corti (center)

Try it, build it, contribute

We’re excited to support this next wave of AI-native applications. If you’re building with agentic AI, try out these tools in your workflow today. Agentic apps are complex, but with Docker, they don’t have to be hard. Let’s build cool stuff together.

Sign up for our beta program and get 300 free GPU minutes!
Use Docker Compose to build and run your AI agents
Watch the keynote, a panel on securing the agentic workflow, and dive into insights from our annual developer survey here.
Try Docker Model Runner and MCP Gateway