LLM – Docker

Permission-Aware RAG: End-to-End Testing with the SpiceDB Testcontainer

Jennifer Kohl — Thu, 15 Jan 2026 14:00:00 +0000

We use GenAI in every facet of technology now – internal knowledge bases, customer support systems, and code review bots, to name just a few use cases. And in nearly every one of these, someone eventually asks:

What stops the model from returning something the user shouldn’t see?”

This is a roadblock that companies building RAG features or AI Agents eventually hit – the moment where an LLM returns data from a document that the user was not authorized to access, introducing potential legal, financial, and reputational risk to all parties. Unfortunately, traditional methods of authorization are not suited for the hierarchical, dynamic nature of access control in RAG. This is exactly where modern authorization permissioning systems such as SpiceDB shine: in building fine-grained authorization for filtering content in your AI-powered applications.

In fact, OpenAI uses SpiceDB to secure 37 Billion documents for 5 Million users who use ChatGPT Connectors – a feature where you bring your data from different sources such as Google Drive, Dropbox, GitHub etc. into ChatGPT.

This blog post shows how you can pair SpiceDB with Testcontainers to give you the ability to test your permission logic inside your RAG pipeline, end-to-end, automatically, with zero infrastructure dependencies.The example repo can be found here.

Quick Primer on Authorization

Before diving into implementation, let’s clarify two foundational concepts: Authentication (verifying who a user is) and Authorization (deciding what they can access).

Authorization is commonly implemented via techniques such as:

Access Control Lists (ACLs)
Role-Based Access Control (RBAC)
Attribute-Based Access Control (ABAC)

However, for complex, dynamic, and context-rich applications like RAG pipelines, traditional methods such as RBAC or ABAC fall short. The new kid on the block – ReBAC (Relationship-Based Access Control) is ideal as it models access as a graph of relationships rather than fixed rules, providing the necessary flexibility and scalability required.

ReBAC was popularized in Google Zanzibar, the internal authorization system Google built to manage permissions across all its products (e.g., Google Docs, Drive). Zanzibar systems are optimized for low-latency, high-throughput authorization checks, and global consistency – requirements that are well-suited for RAG systems.

SpiceDB is the most scalable open-source implementation of Google’s Zanzibar authorization model. It stores access as a relationship graph, where the fundamental check reduces to:

Is this actor allowed to perform this action on this resource?

For a Google Docs-style example:

definition user {}
definition document {
  relation reader: user
  relation writer: user

  permission read = reader + writer
  permission write = writer
}

This schema defines object types (user and document), explicit Relations between the objects (reader, writer), and derived Permissions (read, write). SpiceDB evaluates the relationship graph in microseconds, enabling real-time authorization checks at massive scale.

Access Control for RAG

RAG (Retrieval-Augmented Generation) is an architectural pattern that enhances Large Language Models (LLMs) by letting them consult an external knowledge base, typically involving a Retriever component finding document chunks and the LLM generating an informed response.

This pattern is now used by businesses and enterprises for apps like chatbots that query sensitive data such as customer playbooks or PII – all stored in a vector database for performance. However, the fundamental risk in this flow is data leakage: the Retriever component ignores permissions, and the LLM will happily summarize unauthorized data. In fact, OWASP has a Top 10 Risks for Large Language Model Applications list which includes Sensitive Information Disclosure, Excessive Agency & Vector and Embedding Weaknesses. The consequences of this leakage can be severe, ranging from loss of customer trust to massive financial and reputational damage from compliance violations.

This setup desperately needs fine-grained authorization, and that’s where SpiceDB comes in. SpiceDB can post-filter retrieved documents by performing real-time authorization checks, ensuring the model only uses data the querying user is permitted to see. The only requirement is that the documents have metadata that indicates where the information came from.But testing this critical permission logic without mocks, manual Docker setup, or flaky Continuous Integration (CI) environments is tricky. Testcontainers provides the perfect solution, allowing you to spin up a real, production-grade, and disposable SpiceDB instance inside your unit tests to deterministically verify that your RAG pipeline respects permissions end-to-end.

Spin Up Real Authorization for Every Test

Instead of mocking your authorization system or manually running it on your workstation, you can add this line of code in your test:

container, _ := spicedbcontainer.Run(ctx, "authzed/spicedb:v1.47.1")

And Testcontainers will:

Pull the real SpiceDB image
Start it in a clean, isolated environment
Assign it dynamic ports
Wait for it to be ready
Hand you the gRPC endpoint
Clean up afterwards

Because Testcontainers handles the full lifecycle – from pulling the container, exposing dynamic ports, and tearing it down automatically, you eliminate manual processes such as running Docker commands, and writing cleanup scripts. This isolation ensures that every single test runs with a fresh, clean authorization graph, preventing data conflicts, and making your permission tests completely reproducible in your IDE and across parallel Continuous Integration (CI) builds.

Suddenly you have a real, production-grade, Zanzibar-style permissions engine inside your unit test.

Using SpiceDB & Testcontainers

Here’s a walkthrough of how you can achieve end-to-end permissions testing using SpiceDB and Testcontainers. The source code for this tutorial can be found here.

1. Testing Our RAG

For the sake of simplicity, we have a minimal RAG and the retrieval mechanism is trivial too.

We’re going to test three documents which have doc_ids (doc1 doc2 ..) that act as metadata.

doc1: Internal roadmap
doc2: Customer playbook
doc3: Public FAQ

And three users:

Emilia owns doc1
Beatrice can view doc2
Charlie (or anyone) can view doc3

This SpiceDB schema defines a user and a document object type. A user has read permission on a document if they are the direct viewer or the owner of the document.

definition user {}

definition document {
  relation owner: user
  relation viewer: user | owner
  permission read = owner + viewer
}

2. Starting the Testcontainer

Here’s how a line of code can start a test to launch the disposable SpiceDB instance:

container, err := spicedbcontainer.Run(ctx, "authzed/spicedb:v1.47.1")
require.NoError(t, err)

Next, we connect to the running containerized service:

host, _ := container.Host(ctx)
port, _ := container.MappedPort(ctx, "50051/tcp")
endpoint := fmt.Sprintf("%s:%s", host, port.Port())

client, err := authzed.NewClient(
    endpoint,
    grpc.WithTransportCredentials(insecure.NewCredentials()),
    grpcutil.WithInsecureBearerToken("somepresharedkey"),
)

This is now a fully-functional SpiceDB instance running inside your test runner.

3. Load the Schema + Test Data

The test seeds data the same way your application would:

_, err := client.WriteSchema(ctx, &apiv1.WriteSchemaRequest{Schema: schema})
require.NoError(t, err)

Then:

rel("document", "doc1", "owner", "user", "emilia")
rel("document", "doc2", "viewer", "user", "beatrice")
rel("document", "doc3", "viewer", "user", "emilia")
rel("document", "doc3", "viewer", "user", "beatrice")
rel("document", "doc3", "viewer", "user", "charlie")

We now have a predictable, reproducible authorization graph for every test run.

4. Post-Filtering With SpiceDB

Before the LLM sees anything, we check permissions with SpiceDB which acts as the source of truth of the permissions in the documents.

resp, err := r.spiceClient.CheckPermission(ctx, &apiv1.CheckPermissionRequest{
    Resource:   docObject,
    Permission: "read",
    Subject:    userSubject,
})

If SpiceDB says no, the doc is never fed into the LLM, thereby ensuring the user gets an answer to their query only based on what they have permissions to read.

This avoids:

Accidental data leakage
Overly permissive vector search
Compliance problems

Traditional access controls break down when data becomes embeddings hence having guardrails prevents this from happening.

End-to-End Permission Checks in a Single Test

Here’s what the full test asserts:

Emilia queries “roadmap” → gets doc1
Because they’re the owner.

Beatrice queries “playbook” → gets doc2
Because she’s a viewer.

Charlie queries “public” → gets doc3
Because it’s the only doc he can read, as it’s a public doc

If there is a single failing permission rule, the end-to-end test will immediately fail, which is critical given the constant changes in RAG pipelines (such as new retrieval modes, embeddings, document types, or permission rules).

What If Your RAG Pipeline Isn’t in Go?

First, a shoutout to Guillermo Mariscal for his original contribution to the SpiceDB Go Testcontainers module.

What if your RAG pipeline is written in a different language such as Python? Not to worry, there’s also a community Testcontainers module written in Python that you can use similarly. The module can be found here.

Typically, you would integrate it in your integration tests like this:

 # Your RAG pipeline test
  def test_rag_pipeline_respects_permissions():
      with SpiceDBContainer() as spicedb:
          # Set up permissions schema
          client = create_spicedb_client(
              spicedb.get_endpoint(),
              spicedb.get_secret_key()
          )

          # Load your permissions model
          client.WriteSchema(your_document_permission_schema)

          # Write test relationships
          # User A can access Doc 1
          # User B can access Doc 2

          # Test RAG pipeline with User A
          results = rag_pipeline.search(query="...", user="A")
          assert "Doc 1" in results
          assert "Doc 2" not in results  # Should be filtered out!

Similar to the Go module, this container gives you a clean, isolated SpiceDB instance for every test run.

Why This Approach Matters

Authorization testing in RAG pipelines can be tricky, given the scale and latency requirement and it can get trickier in systems handling sensitive data. By integrating the flexibility and scale of SpiceDB with the automated, isolated environments of Testcontainers, you shift to a completely reliable, deterministic approach to authorization.

Every time your code ships, a fresh, production-grade authorization engine is spun up, loaded with test data, and torn down cleanly, guaranteeing zero drift between your development machine and CI. This pattern can ensure that your RAG system is safe, correct, and permission-aware as it scales from three documents to millions.

Try It Yourself

The complete working example in Go along with a sample RAG pipeline is here:
https://github.com/sohanmaheshwar/spicedb-testcontainer-rag
Clone it.
Run go test -v.
Watch it spin up a fresh SpiceDB instance, load permissions, and assert RAG behavior.
Also, find the community modules for the SpiceDB testcontainer in Go and Python.

Building AI agents shouldn’t be hard. According to theCUBE Research, Docker makes it easy

John H. Ayub — Tue, 02 Dec 2025 15:00:00 +0000

For most developers, getting started with AI is still too complicated. Different models, tools, and platforms don’t always play nicely together. But with Docker, that’s changing fast.

Docker is emerging as essential infrastructure for standardized, portable, and scalable AI environments. By bringing composability, simplicity, and GPU accessibility to the agentic era, Docker is helping developers and the enterprises they support move faster, safer, and with far less friction.

Real results: Faster AI delivery with Docker

The platform is accelerating innovation: According to the latest report from theCUBE Research, 88% of respondents reported that Docker reduced the time-to-market for new features or products, with nearly 40% achieving efficiency gains of more than 25%. Docker is playing an increasingly vital role in AI development as well. 52% of respondents cut AI project setup time by over 50%, while 97% report increased speed for new AI product development.

Reduced AI project failures and delays

Reliability remains a key performance indicator for AI initiatives, and Docker is proving instrumental in minimizing risk. 90% of respondents indicated that Docker helped prevent at least 10% of project failures or delays, while 16% reported prevention rates exceeding 50%. Additionally, 78% significantly improved testing and validation of AI models. These results highlight how Docker’s consistency, isolation, and repeatability not only speed development but also reduce costly rework and downtime, strengthening confidence in AI project delivery.

Build, share, and run agents with Docker, easily and securely

Docker’s mission for AI is simple: make building and running AI and agentic applications as easy, secure, and shareable as any other kind of software.

Instead of wrestling with fragmented tools, developers can now rely on Docker’s trusted, container-based foundation with curated catalogs of verified models and tools, and a clean, modular way to wire them together. Whether you’re connecting an LLM to a database or linking services into a full agentic workflow, Docker makes it plug-and-play.

With Docker Model Runner, you can pull and run large language models locally with GPU acceleration. The Docker MCP Catalog and Toolkit connect agents to over 300 MCP servers from partners like Stripe, Elastic, and GitHub. And with Docker Compose, you can define the whole AI stack of models, tools, and services in a single YAML file that runs the same way locally or in the cloud. Cagent, our open-source agent builder, lets you easily build, run, and share AI agents, with behavior, tools, and persona all defined in a single YAML file. And with Docker Sandboxes, you can run coding agents like Claude Code in a secure, local environment, keeping your workflows isolated and your data protected.

Conclusion

Docker’s vision is clear: to make AI development as simple and powerful as the workflows developers already know and love. And it’s working: theCUBE reports 52% of users cut AI project setup time by more than half, while 87% say they’ve accelerated time-to-market by at least 26%.

Learn more

Read more about ROI of working with Docker in our latest blog
Download theCUBE Research Report and eBook – economic validation of Docker
Explore the MCP Catalog: Discover containerized, security-hardened MCP servers
Open Docker Desktop and get started with the MCP Toolkit (Requires version 4.48 or newer to launch the MCP Toolkit automatically)
Head over to the cagent GitHub repository, give the repository a star, try it out, and let us know what amazing agents you build!

Tooling ≠ Glue: Why changing AI workflows still feels like duct tape

Gerardo López Falcón — Mon, 11 Aug 2025 16:00:00 +0000

There’s a weird contradiction in modern AI development. We have better tools than ever. We’re building smarter systems with cleaner abstractions. And yet, every time you try to swap out a component in your stack, things fall apart. Again.

This isn’t just an inconvenience. It’s become the norm.

You’d think with all the frameworks and libraries out there (LangChain, Hugging Face, MLflow, Airflow) we’d be past this by now. These tools were supposed to make our workflows modular and composable. Swap an embedding model? No problem. Try a new vector store? Easy. Switch from OpenAI to an open-source LLM? Go ahead. That was the dream.

But here’s the reality: we’ve traded monoliths for a brittle patchwork of microtools, each with its own assumptions, quirks, and “standard interfaces.” And every time you replace one piece, you end up chasing down broken configs, mismatched input/output formats, and buried side effects in some YAML file you forgot existed.

Tooling was supposed to be the glue. But most days, it still feels like duct tape.

The composability myth

A lot of the tooling that’s emerged in AI came with solid intentions. Follow the UNIX philosophy. Build small pieces that do one thing well. Expose clear interfaces. Make everything swappable.

In theory, this should’ve made experimentation faster and integration smoother. But in practice, most tools were built in isolation. Everyone had their own take on what an embedding is, how prompts should be formatted, what retry logic should look like, or how to chunk a document.

So instead of composability, we got fragmentation. Instead of plug-and-play, we got “glue-and-hope-it-doesn’t-break.”

And this fragmentation isn’t just annoying; it slows everything down. Want to try a new RAG strategy? You might need to re-index your data, adjust your chunk sizes, tweak your scoring functions, and retrain your vector DB schema. None of that should be necessary. But it is.

The stack is shallow and wide

AI pipelines today span a bunch of layers:

Data ingestion
Feature extraction or embeddings
Vector storage and retrieval
LLM inference
Orchestration (LangChain, LlamaIndex, etc.)
Agent logic or RAG strategies
API / frontend layers

Each one looks like a clean block on a diagram. But under the hood, they’re often tightly coupled through undocumented assumptions about tokenization quirks, statefulness, retry behavior, latency expectations, etc.

The result? What should be a flexible stack is more like a house of cards. Change one component, and the whole thing can wobble.

Why everything keeps breaking

The short answer: abstractions leak — a lot.

Every abstraction simplifies something. And when that simplification doesn’t match the underlying complexity, weird things start to happen.

Take LLMs, for example. You might start with OpenAI’s API and everything just works. Predictable latency, consistent token limits, clean error handling. Then you switch to a local model. Suddenly:

The input format is different
You have to manage batching and GPU memory
Token limits aren’t well documented
Latency increases dramatically
You’re now in charge of quantization and caching

What was once a simple llm.predict() call becomes a whole new engineering problem. The abstraction has leaked, and you’re writing glue code again.

This isn’t just a one-off annoyance. It’s structural. We’re trying to standardize a landscape where variability is the rule, not the exception.

Where are the standards?

One big reason for the current mess is the lack of solid standards for interoperability.

In other fields, we’ve figured this out:

Containers → OCI, Docker
APIs → OpenAPI
Observability → OpenTelemetry
Data formats → Parquet, JSON Schema, Avro

In AI? We’re not there yet. Most tools define their own contracts. Few agree on what’s universal. And as a result, reuse is hard, swapping is risky, and scaling becomes painful.

But in AI tooling?

There’s still no widely adopted standard for model I/O signatures.
Prompt formats, context windows, and tokenizer behavior vary across providers.
We do see promising efforts like MCP (Model Context Protocol) emerging, and that’s a good sign, but in practice, most RAG pipelines, agent tools, and vector store integrations still lack consistent, enforced contracts.
Error handling? It’s mostly improvised: retries, timeouts, fallbacks, and silent failures become your responsibility.

So yes, standards like MCP are starting to show up, and they matter. But today, most teams are still stitching things together manually. Until these protocols become part of the common tooling stack, supported by vendors and respected across libraries, the glue will keep leaking.

Local glue ≠ global composability

It’s tempting to say: “But it worked in the notebook.”

Yes, and that’s the problem.

The glue logic that works for your demo, local prototype, or proof-of-concept often breaks down in production. Why?

Notebooks aren’t production environments—they don’t have retries, monitoring, observability, or proper error surfaces.
Chaining tools with Python functions is different from composing them with real-time latency constraints, concurrency, and scale in mind.
Tools like LangChain often make it easy to compose components, until you hit race conditions, cascading failures, or subtle bugs in state management.

Much of today’s tooling is optimized for developer ergonomics during experimentation, not for durability in production. The result: we demo pipelines that look clean and modular, but behind the scenes are fragile webs of assumptions and implicit coupling.

Scaling this glue logic, making it testable, observable, and robust, requires more than clever wrappers. It requires system design, standards, and real engineering discipline.

The core problem: Illusion of modularity

What makes this even more dangerous is the illusion of modularity. On the surface, everything looks composable – API blocks, chain templates, toolkits – but the actual implementations are tightly coupled, poorly versioned, and frequently undocumented.

The AI stack doesn’t break because developers are careless. It breaks because the foundational abstractions are still immature, and the ecosystem hasn’t aligned on how to communicate, fail gracefully, or evolve in sync.

Until we address this, the glue will keep breaking, no matter how shiny the tools become.

Interface contracts, not SDK hype

Many AI tools offer SDKs filled with helper functions and syntactic sugar. But this often hides the actual interfaces and creates tight coupling between your code and a specific tool. Instead, composability means exposing formal interface contracts, like:

OpenAPI for REST APIs
Protocol Buffers for efficient, structured messaging
JSON Schema for validating data structures

These contracts:

Allow clear expectations for inputs/outputs.
Enable automated validation, code generation, and testing.
Make it easier to swap out models/tools without rewriting your code.
Encourage tool-agnostic architecture rather than SDK lock-in.

Build for failure, not just happy paths

Most current AI systems assume everything works smoothly (“happy path”). But in reality:

Models time out
APIs return vague errors
Outputs may be malformed or unsafe

A truly composable system should:

Provide explicit error types (e.g., RateLimitError, ModelTimeout, ValidationFailed)
Expose retry and fallback mechanisms natively (not hand-rolled)
Offer built-in observability—metrics, logs, traces
Make failure handling declarative and modular (e.g., try model B if model A fails)

Shift toward declarative pipelines

Today, most AI workflows are written in procedural code:

response = model.generate(prompt)
if response.score > 0.8:
    store(response)

But this logic is hard to:

Reuse across tools
Observe or debug
Cache intermediate results

A declarative pipeline describes the what, not the how:

pipeline:
  - step: generate
    model: gpt-4
    input: ${user_input}
  - step: filter
    condition: score > 0.8
  - step: store
    target: vector_database

Benefits of declarative pipelines:

Easier to optimize and cache
Tool-agnostic, works across providers
More maintainable and easier to reason about
Supports dynamic reconfiguration instead of rewrites

Key takeaways for developers

1. Be skeptical of “seamless” tools without contracts

Be skeptical of tools that promise seamless plug-and-play but lack strong interface contracts.

If a tool markets itself as easy to integrate but doesn’t offer:

A clear interface contract (OpenAPI, Protobuf, JSON schema)
Versioned APIs
Validation rules for input/output
Language-agnostic interfaces

Then the “plug-and-play” claim is misleading. These tools often lock you into an SDK and hide the true cost of integration.

2. Design defensively

Design your workflows defensively: isolate components, standardize formats, and expect things to break.

Good system design assumes things will fail.

Isolate responsibilities: e.g., don’t mix prompting, retrieval, and evaluation in one block of code.
Standardize formats: Use common schemas across tools (e.g., JSON-LD, shared metadata, or LangChain-style message objects).
Handle failures: Build with fallbacks, timeouts, retries, and observability from the start.

Tip: Treat every tool like an unreliable network service, even if it’s running locally.

3. Prefer declarative, interoperable pipelines

Embrace declarative and interoperable approaches: less code, more structure.

Declarative tools (e.g., YAML workflows, JSON pipelines) offer:

Clarity: You describe what should happen, not how.
Modularity: You can replace steps without rewriting everything.
Tool-neutrality: Works across providers or frameworks.

This is the difference between wiring by hand and using a circuit board. Declarative systems give you predictable interfaces and reusable components.

Examples:

LangGraph
Flowise
PromptLayer + OpenAPI specs
Tools that use JSON as input/output with clear schemas

Conclusion

We’ve all seen what’s possible: modular pipelines, reusable components, and AI systems that don’t break every time you swap a model or change a backend. But let’s be honest, we’re not there yet. And we won’t get there just by waiting for someone else to fix it. If we want a future where AI workflows are truly composable, it’s on us, the people building and maintaining these systems, to push things forward.

That doesn’t mean reinventing everything. It means starting with what we already control: write clearer contracts, document your internal pipelines like someone else will use them (because someone will), choose tools that embrace interoperability, and speak up when things are too tightly coupled. The tooling landscape doesn’t change overnight, but with every decision we make, every PR we open, and every story we share, we move one step closer to infrastructure that’s built to last, not just duct-taped together.

Publishing AI models to Docker Hub

Kevin Wittek — Wed, 11 Jun 2025 12:16:00 +0000

When we first released Docker Model Runner, it came with built-in support for running AI models published and maintained by Docker on Docker Hub. This made it simple to pull a model like llama3.2 or gemma3 and start using it locally with familiar Docker-style commands.

Model Runner now supports three new commands: tag, push, and package. These enable you to share models with your team, your organization, or the wider community. Whether you’re managing your own fine-tuned models or curating a set of open-source models, Model Runner now lets you publish them to Docker Hub or any other OCI Artifact compatible Container Registry. For teams using Docker Hub, enterprise features like Registry Access Management (RAM) provide policy-based controls and guardrails to help enforce secure, consistent access.

Tagging and pushing to Docker Hub

Let’s start by republishing an existing model from Docker Hub under your own namespace.

# Step 1: Pull the model from Docker Hub
$ docker model pull ai/smollm2

# Step 2: Tag it for your own organization
$ docker model tag ai/smollm2 myorg/smollm2

# Step 3: Push it to Docker Hub
$ docker model push myorg/smollm2

That’s it! Your model is now available at myorg/smollm2 and ready to be consumed using Model Runner by anyone with access.

Pushing to other container registries

Model Runner supports other container registries beyond Docker Hub, including GitHub Container Registry (GHCR).

# Step 1: Tag for GHCR
$ docker model tag ai/smollm2 ghcr.io/myorg/smollm2

# Step 2: Push to GHCR
$ docker model push ghcr.io/myorg/smollm2

Authentication and permissions work just like they do with regular Docker images in the context of GHCR, so you can leverage your existing workflow for managing registry credentials.

Packaging a custom GGUF file

Want to publish your own model file? You can use the package command to wrap a .gguf file into a Docker-compatible OCI artifact and directly push it into a Container Registry, such as Docker Hub.

# Step 1: Download a model, e.g. from HuggingFace
$ curl -L -o model.gguf https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf

# Step 2: Package and push it
$ docker model package --gguf "$(pwd)/model.gguf" --push myorg/mistral-7b-v0.1:Q4_K_M

You’ve now turned a raw model file in GGUF format into a portable, versioned, and sharable artifact that works seamlessly with docker model run.

Conclusion

We’ve seen how easy it is to publish your own models using Docker Model Runner’s new tag, push, and package commands. These additions bring the familiar Docker developer experience to the world of AI model sharing. Teams and enterprises using Docker Hub can securely manage access and control for their models, just like with container images, making it easier to scale GenAI applications across teams.

Stay tuned for more improvements to Model Runner that will make packaging and running models even more powerful and flexible.

Learn more

Read our quickstart guide to Docker Model Runner.
Find documentation for Model Runner.
Subscribe to the Docker Navigator Newsletter.
New to Docker? Create an account.
Have questions? The Docker community is here to help.

Introducing Docker Model Runner: A Better Way to Build and Run GenAI Models Locally

Deanna Sparks — Wed, 09 Apr 2025 13:00:44 +0000

Generative AI is transforming software development, but building and running AI models locally is still harder than it should be. Today’s developers face fragmented tooling, hardware compatibility headaches, and disconnected application development workflows, all of which hinder iteration and slow down progress.

That’s why we’re launching Docker Model Runner — a faster, simpler way to run and test AI models locally, right from your existing workflow. Whether you’re experimenting with the latest LLMs or deploying to production, Model Runner brings the performance and control you need, without the friction.

We’re also teaming up with some of the most influential names in AI and software development, including Google, Continue, Dagger, Qualcomm Technologies, HuggingFace, Spring AI, and VMware Tanzu AI Solutions, to give developers direct access to the latest models, frameworks, and tools. These partnerships aren’t just integrations, they’re a shared commitment to making AI innovation more accessible, powerful, and developer-friendly. With Docker Model Runner, you can tap into the best of the AI ecosystem from right inside your Docker workflow.

LLM development is evolving: We’re making it local-first

Local development for applications powered by LLMs is gaining momentum, and for good reason. It offers several advantages on key dimensions such as performance, cost, and data privacy. But today, local setup is complex.

Developers are often forced to manually integrate multiple tools, configure environments, and manage models separately from container workflows. Running a model varies by platform and depends on available hardware. Model storage is fragmented because there is no standard way to store, share, or serve models.

The result? Rising cloud inference costs and a disjoined developer experience. With our first release, we’re focused on reducing that friction, making local model execution simpler, faster, and easier to fit into the way developers already build.

Docker Model Runner: The simple, secure way to run AI models locally

Docker Model Runner is designed to make AI model execution as simple as running a container. With this Beta release, we’re giving developers a fast, low-friction way to run models, test them, and iterate on application code that uses models locally, without all the usual setup headaches. Here’s how:

Running models locally

With Docker Model Runner, running AI models locally is now as simple as running any other service in your inner loop. Docker Model Runner delivers this by including an inference engine as part of Docker Desktop, built on top of llama.cpp and accessible through the familiar OpenAI API. No extra tools, no extra setup, and no disconnected workflows. Everything stays in one place, so you can test and iterate quickly, right on your machine.

Enabling GPU acceleration (Apple silicon)

GPU acceleration on Apple silicon helps developers get fast inference and the most out of their local hardware. By using host-based execution, we avoid the performance limitations of running models inside virtual machines. This translates to faster inference, smoother testing, and better feedback loops.

Standardizing model packaging with OCI Artifacts

Model distribution today is messy. Models are often shared as loose files or behind proprietary download tools with custom authentication. With Docker Model Runner, we package models as OCI Artifacts, an open standard that allows you to distribute and version them through the same registries and workflows you already use for containers. Today, you can easily pull ready-to-use models from Docker Hub. Soon, you’ll also be able to push your own models, integrate with any container registry, connect them to your CI/CD pipelines, and use familiar tools for access control and automation.

Building momentum with a thriving GenAI ecosystem

To make local development seamless, it needs an ecosystem. That starts with meeting developers where they are, whether they’re testing model performance on their local machines or building applications that run these models.

That’s why we’re launching Docker Model Runner with a powerful ecosystem of partners on both sides of the AI application development process. On the model side, we’re collaborating with industry leaders like Google and community platforms like HuggingFace to bring you high-quality, optimized models ready for local use. These models are published as OCI artifacts, so you can pull and run them using standard Docker commands, just like any container image.

But we aren’t stopping at models. We’re also working with application, language, and tooling partners like Dagger, Continue, and Spring AI and VMware Tanzu to ensure applications built with Model Runner integrate seamlessly into real-world developer workflows. Additionally, we’re working with hardware partners like Qualcomm Technologies to ensure high performance inference on all platforms.

As Docker Model Runner evolves, we’ll work to expand its ecosystem of partners, allowing for ample distribution and added functionality.

Where We’re Going

This is just the beginning. With Docker Model Runner, we’re making it easier for developers to bring AI model execution into everyday workflows, securely, locally, and with a low barrier of entry. Soon, you’ll be able to run models on more platforms, including Windows with GPU acceleration, customize and publish your own models, and integrate AI into your dev loop with even greater flexibility (including Compose and Testcontainers). With each Docker Desktop release, we’ll continue to unlock new capabilities that make GenAI development easier, faster, and way more fun to build with.

Try it out now!

Docker Model Runner is now available as a Beta feature in Docker Desktop 4.40. To get started:

On a Mac with Apple silicon
Update to Docker Desktop 4.40
Pull models developed by our partners at Docker’s GenAI Hub and start experimenting
For more information, check out our documentation here.

Try it out and let us know what you think!

How can I learn more about Docker Model Runner?

Check out our available assets today!

Turn your Mac into an AI playground YouTube tutorial
A Quickstart Guide to Docker Model Runner
Docker Model Runner on Docker Docs
Create Local AI Agents with Dagger and Docker Model Runner

Come meet us at Google Cloud Next!

Swing by booth 1530 in the Mandalay Convention Center for hands-on demos and exclusive content.

How to Get Started with the Weaviate Vector Database on Docker

Leonie Monigatti — Tue, 19 Sep 2023 17:11:50 +0000

Vector databases have been getting a lot of attention since the developer community realized how they can enhance large language models (LLMs). Weaviate is an open source vector database that enables modern search capabilities, such as vector search, hybrid search, and generative search. With Weaviate, you can build advanced LLM applications, next-level search systems, recommendation systems, and more.

This article explains what vector databases are and highlights key features of the Weaviate vector database. Learn how to install Weaviate on Docker using Docker Compose so you can take advantage of semantic search within your Dockerized environment.

Introducing the Weaviate vector database

The core feature of vector databases is storing vector embeddings of data objects. This functionality is especially helpful with the growing amount of unstructured data (e.g., text or images), which is difficult to manage and process with traditional relational databases. The vector embeddings are a numerical representation of the data objects — usually generated by a machine learning (ML) model — and enable the search and retrieval of data based on semantic similarity (vector search).

Vector databases do much more than just store vector embeddings: As you can imagine, retrieving data based on similarity requires a lot of comparing between objects and thus can take a long time. In contrast to other types of databases that can store vector embeddings, a vector database can retrieve data fast. To enable low-latency search queries, vector databases use specific algorithms to index the data.

Additionally, some vector databases, like Weaviate, store the vector embeddings and the original data object, which lets you combine traditional search with modern vector search for more accurate search results.

With these functionalities, vector databases are usually used in search or similar tasks (e.g., recommender systems). With the recent advancements in the LLM space, however, vector databases have also proven effective at providing long-term memory and domain-specific context to conversational LLMs. This means that you can leverage LLM capabilities on your private data or your specific field of expertise.

Key highlights of the Weaviate vector database include:

Open source: Weaviate is open source and available for anybody to use wherever they want. It is also available as a managed service with SaaS and hybrid SaaS options.
Horizontal scalability: You can scale seamlessly into billions of data objects for your exact needs, such as maximum ingestion, largest possible dataset size, maximum queries per second, etc.
Lightning-fast vector search: You can perform lightning-fast pure vector similarity search over raw vectors or data objects, even with filters. Weaviate typically performs nearest-neighbor searches of millions of objects in considerably less than 100ms (see our benchmark).
Combined keyword and vector search (hybrid search): You can store both data objects and vector embeddings. This approach allows you to combine keyword-based and vector searches for state-of-the-art search results.
Optimized for cloud-native environments: Weaviate has the fault tolerance of a cloud-native database, and the core Docker image is comparably small at 18 MB.
Modular ecosystem for seamless integrations: You can use Weaviate standalone (aka “bring your own vectors”) or with various optional modules that integrate directly with OpenAI, Cohere, Hugging Face, etc., to enable easy use of state-of-the-art ML models. These modules can be used as vectorizers to automatically vectorize any media type (text, images, etc.) or as generative modules to extend Weaviate’s core capabilities (e.g., question answering, generative search, etc.).

Prerequisites

Ensure you have both the docker and the docker-compose CLI tools installed. For the following section, we assume you have Docker 17.09.0 or higher and Docker Compose V2 installed. If your system has Docker Compose V1 installed instead of V2, use docker-compose instead of docker compose. You can check your Docker Compose version with:

$ docker compose version

How to Configure the Docker Compose File for Weaviate

To start Weaviate with Docker Compose, you need a Docker Compose configuration file, typically called docker-compose.yml. Usually, there’s no need to obtain individual images, as we distribute entire Docker Compose files.

You can obtain a Docker Compose file for Weaviate in two different ways:

Docker Compose configurator on the Weaviate website (recommended): The configurator allows you to customize your docker-compose.yml file for your purposes (including all module containers) and directly download it.
Manually: Alternatively, if you don’t want to use the configurator, copy and paste one of the example files from the documentation and manually modify it.

This article will review the steps to configure your Docker Compose file with the Weaviate Docker Compose configurator.

Step 1: Version

First, define which version of Weaviate you want to use (Figure 1). We recommend always using the latest version.

Figure 1: The first step when using the Weaviate Docker Compose configurator, suggesting that the latest version be used.

The following shows a minimal example of a Docker Compose setup for Weaviate:

version: '3.4'
services:
  weaviate:
    image: semitechnologies/weaviate:1.20.5
    ports:
    - 8080:8080
    restart: on-failure:0
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      DEFAULT_VECTORIZER_MODULE: 'none'
      CLUSTER_HOSTNAME: 'node1'

Step 2: Persistent volume

Configure persistent volume for Docker Compose file (Figure 2):

Figure 2: Weaviate Docker Compose configurator “Persistent Volume” configuration options.

Setting up a persistent volume to avoid data loss when you restart the container and improve reading and writing speeds is recommended.

You can set a persistent volume in two ways:

With a named volume: Docker will create a named volume weaviate_data and mount it to the PERSISTENCE_DATA_PATH inside the container after starting Weaviate with Docker Compose:

services:
      weaviate:
        volumes:
            - weaviate_data:/var/lib/weaviate
        # etc.
   
    volumes:
        weaviate_data:

With host binding: Docker will mount ./weaviate_data on the host to the PERSISTENCE_DATA_PATH inside the container after starting Weaviate with Docker Compose:

     services:
      weaviate:
        volumes:
          - ./weaviate_data:/var/lib/weaviate
        # etc.

Step 3: Modules

Weaviate can be used with various modules, which integrate directly with inferencing services like OpenAI, Cohere, or Hugging Face. These modules can be used to vectorize any media type at import and search time automatically or to extend Weaviate’s core capabilities with generative modules.

You can also use Weaviate without any modules (standalone). In this case, no model inference is performed at import or search time, meaning you need to provide your own vectors in both scenarios. If you don’t need any modules, you can skip to Step 4: Runtime.

Configure modules for Docker Compose file (Figure 3):

Figure 3: The Weaviate Docker Compose configurator step to define if modules will be used, or if running standalone is desired.

Currently, Weaviate integrates three categories of modules:

Retriever and vectorizer modules automatically vectorize any media type (text, images, etc.) at import and search time. There are also re-ranker modules available for re-ranking search results.
Reader and generator modules can be used to extend Weaviate’s core capabilities after retrieving the data for generative search, question answering, named entity recognition (NER), and summarization.
Other modules are available for spell checking or for enabling using your custom modules.

Note that many modules (e.g., transformer models) are neural networks built to run on GPUs. Although you can run them on CPU, enabling GPU `ENABLE_CUDA=1`, if available, will result in faster inference.

The following shows an example of a Docker Compose setup for Weaviate with the sentence-transformers model:

version: '3.4'
services:
  weaviate:
    image: semitechnologies/weaviate:1.20.5
    restart: on-failure:0
    ports:
     - "8080:8080"
    environment:
      QUERY_DEFAULTS_LIMIT: 20
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: "./data"
      DEFAULT_VECTORIZER_MODULE: text2vec-transformers
      ENABLE_MODULES: text2vec-transformers
      TRANSFORMERS_INFERENCE_API: http://t2v-transformers:8080
      CLUSTER_HOSTNAME: 'node1'
  t2v-transformers:
    image: semitechnologies/transformers-inference:sentence-transformers-multi-qa-MiniLM-L6-cos-v1
    environment:

Step 4: Runtime

In the final step of the configurator, select Docker Compose for your runtime (Figure 4):

Figure 4: The final step of the Weaviate Docker Compose configurator where “Docker Compose” can be selected as the runtime.

Step 5: Download and further customization

Once your configuration is complete, you will see a snippet similar to the following to download the docker-compose.yml file, which has been adjusted to your selected configuration.

$ curl -o docker-compose.yml "https://configuration.weaviate.io/v2/docker-compose/docker-compose.yml?"

After downloading the Docker Compose file from the configurator, you can directly start Weaviate on Docker or customize it further.

You can set additional environment variables to further customize your Weaviate setup (e.g., by defining authentication and authorization). Additionally, you can create a multi-node setup with Weaviate by defining a founding member and other members in the cluster.

Founding member: Set up one node as a “founding” member by configuring CLUSTER_GOSSIP_BIND_PORT and CLUSTER_DATA_BIND_PORT:

  weaviate-node-1:  # Founding member service name
    ...  # truncated for brevity
    environment:
      CLUSTER_HOSTNAME: 'node1'
      CLUSTER_GOSSIP_BIND_PORT: '7100'
      CLUSTER_DATA_BIND_PORT: '7101'

Other members in the cluster: For each further node, configure CLUSTER_GOSSIP_BIND_PORT and CLUSTER_DATA_BIND_PORT and configure to join the founding member’s cluster using the CLUSTER_JOIN variable:

 weaviate-node-2:
        ...  # truncated for brevity
        environment:
          CLUSTER_HOSTNAME: 'node2'
          CLUSTER_GOSSIP_BIND_PORT: '7102'
          CLUSTER_DATA_BIND_PORT: '7103'
          CLUSTER_JOIN: 'weaviate-node-1:7100'  # This must be the service name of the "founding" member node.

Optionally, you can set a hostname for each node using CLUSTER_HOSTNAME.

Note that it’s a Weaviate convention to set the CLUSTER_DATA_BIND_PORT to 1 higher than CLUSTER_GOSSIP_BIND_PORT.

How to run Weaviate on Docker

Once you have your Docker Compose file configured to your needs, you can run Weaviate in your Docker environment.

Start Weaviate

Before starting Weaviate on Docker, ensure that the Docker Compose file is named exactly docker-compose.yml and that you are in the same folder as the Docker Compose file.

Then, you can start up with the whole setup by running:

$ docker compose up -d

The -d option runs containers in detached mode. This means that your terminal will not attach to the log outputs of all the containers.

If you want to attach to the logs of specific containers (e.g., Weaviate), you can run the following command:

$ docker compose up -d && docker compose logs -f weaviate

Congratulations! Weaviate is now running and is ready to be used.

Stop Weaviate

To avoid losing your data, shut down Weaviate with the following command:

$ docker compose down

This will write all the files from memory to disk.

Conclusion

This article introduced vector databases and how they can enhance LLM applications. Specifically, we highlighted the open source vector database Weaviate, whose advantages include fast vector search at scale, hybrid search, and integration modules to state-of-the-art ML models from OpenAI, Cohere, Hugging Face, etc.

We also provided a step-by-step guide on how to install Weaviate on Docker using Docker Compose, noting that you can obtain a docker-compose.yml file from the Weaviate Docker Compose configurator, which helps you to customize your Docker Compose file for your specific needs.

Visit our AI/ML page and read the article collection to learn more about how developers are using Docker to accelerate the development of their AI/ML applications.

Learn more

Get the latest release of Docker Desktop.
Vote on what’s next! Check out our public roadmap.
Have questions? The Docker community is here to help.
New to Docker? Get started.