Running LLMs Locally with Docker Model Runner and Python

This is the second tutorial on the Docker Model Runner sequence

Jul 29, 2025

In the previous issue, we introduced Docker Model Runner (DMR) and explored how to run and interact with LLMs locally using both the terminal and the Docker Desktop UI. In this tutorial, we'll take the next step by using the OpenAI Python SDK to work with DMR directly from Python.

By the end, you'll be able to run and interact with LLMs locally through Python code.

Tutorial level: Beginner

In the next issue, we'll cover how to pull models from Hugging Face into DMR.

Let’s get started!

One of the great features of Docker Model Runner is its compatibility with the OpenAI API SDKs. This makes it easy to adapt existing code that uses the OpenAI API to work with DMR and interact with locally running LLMs. In this tutorial, we'll focus on the Python SDK, though the same approach applies to other OpenAI SDKs like JavaScript, Java, Go, .NET, and more.

The prerequisite for running DMR is Docker Engine (Linux) or Docker Desktop 4.40 and above for MacOS, and Docker Desktop 4.41 for Windows Docker Engine. For hardware requirements, please check the Docker Model Runner documentation

DMR runs as a standalone server so that you can connect to it from both containerized environments and regular local Python environments.

This flexibility gives you the choice to either containerize your Python setup or stick with a traditional virtual environment—whatever works best for your workflow.

OpenAI API Python SDK Workflow

Now, let’s walk through how to configure the OpenAI Python SDK to interact with models running locally via DMR.

A typical OpenAI workflow in Python involves the following steps:

Import the openai library
Set up a client by specifying API details (e.g., base URL, API key)
Define a prompt and send it as a request to the OpenAI API
Parse and use the API response

The DMR follows this same workflow, with just one key difference: you'll use the DMR server’s URL instead of the official OpenAI endpoint when configuring the client. In addition, we will define the specific model we want to leverage from the available local models.

In the next example, we'll show you how to "chat" with an LLM using the OpenAI Python SDK and a locally running DMR instance.

Docker Model Runner Settings

Before we dive into the Python code, let’s first make sure that DMR is up and running. You can do this from the command line using the status command:

docker model status

If everything is working correctly, you should see output similar to this:

Docker Model Runner is running

Status:
llama.cpp: running llama.cpp latest-metal (sha256:3f229388659b4b647d9dec47e1af843f874ca8bf8e43a336828362d9afad394c) version: 79e0b68

Next, list the available local models using:

docker model list

Currently, the following two models are available locally:

MODEL NAME          PARAMETERS  QUANTIZATION    ARCHITECTURE  MODEL ID      CREATED       SIZE
ai/llama3.2:latest  3.21 B      IQ2_XXS/Q4_K_M  llama         436bb282b419  4 months ago  1.87 GiB
ai/gemma3n:latest   6.87 B      IQ2_XXS/Q4_K_M  gemma3n       800c2ac86449  4 weeks ago   3.94 GiB

Before connecting to the model from Python, you'll need to enable Transmission Control Protocol (TCP) access and set the port number. This step defines the local host address and allows the OpenAI SDK to communicate with the DMR server.

You can enable TCP and configure the port via the Docker Desktop Settings dashboard (highlighted in white):

Or simple run the following command on the CLI:

docker desktop enable model-runner --tcp=12434

This will expose the server on port 12434.

Working with the Python SDK

Now let’s switch over to Python and send a simple prompt to the Llama model. We’ll start by importing the openai library:

import openai

Next, we’ll define the client using the OpenAI function. To connect to the DMR server, we need to set the base_url parameter. The value of this URL depends on whether you're running the code inside a container or from your local environment.

If you're running the code inside a container, use the following:

base_url= "http://model-runner.docker.internal/engines/v1"

If you're running it locally (outside of a container), use:

base_url = "http://localhost:12434/engines/v1"

Note that the localhost should point to the TCP port, in this case 12434.

Now, initialize the OpenAI client:

client = openai.OpenAI(
  base_url = base_url,
  api_key = "docker"
)

With the client set up, the rest of the workflow is exactly the same as using the OpenAI API. For example, here’s how you can use chat.completions.create to ask the following question:

What is the capital of the United States of America?

completion = client.chat.completions.create(
    model="ai/llama3.2:latest", 
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "What is the capital of the United States of America?"}
        ],
)

We use the model argument to reference the LLM, in this case llama 3.2, and set the prompt using the message argument. Last but not least, we will parse and print the return answer from the LLM:

print(completion.choices[0].message.content)

This returns the following output:

The capital of the United States of America is Washington, D.C.

Summary

In this tutorial, we explored how to interact with LLMs running locally via Docker Model Runner using Python. Thanks to DMR's compatibility with the OpenAI API SDK, adapting your Python code to run models locally is simple and seamless.

In the next issue, we’ll walk through how to download and run LLMs from Hugging Face using DMR.

Resources

Getting Started with Docker Model Runner - link
Docker Desktop documentation - https://docs.docker.com/desktop/
Docker Model Runner Documentation - https://docs.docker.com/ai/model-runner/
Available LLMs on Docker Hub - https://hub.docker.com/u/ai
OpenAI API SDK - https://platform.openai.com/docs/libraries

The AIOps Newsletter

Discussion about this post

Ready for more?