Running on CPU-only Hardware¶

LLM Extractinator works fine on machines without a GPU – as long as you pick a small enough model and are okay with slower runtimes.

This page explains:

What to change (Docker / local)
How to run with CPU-only
Why you should pick a smaller model, e.g. qwen3:8b

Note

Using a GPU is strongly recommended, as it allows you to run larger models which generally perform better.

1. Core idea¶

LLM Extractinator itself doesn’t “talk” to your GPU. It just talks to Ollama.

Whether inference runs on GPU or CPU is controlled by:

How Ollama is installed/configured
Whether your Docker container is started with GPU access

So to go CPU-only you mainly need to:

Start Ollama without GPU access, and
Use a small model such as qwen3:8b.

Everything else (tasks, parsers, CLI flags) stays the same.

2. Choosing a CPU-friendly model¶

For larger models, Ollama will throw an error if no compatible GPU is found. These errors can be seen by running the model using the verbose flag.

Therefore for CPU-only hardware:

Prefer models smaller than 10B parameters
Use the default quantized variants from Ollama
Good starting point:
qwen3:8b

Example CLI call:

extractinate --task_id 1 --model_name "qwen3:8b" --reasoning_model

Note

We use the --reasoning_model flag here because qwen3 is a reasoning-capable model. If you use a different model that does not emit intermediate reasoning, you can omit this flag.

If this still throws an error or inference is too slow, you can choose an even smaller model such as qwen3:4b, qwen3:1.7b, or even qwen3:0.6b. Alternatively, you can further tune --num_predict, --max_context_len, and --num_examples (see below).

Warning

Very small models may struggle to follow complex instructions or produce high-quality outputs. Always spot-check the results to ensure they meet your requirements.

3. Docker: GPU vs CPU-only¶

When running LLM Extractinator via Docker, GPU access is controlled by the --gpus flag in the docker run command.

3.1 Run CPU-only in Docker¶

To run on CPU only, simply omit the GPU flag:

docker run --rm \
-p 127.0.0.1:8501:8501 \
-p 11434:11434 \
-v $(pwd)/data:/app/data \
-v $(pwd)/examples:/app/examples \
-v $(pwd)/tasks:/app/tasks \
-v $(pwd)/output:/app/output \
lmmasters/llm_extractinator:latest

Inside the container:

Ollama will run in CPU mode.
The Studio (launch-extractinator) and CLI (extractinate) work exactly the same.
The main difference is speed, so use a smaller model such as qwen3:8b.

4. Local installation: CPU-only¶

If you run everything directly on your machine instead of Docker:

Install Ollama.
Make sure the Ollama service is running.
Pull a small model:

ollama pull qwen3:8b

Run Extractinator with that model:

extractinate --task_id 1 --model_name "qwen3:8b" --reasoning_model

On a machine without a compatible GPU, Ollama will automatically fall back to CPU-only.

If you do have a GPU but want to force CPU, check Ollama’s configuration (e.g. setting GPU usage to 0) so it does not try to use the GPU.

5. Tweaking settings for CPU runs¶

On CPU you pay more dearly for every token, so it’s worth dialling a few knobs back.

5.1 Limit generation length¶

Use a smaller --num_predict:

extractinate --task_id 1   --model_name "qwen3:8b"   --num_predict 256 --reasoning_model

This caps how long the model’s response can be.

5.2 Limit context length¶

If your inputs are very long, you can reduce the effective context via --max_context_len:

extractinate --task_id 1   --model_name "qwen3:8b"   --max_context_len 2048 --reasoning_model

This can significantly reduce compute on very long texts.

5.3 Use fewer examples¶

Each example you add with --num_examples increases prompt size and compute.

For CPU-only runs:

Prefer --num_examples 0

6. Quick CPU-only checklist¶

If you only have a CPU and want a sane setup:

Pick a small model, e.g. qwen3:8b.
If using Docker, remove --gpus all from the run command.
If running locally, make sure Ollama is installed and running; pull the CPU-friendly model.
Start with:

extractinate --task_id 1 --model_name "qwen3:8b" --num_predict 256 --max_context_len 2048 --num_examples 0 --reasoning_model

If that’s fast enough, you can gradually increase --num_predict or context length as needed.