Local Open-Source AI Stack: Building Your Local Powerhouse

A guide to the essential software stack & HW for running high-performance AI models privately & locally, ensuring full control and no long-term costs.

Local Open-Source AI Stack: Building Your Local Powerhouse
Illustration generated with Nano Banana 2 Pro. Building a local open source AI stack

Table of Contents

    1. Why a Local AI Stack
    2. Architecture Overview
    3. Prerequisites
    4. Layer 1 - Ollama: Your AI Engine
    5. Layer 2 - OpenWebUI: The Browser Interface
    6. Layer 3 - AnythingLLM: Knowledge and RAG
    7. Layer 4 - ComfyUI: Creative Media Generation
    8. Layer 5 - Hermes Agent: Orchestration and Automation
    9. Model Routing with OpenRouter
    10. Pitfalls
    11. Verifying the Full Stack
    12. Conclusion and Next Steps

1. Why a Local AI Stack

Most AI setups fall into two extremes: fully cloud-hosted services where your data lives on someone else's servers, or half-baked local installs that never quite talk to each other. The goal is a fully local, open-source AI stack where every component runs on our own hardware, every connection is deliberate, and you own the data end to end.

The result is a five-layer architecture that covers everything from raw model inference to creative media generation and agent orchestration. This article walks through every step, every command, and every pitfall you might hit so you can build your own working system.

Building your local powerhouse

How it works under the hood: the car analogy

What are the benefits?

2. Architecture Overview

The stack is built from five distinct layers, each running as either a native macOS process or a Docker container. Here is the complete topology:

Layer 1 (Foundation)     Ollama [Native]
Layer 2 (Interface)      OpenWebUI [Docker]
Layer 3 (Knowledge/RAG)  AnythingLLM [Docker]
Layer 4 (Creative)       ComfyUI [Native]
Layer 5 (Orchestration)  Hermes Agent [Native]
Layer 5 (Management)     Hermes Desktop [Native GUI]
5-layer local AI stack

Deployment mode at a glance

ServiceDeploymentPortNotes
OllamaNative11434Local model inference engine
OpenWebUIDocker3000Browser-based chat interface
AnythingLLMDocker3001RAG knowledge base with vector search
ComfyUINative8000Local image and media generation
Hermes AgentNative8642Agent orchestration via OpenRouter
Hermes DesktopNative GUIN/ASession and skill management interface

The Docker networking concept that matters most

Ollama, ComfyUI, and Hermes all run natively on the host. OpenWebUI and AnythingLLM run inside Docker containers. For the containers to reach the native services, they must use host.docker.internal, NOT localhost.

This is the single most important networking concept in this entire setup. Inside a Docker container, localhost refers to the container itself, not the host machine. Any time a containerized service needs to talk to a natively running service, host.docker.internal is the hostname you need. We hit this wall more than once, so it is worth emphasizing upfront.

3. Prerequisites

Before starting, ensure you have the following in place:

  • macOS with Apple Silicon (tested on M-series chips, but Intel Macs work too)
  • Docker Desktop installed and running. This is critical. Docker Desktop must be running before you pull or start any images. We lost time assuming we could just run docker run on demand. Start Docker Desktop first.
  • Homebrew for native package management
  • Node.js (v18 or later) for Hermes Agent
  • An OpenRouter API key for model routing. Get one at openrouter.ai.
  • At least 16 GB RAM for the full stack with smaller models. 32 GB gives you more freedom with model sizes.

Model selection guidance

The amount of available RAM must be taken into account. Below you find a graphic to help you decide which LLM model is going to fit on your local machine, based on the unified RAM.

HardwareRecommended Starting ModelNotes
16+ GB RAMqwen3.5:9b or gemma4:e4bFits comfortably in memory
32 GB RAMUp to 14B parameter modelsCan run larger, more capable
64+ GB RAM30B+ modelsSerious local inference power

For our setup we choose qwen3.5:9b and gemma4:e4b . Both run well on 16+ GB hardware.

4. Layer 1 - Ollama: Your AI Engine

Ollama is the foundation. It is the local model runtime that serves model inference on http://localhost:11434. Everything else in this stack either talks to Ollama directly or talks to a service that talks to Ollama.

Step 1: Install Ollama

brew install ollama

Step 2: Start the Ollama service

ollama serve

On macOS with Homebrew, Ollama installs as a background service that starts automatically. You can verify it is running:

curl <http://localhost:11434/api/tags>

A successful response returns a JSON object listing your available models. If you get a connection refused error, Ollama is not running.

Step 3: Pull your first model

We recommend starting with a small model that runs responsively on consumer hardware, such as Qwen3.5 9B:

ollama pull qwen3.5:9b

We also pulled Gemma 4:

ollama pull gemma4:e4b

Step 4: Test a model directly

ollama run qwen3.5:9b "Hello from the local stack"

If you get a response, Layer 1 is working. This is the checkpoint to hit before moving forward.

5. Layer 2 - OpenWebUI: The Browser Interface

OpenWebUI provides the daily browser-based experience. It connects to Ollama for inference and gives you a polished chat interface with conversation history, model switching, and plugin support.

Step 1: Ensure Docker Desktop is running

This sounds obvious, but it is the most common mistake on macOS. If Docker Desktop is not running, docker run silently fails or hangs. Check the Docker Desktop app or run:

docker info

If you get connection errors, start Docker Desktop and wait for the engine to be ready.

Step 2: Launch OpenWebUI

docker run -d --name open-webui \\\\
  -p 3000:8080 \\\\
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \\\\
  --add-host=host.docker.internal:host-gateway \\\\
  ghcr.io/open-webui/open-webui:main

Let us break down what each flag does:

FlagPurpose
-dRun detached in the background
--name open-webuiContainer name for easy management
-p 3000:8080Map host port 3000 to container port 8080
-e OLLAMA_BASE_URL=http://host.docker.internal:11434Tell OpenWebUI where Ollama lives
--add-host=host.docker.internal:host-gatewayEnable host.docker.internal DNS resolution

The critical piece here is OLLAMA_BASE_URL. We use host.docker.internal:11434 because Ollama runs on the host, not inside the container. Using localhost:11434 here would cause OpenWebUI to look inside its own container for Ollama, which does not exist.

Step 3: Verify

Open your browser to http://localhost:3000. You should see the OpenWebUI setup page. Create an admin account, select a model from the dropdown, and send a test message.

If the model responds, your chat interface is connected to Ollama. Layer 2 is complete.

6. Layer 3 - AnythingLLM: Knowledge and RAG

AnythingLLM is where local documents become searchable knowledge. It handles document ingestion, chunking, embedding, vector storage, retrieval, and prompt augmentation. In RAG terminology, it turns your files into an LLM-accessible wiki.

Step 1: Launch AnythingLLM

docker run -d --name anythingllm \\\\
  -p 3001:3001 \\\\
  --add-host=host.docker.internal:host-gateway \\\\
  mintplexlabs/anythingllm

AnythingLLM serves its web interface on port 3001. The --add-host flag enables it to reach Ollama on the host when we configure it in the setup wizard.

Step 2: Complete the setup wizard

Open http://localhost:3001 in your browser. The setup wizard walks you through three critical configuration choices:

  • LLM Provider: Select Ollama. Enter the connection URL as http://host.docker.internal:11434. This is the same host.docker.internal pattern we used in OpenWebUI.
  • Embedding Model: Select Ollama again with the same URL http://host.docker.internal:11434. For the embedding model name, we recommend nomic-embed-text which you can pull with:
ollama pull nomic-embed-text
    • This is a lightweight embedding model designed for local RAG pipelines. It produces vectors that AnythingLLM uses for semantic search across your documents.
  • Vector Database: Select LanceDB. This is the local-first option that stores vector embeddings on your filesystem rather than sending them to a cloud service. It is the privacy-conscious choice and perfectly performant for personal knowledge bases.

Step 3: Create a workspace and ingest documents

Once configured, create a workspace in AnythingLLM, upload documents (PDFs, text files, markdown, web pages), and let it process them. The documents get chunked, embedded, and stored in LanceDB as vector records.

When you query the workspace, AnythingLLM retrieves the most relevant chunks, injects them into the prompt, and sends the augmented prompt to Ollama. The LLM then answers with knowledge drawn from your documents.

7. Layer 4 - ComfyUI: Creative Media Generation

ComfyUI is a node-based interface for Stable Diffusion and other generative image models. It provides local image generation, video generation, and complex media workflows.

Step 1: Download ComfyUI for macOS

Unlike most Python tools in this stack, ComfyUI is available as a native .app download for macOS. Head to the official ComfyUI releases and download the desktop application.

Do not attempt a pip install for this layer unless you need a custom development setup. The .app is the fastest path to a working installation on macOS.

Step 2: Launch and verify

Open the ComfyUI app. It starts a local server on port 8000. The default workflow loads with a basic text-to-image pipeline. Run a test generation to confirm everything is working.

ComfyUI runs natively on the host, so other services reach it at http://localhost:8000 from the host or http://host.docker.internal:8000 from Docker containers.

8. Layer 5 - Hermes Agent: Orchestration and Automation

Hermes Agent by Nous Research is the orchestration layer. It exposes an OpenAI-compatible API server that coordinates model calls, skills, and agent sessions. This is where your stack transitions from a chat tool to an autonomous agent that can plan, execute, and chain tasks.

Step 1: Install Hermes Agent natively

We tried the Docker approach first. It does not work reliably on macOS Docker Desktop because of a loopback binding issue -- port 8642 binds to 127.0.0.1 inside the container, making it unreachable from the host network. After wrestling with this, we switched to a native install which works cleanly.

curl -fsSL <https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh> | bash

Step 2: Source your shell config

The installer adds Hermes to your PATH. Source your shell config to pick it up immediately:

source ~/.zshrc

Verify the installation:

hermes --version

Step 3: Configure the API key

Hermes requires an API key in ~/.hermes/.env. This is where you set your OpenRouter key for model routing:

# In ~/.hermes/.env:
OPENROUTER_API_KEY=your-key-here

This is a critical file. We had a production failure where this file got corrupted with ANSI escape sequences, and all LLM calls failed silently with HTTP 400 responses. See the pitfalls section below for details.

Step 4: Start the Hermes gateway

hermes gateway run

This starts the agent server on port 8642. It exposes OpenAI-compatible /v1 endpoints, so any OpenAI SDK or tool can talk to Hermes by pointing its base URL at http://localhost:8642/v1.

Step 5: Install Hermes Desktop

Hermes Desktop is the companion GUI application for session and skill management. It is a native macOS app that provides a visual interface for creating agent sessions, configuring skills, and monitoring execution. Download it from the Nous Research releases page.

With Hermes running, you have a full agent layer that can coordinate calls across your local models and OpenRouter hosted models.

9. Model Routing with OpenRouter

OpenRouter is the managed inference API that sits behind Hermes Agent. It provides access to 200+ models from providers including OpenAI, Anthropic, Google, Meta, and others – all through a single API endpoint.

The key advantage is model switching without code changes. Your applications talk to Hermes on port 8642, and Hermes routes to whatever model OpenRouter is configured to use. Changing models is a configuration update, not a code refactor.

Models we tested through OpenRouter

ModelProviderNotes
qwen/qwen3.6-plusQwenStrong general-purpose reasoning
anthropic/claude-sonnet-4-6-20250514AnthropicHigh-capability reasoning with tool use

For local-only inference with smaller models, Ollama handles the workload directly. For heavier reasoning tasks or when you need a stronger model than your hardware can run locally, Hermes routes the request through OpenRouter. This gives you the best of both worlds: local control for everyday tasks and cloud-scale models when you need them.

10. Pitfalls

These are the issues that cost us hours. Read this section before you start, and you will save yourself the same pain.

Pitfall 1: Docker Desktop must be running first

On macOS, Docker Desktop is not always auto-starting. Running docker run while Docker Desktop is down produces confusing errors or silent hangs. Always verify:

docker info

If Docker Desktop shows as unavailable, open the app and wait for the engine indicator to turn green before pulling or running any images.

Pitfall 2: Hermes Docker container has a loopback binding issue on macOS

We initially tried to run Hermes Agent in Docker using a standard port mapping. On macOS Docker Desktop, port 8642 binds to 127.0.0.1 inside the container, making it inaccessible from the host network. The container starts fine, the port maps fine from Docker's perspective, but the process inside the container only listens on the container's own loopback interface. Nothing outside the container can reach it.

The fix is straightforward: install Hermes natively using the official install script instead of Docker. The native install starts the gateway process directly on the host, bound to the correct interface, and everything works.

Pitfall 3: Hermes API key corruption with ANSI escape sequences

The ~/.hermes/.env file is where Hermes stores its API keys. We encountered a situation where this file got corrupted with ANSI escape sequences (terminal color codes leaked into the file somehow, likely from a poorly formatted redirect or copy-paste).

The symptom was that all LLM calls through Hermes failed silently with HTTP 400 responses. No error messages, no helpful logs – just silent failures. The request went through, but the corrupted API key was rejected by the upstream provider.

The fix:

  1. Open ~/.hermes/.env in a plain text editor
  2. Remove any non-printable characters or escape sequences
  3. Ensure the file contains only plain text in the format KEY=value
  4. Restart the Hermes gateway (see below)

Pitfall 4: Hermes gateway must be restarted after .env changes

After editing ~/.hermes/.env, the running Hermes gateway process does not automatically pick up the new configuration. You must stop and restart it:

# Stop the running gateway (Ctrl+C in the terminal where it runs)
# Then restart:
hermes gateway run

If you skip this step, Hermes continues using the old configuration, and you will wonder why your changes did not take effect.

Pitfall 5: AnythingLLM setup wizard requires manual configuration

AnythingLLM does not auto-detect Ollama or your embedding model. The setup wizard requires you to manually select each component:

  • LLM provider must be set to Ollama with the URL http://host.docker.internal:11434
  • Embedding provider must be set to Ollama with the same URL
  • Vector database must be manually selected as LanceDB

If you skip any of these steps or use localhost instead of host.docker.internal, AnythingLLM cannot reach Ollama from inside its container.

Pitfall 6: ComfyUI is a .app on macOS, not a pip install

ComfyUI documentation often references Python installation instructions that work on Linux. On macOS, the correct path is to download the native .app from the official releases. Pip installations require additional dependencies, build tools, and troubleshooting that the .app sidesteps entirely.

11. Verifying the Full Stack

Once all layers are running, verify connectivity end to end.

Quick health checks

# Layer 1: Ollama
curl <http://localhost:11434/api/tags> | python3 -m json.tool

# Layer 2: OpenWebUI
curl -s -o /dev/null -w "%{http_code}" <http://localhost:3000>

# Layer 3: AnythingLLM
curl -s -o /dev/null -w "%{http_code}" <http://localhost:3001>

# Layer 4: ComfyUI
curl -s -o /dev/null -w "%{http_code}" <http://localhost:8000>

# Layer 5: Hermes Agent
curl -s -o /dev/null -w "%{http_code}" <http://localhost:8642/v1/models>

Expected responses

ServiceExpected StatusWhat it means
Ollama200Model engine is serving
OpenWebUI200Chat interface is up
AnythingLLM200RAG platform is running
ComfyUI200Creative suite is ready
Hermes200Agent API is accessible

Full stack test

With everything healthy, the end-to-end flow looks like this:

  1. Open OpenWebUI at http://localhost:3000 and chat with a local Ollama model
  2. Open AnythingLLM at http://localhost:3001, create a workspace, upload a document, and ask a question that requires document retrieval
  3. Launch ComfyUI and generate an image from a text prompt
  4. Open Hermes Desktop and create an agent session that coordinates across models

If all four work, your stack is fully operational.

12. Conclusion and Next Steps

Where to go from here

We built a five-layer local AI stack that runs entirely under our own control. Ollama provides the inference foundation, OpenWebUI is the daily interface, AnythingLLM turns documents into searchable knowledge, ComfyUI handles creative media generation, and Hermes Agent orchestrates the whole system through OpenRouter managed inference.

The hardest lessons we learned:

  • Docker containers reach native services through host.docker.internal, not localhost
  • Hermes does not work reliably in Docker on macOS – install it natively
  • API key files must be plain text, with no escape sequences
  • Restart services after configuration changes
  • Always start Docker Desktop before running container commands

All of these are documented in the pitfalls section above. If you read that section before starting, you will avoid most of the friction we encountered.

Where to go from here

  • Build RAG workflows in AnythingLLM with structured document ingestion and workspace-specific knowledge bases
  • Experiment with ComfyUI node workflows for automated image generation pipelines
  • Create Hermes Agent skills that chain together model calls, file operations, and API requests
  • Add n8n or another workflow orchestrator to coordinate triggers and scheduled agent tasks
  • Swap models freely between Ollama local models and OpenRouter hosted models without changing any application code

The stack is designed to be modular. You can run the full system, or pick individual layers based on your needs. Start with Ollama and OpenWebUI, add RAG when you need document search, add Hermes when you want agent capabilities, and add ComfyUI when you need local media generation. Each layer is independently useful, and together they form a complete local AI platform.