BranchProvider: Ollama (Local)

Provider: Ollama (Local)

No API key, no cloud. Everything runs on your Mac.

Get Your Agent to Help

Install a local LLM skill:

npx skills add bobmatnyc/claude-mpm-skills@local-llm-ops

Or the ollama-specific one:

npx skills add jeremylongshore/claude-code-plugins-plus-skills@ollama-setup

Then ask: "help me set up ollama and build the daily-digest run script using local inference"

Install Ollama

brew install ollama
ollama serve &
ollama pull llama3.2:3b

Verify:

curl -s http://localhost:11434/api/tags | python3 -c "import json,sys; [print(m['name']) for m in json.load(sys.stdin)['models']]"

The Pattern

Two endpoints. Use /api/generate for simple one-shot prompts, /api/chat for multi-turn:

// Simple: /api/generate
async function ask(prompt: string, model = "llama3.2:3b"): Promise<string> {
  const resp = await fetch("http://localhost:11434/api/generate", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ model, prompt, stream: false }),
  });

  if (!resp.ok) throw new Error(`Ollama ${resp.status}`);
  const data = (await resp.json()) as any;
  return data.response;
}
// Multi-turn: /api/chat (same shape as OpenAI)
async function chat(messages: Array<{role: string, content: string}>): Promise<string> {
  const resp = await fetch("http://localhost:11434/api/chat", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ model: "llama3.2:3b", messages, stream: false }),
  });

  const data = (await resp.json()) as any;
  return data.message.content;
}

OpenAI-Compatible Endpoint

Ollama 0.1.24+ has an OpenAI-compatible endpoint at /v1/chat/completions. If you already have code using the OpenAI format, just change the base URL:

const resp = await fetch("http://localhost:11434/v1/chat/completions", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "llama3.2:3b",
    messages: [{ role: "user", content: prompt }],
  }),
});
// Response shape matches OpenAI: data.choices[0].message.content

Models

ollama pull llama3.2:3b    # 2GB, fast, good default
ollama pull llama3.2:1b    # 1.3GB, fastest, simple tasks
ollama pull mistral         # 4GB, stronger reasoning
ollama pull phi4            # Microsoft, small and fast

The Ollama-as-launchd-Job Pattern

Ollama must be running when your job fires. Make it self-sustaining — manage ollama with the same scheduler:

mkdir -p system-jobs/ollama-server
echo '{"type": "periodic", "seconds": 60, "runAtLoad": true}' > system-jobs/ollama-server/schedule
cat > system-jobs/ollama-server/run << 'EOF'
#!/bin/bash
pgrep -x ollama > /dev/null || /opt/homebrew/bin/ollama serve
EOF
chmod 755 system-jobs/ollama-server/run
bun run sync

Your job scheduler keeps its own LLM server alive.

Test It

bun run src/cli.ts kick daily-digest
bun run src/cli.ts logs daily-digest

Companion Notes

Branch: Ollama (Local Inference)

No API key. No cloud. Everything runs on your Mac. Best for privacy, offline use, or just wanting to own the whole stack.

Setup

# Install ollama
brew install ollama

# Start the server (runs in background)
ollama serve &

# Pull a model
ollama pull llama3.2

Verify it's running:

curl -s http://localhost:11434/api/tags | python3 -m json.tool

The Pattern

#!/usr/bin/env bun
async function ask(prompt: string, model = "llama3.2"): Promise<string> {
  const resp = await fetch("http://localhost:11434/api/generate", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      model,
      prompt,
      stream: false,
    }),
  });

  if (!resp.ok) {
    throw new Error(`Ollama ${resp.status}: ${await resp.text()}`);
  }

  const data = (await resp.json()) as any;
  return data.response;
}

Model Selection

# List available models
ollama list

# Pull a model
ollama pull llama3.2      # 2GB, fast, good for most tasks
ollama pull mistral        # 4GB, strong reasoning
ollama pull codellama      # code-focused
ollama pull phi4           # small, fast, Microsoft
ModelSizeSpeedBest For
llama3.22GBFastDefault — summaries, classification, short tasks
llama3.2:1b1.3GBFastestVery simple tasks, low memory machines
mistral4GBMidBetter reasoning, longer outputs
codellama4GBMidCode-related jobs

Start with llama3.2. It's the sweet spot for scheduled jobs.

The Ollama Gotcha

Ollama must be running when the job fires. If the server isn't up, the job fails silently (connection refused → stderr.log).

Option 1: Keep ollama running (recommended)

# Add to your shell profile
ollama serve &

Option 2: Start ollama as its own launchd job

Create system-jobs/ollama-server/:

mkdir -p system-jobs/ollama-server
echo '{"type": "periodic", "seconds": 60, "runAtLoad": true}' > system-jobs/ollama-server/schedule
#!/bin/bash
# system-jobs/ollama-server/run
# Keep ollama alive — if it's already running, this is a no-op
pgrep -x ollama > /dev/null || ollama serve

Now your job scheduler manages its own LLM server. This is the "tiny claw" moment — the system sustains itself.

Chat API (Multi-turn)

For more complex jobs that need conversation context:

async function chat(messages: Array<{role: string, content: string}>, model = "llama3.2"): Promise<string> {
  const resp = await fetch("http://localhost:11434/api/chat", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ model, messages, stream: false }),
  });

  const data = (await resp.json()) as any;
  return data.message.content;
}

// Use it
const result = await chat([
  { role: "system", content: "You summarize notes concisely." },
  { role: "user", content: noteContent },
]);

Verification

# Test ollama directly
curl -s http://localhost:11434/api/generate \
  -d '{"model":"llama3.2","prompt":"Say hello in 3 words","stream":false}' | python3 -c "import json,sys; print(json.load(sys.stdin)['response'])"

# Test through the job
bun run src/cli.ts kick daily-digest
bun run src/cli.ts logs daily-digest