Get Your Agent to Help
Install a local LLM skill:
npx skills add bobmatnyc/claude-mpm-skills@local-llm-ops
Or the ollama-specific one:
npx skills add jeremylongshore/claude-code-plugins-plus-skills@ollama-setup
Then ask: "help me set up ollama and build the daily-digest run script using local inference"
Install Ollama
brew install ollama
ollama serve &
ollama pull llama3.2:3b
Verify:
curl -s http://localhost:11434/api/tags | python3 -c "import json,sys; [print(m['name']) for m in json.load(sys.stdin)['models']]"
The Pattern
Two endpoints. Use /api/generate for simple one-shot prompts, /api/chat for multi-turn:
// Simple: /api/generate
async function ask(prompt: string, model = "llama3.2:3b"): Promise<string> {
const resp = await fetch("http://localhost:11434/api/generate", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ model, prompt, stream: false }),
});
if (!resp.ok) throw new Error(`Ollama ${resp.status}`);
const data = (await resp.json()) as any;
return data.response;
}
// Multi-turn: /api/chat (same shape as OpenAI)
async function chat(messages: Array<{role: string, content: string}>): Promise<string> {
const resp = await fetch("http://localhost:11434/api/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ model: "llama3.2:3b", messages, stream: false }),
});
const data = (await resp.json()) as any;
return data.message.content;
}
OpenAI-Compatible Endpoint
Ollama 0.1.24+ has an OpenAI-compatible endpoint at /v1/chat/completions. If you already have code using the OpenAI format, just change the base URL:
const resp = await fetch("http://localhost:11434/v1/chat/completions", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "llama3.2:3b",
messages: [{ role: "user", content: prompt }],
}),
});
// Response shape matches OpenAI: data.choices[0].message.content
Models
ollama pull llama3.2:3b # 2GB, fast, good default
ollama pull llama3.2:1b # 1.3GB, fastest, simple tasks
ollama pull mistral # 4GB, stronger reasoning
ollama pull phi4 # Microsoft, small and fast
The Ollama-as-launchd-Job Pattern
Ollama must be running when your job fires. Make it self-sustaining — manage ollama with the same scheduler:
mkdir -p system-jobs/ollama-server
echo '{"type": "periodic", "seconds": 60, "runAtLoad": true}' > system-jobs/ollama-server/schedule
cat > system-jobs/ollama-server/run << 'EOF'
#!/bin/bash
pgrep -x ollama > /dev/null || /opt/homebrew/bin/ollama serve
EOF
chmod 755 system-jobs/ollama-server/run
bun run sync
Your job scheduler keeps its own LLM server alive.
Test It
bun run src/cli.ts kick daily-digest
bun run src/cli.ts logs daily-digest
Companion Notes
Branch: Ollama (Local Inference)
No API key. No cloud. Everything runs on your Mac. Best for privacy, offline use, or just wanting to own the whole stack.
Setup
# Install ollama
brew install ollama
# Start the server (runs in background)
ollama serve &
# Pull a model
ollama pull llama3.2
Verify it's running:
curl -s http://localhost:11434/api/tags | python3 -m json.tool
The Pattern
#!/usr/bin/env bun
async function ask(prompt: string, model = "llama3.2"): Promise<string> {
const resp = await fetch("http://localhost:11434/api/generate", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model,
prompt,
stream: false,
}),
});
if (!resp.ok) {
throw new Error(`Ollama ${resp.status}: ${await resp.text()}`);
}
const data = (await resp.json()) as any;
return data.response;
}
Model Selection
# List available models
ollama list
# Pull a model
ollama pull llama3.2 # 2GB, fast, good for most tasks
ollama pull mistral # 4GB, strong reasoning
ollama pull codellama # code-focused
ollama pull phi4 # small, fast, Microsoft
| Model | Size | Speed | Best For |
|---|---|---|---|
llama3.2 | 2GB | Fast | Default — summaries, classification, short tasks |
llama3.2:1b | 1.3GB | Fastest | Very simple tasks, low memory machines |
mistral | 4GB | Mid | Better reasoning, longer outputs |
codellama | 4GB | Mid | Code-related jobs |
Start with llama3.2. It's the sweet spot for scheduled jobs.
The Ollama Gotcha
Ollama must be running when the job fires. If the server isn't up, the job fails silently (connection refused → stderr.log).
Option 1: Keep ollama running (recommended)
# Add to your shell profile
ollama serve &
Option 2: Start ollama as its own launchd job
Create system-jobs/ollama-server/:
mkdir -p system-jobs/ollama-server
echo '{"type": "periodic", "seconds": 60, "runAtLoad": true}' > system-jobs/ollama-server/schedule
#!/bin/bash
# system-jobs/ollama-server/run
# Keep ollama alive — if it's already running, this is a no-op
pgrep -x ollama > /dev/null || ollama serve
Now your job scheduler manages its own LLM server. This is the "tiny claw" moment — the system sustains itself.
Chat API (Multi-turn)
For more complex jobs that need conversation context:
async function chat(messages: Array<{role: string, content: string}>, model = "llama3.2"): Promise<string> {
const resp = await fetch("http://localhost:11434/api/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ model, messages, stream: false }),
});
const data = (await resp.json()) as any;
return data.message.content;
}
// Use it
const result = await chat([
{ role: "system", content: "You summarize notes concisely." },
{ role: "user", content: noteContent },
]);
Verification
# Test ollama directly
curl -s http://localhost:11434/api/generate \
-d '{"model":"llama3.2","prompt":"Say hello in 3 words","stream":false}' | python3 -c "import json,sys; print(json.load(sys.stdin)['response'])"
# Test through the job
bun run src/cli.ts kick daily-digest
bun run src/cli.ts logs daily-digest