Three Local LLMs on an M1 Pro: A One-Week Shakedown
Updated May 2026. Sixteen months after the original test, with corrected model names, sharper claims about where each model wins and breaks, and the stack I actually run today added at the end.
I ran three local LLMs on a 16GB M1 Pro MacBook for a week of normal work. Inbox triage, code refactors, summaries, the occasional 2am debugging spiral.
This is not a benchmark. There is no rubric, no tokens-per-second chart, no eval suite. It is a shakedown. The output is the shape of where each model wins, where each model fails, and when you should reach for a different tool entirely.
I ran them through Ollama because I wanted to know how a normal developer's machine handles this stuff, not how a 192GB Mac Studio does.
Here is what survived contact with real work.
The Setup
Three models, all pulled fresh:
ollama pull mistral:7b
ollama pull deepseek-r1:14b
ollama pull llama3.2:3b
Machine: 16GB unified memory M1 Pro. Nothing fancy. The same laptop you probably already have.
Ollama is the only piece of plumbing you need. Install, pull, run. The whole stack fits in the time it takes to make coffee. There is no Docker. There is no Python virtualenv. There is no quantization decision to make for you. The Ollama defaults are sensible.
Two things to know that the marketing does not say:
The 14B is going to push 16GB of memory hard. Close your browser tabs. Quit Slack. The laptop will not crash but it will not be happy.
The 3B will feel free. You will be tempted to leave it running. Resist this if you care about battery.
Mistral 7B
The model you reach for when you have a hundred small jobs and no patience.
Where it won during a week of actual use: inbox triage, single-shot classification, paragraph-level summarization, "rewrite this Slack message so it sounds less passive-aggressive." Tasks where the answer is short, the context is small, and the cost of being wrong is one redo.
Where it broke: multi-step reasoning where each step depends on the prior. Anything that required tracking state across more than one chunk of code. Tone-shifting in the same response. Asking it for a friendly version and a formal version usually got one or the other, not both.
The shape of its failure is "did the obvious thing, did not see the second thing." You learn quickly to give it one job per prompt and accept the trade.
If your use case is volume, this is the model. First token is near-instant on the M1 Pro. You can chain 50 of these in the time it takes the 14B to finish one.
DeepSeek R1:14B
The one that thinks out loud.
DeepSeek R1 surfaces its chain of reasoning in the response itself. You can read the model arguing with itself before it commits to an answer. For one-shot reasoning problems this is genuinely useful. You can see where the reasoning breaks. You can see where you under-specified the prompt. The transparency is not just a flex. It is a debug surface.
Where it won: planning prompts. Architectural sketch-outs. "Walk me through how I would structure X." It thinks longer and the extra thinking shows up in the output. For ten-minute problems it can be worth the wait.
Where it broke: multi-turn. Around turn four or five of a dense technical conversation the model would lose track of which file we were editing, which variant of a function we had settled on, which constraints we had ruled out. It is a one-shot reasoner pretending to be a chat partner. If you treat it like a chat partner, you will get burned.
It also hallucinates package names and API surfaces with confidence. Anything where the actual API has changed since its training data, the model will quietly make up. Always verify the imports.
And the obvious one. It is heavy. 14B on a 16GB Mac is the upper end of comfortable. First token is slow. If you are doing anything else memory-intensive, you will feel the laptop think.
Llama 3.2:3B
Solid. Forgettable. Useful for both reasons.
The 3B was the model I stopped thinking about. Which is praise. It handled normal chat, normal Q&A, normal "explain this concept to me" prompts without drama. It loaded fast, responded fast, and used a sensible amount of memory.
Where it won: anything where speed mattered more than depth. Real-time interface use. Drafting a quick reply. Asking a question while compiling something else.
Where it broke: niche topics. Anything specialist. Anything where the answer needs domain expertise. The 3B is a generalist that knows a little about a lot. When you push it into a corner where you need depth, it answers in vague-confident prose that sounds right and reads wrong.
If you only run one local model on a 16GB Mac and you want it to feel snappy, this is the one.
When To Reach For Which
The decision framework is simple once you have run them all.
Reach for Mistral 7B when you have volume and small tasks. Inbox triage, classification, short summaries, one-line rewrites. The model that earns its keep on throughput.
Reach for DeepSeek R1:14B when you have one hard problem and the time to let it think. Planning. Architecture. Standalone reasoning. Watch for hallucinated imports.
Reach for Llama 3.2:3B when you want speed and you want to stop thinking about model selection. The default. The boring choice.
Reach for none of them when:
- You need a real context window. These are short-context models. A long document or a real codebase will not fit.
- The cost of hallucination is high. Anything touching money, legal, medical, or production code that ships.
- You need reliable tool use. None of these three was strong at structured tool invocation. The paid frontier models still own this lane.
Local LLMs are not free. They cost your laptop's attention, your fan's lifespan, and your willingness to babysit memory pressure. The question is not "can I run this." The question is whether the price is worth what you save on tokens.
For high-volume narrow tasks, yes. For high-stakes one-shot reasoning, sometimes. For everything in between, your laptop's workload decides.
What I Would Do Differently
Looking back on the original test:
I should have logged tokens per second instead of writing "smooth as butter." A vibe is not a measurement.
I should have run the same set of prompts across all three models instead of letting the conversations diverge. Without controlled input you compare impressions, not behavior.
I should have measured time to first token. That is the perceived latency. It is what determines whether the model feels usable.
I should have stress-tested at 80% memory pressure, not on a fresh-boot system. You almost never use these models on a fresh-boot system.
Most of all I should have defined what "good" looked like before I started. Without success criteria you grade the model on the prompt instead of the other way around.
The version of this post that defined those things up front would have been a benchmark. This one is a shakedown. Both are useful. They are not the same thing, and pretending otherwise was the original sin of the first draft.
What I Run Now
Sixteen months later the stack looks different. I do not chase models week to week.
The local-primary slot is Qwen 2.5:7B through Ollama. It replaced everything I used to run, because it is good enough at most things to not need three models. Above that sits a free-at-margin provider chain I route through Hermes: codex_cli via my ChatGPT Plus quota, claude_cli, and GitHub Copilot via gh-auth credential. Paid Anthropic sits at the bottom as the final fallback. The whole point is to keep token spend near zero while keeping the option to escalate when a task earns it.
The 14B-on-a-laptop era is mostly over for me. If I need that level of reasoning I send it up the chain to a model that has the hardware behind it. The 3B-as-a-default era is also over. Qwen 2.5:7B occupies that slot now and does it better.
The thing that did not change is the reason for running locally at all. Privacy on the small stuff. Latency on the smaller stuff. And not paying tokens for a job that does not need them.
Close
Models change.
The reason to run them locally does not.
Tagged



