Local 8B LLM Benchmarks: MacBook M4 vs RTX 3070 vs Cloud CPU

Open source models have gotten unbelievably good, which has opened up a world of home uses for enthusiasts of all kinds. With no AI background you can have Ollama serving models on any laptop or desktop machine you have lying around, and with a little more work you can use that to have your own free¹ service to power your RAG, personal finance or even code editing projects.

tl;dr – a recent mac is just fine for your development, and you can get far with a modest GPU investment. Current Intel/AMD machines (even large cloud VMs) are not worthwhile..

But what can you expect from your personal hardware, and what can you get if you’re willing to upgrade it? I don’t mean leasing a GPU farm for a year, just spending a little on your laptop or video card to improve your inference speed and model size. I tested 4 open source models on a consumer-grade NVidia graphics card, modern Apple silicon, and a reasonably powerful Cloud (CPU based) VM to see what would work for me for development and home service.

To do that I wrote a small LLM benchmark tool I call Unladen SwaLLM² that takes a file of prompts (or one on the command line) and runs it against a set of LLMs on a given Ollama server, and outputs JSON stats and optionally the responses for evaluation (for which I’d recommend a public AI service). I ran this against the three platforms I had available.

Hardware Tested

Platform	CPU	RAM	GPU	Upfront Cost	Operating Cost
Apple M4 Laptop	Apple M4 (10-core)	16GB	Integrated	$1,099	~$0/hr
GPU Server	Intel i7-12700KF	32GB	RTX 3070 8GB	~$1,200	~$0.20/hr*
Cloud VM	AMD EPYC 7J13 (12 vCPU)	128GB	None	Pay-as-you-go	~$0.50/hr

*Electricity cost when running inference

So the good news so far is that you don’t even need an internet connection to do AI development on your laptop. But most real world applications will require some degree of concurrency, so I also tested with 5 and 10 asynchronous requests at a time.

Performance

The first thing to review is raw speed, and no shock, GPUs rule as long as the model fits in memory (4.9GB files seem to be the sweet spot for the 3070’s 8 GB VRAM).

A note about the testing – The unladen-swallm repo has the prompts file I used, and the full command line was:

swallm benchmark -m llama3.1:8b-instruct-q4_K_M -m cogito:8b -m deepseek-r1:8b -m mistral:latest -t 90 -c 10 -P ./more_eval_prompts.txt -r -o output_concurrent_10.json

This will test the 4 models assuming ollama is running on localhost (otherwise -H <hostname:port>) with a concurrency of 10 statements per model used and a timeout of 90 seconds. The -r includes the responses in the JSON file, which I then gave to Claude Opus 4.5 (new model!) to review for quality. Some surprises there, too.

Quality

All four models failed this prompt:

You have a backpack with a 15kg weight limit. Choose the optimal combination of these items—camera (5kg), laptop (3kg), tent (7kg), food pack (6kg), and water (2kg)—to maximize utility for a 2-day hiking trip. Explain your reasoning in under 120 words.

(food, tent, water. C’mon now.)

For code generation, all four models handled utility scripts competently—none will refactor a large codebase, but they’ll write your sieve of Eratosthenes. Cogito was the only one to produce a subtly incorrect bug fix. For tasks with strict format constraints (word counts, paragraph limits), deepseek-r1 was most reliable but also slowest. YMMV. For instruction following, similarly, no model was perfect, but all were good – though mistral seemed to have the most problems with format instructions (e.g., when told “write 3 paragraphs” it produces a numbered list).

Concurrency

Add concurrency and the speed issues become more pronounced. CPU becomes unusable, the m4 loses a lot more ground to the GPU, with latency going up 4x at 10 concurrent prompts vs closer to 3x for the 3070. So for production home inference, you should really pop for the $300.

Conclusion

My conclusion: Mac laptop for development, GPU for anything real, skip AMD/Intel CPU entirely (for now). If you want to test your own setup, my code is on GitHub. Next I’m wondering about the backend – is the convenience of Ollama leaving performance on the table?

“Free” apart from the electric bill ↩︎
apologies to Monty Python ↩︎

Hardware Tested

Performance

Quality

Concurrency

Conclusion

Leave a Reply Cancel reply