If you were following along last week, I was deep into the territory of running open, small-scale Large Language Models (LLMs) locally on a laptop in the free LMStudio environment. There are lots of reasons you’d want to run these mini chatbots, including the educational, environmental, and security aspects.
I finished off with a very cursory mention of an even more mobile vehicle for these, PocketPal. This free, open source app (available on Google and iOS) allows for easy (no computer science degree required) searching, downloading and running LLMs on smartphones and tablets. And, despite the resource limitations of mobile devices compared with full computer hardware, they run surprisingly well.
PocketPal is such a powerful and unique tool, and definitely worth a spotlight of its own. So, this week, I thought I’d share some tips and tricks I’ve found for smooth running of these language models in your pocket.
Full-Fat LLMs?
First off, even small, compact models can be (as you’d expect) unwieldy and resource-heavy files. Compressed, self-contained LLM models are available as .gguf files from sources like Hugging Face, and they can be colossal. There’s a process you’ll hear mentioned a lot in the AI world called quantisation, which compresses models to varying degrees. Generally speaking, the more compression, the more poorly the model performs. But even the most highly compressed small models can weigh in at 2gb and above. After downloading them, these mammoth blobs then load into memory, ready to be prompted. That’s a lot of data for your system to be hanging onto!
That said, with disk space, a good internet connection, and decent RAM, it’s quite doable. On a newish MacBook, I was comfortably downloading and running .gguf files 8gb large and above in LMStudio. And you don’t need to downgrade your expectations too much to run models in PocketPal, either.
For reference, I’m using a 2023 iPad Pro with the M2 chip – quite a modest spec now – and a 2024 iPhone 16. On both of them, the sweet spot seems to be a .gguf size of around 4gb – you can go larger, but there’s a noticeable slowdown and sluggishness beyond that. A couple of the models I’ve been getting good, sensible and usable results from on mobile recently are:
- Qwen3-4b-Instruct (8-bit quantised version) – 4.28gb
- Llama-3.2-3B-Instruct (6-bit quantised version) – 3.26gb
The ‘instruct’ in those model names refers to the fact that they’ve been trained to follow instructions particularly keenly – one of the reasons they give such decent practical prompt responses with a small footprint.
Optimising PocketPal
Once you have them downloaded, there are a couple of things you can tweak in PocketPal to eke out even more performance.
The first is to head to the settings and switch on Metal, Apple’s hardware-accelerated API. Then, increase the “Layers on GPU” setting to around 80 or so – you can experiment with this to see what your system is happy with. But the performance improvement should be instantaneous, the LLM spitting out tokens at multiple times the default speed.
What’s happening with this change is that iOS is shifting some of the processing from the device’s CPU to the GPU, or graphical processing unit. That may seem odd, but modern graphics chips are capable of intense mathematical operations, and this small switch recruits them into doing some of the heavy work.
Additionally, on some recent devices, switching on “Flash Attention” can bring extra performance enhancements. This interacts with the way LLMs track how much weight to give certain tokens, and how that matrix is stored in memory during generation. It’s pot luck whether it will make a difference, depending on device spec, but I see a little boost.

Tweaking PocketPal’s settings to run LLMs more efficiently
Making Pals – Your Own Custom Bots
When you’re all up and running with your PocketPal LLMs, there’s another great feature you can play with to get very domain-specific results – “Pal” creation. Pals are just system prompts – instructions that set the boundaries and parameters for the conversation – in a nice wrapper. And you can be as specific as you want with them, instructing the LLM to behave as a language learning assistant, a nutrition expert, a habits coach, and such like – with as many rules and output notes as you see fit. It’s an easy way to turn a very generalised tool into something focused and with real-world application.
So that’s my PocketPal in-a-nutshell power guide. I hope you can see why it’s worth much more than just a cursory mention at the end of last week’s post! Tools like PocketPal and LMStudio put you right at the centre of LLM development, and I must admit it’s turned me into a models geek – I’m already looking forward to what new open LLMs will be unleashed next.
So what have you set your mobile models doing? Please share your tips and experiences in the comments!