I’ve been singing the praises of local models of late, for so many reasons. From intelligent OCR to data crunching with enhanced privacy, there are gains to be had and they’re easy to access with free inferencing software like LM Studio and Ollama.
That said, there’s a moment that happens to a lot of people who work adjacent to tech – linguists, teachers, researchers – where they think: I’d love to tinker with these AI models properly – and maybe even build them directly into my own tech projects.
This post addresses that tinkering itch. The good news: it’s genuinely easier than you think, and you can get something running in an afternoon.
Why Python?
I ask this a lot, myself, coming from a totally different development background (full-stack and native web app coding). Going back into academia, Python seems to be everywhere.
Python has become the de facto language of AI and data science for a reason. Its syntax is readable almost like pseudocode, its libraries are extraordinarily well-developed and vast, and – linked to that last point – calling an API takes a handful of lines, not pages of custom routines. If you’re coming from a research or humanities background, Python also has the advantage of being widely taught in academic contexts, which means the community, tutorials, and Stack Overflow threads are abundant.
Compare calling an LLM in Python to doing the same in JavaScript or Swift, and you’ll understand immediately why the ‘AI for academia’ world standardised on Python.
And a big plus – it’s probably already installed on your machine. Open your terminal / command prompt interface, and type python --version or python3 --version. If you see a version number come back, you’re good to go. If not, head to python.org/downloads and grab the latest stable release – it’s a straightforward installer on every platform.
Two Ways In: Cloud or Local
Option 1: Hugging Face’s Free Inference API (great for experimenting, zero cost)
Hugging Face is essentially the GitHub of AI models – tens of thousands of open-source models, all in one place. The Serverless Inference API lets you call many of them without setting up any infrastructure, and the free tier is perfectly generous for tinkering and learning. You’ll hit rate limits if you go overboard, but for exploration it’s hard to beat.
Here’s what you need to get started:
- Create a free account at huggingface.co
- Go to Settings → Access Tokens and generate a token with Read permissions
- Install the library:
pip install huggingface_hub
Then you can call a model like this:
from huggingface_hub import InferenceClient
client = InferenceClient(
model="meta-llama/Llama-3.2-11B-Vision-Instruct",
token="hf_your_token_here"
)
response = client.text_generation("Explain enregisterment in simple terms.")
print(response)
That’s genuinely it for a first experiment. A few lines. No GPU. No cloud bill.
One gotcha: some popular models require you to accept their licence terms on the Hugging Face website before you can access them via the API. If you get a 403 error, that’s almost certainly why — head to the model page, accept the terms, and try again.
Option 2: LM Studio (run models locally, completely private)
If you’d rather not send your data to any external service – which matters for research involving sensitive text – LM Studio is still a brilliant solution. It gives you a clean interface to download and run open-source models on your own machine, with no internet connection required once the model is downloaded.
The local model landscape has improved dramatically. Models like Qwen3 (the 4B and 14B variants especially) are genuinely impressive on a modern laptop or desktop with a decent amount of RAM. You wouldn’t have believed this was possible two years ago.
LM Studio exposes a local API that mimics the OpenAI format, so you can call it from Python the same way:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:1234/v1",
api_key="not-needed" # LM Studio doesn't require auth locally
)
response = client.chat.completions.create(
model="qwen3-14b", # whatever model you've loaded in LM Studio
messages=[{"role": "user", "content": "Hello, what can you do?"}]
)
print(response.choices[0].message.content)
The openai library here is just a convenient HTTP client — you’re not actually talking to OpenAI. You’re talking to a model running on your own machine.
Common stumbling block: LM Studio’s server needs to be running and a model needs to be loaded before your script will work. The error message when it’s not running is a bit cryptic (ConnectionRefusedError or similar) — if you see that, it just means you didn’t start the server yet.
Making the Output Actually Readable
Once you’re getting responses back, the next temptation is to do something with them in your terminal – loop through results, display analysis, format comparisons. The default print() approach quickly gets messy.
My namesake, the rich library is a revelation here (how nice to have a Python library named after me). It adds colour, formatting, tables, and syntax highlighting to terminal output with almost no effort:
pip install rich
from rich.console import Console
from rich.markdown import Markdown
console = Console()
response_text = client.text_generation("Write a haiku about Python.")
console.print(Markdown(response_text))
If the model returns markdown (which most do), rich will render it beautifully right in your terminal. Headers, code blocks, bold text — all of it. This is genuinely transformative for readability when you’re doing exploratory work.
Don’t Stop at Chat: Sentence Transformers Are Worth Knowing About
Here’s where it gets interesting for researchers and linguists in particular. Large language models are great for generation — producing text, summarising, answering questions. But there’s a whole other class of model designed for understanding text semantically: sentence transformers.
The Sentence Transformers library (also called sbert) lets you turn text into numerical vectors that capture meaning. Two sentences that mean the same thing will have vectors that are close together; two unrelated sentences won’t. This is called a semantic embedding.
Why does this matter? A few examples:
- Corpus linguistics for semantics: Automatically cluster dialect examples by semantic similarity rather than just keyword matching
- Research assistants: Find the most relevant papers or passages from a large collection based on meaning, not just exact words
- Teaching tools: Build a quiz that detects when a learner’s answer is semantically equivalent to the model answer, even if the wording is different
pip install sentence-transformers
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = [
"The dialect features of the Black Country are highly distinctive.",
"Black Country speech has unique phonological characteristics.",
"The weather in Edinburgh is famously miserable."
]
embeddings = model.encode(sentences)
similarity = util.cos_sim(embeddings[0], embeddings[1])
print(f"Similarity: {similarity.item():.3f}") # will be high
This runs entirely locally (the model downloads once and caches), is fast even on a modest laptop, and opens up a whole world of computational approaches to language that go well beyond chatting with an LLM.
Getting Set Up: The Boring-but-Important Bit
Beyond that, there are just a few things I’ve learnt from my initial tinkerings that will save you headaches.
Use a virtual environment. Every time. Before you install anything for a new project, do:
python -m venv venv
source venv/bin/activate # on Mac/Linux
venv\Scripts\activate # on Windows
This keeps your project’s dependencies isolated and prevents the infuriating “but it worked yesterday” problem where one project’s libraries silently break another’s.
Keep API secrets out of your code. Don’t paste your Hugging Face token directly into a script you might share or commit to GitHub. Use a .env file and the python-dotenv library:
pip install python-dotenv
# .env file (this file stays off GitHub — add it to .gitignore)
HF_TOKEN=hf_your_token_here
# your script
from dotenv import load_dotenv
import os
load_dotenv()
token = os.getenv("HF_TOKEN")
Read error messages. This sounds obvious, but: most Python errors from LLM libraries tell you exactly what went wrong. A 401 means authentication failed (wrong or missing token). A 503 means the model is loading on the server side – wait a moment and retry. A ConnectionRefusedError from a local API almost always means LM Studio’s server isn’t running.
What Next?
Once you’ve got a basic script running, the natural next steps are:
- Build a simple chat loop that keeps track of conversation history and lets you have a back-and-forth with a model
- Experiment with system prompts to give the model a persona or set of instructions
- Try different models on the same prompts and compare the results – it’s illuminating
- Start combining LLMs with sentence transformers for retrieval-augmented approaches where you search a corpus semantically before feeding results to a generative model
The Python AI ecosystem is genuinely exciting right now, and the barrier to entry has never been lower. You don’t need a GPU, you don’t need a cloud account, and you don’t need to be a professional developer. You just need an afternoon and a bit of curiosity.
Have questions or want to share what you built? Drop a comment below.




