Fine-Tune an LLM on Your MacBook with LoRA: A Hands-On Guide

Introduction

Fine-tuning a large language model used to mean renting a cluster of A100s and setting aside a weekend. That's no longer true. With modern parameter-efficient techniques and the hardware in a recent MacBook, you can meaningfully adapt an open-source model to a specific domain - locally, in minutes, without writing a single cloud provider a check.

This project is a small proof-of-concept built to test exactly that. The idea was simple: take a pre-trained model, feed it a focused document corpus, and measure whether the resulting model gives noticeably better answers on that domain compared to the base model out of the box.

The approach used here is LoRA (Low-Rank Adaptation) - a technique that freezes the base model's weights and introduces a small set of trainable matrices alongside the existing layers. Instead of updating billions of parameters on every training step, you're only updating a fraction of them. The base model stays intact; the adapter learns the delta. It's fast, memory-efficient, and surprisingly effective for domain adaptation.

The document corpus chosen for this POC is the GDPR - the EU's General Data Protection Regulation. It's a natural fit: dense, technical, domain-specific, and the kind of text that generic language models handle poorly. Most models have seen summaries and commentary about GDPR, not the regulation itself. Fine-tuning directly on the source gives the model something closer to first-hand knowledge.

The full pipeline - from PDF ingestion to a queryable web interface - runs on an Apple M1 Max with no external dependencies. One document, one epoch, one focused experiment. The goal was to validate the method, not ship a product.

The Problem with Generic LLMs

Out-of-the-box language models are trained on the internet. They've seen summaries of GDPR. They've seen blog posts about it. But they haven't deeply internalized the actual text - the specific articles, recitals, and obligations that matter when you're trying to answer a real compliance question.

The fix isn't to write a better prompt. The fix is to teach the model the source material directly. That's fine-tuning. (If you'd rather keep the base model untouched and just retrieve the right chunks at query time, that's RAG - a different tool for a different job.)

The Stack: LoRA, Phi-2, PEFT, Streamlit

Model: Microsoft Phi-2 (2.7B parameters)
Fine-tuning method: LoRA (Low-Rank Adaptation)
Hardware: Apple M1 Max, 32GB unified memory
Interface: Streamlit web app
PDF processing: PyPDF2 + LangChain text splitters
Training: HuggingFace Transformers + PEFT

Nothing exotic. Everything open source.

Step 1: Turning a PDF into Training Data

The first hurdle is getting text out of a regulatory PDF in a format a model can actually learn from.

The pipeline uses PyPDF2 to extract raw text, then LangChain's RecursiveCharacterTextSplitter to cut it into digestible chunks:

Chunk size: 128 tokens
Overlap: 25 tokens

The overlap is important. You don't want a sentence about data subject rights split across two chunks with no shared context - the model would learn two orphaned half-thoughts instead of one coherent idea.

One GDPR PDF produces about 1,500 usable chunks. That's your training dataset.

Step 2: Why LoRA Instead of Full Fine-Tuning

Full fine-tuning a 2.7B parameter model means updating all 2.7 billion weights on every training step. On a laptop, that's not just slow - it's a non-starter.

LoRA is the practical alternative. Instead of touching every weight, you freeze the base model and insert small trainable matrices alongside the attention and feed-forward layers. The result: you're only updating a small fraction of total parameters (in this config, well under 1% - whatever model.print_trainable_parameters() reports for your setup), while still achieving meaningful domain adaptation.

The target layers in this project are q_proj, v_proj, fc1, and fc2 - the attention projections and feed-forward components where domain knowledge tends to live. The LoRA config looks like this:

LoraConfig(
    r=4,           # rank
    lora_alpha=8,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj", "fc1", "fc2"]
)

Low rank. Lightweight. Effective enough for a proof-of-concept.

Step 3: Training on Apple Silicon with MPS

Apple's M1/M2 chips support PyTorch's MPS (Metal Performance Shaders) backend, which lets you use the GPU cores in the chip for training - no NVIDIA required.

The training loop is a custom implementation (bypassing HuggingFace's Trainer) to give tighter control over memory usage and real-time progress feedback. Configuration is straightforward:

model_name: "microsoft/phi-2"
learning_rate: 0.0005
batch_size: 1
num_epochs: 1
max_length: 2048

Training 500 samples for one epoch took about 6.5 minutes on an M1 Max. Loss dropped from 0.4566 → 0.1959 - a solid convergence for a POC run.

Step 4: Merging the Model for Clean Deployment

After training, you have two things: the original base model and the LoRA adapter weights. At inference time you can load both together - but that means your deployment requires the PEFT library and the overhead of composing adapters on the fly.

The cleaner production path is to merge them:

model = model.merge_and_unload()  # fuses LoRA weights into base model
model.save_pretrained("output/merged_model/")

The result is a single standalone model (~5.5 GB) that loads and runs like any normal HuggingFace model - no PEFT, no adapters, no extra dependencies.

Step 5: The Interface

The Streamlit app wraps everything into a four-page UI: Home, Preprocessing, Fine-tuning, and Inference.

At query time, the prompt is formatted as:

Question: What are the rights of a data subject under GDPR?

The app checks for a merged model first, falls back to the LoRA adapter if none exists, and displays the generated response along with token counts so you can see exactly how much of the context window you're consuming.

Does It Actually Work?

Short answer: yes, noticeably.

Side-by-side comparisons between the base Phi-2 and the fine-tuned version show a clear difference in GDPR-specific density. The fine-tuned model references specific obligations, data subject rights, and regulatory terminology that the base model glosses over or hallucinates.

It's not perfect - one epoch on 500 chunks is a POC, not a production system. But it validates the pipeline. The model learns from the source material.

And if you're shipping any of this to production, the operational risks of LLM APIs - rate limits, prompt injection, data-exfiltration surfaces - deserve their own look before you put a fine-tuned model in front of users.

The Takeaway

A regulatory AI assistant doesn't have to be generic. Grounded in the source, running on a laptop, costing nothing to iterate on. That's the bar now - not the exception.