📅 13/6/2025
My curiosity about large language models began when I took an ML &
AI class at university. For our group project, we had to build a
basic GPT-style model. The goal was simple: given a prompt like I like, the model should predict the next word, like "pizza". We were
encouraged to use the wine-review dataset. Our first attempt (built in PyTorch—you can view the code here) didn’t go so well. For example, when we gave the model this
input: wine review : US : Oregon : Pinot Gris : A wine
It might generate gibberish like:
or
Clearly, the model was hallucinating nonsense. We used the GPT-2 tokenizer (with a vocabulary size of 50,257), which resulted in a model with ~13.9 million parameters, far too large for our dataset of only ~120,000 reviews. We even tried a trigram model, and surprisingly, it outperformed our neural net. We experimented with different layer sizes, learning rates, and decoding strategies (top-n, top-p, beam search, repetition penalties). Still, our best model was a bloated 188.4 MB, and the text quality remained poor. We chalked it up to limited hardware and time.
A few months later, while working on time series models and doing lots of data cleaning, I revisited the wine-review dataset. That’s when I found arhamshahbaz’s Kaggle notebook. Their approach was similar to ours but with a 5.8M parameter model (just 22.16 MB), their results were significantly better. The key difference? They used a custom tokenizer implemented in NumPy instead of the full GPT-2 tokenizer. Inspired, I decided to build my own tiny language model from scratch with a custom tokenizer and that’s how Smoll-GPT was born.
I’m using the zynicide/wine-reviews dataset, which has 129,907 rows of wine reviews scraped from WineEnthusiast.com on June 15, 2017. The original author was inspired by the documentary Somm and aimed to build a model that could “identify wines through blind tasting like a master sommelier.” My goal is a bit different: I want Smoll-GPT to generate a wine description given its Country, Province, and Variety.
Funny enough, the dataset comes completely uncleaned. I found about 10 K duplicates and filtered out rows with missing values, leaving me with 119,905 unique reviews. Since I’m more comfortable with numerical data, diving into NLP preprocessing was a new challenge. In my setup, each processed sample is structured as:
[EOD] is a special token I use to signify the end of document.
To build the tokenizer vocabulary, I also used a random subset of 250 reviews plus three Wikipedia articles (Wine, Alcoholic Beverage, and Old World Wine). All that text gets merged into a single file called tokenizer.txt.
The heart of this project is a custom Byte Pair Encoding (BPE) tokenizer, similar in spirit to GPT-2’s. I learned how BPE works from Andrej Karpathy’s tokenizer video. Here’s a high-level overview of how my tokenizer works:
Unicode and Special Tokens
Start with UTF-8 byte encodings for all characters (the first 256 tokens).
Reserve a special [EOD] token for “end of document.”
The rest of the vocabulary is learned by merging frequent byte-pairs until we reach a vocab size of 5,000.
Pretokenization
I use the same regex as GPT-2 (which is fine for wine reviews—no need for the more advanced patterns that newer models use). The regex is:
For example, wine review : Spain : Catalonia splits into:
['wine', ' review', '
:', ' Spain', ' :', '
Catalonia']
Byte-Pair Merging
Start with an initial vocab of 257 tokens (0–255 for bytes, plus one for [EOD]).
Map each pretoken to its bytes. E.g., “wine” → [119, 105, 110, 101].
Count all adjacent byte-pairs (e.g., (105, 110), (110, 101), (32, 114), etc.).
Merge the most frequent pair (say, (105, 110) becomes a new token “in” with ID 257).
Repeat the process until you have 5,000 tokens.
Along the way, build vocab.json (mapping tokens to IDs) and merges.txt (list of merged byte-pairs in order).
For example, after merging, “wine review : Spain” might tokenize to [1147, 332, 275, 676, 275, 2435] down from 31 byte-level tokens to just 6. When decoding, you reverse
the merges:
1147 → (119, 283) → (119, (257, 101)) → (119, 105, 110, 101) → wine
Building a 5,000-token vocab required 4,743 merge operations. With about 147,048 initial tokens (25 K words from the 250 samples + ~75 K words from the Wikipedia articles), the naive merging was O(M × N) (4,743 × 147,048), which took about 7 minutes and 16 seconds on Colab’s free CPU tier. That’s fine for M≈5K, but obviously not optimal.
Explore the tokenizer in an interactive playground inspired by TikTokenizer.
My first implementation was slow: naive pair-counting with Python
dictionaries and repeated full scans made it roughly O(t²) on each BPE step, where t = number of byte-tokens in a single sequence.
On the full dataset (~35 M characters), even with 5 threads, it was projected
to take up to 2 hours and 20 minutes—way too slow.
So I rewrote the tokenizer using more efficient data
structures:
Heap-Based Priority Queue
Counter and defaultdict
Inverse Vocab Map
Precompiled Regex
With these changes, each sequence’s merge loop is effectively O(t log t). Tokenizing the full dataset now takes 35 seconds—over 200× speedup. I save the final token IDs into a PyTorch tensor called data.pt. In practice, you can skip re-tokenizing by simply loading data.pt directly.
With my optimized tokenizer in hand, I implemented a small transformer-based Bigram Language Model inspired by Karpathy’s nanoGPT video. Here are the key specs for Smoll-GPT:
Vocabulary Size: 5,000
Context Length (Block Size): 32 tokens
Embedding Dimension: 64
Number of Transformer Blocks: 4
Number of Heads per Block: 4
Feedforward Hidden Size: 256
Below is a summary of the parameter counts:
Layer (type) | Output Shape | Param # | Description |
---|---|---|---|
Embedding | [B, T, 64] | 320,000 | token_embedding_table: (5000, 64) |
Embedding | [B, T, 64] | 2,048 | position_embedding_table: (32, 64) |
Transformer Block × 4 | [B, T, 64] | ~475,136 | Each block: MultiHeadAttention + FeedForward + LN |
└─ MultiHeadAttention (4 heads) | [B, T, 64] | ~20.5 K | (4 × 3 × 64 × 16) + (64 × 64) + biases per block |
└─ FeedForward | [B, T, 64] | ~17.2 K | (64 × 256) + (256 × 64) + biases per block |
└─ LayerNorm × 2 | [B, T, 64] | negligible | (2 × 64 params per block) |
LayerNorm (ln_f) | [B, T, 64] | 128 | Final layer norm |
Linear (lm_head) | [B, T, 5000] | 320,000 | (64 × 5000) + bias |
Total Parameters: 846,304
Model Size on Disk: 3.5 MB
For training, I split the tokenized data into a 90/10 training/validation split. I monitored training and validation loss and manually stopped at 10,000 iterations, which seemed to strike a good balance between underfitting and overfitting. There’s definitely room for further hyperparameter tuning, but this was a solid first pass.
By comparison:
Model | Parameters | Size on Disk |
---|---|---|
Initial (PyTorch) Model | 13.9 M | 188.4 MB |
arhamshahbaz’s Model | 5.8 M | 22.16 MB |
Smoll-GPT | 846.3 K | 3.5 MB |
Smoll-GPT is therefore ~16× smaller (in parameters) than our original attempt, and ~6.9× smaller than arhamshahbaz’s version—yet produces comparable (if not better) output for wine descriptions.
I export the trained model to ONNX so it can run in a Docker
container. There’s a simple SvelteKit web app that wraps the
model—fun fact: I rewrote the tokenizer in JavaScript for it so that
everything runs client-side in the browser.
To try it
out, run:
or just play with it online at Smoll-GPT.
Building Smoll-GPT taught me tons about tokenization, BPE, and how much impact a lightweight implementation can have on speed and model size. My custom tokenizer went from taking 2+ hours to tokenize the dataset to under 35 seconds, all by switching from naive pair-counting to efficient data structures.
Moving forward, here are a few ideas I’m excited about:
Rewrite the Tokenizer in Rust or C++.
Hugging Face’s tokenizers are written in Rust for speed—maybe
I can learn from their implementation and make mine even faster.
Improve Dataset Cleaning.
The raw wine
reviews still have some weird characters (e.g., accented letters,
stray punctuation). I’d like to strip out or normalize the bad
ones. Here’s the full list of characters in the dataset (154
unique codepoints):
Hyperparameter Tuning & Model Scaling.
Try more transformer blocks, larger embeddings, or different learning rate schedules.
Consider sparse attention or low-rank factorizations to squeeze out more performance.
Evaluation Metrics.
Right now, I eyeball
the generated wine descriptions. It’d be cool to implement automatic
metrics (e.g., perplexity on held-out data, BLEU/ROUGE vs. real
descriptions) and maybe even some human evaluations (have sommeliers
score how “realistic” the descriptions feel).
Retrieve-and-Generate (RAG).
Imagine a
model that, given a query like “Tell me about an Oregon Pinot
Gris,” first retrieves a few real reviews or facts from a database,
then crafts a blended description. That could make Smoll-GPT
more factual and reduce hallucinations.
Smoll-GPT is just the beginning. By understanding how tokenizers and small transformers work under the hood, I feel much better equipped to tackle larger NLP projects—and hopefully make models that are both lightweight and useful.
Thanks for reading! If you have any suggestions or want to fork the code, check out the repo and let me know what you think.