Smoll-GPT

📅 13/6/2025

My curiosity about large language models began when I took an ML & AI class at university. For our group project, we had to build a basic GPT-style model. The goal was simple: given a prompt like I like, the model should predict the next word, like "pizza". We were encouraged to use the wine-review dataset. Our first attempt (built in PyTorch—you can view the code here) didn’t go so well. For example, when we gave the model this input: wine review : US : Oregon : Pinot Gris : A wine
It might generate gibberish like:

80 thick grand curedhoively estateniently lacking ke easy historicornia excellence allow lane head tough cho

or

managed concentration rust adding lingeringster honey50 goatpieble remains lou toward elegant distinctly coun maintains pairs.

Clearly, the model was hallucinating nonsense. We used the GPT-2 tokenizer (with a vocabulary size of 50,257), which resulted in a model with ~13.9 million parameters, far too large for our dataset of only ~120,000 reviews. We even tried a trigram model, and surprisingly, it outperformed our neural net. We experimented with different layer sizes, learning rates, and decoding strategies (top-n, top-p, beam search, repetition penalties). Still, our best model was a bloated 188.4 MB, and the text quality remained poor. We chalked it up to limited hardware and time.

A few months later, while working on time series models and doing lots of data cleaning, I revisited the wine-review dataset. That’s when I found arhamshahbaz’s Kaggle notebook. Their approach was similar to ours but with a 5.8M parameter model (just 22.16 MB), their results were significantly better. The key difference? They used a custom tokenizer implemented in NumPy instead of the full GPT-2 tokenizer. Inspired, I decided to build my own tiny language model from scratch with a custom tokenizer and that’s how Smoll-GPT was born.


Dataset

I’m using the zynicide/wine-reviews dataset, which has 129,907 rows of wine reviews scraped from WineEnthusiast.com on June 15, 2017. The original author was inspired by the documentary Somm and aimed to build a model that could “identify wines through blind tasting like a master sommelier.” My goal is a bit different: I want Smoll-GPT to generate a wine description given its Country, Province, and Variety.

Funny enough, the dataset comes completely uncleaned. I found about 10 K duplicates and filtered out rows with missing values, leaving me with 119,905 unique reviews. Since I’m more comfortable with numerical data, diving into NLP preprocessing was a new challenge. In my setup, each processed sample is structured as:

wine review : <country> : <province> : <variety> : <description>[EOD]

[EOD] is a special token I use to signify the end of document.

To build the tokenizer vocabulary, I also used a random subset of 250 reviews plus three Wikipedia articles (Wine, Alcoholic Beverage, and Old World Wine). All that text gets merged into a single file called tokenizer.txt.


Tokenizer

The heart of this project is a custom Byte Pair Encoding (BPE) tokenizer, similar in spirit to GPT-2’s. I learned how BPE works from Andrej Karpathy’s tokenizer video. Here’s a high-level overview of how my tokenizer works:


  1. Unicode and Special Tokens

    • Start with UTF-8 byte encodings for all characters (the first 256 tokens).

    • Reserve a special [EOD] token for “end of document.”

    • The rest of the vocabulary is learned by merging frequent byte-pairs until we reach a vocab size of 5,000.


  2. Pretokenization

    • I use the same regex as GPT-2 (which is fine for wine reviews—no need for the more advanced patterns that newer models use). The regex is:

      /(?:'s|'t|'re|'ve|'m|'ll|'d)|\s?\p{L}+|\s?\p{N}+|\s?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+/gu
    • For example, wine review : Spain : Catalonia splits into:
      ['wine', ' review', ' :', ' Spain', ' :', ' Catalonia']


  3. Byte-Pair Merging

    • Start with an initial vocab of 257 tokens (0–255 for bytes, plus one for [EOD]).

    • Map each pretoken to its bytes. E.g., “wine” → [119, 105, 110, 101].

    • Count all adjacent byte-pairs (e.g., (105, 110), (110, 101), (32, 114), etc.).

    • Merge the most frequent pair (say, (105, 110) becomes a new token “in” with ID 257).

    • Repeat the process until you have 5,000 tokens.

    • Along the way, build vocab.json (mapping tokens to IDs) and merges.txt (list of merged byte-pairs in order).


For example, after merging, “wine review : Spain” might tokenize to [1147, 332, 275, 676, 275, 2435] down from 31 byte-level tokens to just 6. When decoding, you reverse the merges:
1147(119, 283)(119, (257, 101))(119, 105, 110, 101)wine

Building a 5,000-token vocab required 4,743 merge operations. With about 147,048 initial tokens (25 K words from the 250 samples + ~75 K words from the Wikipedia articles), the naive merging was O(M × N) (4,743 × 147,048), which took about 7 minutes and 16 seconds on Colab’s free CPU tier. That’s fine for M≈5K, but obviously not optimal.

Explore the tokenizer in an interactive playground inspired by TikTokenizer.


Optimizing the Tokenizer

My first implementation was slow: naive pair-counting with Python dictionaries and repeated full scans made it roughly O(t²) on each BPE step, where t = number of byte-tokens in a single sequence. On the full dataset (~35 M characters), even with 5 threads, it was projected to take up to 2 hours and 20 minutes—way too slow.
So I rewrote the tokenizer using more efficient data structures:


  1. Heap-Based Priority Queue

    • Store merge candidates in a max-heap keyed by frequency, so we can pop the highest-frequency pair in O(log N) instead of scanning all pairs.
  2. Counter and defaultdict

    • Track pair frequencies and update counts in O(1) per merge, rather than using expensive Python min() or full-dictionary scans.
  3. Inverse Vocab Map

    • Map tokens → IDs and IDs → tokens for constant-time lookups instead of linear searches.
  4. Precompiled Regex

    • Compile the pretokenization regex once to avoid repeated overhead.

With these changes, each sequence’s merge loop is effectively O(t log t). Tokenizing the full dataset now takes 35 seconds—over 200× speedup. I save the final token IDs into a PyTorch tensor called data.pt. In practice, you can skip re-tokenizing by simply loading data.pt directly.


Language Model

With my optimized tokenizer in hand, I implemented a small transformer-based Bigram Language Model inspired by Karpathy’s nanoGPT video. Here are the key specs for Smoll-GPT:

  • Vocabulary Size: 5,000

  • Context Length (Block Size): 32 tokens

  • Embedding Dimension: 64

  • Number of Transformer Blocks: 4

  • Number of Heads per Block: 4

  • Feedforward Hidden Size: 256


Below is a summary of the parameter counts:

Layer (type)Output ShapeParam #Description
Embedding[B, T, 64]320,000token_embedding_table: (5000, 64)
Embedding[B, T, 64]2,048position_embedding_table: (32, 64)
Transformer Block × 4[B, T, 64]~475,136Each block: MultiHeadAttention + FeedForward + LN
└─ MultiHeadAttention (4 heads)[B, T, 64]~20.5 K(4 × 3 × 64 × 16) + (64 × 64) + biases per block
└─ FeedForward[B, T, 64]~17.2 K(64 × 256) + (256 × 64) + biases per block
└─ LayerNorm × 2[B, T, 64]negligible(2 × 64 params per block)
LayerNorm (ln_f)[B, T, 64]128Final layer norm
Linear (lm_head)[B, T, 5000]320,000(64 × 5000) + bias

  • Total Parameters: 846,304

  • Model Size on Disk: 3.5 MB


For training, I split the tokenized data into a 90/10 training/validation split. I monitored training and validation loss and manually stopped at 10,000 iterations, which seemed to strike a good balance between underfitting and overfitting. There’s definitely room for further hyperparameter tuning, but this was a solid first pass.


By comparison:

ModelParametersSize on Disk
Initial (PyTorch) Model13.9 M188.4 MB
arhamshahbaz’s Model5.8 M22.16 MB
Smoll-GPT846.3 K3.5 MB

Smoll-GPT is therefore ~16× smaller (in parameters) than our original attempt, and ~6.9× smaller than arhamshahbaz’s version—yet produces comparable (if not better) output for wine descriptions.


Usage

I export the trained model to ONNX so it can run in a Docker container. There’s a simple SvelteKit web app that wraps the model—fun fact: I rewrote the tokenizer in JavaScript for it so that everything runs client-side in the browser.

To try it out, run:

docker run --rm -p 3000:3000 akashcs13/wine-review:latest

or just play with it online at Smoll-GPT.


Conclusion

Building Smoll-GPT taught me tons about tokenization, BPE, and how much impact a lightweight implementation can have on speed and model size. My custom tokenizer went from taking 2+ hours to tokenize the dataset to under 35 seconds, all by switching from naive pair-counting to efficient data structures.


Moving forward, here are a few ideas I’m excited about:

  1. Rewrite the Tokenizer in Rust or C++.
    Hugging Face’s tokenizers are written in Rust for speed—maybe I can learn from their implementation and make mine even faster.

  2. Improve Dataset Cleaning.
    The raw wine reviews still have some weird characters (e.g., accented letters, stray punctuation). I’d like to strip out or normalize the bad ones. Here’s the full list of characters in the dataset (154 unique codepoints):

    (!"#$%&'()*+,-./0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]_`abcdefghijklmnopqrstuvwxyz| ¡¨¬­°´º½ÀÃÇÉÕÖÜàáâãäçèéêëìíîïñòóôõöøùúûüýÿăćčğıŠšŽžǎș–—‘’“”•…)
  3. Hyperparameter Tuning & Model Scaling.

    • Try more transformer blocks, larger embeddings, or different learning rate schedules.

    • Consider sparse attention or low-rank factorizations to squeeze out more performance.

  4. Evaluation Metrics.
    Right now, I eyeball the generated wine descriptions. It’d be cool to implement automatic metrics (e.g., perplexity on held-out data, BLEU/ROUGE vs. real descriptions) and maybe even some human evaluations (have sommeliers score how “realistic” the descriptions feel).

  5. Retrieve-and-Generate (RAG).
    Imagine a model that, given a query like “Tell me about an Oregon Pinot Gris,” first retrieves a few real reviews or facts from a database, then crafts a blended description. That could make Smoll-GPT more factual and reduce hallucinations.


Smoll-GPT is just the beginning. By understanding how tokenizers and small transformers work under the hood, I feel much better equipped to tackle larger NLP projects—and hopefully make models that are both lightweight and useful.


Thanks for reading! If you have any suggestions or want to fork the code, check out the repo and let me know what you think.