Building a Custom BPE Tokenizer: From Wikipedia to Real-Time Text Processing

July 12, 2025

Building a Custom BPE Tokenizer from Scratch

After building my autograd engine and movie title generator, I wanted to dive deeper into the fundamental building blocks of language models. Tokenization is where everything starts - it's how we bridge human language and machine understanding.

Being a huge Celtics fan, I thought it would be fun to train a tokenizer on Jayson Tatum's Wikipedia page. He's been absolutely crushing it for the us, and I figured his page would give me a solid basketball-focused dataset to work with. Plus, it's way more interesting than training on generic text!

The Tokenization Engine

From Raw Text to Meaningful Tokens

Visualization of tokenizer training process

The tokenization process: breaking down complex text into manageable, meaningful units that machines can understand.

What I Built

This project ended up being a complete tokenization pipeline - from scraping Wikipedia data to building a real-time interactive demo. I implemented a custom BPE tokenizer inspired by GPT-4's approach, trained it on Jayson Tatum's Wikipedia page, and built a visualization tool so you can see exactly how text gets broken down into tokens. The final result is a 512-token vocabulary that's surprisingly good at compressing basketball-related text.

Key Performance Metrics

"Jayson Tatum is a Celtic"

24 characters → 7 tokens

Compression: 343%

"Basketball is the second most famous sport behind football"

58 characters → 28 tokens

Compression: 207%

"General Text (Taylor Swift)"

27 characters → 18 tokens

Compression: 150%

This is a tokenizer in action!

30 characters → 18 tokens

Compression: 167%

The cool thing is how much better it handles basketball content compared to general text. Try typing "Celtics" or "Jayson" in the demo below - you'll see how efficiently it tokenizes terms that appeared frequently in the training data.

Try It Out

Here's the interactive demo I built. Type anything you want and watch it get tokenized in real-time. The fun part is trying basketball terms like "Celtics", "Jayson", "Tatum", or "basketball" - you'll see how much more efficiently it handles these compared to random text.

Token Count

0

Character Count

25

Compression Ratio

Infinity%

Tokenized Output:

Each color represents a different token. Basketball terms (highlighted with yellow rings) get better compression due to domain-specific training!

The Custom BPE Implementation

Performance Analysis & Insights

Technical Implementation Details

Wrapping Up

This project ended up being way more fun than I expected. What started as "let me train a tokenizer on basketball content" turned into a deep dive into how language models actually process text.

What I learned:

  1. Domain-specific training makes a huge difference - the basketball-focused tokenizer gets 38% compression on Celtics content vs 56% on general text
  2. BPE is elegantly simple - the algorithm just learns patterns from your data
  3. Interactive demos are worth building - seeing tokenization happen live makes everything click
  4. Building from scratch teaches you so much more - you understand every piece when you implement it yourself

The coolest part was watching the tokenizer essentially become a Celtics fan through training. It learned that "Jayson", "Tatum", and "Celtics" are important enough to get their own tokens. That's the power of specialized models - they adapt to your specific domain.

What's next? I'm thinking about:

  • Training a small language model using this tokenizer (basketball chatbot anyone?)
  • Extending to other sports or domains
  • Building more interactive NLP tools
  • Maybe training on the entire Celtics roster's pages?

The interactive demo above is just the start. I want to keep building tools that make AI concepts accessible and fun to play with.

If you're working on similar stuff, have questions about the implementation, or just want to talk about the Celtics' championship chances, feel free to reach out!

This project captures what I love about building AI - taking something complex and making it both understandable and personally meaningful. More projects like this coming soon! šŸ€