Building a Custom BPE Tokenizer from Scratch
After building my autograd engine and movie title generator, I wanted to dive deeper into the fundamental building blocks of language models. Tokenization is where everything starts - it's how we bridge human language and machine understanding.
Being a huge Celtics fan, I thought it would be fun to train a tokenizer on Jayson Tatum's Wikipedia page. He's been absolutely crushing it for the us, and I figured his page would give me a solid basketball-focused dataset to work with. Plus, it's way more interesting than training on generic text!
The Tokenization Engine
From Raw Text to Meaningful Tokens

The tokenization process: breaking down complex text into manageable, meaningful units that machines can understand.
What I Built
This project ended up being a complete tokenization pipeline - from scraping Wikipedia data to building a real-time interactive demo. I implemented a custom BPE tokenizer inspired by GPT-4's approach, trained it on Jayson Tatum's Wikipedia page, and built a visualization tool so you can see exactly how text gets broken down into tokens. The final result is a 512-token vocabulary that's surprisingly good at compressing basketball-related text.
Key Performance Metrics
"Jayson Tatum is a Celtic"
24 characters ā 7 tokens
Compression: 343%
"Basketball is the second most famous sport behind football"
58 characters ā 28 tokens
Compression: 207%
"General Text (Taylor Swift)"
27 characters ā 18 tokens
Compression: 150%
This is a tokenizer in action!
30 characters ā 18 tokens
Compression: 167%
The cool thing is how much better it handles basketball content compared to general text. Try typing "Celtics" or "Jayson" in the demo below - you'll see how efficiently it tokenizes terms that appeared frequently in the training data.
Try It Out
Here's the interactive demo I built. Type anything you want and watch it get tokenized in real-time. The fun part is trying basketball terms like "Celtics", "Jayson", "Tatum", or "basketball" - you'll see how much more efficiently it handles these compared to random text.
Token Count
0
Character Count
25
Compression Ratio
Infinity%
Tokenized Output:
Each color represents a different token. Basketball terms (highlighted with yellow rings) get better compression due to domain-specific training!
The Custom BPE Implementation
Performance Analysis & Insights
Technical Implementation Details
Wrapping Up
This project ended up being way more fun than I expected. What started as "let me train a tokenizer on basketball content" turned into a deep dive into how language models actually process text.
What I learned:
- Domain-specific training makes a huge difference - the basketball-focused tokenizer gets 38% compression on Celtics content vs 56% on general text
- BPE is elegantly simple - the algorithm just learns patterns from your data
- Interactive demos are worth building - seeing tokenization happen live makes everything click
- Building from scratch teaches you so much more - you understand every piece when you implement it yourself
The coolest part was watching the tokenizer essentially become a Celtics fan through training. It learned that "Jayson", "Tatum", and "Celtics" are important enough to get their own tokens. That's the power of specialized models - they adapt to your specific domain.
What's next? I'm thinking about:
- Training a small language model using this tokenizer (basketball chatbot anyone?)
- Extending to other sports or domains
- Building more interactive NLP tools
- Maybe training on the entire Celtics roster's pages?
The interactive demo above is just the start. I want to keep building tools that make AI concepts accessible and fun to play with.
If you're working on similar stuff, have questions about the implementation, or just want to talk about the Celtics' championship chances, feel free to reach out!
This project captures what I love about building AI - taking something complex and making it both understandable and personally meaningful. More projects like this coming soon! š