SQLite FTS5 Tokenizers: `unicode61` and `ascii`

by Audrey M. Roy Greenfeld | Mon, Jan 13, 2025

The SQLite FTS5 (full-text search) extension includes the built-in tokenizers unicode61, ascii, porter, and trigram, implemented in [fts5_tokenize.c](https://github.com/sqlite/sqlite/blob/master/ext/fts5/fts5_tokenize.c). [APSW](https://github.com/rogerbinns/apsw) provides a Python API to use those. Here we explore the default tokenizer `unicode61` and built-in tokenizer `ascii` in detail.


Overview

When you use SQLite's FTS5 extension, you create a FTS5 virtual table and specify which of its tokenizers to use (or a custom one of your own).

The tokenizer defines how text is split into tokens, and how it'll be matched in searches.

Setup

import apsw
import apsw.ext
connection = apsw.Connection("dbfile")
connection

The Default SQLite FTS5 Tokenizer unicode61

This is implemented in SQLite fts5_tokenize.c starting at line 175 with Python wrapper in APSW fts5.py.

tokenizer = connection.fts5_tokenizer("unicode61")
tokenizer

I like the apsw tokenizer docs' test text:

txt = """🤦🏼‍♂️ v1.2 Grey Ⅲ ColOUR! Don't jump -  🇫🇮你好世界 Straße
    हैलो वर्ल्ड Déjà vu Résumé SQLITE_ERROR"""
encoded = txt.encode("utf8")
encoded

Yeah, remember txt ends in SQLITE_ERROR, that's expected. Interesting that it doesn't get encoded by fts5_tokenizer.

tokenized = tokenizer(encoded, apsw.FTS5_TOKENIZE_DOCUMENT, None, include_offsets=False)
tokenized

Observations about tokenization with unicode61:

Observation Example
Emojis are preserved as single tokens 🤦🏼
Numbers are separated on decimal points v1.2v1, 2
Case is normalized to lowercase ColOURcolour
Single Roman numeral Unicode characters are preserved
Apostrophes split words Don'tdon, t
Hyphens and spaces are removed jump - 🇫🇮你好世界jump, 你好世界
Chinese characters stay together (Google Translate says 你好世界 is "Hello World") 你好世界 kept as one token
German eszett (ß) is preserved Straßestraße
Hindi written in Devanagari script is split into parts (Google Translate detects हैलो वर्ल्ड as Hindi for "Hello World": है "is", ) हैलो,
Diacritics are removed Déjàdeja
Underscores split words SQLITE_ERRORsqlite, error
Emojis as part of a word are ignored (?) 🇫🇮 in 🇫🇮你好世界 not tokenized

Included SQLite FTS5 Tokenizer ascii

This is implemented in SQLite fts5_tokenize.c starting at line 18 with Python wrapper in APSW fts5.py.

tokenizer_ascii = connection.fts5_tokenizer("ascii")
tokenizer_ascii
tokenized = tokenizer_ascii(encoded, apsw.FTS5_TOKENIZE_DOCUMENT, None, include_offsets=False)
tokenized

Comparing our ascii tokenization here with the above unicode61 tokenization:

Feature unicode61 ascii Example
Emoji handling Preserves simple emojis Keeps composite emojis with modifiers 🤦🏼 vs 🤦🏼\u200d♂️
Number splitting Splits on decimal points Same v1.2v1, 2
Case normalization Converts to lowercase Same ColOURcolour
Roman numeral unicode characters Preserves and lowercases as equivalent Unicode character Preserves original Unicode character vs
Apostrophes Splits words Same Don'tdon, t
Spaces/hyphens Removes Same jump -jump
CJK characters Keeps together Same 你好世界 stays as one token
German eszett Preserves Same Straßestraße
Devanagari script Splits into parts Keeps words together हैलो, vs हैलो
Diacritics Removes Preserves Déjàdeja vs déjà
Underscores Splits words Same SQLITE_ERRORsqlite, error
Emoji as part of a word Ignores Keeps with following text 🇫🇮你好世界

Choosing a Tokenizer

There's no best tokenizer: choose based on your use case. Here I'm considering the use case of implementing search for a blog or simple personal website, such as this one.

Since I write my notebooks in English and don't really use emoji, the default of unicode61 is fine.

Search

To be continued...

Credits

Thanks to Daniel Roy Greenfeld for reviewing.