When you use SQLite's FTS5 extension, you create a FTS5 virtual table and specify which of its tokenizers to use (or a custom one of your own).
The tokenizer defines how text is split into tokens, and how it'll be matched in searches.
This is implemented in SQLite [fts5_tokenize.c starting at line 175](https://github.com/sqlite/sqlite/blob/master/ext/fts5/fts5_tokenize.c#L175) with Python wrapper in APSW [fts5.py](https://github.com/rogerbinns/apsw/blob/master/apsw/fts5.py).
Observations about tokenization with `unicode61`:
| Observation | Example |
|------------|---------|
| Emojis are preserved as single tokens | `🤦🏼` |
| Numbers are separated on decimal points | `v1.2` → `v1`, `2` |
| Case is normalized to lowercase | `ColOUR` → `colour` |
| Single Roman numeral Unicode characters are preserved | `Ⅲ` → `ⅲ` |
| Apostrophes split words | `Don't` → `don`, `t` |
| Hyphens and spaces are removed | `jump - 🇫🇮你好世界` → `jump`, `你好世界` |
| Chinese characters stay together (Google Translate says `你好世界` is "Hello World") | `你好世界` kept as one token |
| German eszett (ß) is preserved | `Straße` → `straße` |
| Hindi written in Devanagari script is split into parts (Google Translate detects `हैलो वर्ल्ड` as Hindi for "Hello World": `है` "is", ) | `हैलो` → `ह`, `ल` |
| Diacritics are removed | `Déjà` → `deja` |
| Underscores split words | `SQLITE_ERROR` → `sqlite`, `error` |
| Emojis as part of a word are ignored (?) | `🇫🇮` in `🇫🇮你好世界` not tokenized |
## Included SQLite FTS5 Tokenizer `ascii`
This is implemented in SQLite [fts5_tokenize.c starting at line 18](https://github.com/sqlite/sqlite/blob/master/ext/fts5/fts5_tokenize.c#L18) with Python wrapper in APSW [fts5.py](https://github.com/rogerbinns/apsw/blob/master/apsw/fts5.py).
Comparing our `ascii` tokenization here with the above `unicode61` tokenization:
| Feature | unicode61 | ascii | Example |
|---------|-----------|--------|---------|
| Emoji handling | Preserves simple emojis | Keeps composite emojis with modifiers | `🤦🏼` vs `🤦🏼\u200d♂️` |
| Number splitting | Splits on decimal points | Same | `v1.2` → `v1`, `2` |
| Case normalization | Converts to lowercase | Same | `ColOUR` → `colour` |
| Roman numeral unicode characters | Preserves and lowercases as equivalent Unicode character | Preserves original Unicode character | `Ⅲ` → `ⅲ` vs `Ⅲ` |
| Apostrophes | Splits words | Same | `Don't` → `don`, `t` |
| Spaces/hyphens | Removes | Same | `jump -` → `jump` |
| CJK characters | Keeps together | Same | `你好世界` stays as one token |
| German eszett | Preserves | Same | `Straße` → `straße` |
| Devanagari script | Splits into parts | Keeps words together | `हैलो` → `ह`, `ल` vs `हैलो` |
| Diacritics | Removes | Preserves | `Déjà` → `deja` vs `déjà` |
| Underscores | Splits words | Same | `SQLITE_ERROR` → `sqlite`, `error` |
| Emoji as part of a word | Ignores | Keeps with following text | `🇫🇮你好世界` |
## Choosing a Tokenizer
There's no best tokenizer: choose based on your use case. Here I'm considering the use case of implementing search for a blog or simple personal website, such as this one.
Since I write my notebooks in English and don't really use emoji, the default of `unicode61` is fine.
## Search
To be continued...
## Credits
Thanks to [Daniel Roy Greenfeld](https://daniel.feldroy.com) for reviewing.