Constructing SQLite Tables for Notebooks and Search

SQLite full text search setup via APSW for all the notebooks on this website, inspired by the [APSW FTS5 Tour]((https://rogerbinns.github.io/apsw/example-fts.html)).
## Setup
```python from typing import Optional, Iterator, Any from pprint import pprint import re import functools import apsw import apsw.ext import apsw.fts5 import apsw.fts5aux import apsw.fts5query from execnb.nbio import read_nb from fastcore.all import * from pathlib import Path, PosixPath ```
```python print("FTS5 available:", "ENABLE_FTS5" in apsw.compile_options) ```
FTS5 available: True
## Create a Notebooks Database
Until now I haven't had a SQLite database for anything on this site. Let's create one.
```python connection = apsw.Connection("notebooks.db") connection ```
<apsw.Connection at 0x10982ca90>
## Create and Populate Table `notebooks`
```python nbs = L(Path("../arg-blog-fasthtml/nbs").glob("*.ipynb")).sorted(reverse=True) nbs ```
(#35) [Path('../arg-blog-fasthtml/nbs/2025-01-12-A-Better-Notebook-Index-Page.ipynb'),Path('../arg-blog-fasthtml/nbs/2025-01-11-NBClassic-Keyboard-Shortcuts-in-Command-and-Dual-Mode.ipynb'),Path('../arg-blog-fasthtml/nbs/2025-01-10-Understanding-FastHTML-Routes-Requests-and-Redirects.ipynb'),Path('../arg-blog-fasthtml/nbs/2025-01-09-Reading-and-Writing-Jupyter-Notebooks-With-Python.ipynb'),Path('../arg-blog-fasthtml/nbs/2025-01-08-HTML-Title-Tag-in-FastHTML.ipynb'),Path('../arg-blog-fasthtml/nbs/2025-01-07-Verifying-Bluesky-Domain-in-FastHTML.ipynb'),Path('../arg-blog-fasthtml/nbs/2025-01-06-Understanding-FastHTML-Headers.ipynb'),Path('../arg-blog-fasthtml/nbs/2025-01-05-SSH-Agent-to-Save-Passphrase-Typing.ipynb'),Path('../arg-blog-fasthtml/nbs/2025-01-04-Claude-Artifacts-in-Notebooks.ipynb'),Path('../arg-blog-fasthtml/nbs/2025-01-03-Using-zip.ipynb'),Path('../arg-blog-fasthtml/nbs/2025-01-02-FastHTML-Piano-Part-3.ipynb'),Path('../arg-blog-fasthtml/nbs/2025-01-02-FastHTML-Piano-Part-2.ipynb'),Path('../arg-blog-fasthtml/nbs/2025-01-01-FastHTML-Piano-Part-1.ipynb'),Path('../arg-blog-fasthtml/nbs/2025-01-01-Command-Substitution-in-Bash.ipynb'),Path('../arg-blog-fasthtml/nbs/2024-12-31-Note-Box-FastTag.ipynb'),Path('../arg-blog-fasthtml/nbs/2024-12-30-Images-In-Every-Way-In-Notebooks.ipynb'),Path('../arg-blog-fasthtml/nbs/2024-12-29-Bash-in-Jupyter-Notebooks.ipynb'),Path('../arg-blog-fasthtml/nbs/2024-12-28-Minimal-Typography-for-FastHTML-Apps.ipynb'),Path('../arg-blog-fasthtml/nbs/2024-12-27-Notebook-Names-to-Cards.ipynb'),Path('../arg-blog-fasthtml/nbs/2024-12-27-CSS-Scope-Leakage-Pygments.ipynb')...]
All notebooks, sorted from newest to oldest.
```python nb = read_nb(nbs[0]) nb.cells[0] ```
{ 'cell_type': 'markdown',
  'id': 'b8ca3733',
  'idx_': 0,
  'metadata': {},
  'source': '# A Better Notebook Index Page'}
A Markdown cell looks like this.
```python connection.execute("""CREATE TABLE IF NOT EXISTS notebooks ( id INTEGER PRIMARY KEY, path TEXT NOT NULL, markdown_content TEXT)""") ```
<apsw.Cursor at 0x10ac60510>
We create a table to put notebooks' paths and Markdown content into. (At this point we skip code cells to make things simple.)
```python def is_md_cell(c): return c.cell_type == 'markdown' md_cells = L(nb.cells).filter(is_md_cell) md_cells ```
(#39) [{'cell_type': 'markdown', 'id': 'b8ca3733', 'metadata': {}, 'source': '# A Better Notebook Index Page', 'idx_': 0},{'cell_type': 'markdown', 'id': '0adccd3d', 'metadata': {}, 'source': "I've made good progress on creating a notebook every day. Now I have so many notebooks that my index page needs an overhaul, including:\n\n* Dates with datetime\n* Cards with execnb to grab notebook titles\n* The cache decorator to make that fast\n* Subtle CSS tweaks to increase information density", 'idx_': 1},{'cell_type': 'markdown', 'id': '9f9ab0bb', 'metadata': {}, 'source': '## List Live Posts', 'idx_': 3},{'cell_type': 'markdown', 'id': '8e35e71e', 'metadata': {}, 'source': 'According to this, I have 34 notebooks in `arg-blog-fasthtml/nbs`, which matches the 34 cards on audrey.feldroy.com.', 'idx_': 5},{'cell_type': 'markdown', 'id': '506c2763', 'metadata': {}, 'source': '## Pathlib, User Directory, and PosixPath', 'idx_': 6},{'cell_type': 'markdown', 'id': 'c57a8f71', 'metadata': {}, 'source': 'To specify the path in terms of my home directory `~`, I use `PosixPath`.', 'idx_': 8},{'cell_type': 'markdown', 'id': 'f8f7b5f2', 'metadata': {}, 'source': 'Here I expand `~` into `/Users/arg/`, list files that end in .ipynb, and convert the generator object into a readable list with a fastcore `L` list.', 'idx_': 10},{'cell_type': 'markdown', 'id': '8f1183a3', 'metadata': {}, 'source': '## Display the Notebooks List Nicely', 'idx_': 11},{'cell_type': 'markdown', 'id': '33cf8a98', 'metadata': {}, 'source': 'If we just print the filenames, we can see my current approach of naming them with the date and TitleCase title.', 'idx_': 13},{'cell_type': 'markdown', 'id': '96344324', 'metadata': {}, 'source': '## New Approach', 'idx_': 14},{'cell_type': 'markdown', 'id': '4bd7dc56', 'metadata': {}, 'source': 'Doing this has given me insight about how to improve my site:\n\n* Keep naming files as before\n* Now for the index page, get the titles from the notebooks instead of the filenames', 'idx_': 15},{'cell_type': 'markdown', 'id': 'e63668fc', 'metadata': {}, 'source': 'When there were just a few notebooks, these cards were great. Now I want a more information-dense layout with tighter cards, and with titles containing proper punctuation coming from the notebooks themselves.', 'idx_': 16},{'cell_type': 'markdown', 'id': '7daf2fca', 'metadata': {}, 'source': '## Revisit Date Parsing', 'idx_': 17},{'cell_type': 'markdown', 'id': 'd0204966', 'metadata': {}, 'source': "I'm currently getting dates from ISO 8601-prefixed filenames with this not-great code that I hacked together quickly:", 'idx_': 18},{'cell_type': 'markdown', 'id': '5e7a96cd', 'metadata': {}, 'source': "I talked with Claude 3.5 Sonnet about it. It generated code that looked awesome at first but wasn't my favorite when I experimented with it carefully. But something good resulted: out of that I learned about `datetime.fromisoformat` and looked it up in the [Python datetime docs](https://docs.python.org/3/library/datetime.html#datetime.datetime.fromisoformat).", 'idx_': 20},{'cell_type': 'markdown', 'id': '4bdadc6d', 'metadata': {}, 'source': "It's actually nice to have `datetime` objects here because I can get the parts like:", 'idx_': 22},{'cell_type': 'markdown', 'id': '5956ab70', 'metadata': {}, 'source': 'And print them with f-strings:', 'idx_': 24},{'cell_type': 'markdown', 'id': '41130f57', 'metadata': {}, 'source': "I like that combination of readability and abbreviations. I'll try it and see if I still like it later.", 'idx_': 26},{'cell_type': 'markdown', 'id': '3f2adffb', 'metadata': {}, 'source': '## Iterate on ISO 8601 Date Parsing', 'idx_': 27},{'cell_type': 'markdown', 'id': '0a9e4008', 'metadata': {}, 'source': 'My improved function:', 'idx_': 28}...]
A list of only Markdown cells from a notebook looks like this.
```python def cell_source(c): return c.source md = md_cells.map(cell_source) md ```
(#39) ['# A Better Notebook Index Page',"I've made good progress on creating a notebook every day. Now I have so many notebooks that my index page needs an overhaul, including:\n\n* Dates with datetime\n* Cards with execnb to grab notebook titles\n* The cache decorator to make that fast\n* Subtle CSS tweaks to increase information density",'## List Live Posts','According to this, I have 34 notebooks in `arg-blog-fasthtml/nbs`, which matches the 34 cards on audrey.feldroy.com.','## Pathlib, User Directory, and PosixPath','To specify the path in terms of my home directory `~`, I use `PosixPath`.','Here I expand `~` into `/Users/arg/`, list files that end in .ipynb, and convert the generator object into a readable list with a fastcore `L` list.','## Display the Notebooks List Nicely','If we just print the filenames, we can see my current approach of naming them with the date and TitleCase title.','## New Approach','Doing this has given me insight about how to improve my site:\n\n* Keep naming files as before\n* Now for the index page, get the titles from the notebooks instead of the filenames','When there were just a few notebooks, these cards were great. Now I want a more information-dense layout with tighter cards, and with titles containing proper punctuation coming from the notebooks themselves.','## Revisit Date Parsing',"I'm currently getting dates from ISO 8601-prefixed filenames with this not-great code that I hacked together quickly:","I talked with Claude 3.5 Sonnet about it. It generated code that looked awesome at first but wasn't my favorite when I experimented with it carefully. But something good resulted: out of that I learned about `datetime.fromisoformat` and looked it up in the [Python datetime docs](https://docs.python.org/3/library/datetime.html#datetime.datetime.fromisoformat).","It's actually nice to have `datetime` objects here because I can get the parts like:",'And print them with f-strings:',"I like that combination of readability and abbreviations. I'll try it and see if I still like it later.",'## Iterate on ISO 8601 Date Parsing','My improved function:'...]
Map the cells to just their Markdown source. For now we don't care about the rest.
```python def extract_markdown_content(nbpath): """Extract all markdown cell content from a notebook""" nb = read_nb(nbpath) md_cells = L(nb.cells).filter(is_md_cell) return "\n".join(md_cells.map(cell_source)) extract_markdown_content(nbs[0]) ```
"# A Better Notebook Index Page\nI've made good progress on creating a notebook every day. Now I have so many notebooks that my index page needs an overhaul, including:\n\n* Dates with datetime\n* Cards with execnb to grab notebook titles\n* The cache decorator to make that fast\n* Subtle CSS tweaks to increase information density\n## List Live Posts\nAccording to this, I have 34 notebooks in `arg-blog-fasthtml/nbs`, which matches the 34 cards on audrey.feldroy.com.\n## Pathlib, User Directory, and PosixPath\nTo specify the path in terms of my home directory `~`, I use `PosixPath`.\nHere I expand `~` into `/Users/arg/`, list files that end in .ipynb, and convert the generator object into a readable list with a fastcore `L` list.\n## Display the Notebooks List Nicely\nIf we just print the filenames, we can see my current approach of naming them with the date and TitleCase title.\n## New Approach\nDoing this has given me insight about how to improve my site:\n\n* Keep naming files as before\n* Now for the index page, get the titles from the notebooks instead of the filenames\nWhen there were just a few notebooks, these cards were great. Now I want a more information-dense layout with tighter cards, and with titles containing proper punctuation coming from the notebooks themselves.\n## Revisit Date Parsing\nI'm currently getting dates from ISO 8601-prefixed filenames with this not-great code that I hacked together quickly:\nI talked with Claude 3.5 Sonnet about it. It generated code that looked awesome at first but wasn't my favorite when I experimented with it carefully. But something good resulted: out of that I learned about `datetime.fromisoformat` and looked it up in the [Python datetime docs](https://docs.python.org/3/library/datetime.html#datetime.datetime.fromisoformat).\nIt's actually nice to have `datetime` objects here because I can get the parts like:\nAnd print them with f-strings:\nI like that combination of readability and abbreviations. I'll try it and see if I still like it later.\n## Iterate on ISO 8601 Date Parsing\nMy improved function:\nNote: on my first pass writing this notebook, I didn't use the `@cache` decorator. I waited until the end to cache, to make debugging easier for myself. It'll become clear in the next section on notebook titles why caching is good here.\nNow let's test it by grabbing a filename and passing it in.\n## Iterate on Getting Notebook Titles\nInstead of parsing filenames, I'm going to grab the first cell of each notebook and remove the `# ` prefix.\n### Make That a Function\n### Try It on All Titles\n### Find the Broken Titles\n### Fix Broken Titles\nYou can imagine how running this on every notebook would be slow! So we add `@cache`.\n## Iterate on Cards\nMy cards were created with this FastTag:\nLet's rebuild it from scratch with our new functions.\nHere I had to set a root element so `rem` font sizes would show correctly here in nbclassic, where I'm working from. There's more to get it to look right here, but I'm getting a bit tired.\nYou may be wondering about how there's a lot of CSS getting repeated in each card instance. There aren't that many cards right now, and this is still way smaller than a React/Tailwind app. I have some interesting ideas here that I'll save for another day.\n## Bringing Changes Over to My Blog App\nMy blog app [arg-blog-fasthtml](https://github.com/audreyfeldroy/arg-blog-fasthtml) has a `main.py` that isn't notebook-generated. \n\nNote: At this point, considering how many functions I rewrote in this notebook, it would be nice to move that `main.py` to notebooks. I originally had started writing it in notebooks but had moved to the simple main.py to make troubleshooting deployment on a PaaS easier. \n\nFor now, I've updated `main.py` with all of the above manually. I'm like a manual version of `nbdev_export` and that tells me that I should automate.\n## Caching\nWith `@cache` on all the functions above, the index page is super snappy locally! Yes, that's the unbounded cache, but I have few and small enough things to cache that I won't worry about it for now. I can explore it another time.\n## Summary\n\nI've made good progress improving my blog's index page!\n\n1. Better date handling using `datetime.fromisoformat()` instead of regex parsing\n2. Getting proper titles from notebook first cells instead of filenames\n3. Tighter, more information-dense cards with improved typography\n4. Caching with `@cache` to keep things snappy\n\nIt's working well locally, and we'll see what happens when I deploy. I manually updated my blog app's `main.py` with these changes, though doing this made me realize I should probably move that code into notebooks and use `nbdev_export` instead of copying by hand.\n\nThere's still room for improvement, but I'm happy with my progress!"
Join all the Markdown cells' content for a notebook together, separated by 1 new line between each pair of cells.
```python def populate_notebooks_table(): connection.execute("DELETE FROM notebooks") for nb_path in nbs: markdown_text = extract_markdown_content(nb_path) connection.execute( "INSERT INTO notebooks (path, markdown_content) VALUES (?, ?)", (str(nb_path), markdown_text) ) populate_notebooks_table() ```
```python connection.execute("SELECT count(*) FROM notebooks").get ```
35
Now we have notebook paths and their contents in a SQLite table.
## Create Search Table
```python if not connection.table_exists("main", "search"): search_table = apsw.fts5.Table.create( connection, "search", content="notebooks", columns=None, generate_triggers=True, tokenize=["simplify","casefold","true","strip","true","unicodewords"]) else: search_table = apsw.fts5.Table(connection, "search") ```
The SQLite FTS5 extension has you create a virtual search table, where you point it at the name of the table containing the content to search.
```python print("quoted name", search_table.quoted_table_name) ```
quoted name "main"."search"
The notebooks' content:
```python print(connection.execute( "SELECT sql FROM sqlite_schema WHERE name='notebooks'" ).get) ```
CREATE TABLE notebooks (
    id INTEGER PRIMARY KEY,
    path TEXT NOT NULL,
    markdown_content TEXT)
```python pprint(search_table.structure) ```
FTS5TableStructure(name='search',
                   columns=('id', 'path', 'markdown_content'),
                   unindexed=set(),
                   tokenize=('simplify',
                             'casefold',
                             'true',
                             'strip',
                             'true',
                             'unicodewords'),
                   prefix=set(),
                   content='notebooks',
                   content_rowid='_ROWID_',
                   contentless_delete=None,
                   contentless_unindexed=None,
                   columnsize=True,
                   tokendata=False,
                   locale=False,
                   detail='full')
```python print(f"{search_table.config_rank()=}") ```
search_table.config_rank()='bm25()'
```python print(f"{search_table.row_count=}") ```
search_table.row_count=35
```python print(f"{search_table.tokens_per_column=}") ```
search_table.tokens_per_column=[35, 373, 8700]
To be continued...