Constructing SQLite Tables for Notebooks and Search

by Audrey M. Roy Greenfeld | Tue, Jan 14, 2025

SQLite full text search setup via APSW for all the notebooks on this website, inspired by the [APSW FTS5 Tour]((https://rogerbinns.github.io/apsw/example-fts.html)).


Setup

from typing import Optional, Iterator, Any

from pprint import pprint
import re
import functools

import apsw
import apsw.ext
import apsw.fts5
import apsw.fts5aux
import apsw.fts5query

from execnb.nbio import read_nb
from fastcore.all import *
from pathlib import Path, PosixPath
print("FTS5 available:", "ENABLE_FTS5" in apsw.compile_options)

Create a Notebooks Database

Until now I haven't had a SQLite database for anything on this site. Let's create one.

connection = apsw.Connection("notebooks.db")
connection

Create and Populate Table notebooks

nbs = L(Path(".").glob("*.ipynb")).sorted(reverse=True)
nbs

All notebooks, sorted from newest to oldest.

nb = read_nb(nbs[0])
nb.cells[0]

A Markdown cell looks like this.

connection.execute("""CREATE TABLE IF NOT EXISTS notebooks (
    id INTEGER PRIMARY KEY,
    path TEXT NOT NULL,
    markdown_content TEXT)""")

We create a table to put notebooks' paths and Markdown content into. (At this point we skip code cells to make things simple.)

def is_md_cell(c): return c.cell_type == 'markdown'
md_cells = L(nb.cells).filter(is_md_cell)
md_cells

A list of only Markdown cells from a notebook looks like this.

def cell_source(c): return c.source
md = md_cells.map(cell_source)
md

Map the cells to just their Markdown source. For now we don't care about the rest.

def extract_markdown_content(nbpath):
    """Extract all markdown cell content from a notebook"""
    nb = read_nb(nbpath)
    md_cells = L(nb.cells).filter(is_md_cell)
    return "\n".join(md_cells.map(cell_source))
extract_markdown_content(nbs[0])

Join all the Markdown cells' content for a notebook together, separated by 1 new line between each pair of cells.

def populate_notebooks_table():
    connection.execute("DELETE FROM notebooks")
    
    for nb_path in nbs:
        markdown_text = extract_markdown_content(nb_path)
        connection.execute(
            "INSERT INTO notebooks (path, markdown_content) VALUES (?, ?)",
            (str(nb_path), markdown_text)
        )
populate_notebooks_table()
connection.execute("SELECT count(*) FROM notebooks").get

Now we have notebook paths and their contents in a SQLite table.

Create Search Table

if not connection.table_exists("main", "search"):
    search_table = apsw.fts5.Table.create(
        connection,
        "search",
        content="notebooks",
        columns=None,
        generate_triggers=True,
        tokenize=["unicode61"])
else:
    search_table = apsw.fts5.Table(connection, "search")

Here we check if a table named search already exists in the main database of connection. If it doesn't exist, we create it.

search is a virtual table pointing at the notebooks table. It doesn't actually store any data! It contains indexes on that table, as well as a table-like interface for searching it.

Check the Tables

print("quoted name", search_table.quoted_table_name)

This verifies that the database schema main and table name `search have been set up.

The notebooks' content:

print(connection.execute(
        "SELECT sql FROM sqlite_schema WHERE name='notebooks'"
    ).get)
pprint(search_table.structure)
print(f"{search_table.config_rank()=}")
print(f"{search_table.row_count=}")
print(f"{search_table.tokens_per_column=}")

Optional Cleanup

At one point I had to run this to drop the search table, so I could recreate it with a different tokenizer:

connection.execute("DROP TABLE IF EXISTS search")

What Next?

To be continued...