Live system Last refresh · 44d ago

Mastascusa Holdings · Case study

Nine websites in. One map out.
Every six hours, on the dot.

This is the live map reverse-engineered: every box, every decision, every line of math. Plain-English up top, the engineering chops underneath. Same pipeline pointed at your data is a thing I sell.

Put this on my data Or run it yourself ↓

Items right now

4,232

Sources

Cadence

Window

2026-04-29
→ 2026-05-01

Numbers above pulled from kb-preview.json at build time — refreshed every six hours by the pipeline you're about to read about.

Plain English

A baby could follow this:

1.
Computer reads nine AI websites every six hours.
2.
It remembers what it's already seen so it never shows the same thing twice.
3.
It turns each headline into a dot on a map. Headlines about the same idea sit next to each other.
4.
It saves the map. The website updates. You're looking at it.

Engineering version

Five stages. Each one is a deliberate choice.

Ingest — Read nine websites.

feedparser pulls RSS from 8 publishers. Anthropic has no feed, so httpx walks the sitemap and regex-extracts title + meta-description from each URL under /news, /research, /engineering. Hacker News is the Algolia search API gated by points ≥ 100 and a keyword regex.

Fingerprint — Give every item a unique ID.

SHA-256 of the URL. Collisions impossible at this volume (~10⁻⁷⁷ at our scale). The fingerprint is the SQLite primary key, so the pipeline is idempotent — run it 100 times in a row, store exactly one row per item.

Embed — Turn each title into a vector of numbers.

TF-IDF on title + summary (1–2 gram, 20k features, sublinear TF), then TruncatedSVD down to 64 dimensions. Cached to .npz keyed by fingerprint — re-runs only embed the deltas, so the 6-hour cadence stays cheap.

Project — Squish those vectors into a 2-D map.

Mean-center, take the top-2 principal components via SVD, scale by the 99th-percentile absolute value, clip to [-1, 1]. The map you see is literally argmax-variance projection of the corpus.

Publish — Save it. Push to GitHub. Site rebuilds.

Two outputs: a chronological markdown archive grouped by ISO week, and a JSON snapshot embedded in this site. The publish step is a git commit on the website repo — Vercel detects it and redeploys in ~30 seconds. The page you're reading is the proof.

Receipts

The dedup, in nine lines.

The whole pipeline is ~290 lines of Python. This is the load-bearing chunk — the reason re-running the crawler doesn't corrupt the database.

# Idempotent ingest. Every item's primary key is SHA-256(url).
# Run this 1× or 1000× and the table is identical.
def store_new(conn, items):
    new = []
    for it in items:
        if conn.execute(
            "SELECT 1 FROM items WHERE fingerprint=?",
            (it.fingerprint,),
        ).fetchone():
            continue
        conn.execute(
            "INSERT INTO items VALUES (?,?,?,?,?,?,?)",
            (it.fingerprint, it.source, it.title, it.url,
             it.published.isoformat(), it.summary, now),
        )
        new.append(it)
    conn.commit()
    return new

Decisions I'll defend on a whiteboard

Every choice has a "why."

Why SHA-256, not a softer dedup?: Title-based dedup is fragile (publishers retitle posts mid-flight). URL-fingerprint is bit-exact and cheap. Boring decision, but boring is correct here.
Why TF-IDF + SVD instead of a giant transformer?: For 4,000 short documents the variance is in the vocabulary, not the deep semantics. TF-IDF + SVD takes 800ms cold; sentence-transformers takes a minute and adds nothing the eye can see in 2-D. Pick the floor of the model that solves the problem.
Why SQLite, not a real database?: One file. No server. No connection pool. Backups are `cp`. For a single-writer pipeline this is the right answer. Switching to Postgres would buy zero capability and cost one ops headache.
Why publish via git push?: It's already the source of truth for the site. Routing the data through it means the website state is reproducible from a single commit hash — no separate "freshness pipeline" to monitor.

Stack

Ingestion

Python · feedparser · httpx · regex

Storage

SQLite (one file, single-writer)

Embeddings

scikit-learn · TF-IDF · TruncatedSVD · .npz cache

Projection

NumPy SVD, 2-D principal components

Scheduling

Windows Task Scheduler · 6-hour interval

Publish

git push → Vercel autodeploy → Astro static render

DIY

Want to run it yourself?

The whole thing is one Python file, one SQLite DB, and a batch script. No Docker. No cloud account. Five minutes from install to first crawl.

# What it looks like end-to-end:
python -m venv .venv && .venv\Scripts\activate
pip install -r requirements.txt
python crawler.py
# → SQLite + markdown archive populated. Done.

Point ARXIV_FEEDS and BLOG_FEEDS at any RSS in crawler.py and the same pipeline reads whatever you give it. The dedup, embeddings, and publish loop are all source-agnostic.

Email me for the source →

For your organization

Want this on your data?

Research literature. Contract repositories. Customer call transcripts. Regulatory filings. Internal wikis. The architecture above is source-agnostic — I swap the adapters, you get the map.

Put this on my data → Or see the three Build SKUs

Nine websites in. One map out.Every six hours, on the dot.