Live system
Last refresh · 44d ago
Mastascusa Holdings · Case study
Nine websites in. One map out.
Every six hours, on the dot.
This is the live map reverse-engineered: every box, every decision, every line of math. Plain-English up top, the engineering chops underneath. Same pipeline pointed at your data is a thing I sell.
Window
2026-04-29
→ 2026-05-01
Numbers above pulled from kb-preview.json at build time — refreshed every six hours by the pipeline you're about to read about.
Plain English
A baby could follow this:
- 1.
Computer reads nine AI websites every six hours.
- 2.
It remembers what it's already seen so it never shows the same thing twice.
- 3.
It turns each headline into a dot on a map. Headlines about the same idea sit next to each other.
- 4.
It saves the map. The website updates. You're looking at it.
Engineering version
Five stages. Each one is a deliberate choice.
01
Ingest — Read nine websites.
feedparser pulls RSS from 8 publishers. Anthropic has no feed, so httpx walks the sitemap and regex-extracts title + meta-description from each URL under /news, /research, /engineering. Hacker News is the Algolia search API gated by points ≥ 100 and a keyword regex.
02
Fingerprint — Give every item a unique ID.
SHA-256 of the URL. Collisions impossible at this volume (~10⁻⁷⁷ at our scale). The fingerprint is the SQLite primary key, so the pipeline is idempotent — run it 100 times in a row, store exactly one row per item.
03
Embed — Turn each title into a vector of numbers.
TF-IDF on title + summary (1–2 gram, 20k features, sublinear TF), then TruncatedSVD down to 64 dimensions. Cached to .npz keyed by fingerprint — re-runs only embed the deltas, so the 6-hour cadence stays cheap.
04
Project — Squish those vectors into a 2-D map.
Mean-center, take the top-2 principal components via SVD, scale by the 99th-percentile absolute value, clip to [-1, 1]. The map you see is literally argmax-variance projection of the corpus.
05
Publish — Save it. Push to GitHub. Site rebuilds.
Two outputs: a chronological markdown archive grouped by ISO week, and a JSON snapshot embedded in this site. The publish step is a git commit on the website repo — Vercel detects it and redeploys in ~30 seconds. The page you're reading is the proof.
Receipts
The dedup, in nine lines.
The whole pipeline is ~290 lines of Python. This is the load-bearing chunk — the reason re-running the crawler doesn't corrupt the database.
# Idempotent ingest. Every item's primary key is SHA-256(url).
# Run this 1× or 1000× and the table is identical.
def store_new(conn, items):
new = []
for it in items:
if conn.execute(
"SELECT 1 FROM items WHERE fingerprint=?",
(it.fingerprint,),
).fetchone():
continue
conn.execute(
"INSERT INTO items VALUES (?,?,?,?,?,?,?)",
(it.fingerprint, it.source, it.title, it.url,
it.published.isoformat(), it.summary, now),
)
new.append(it)
conn.commit()
return new
Decisions I'll defend on a whiteboard
Every choice has a "why."
- Why SHA-256, not a softer dedup?
- Title-based dedup is fragile (publishers retitle posts mid-flight). URL-fingerprint is bit-exact and cheap. Boring decision, but boring is correct here.
- Why TF-IDF + SVD instead of a giant transformer?
- For 4,000 short documents the variance is in the vocabulary, not the deep semantics. TF-IDF + SVD takes 800ms cold; sentence-transformers takes a minute and adds nothing the eye can see in 2-D. Pick the floor of the model that solves the problem.
- Why SQLite, not a real database?
- One file. No server. No connection pool. Backups are `cp`. For a single-writer pipeline this is the right answer. Switching to Postgres would buy zero capability and cost one ops headache.
- Why publish via git push?
- It's already the source of truth for the site. Routing the data through it means the website state is reproducible from a single commit hash — no separate "freshness pipeline" to monitor.
Stack
Ingestion
Python · feedparser · httpx · regex
Storage
SQLite (one file, single-writer)
Embeddings
scikit-learn · TF-IDF · TruncatedSVD · .npz cache
Projection
NumPy SVD, 2-D principal components
Scheduling
Windows Task Scheduler · 6-hour interval
Publish
git push → Vercel autodeploy → Astro static render
DIY
Want to run it yourself?
The whole thing is one Python file, one SQLite DB, and a batch script. No Docker. No cloud account. Five minutes from install to first crawl.
# What it looks like end-to-end:
python -m venv .venv && .venv\Scripts\activate
pip install -r requirements.txt
python crawler.py
# → SQLite + markdown archive populated. Done.
Point ARXIV_FEEDS and BLOG_FEEDS at any RSS in crawler.py and the same pipeline reads whatever you give it. The dedup, embeddings, and publish loop are all source-agnostic.
Email me for the source →
For your organization
Want this on your data?
Research literature. Contract repositories. Customer call transcripts. Regulatory filings. Internal wikis. The architecture above is source-agnostic — I swap the adapters, you get the map.