How It Works

Want to build something like this with your own newspaper archives? Contact me →

Every day at 5 AM UTC, an automated pipeline pairs a major world news story with a compelling historical parallel from old American newspapers. The core of the system tries to match the current top news stories with a historical news archive using statistical techniques. A language model handles selection and fixes scanning errors in old text, but all displayed content originates from news feeds, historical newspapers, and Wikipedia.

The Daily Pipeline

1. News Selection

The pipeline reads today’s headlines from BBC, CNN, DW, NPR, France 24, and Sky News. A large language model picks a handful of the most globally significant stories, then prepares several angles to search the historical archive for each: the core topic, key terms, and the broader themes at play. Stories that overlap with recent days are excluded so the same story arc doesn’t repeat.

Under the hood

Headlines are pulled via RSS feeds. The model reads all items in a single prompt with a cached system message and outputs structured JSON: the selected story, a cleaned headline for search, key terms, and narrative themes.

2. Searching Old Newspapers

The system searches through tens of thousands of historical newspaper articles to find ones that resonate with today’s news. Three parallel searches run at once, each looking for a different kind of similarity:

Headline match: articles with similar wording
Keyword match: articles about the same key topics
Pattern match: articles with the same kind of event, regardless of who was involved (e.g., “leader threatens rival nation” matches even if the leaders and nations are different)

The results are combined into about 25 candidates, with recently used articles filtered out to keep things fresh.

Under the hood

Each search query is converted into a vector embedding (a numerical meaning representation). For the pattern match, named entity recognition replaces specific names and places with placeholders, so “Blair pushes case for Iraq intervention” becomes something like “[PERSON] pushes case for [COUNTRY] intervention”, allowing structural matching across eras.

Vector search uses cosine similarity on a dedicated vector database. Each strategy returns its top results, which are merged with reserved slots per strategy, then filled to ~25 total.

3. Picking the Best Match

A large language model reviews the candidates and picks the most compelling parallel — the pair where you’d most think “history repeats itself.” It looks for the same kind of event, similar phrasing, but different people and a different era. If the winning article has garbled text from the original newspaper scanning process, a separate pass fixes character errors without rewriting any words.

Under the hood

The model acts as a reranker with strict criteria: headline echo > event parallel > thematic parallel. It sees each candidate’s headline, similarity score, themes, and a text excerpt, then picks a winner with a brief rationale. Text cleanup fixes garbled characters introduced by the newspaper scanning process — correcting things like “tbe” back to “the”, not rephrasing.

4. Timeline

Finally, a “What Happened Next” timeline is built: real events that followed the historical moment. Timeline excerpts are sourced from Wikipedia articles, selected and fixed by a language model but not rephrased. The goal is verbatim Wikipedia text, though the selection process means minor differences can occur.

Under the hood

A language model runs a multi-step tool-use loop, searching Wikipedia, reading article sections, verifying dates fall after the original newspaper article, and confirming chronological order. The same model then selects 2–3 key sentences verbatim from each Wikipedia article. A validation pass checks every entry for date accuracy and relevance, repairing or dropping ones that don’t hold up.

Building the Corpus

Before the daily pipeline could run, tens of thousands of historical articles had to be prepared. The source is American Stories, a public research dataset from Harvard (CC-BY 4.0) containing digitized American newspapers. A one-time process gets them ready for searching:

Extraction: front-page articles with legible text are pulled from the dataset
Cleanup: a language model fixes scanning errors and filters out ads, obituaries, and sports scores
Indexing: each article is converted into a searchable format and stored

Under the hood

The dataset provides properly segmented newspaper articles with separate headline and body text. Enrichment runs through the Batch API with structured output. Each article gets both a headline embedding and an entity-masked embedding (via spaCy NER).

Data Sources

Historical articles: American Stories (Harvard, CC-BY 4.0), properly segmented newspaper articles from newspapers across the country.
Newspaper images: Chronicling America at the Library of Congress, original newspaper page scans (public domain).
Timeline sources: Wikipedia . Excerpts are sourced from Wikipedia articles, selected by a language model.
Current news: Headlines and excerpts from public RSS feeds of major outlets. Only headlines and excerpts are shown, with a link to the original. Full text is never republished.