← All case studies

Data Engineer

SRJC Course Outline Archive

Turned hundreds of unstructured catalog pages into a queryable archive.

Course outline data at a community college is public but not queryable: it lives in hundreds of individual web pages with no structured export. This pipeline scrapes and normalizes the full Santa Rosa Junior College catalog, extracting course codes, titles, units, objectives, and outcomes into a consistent archive built for curriculum research.

SRJC Course Outline Archive, image 1

The problem

Course outline data was public but unqueryable, scattered across hundreds of catalog pages with no structured export.

What we built

A scrape-and-normalize pipeline extracts course codes, titles, units, objectives, and outcomes from the full SRJC catalog into a consistent archive ready for curriculum analysis.

What was delivered

Outcomes

Services: Node.js, Scraping, Structured extraction