Data Engineer

SRJC Course Outline Archive

Turned hundreds of unstructured catalog pages into a queryable archive.

Course outline data at a community college is public but not queryable: it lives in hundreds of individual web pages with no structured export. This pipeline scrapes and normalizes the full Santa Rosa Junior College catalog, extracting course codes, titles, units, objectives, and outcomes into a consistent archive built for curriculum research.

The problem

Course outline data was public but unqueryable, scattered across hundreds of catalog pages with no structured export.

What we built

A scrape-and-normalize pipeline extracts course codes, titles, units, objectives, and outcomes from the full SRJC catalog into a consistent archive ready for curriculum analysis.

What was delivered

Scraper for the full SRJC course catalog
Normalization logic for course metadata fields
Consistent archive output for downstream research
Operational pattern for refresh cycles

Outcomes

Hours of manual collection replaced with a repeatable pipeline
Analysis-ready archive across the full catalog
Reliable refresh cycles for ongoing updates
Reusable pattern for similar institutional datasets

Services: Node.js, Scraping, Structured extraction

Book a 20-minute fit call