Data Engineer
SRJC Course Outline Archive
Turned hundreds of unstructured catalog pages into a queryable archive.
Course outline data at a community college is public but not queryable: it lives in hundreds of individual web pages with no structured export. This pipeline scrapes and normalizes the full Santa Rosa Junior College catalog, extracting course codes, titles, units, objectives, and outcomes into a consistent archive built for curriculum research.
The problem
Course outline data was public but unqueryable, scattered across hundreds of catalog pages with no structured export.
What we built
A scrape-and-normalize pipeline extracts course codes, titles, units, objectives, and outcomes from the full SRJC catalog into a consistent archive ready for curriculum analysis.
What was delivered
- Scraper for the full SRJC course catalog
- Normalization logic for course metadata fields
- Consistent archive output for downstream research
- Operational pattern for refresh cycles
Outcomes
- Hours of manual collection replaced with a repeatable pipeline
- Analysis-ready archive across the full catalog
- Reliable refresh cycles for ongoing updates
- Reusable pattern for similar institutional datasets
Services: Node.js, Scraping, Structured extraction