How we find and process job postings
Job Seek is a company-tracking tool — it monitors thousands of company career pages so users can build watchlists and get alerts on new postings. This page documents exactly how our crawler behaves, the controls that keep it polite, and how jobs ultimately land in the index.
Crawling assurances
Respectful pacing.
Every retry window uses exponential backoff so we never hammer an origin, and we bail if a host keeps timing out.
Robots, attribution, and TDM reservation.
Our crawler reads
robots.txt, honours disallow rules, identifies itself viaUser-Agent, and respects the W3CTDM-Reservationheader—if a page signals reservation, we skip it.One page per minute.
Even after discovery we retrieve job detail pages at a strict limit of one request per site per minute.
How postings enter the index
We look for structured feeds before scraping raw HTML. First we check for sitemaps, then client-side JSON APIs, and only parse full pages when neither exists.
- Sitemap first. We look for a sitemap that already lists every careers or job detail page—ideally linked from
robots.txt—and rely on it whenever possible. - Client APIs second. If no sitemap exists we inspect the client application for JSON APIs it calls; when found we hit those endpoints directly to enumerate posting URLs without scraping the DOM.
- Graceful page parsing. As a last resort we parse the careers pages themselves, preferring newest-first sorts and stopping once previously indexed roles reappear instead of crawling every page.
- Selective storage. Once we fetch an individual posting we store only the job-specific metadata (title, role description, location, compensation notes, posting URL, and timestamps) plus extracted structured fields. We do not archive unrelated site content.
We strongly encourage publishing an easily discoverable sitemap for your careers section. Without it, we periodically mint lightweight HEAD requests against previously discovered job URLs to confirm they are still live, which introduces unnecessary traffic.
Opt-out or questions
If you notice unexpected activity from our crawler or prefer that your careers site not be indexed, please email us and we will respond promptly. business@colophon-group.org.
Our stance on automation
We oppose handing hiring or job-search decisions over to black-box automation — whether on the employer or applicant side. Every outbound link we share includes utm_source=jobseek so recruiters recognise the traffic, and we continuously review usage patterns plus enforce friction to deter scripted applications.
Open-source crawlers
Transparency matters, so the code for our job link collection service and extraction pipeline is open source. Browse the repository at github.com/colophon-group/jobseek-indexing(opens in new tab).
Need to reach us?
If you notice unusual crawler behaviour, prefer that we do not index your content, or have suggestions on how to improve our safeguards, please reach out. business@colophon-group.org.
