Indexing policy

How we find and process job postings

Job Seek is a company-tracking tool — it monitors thousands of company career pages so users can build watchlists and get alerts on new postings. This page documents exactly how our crawler behaves, the controls that keep it polite, and how jobs ultimately land in the index.

Crawling assurances

Respectful pacing.
Every retry window uses exponential backoff so we never hammer an origin, and we bail if a host keeps timing out.
Robots, attribution, and TDM reservation.
Our crawler reads robots.txt, honours disallow rules, identifies itself via User-Agent, and respects the W3C TDM-Reservation header—if a page signals reservation, we skip it.
One page per minute.
Even after discovery we retrieve job detail pages at a strict limit of one request per site per minute.

The Monk · Hans Holbein (1523-5)

How postings enter the index

We look for structured feeds before scraping raw HTML. First we check for sitemaps, then client-side JSON APIs, and only parse full pages when neither exists.

Sitemap first. We look for a sitemap that already lists every careers or job detail page—ideally linked from robots.txt—and rely on it whenever possible.
Client APIs second. If no sitemap exists we inspect the client application for JSON APIs it calls; when found we hit those endpoints directly to enumerate posting URLs without scraping the DOM.
Graceful page parsing. As a last resort we parse the careers pages themselves, preferring newest-first sorts and stopping once previously indexed roles reappear instead of crawling every page.
Selective storage. Once we fetch an individual posting we store only the job-specific metadata (title, role description, location, compensation notes, posting URL, and timestamps) plus extracted structured fields. We do not archive unrelated site content.

We strongly encourage publishing an easily discoverable sitemap for your careers section. Without it, we periodically mint lightweight HEAD requests against previously discovered job URLs to confirm they are still live, which introduces unnecessary traffic.

Opt-out or questions

If you notice unexpected activity from our crawler or prefer that your careers site not be indexed, please email us and we will respond promptly. business@colophon-group.org.

Our stance on automation

We oppose handing hiring or job-search decisions over to black-box automation — whether on the employer or applicant side. Every outbound link we share includes utm_source=jobseek so recruiters recognise the traffic, and we continuously review usage patterns plus enforce friction to deter scripted applications.

Open-source crawlers

Transparency matters, so the code for our job link collection service and extraction pipeline is open source. Browse the repository at github.com/colophon-group/jobseek-indexing(opens in new tab).

Need to reach us?

If you notice unusual crawler behaviour, prefer that we do not index your content, or have suggestions on how to improve our safeguards, please reach out. business@colophon-group.org.