What this is

A searchable full-text database of every collective bargaining agreement currently in force for New York City municipal employees, sourced from the Office of Labor Relations.

Why it exists

The City publishes its collective bargaining agreements as scattered PDFs. Many are scans of paper documents — the text isn't selectable, you can't even use Ctrl-F inside them. Comparing how, say, the discipline process works for sanitation workers versus teachers requires opening dozens of PDFs side-by-side. This site puts every clause from every contract in one searchable, taggable, citable place.

Sources

If a stated term appears expired (e.g. "2017–2025" PBA), the contract is still in force under New York State's Triborough Amendment, which keeps public-sector contracts in effect until a successor is signed. Expirations are flagged but do not mean a contract has lapsed.

Pipeline

  1. Inventory. scripts/inventory.py scrapes the OLR Recent Agreements page, finds every PDF link, and assigns each one an ID, label, source URL, and stated term years.
  2. Download. scripts/scrape.py downloads each PDF to data/pdfs/<id>.pdf.
  3. Extract. scripts/extract.py processes each PDF page-by-page:
    • Tables are detected and rendered as pipe-delimited markdown so column boundaries survive into the search corpus. (Naive PDF text extraction tends to flatten columns into a left-to-right wall of words; this preserves wage schedules, longevity tables, etc.)
    • Multi-column pages are detected via character-position analysis and split into left/right columns before extraction, so reading order is preserved.
    • Each page is scored on word count and alphabetic-character ratio. If a page falls below threshold (because the source PDF has no text layer), the page is rendered at 300 dpi and run through macOS Vision OCR via the ocrmac Python binding. Vision-based OCR handles real-world contract pages well — including stamps, signatures, and side-by-side tables.
    • OCR'd pages are flagged with the OCR badge in the UI so users know the text was reconstructed and may have minor errors.
  4. Segment. scripts/segment.py splits each contract into clauses using article/section heading regexes (Article I, Section 1, all-caps headings, and numbered headings). Each clause carries an article, section, heading, page number, and OCR flag.
  5. Tag. scripts/tag.py applies a topic taxonomy via keyword regex against each clause: wages, longevity, overtime, holidays, vacation, sick leave, parental leave, health & welfare, pension, grievance, discipline, layoff, hours, shift differential, uniform allowance, training, safety, no-strike, management rights, work rules, union security, recognition, promotion, telework, anti-discrimination, workforce composition. A clause can carry multiple topics.
  6. Index. The frontend loads data/clauses.json and builds a FlexSearch full-text index in the browser. No server-side query — everything runs client-side.

Coverage beyond the OLR Recent Agreements page

The bulk of this corpus comes from OLR's "Recent Agreements" page, which is the authoritative list of contracts where the City of New York is the direct employer. Several major NYC public-sector unions whose contracts are not on that page are also included, sourced directly from OLR's contract download server, the City University of New York's labor relations page, and the unions themselves:

These five contracts cover roughly 50,000 additional NYC public-sector unionized employees. Each is flagged in the contracts directory with its source ("olr-direct" or "cuny-direct") to distinguish from the OLR Recent Agreements set.

Still missing: the post-2023 NYSNA H+H pay-parity successor agreement, and any UFA/UFOA award from the current arbitration round. Both are tracked here for future ingestion when published.

Limitations

Wage tracker — verification

The wage pattern tracker shows curated GWI schedules for the 12 largest contracts. Every percentage and effective date has been verified directly against the OCR'd contract text. Each entry on that page carries a verification badge:

The $3,000 ratification bonus appears only in the civilian-pattern contracts; the uniformed pattern doesn't include it. PBA's prior 2017-2025 settlement also has no $3,000 bonus.

Natural-language Q&A

For natural-language questions across the corpus — "compare the discipline procedures for sanitation workers and teachers," "which contracts have parental leave," "what's the longest grievance timeline" — use the companion NotebookLM notebook, which has been loaded with the same Markdown corpus this site is built from. NotebookLM provides citations back to the source documents and runs on Google's infrastructure (free for users with a Google account).

This site itself is search-only by design — it doesn't call any LLM API at query time, so it stays free to host and free to use. The two tools complement each other: use the search/topic-pivot here for keyword-precise lookups and citations, use NotebookLM for cross-document synthesis questions.

Markdown export

Every contract is also published as a standalone Markdown file under /data/markdown/<contract-id>.md, with YAML frontmatter for metadata, page numbers and OCR flags inline, and tables preserved as pipe-delimited rows. A single ZIP bundle of all 94 contracts (~750 KB) is also available, plus a per-contract download button on every contract detail page. Built by scripts/export_markdown.py from the same source data the search index uses.

Refresh

The corpus is regenerated by re-running the inventory + scrape + extract + segment + tag pipeline. The footer of the home page shows the build timestamp.

Reproduce locally

git clone https://github.com/joshgreenman1973/nyc-labor-contracts
cd nyc-labor-contracts
python3 -m venv .venv && source .venv/bin/activate
pip install pypdf pdfplumber requests beautifulsoup4 lxml pypdfium2 ocrmac
python scripts/inventory.py
python scripts/scrape.py
python scripts/extract.py
python scripts/segment.py
python scripts/tag.py
python scripts/build_manifest.py
python -m http.server 8000   # then open http://localhost:8000