classify_unknown_domains.py: enforce concept-parity across ~60 languages (#765)

Multilingual detectors previously held English at full breadth (e.g. Healthcare = hospital + clinic + pharmacy + healthcare + pharmaceutical industry + nursing home + medical center) while many non-English sections covered the same concept set with only one or two transliterated words. This left every language other than English under-detecting against pages that used the operator's natural compound terms. Reworked every detector so each language now expresses the same English concept set in idiomatic compounds — never inventing calques where no natural form exists. Added ~32 new languages (Macedonian, Belarusian, Azerbaijani, Armenian, Georgian, Kazakh, Uzbek, Mongolian, Khmer, Burmese, Lao, Nepali, Sinhala, Amharic, Yoruba, Hausa, Igbo, Zulu, Pashto, Kurdish, Tajik, Kyrgyz, Maltese, Luxembourgish, Haitian Creole, Frisian, Yiddish, Faroese, Tatar, Javanese, Sundanese, Cebuano) on top of the existing pool, again applied per-concept rather than as token presence. Also added British / American spelling pairs where they diverge (`tire`/`tyre`, `defense`/`defence`, `center`/`centre`, etc.) and a handful of new English concepts that had been implicit (`tire shop`, `car parts`, `oil exploration`, `olympic committee`, ...) — each with its multilingual equivalents in the same edit. AGENTS.md: codified the rule under "Maintaining the reverse DNS maps" so future edits are bound by it: every language section must cover the same concept set the English section covers, with idiomatic compounds rather than calques, skip rather than invent when no natural form exists, and any new English keyword must be added in parallel across the existing language set. Final shape: 11,777 alternations / 175,556 chars across 45 detectors. Ruff check + format clean. Module compiles. Known limitation (pre-existing, unchanged): Python's `re` does not treat Unicode Mn / Mc combining marks as word characters, so Brahmic-script words ending in vowel signs / virama won't match the outer `\b…\b`. Affects pre-existing and new entries equally; fixable later by switching to the `regex` module. Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-07-06 00:35:09 +00:00 · 2026-05-07 19:01:15 -04:00
parent 3b705aeaa8
commit 06d277686d
2 changed files with 6878 additions and 446 deletions
@@ -225,7 +225,17 @@ When `unknown_base_reverse_dns.csv` has new entries, follow this order rather th
 - `find_unknown_base_reverse_dns.py` — regenerates `unknown_base_reverse_dns.csv` from `base_reverse_dns.csv` by subtracting what is already mapped or known-unknown. Enforces the no-full-IP privacy rule at ingest. Translates non-domain-shaped `source_name` rows (raw MMDB `as_name` strings surfaced by the ASN-fallback path in `utils.py:get_ip_address_info` when the IP had no PTR and the `as_domain` was uncategorized) to their corresponding `as_domain` via the bundled MMDB, so the row enters the pipeline as a researchable domain (and drops out automatically if that `as_domain` is already mapped). Run after merging a batch.
 - `detect_psl_overrides.py` — scans the lists for clustered IP-containing patterns, auto-adds brand suffixes to `psl_overrides.txt`, folds affected entries to their base, and removes any remaining full-IP entries. Run before the collector on any new batch.
 - `collect_domain_info.py` — the bulk enrichment collector described above. Respects `psl_overrides.txt` and skips full-IP entries. Two derived columns surface drift signals that are also useful during initial classification: `rebrand_signal` combines a body-text regex (matches "now X", "formerly known as X", "is now part of X", etc.) with a path/alt-text regex (matches "rebrand", "brand-launch", "brand-announcement", "name-change", "our-new-name") so that image-only acquisition banners — `<a href="…/brand-launch-…"><img alt="Brand announcement"></a>` — also fire. `external_links` lists the homepage's non-self, non-social outbound link hosts; useful as review context but not a flag trigger by default in the drift sweep (most external links are to partners / customers / vendors and don't indicate a rebrand).
- `classify_unknown_domains.py` — regex-based multilingual classifier that consumes a `collect_domain_info.py` TSV and emits map / known-unknown additions. Useful for both lookup paths into `base_reverse_dns_map.csv`: the original PTR-side flow (classifying reverse-DNS base domains discovered from DMARC report source IPs) and the MMDB-coverage flow (classifying ASN domains lifted from the bundled IPinfo Lite MMDB). Detectors cover all 44 industry types in the README, with concept-translation parity across ~30 languages on the high-volume detectors (Healthcare, Travel, Government, Retail, Finance, ISP, Web Host, Manufacturing, Logistics, Real Estate, Automotive, Legal, Agriculture) and ~10–20 languages on the smaller ones. The classifier is the regex baseline of step 4 of the unknown-domain workflow (see "Workflow for classifying unknown domains" above) — it catches the obvious cases at scale and leaves the genuinely ambiguous to manual / LLM review. Append the script's `--map-out` to `base_reverse_dns_map.csv` and `--ku-out` to `known_unknown_base_reverse_dns.txt` (after the per-batch brand cleanup pass), then run `sortlists.py`. The HAND dict at the top of the script is an extension point for batch-specific overrides (e.g. acquisition aliases, brand-name corrections that don't fit any detector).
+- `classify_unknown_domains.py` — regex-based multilingual classifier that consumes a `collect_domain_info.py` TSV and emits map / known-unknown additions. Useful for both lookup paths into `base_reverse_dns_map.csv`: the original PTR-side flow (classifying reverse-DNS base domains discovered from DMARC report source IPs) and the MMDB-coverage flow (classifying ASN domains lifted from the bundled IPinfo Lite MMDB). Detectors cover all 44 industry types in the README, and every detector aims for **concept parity across the same broad language pool** — see the concept-parity rule below. The classifier is the regex baseline of step 4 of the unknown-domain workflow (see "Workflow for classifying unknown domains" above) — it catches the obvious cases at scale and leaves the genuinely ambiguous to manual / LLM review. Append the script's `--map-out` to `base_reverse_dns_map.csv` and `--ku-out` to `known_unknown_base_reverse_dns.txt` (after the per-batch brand cleanup pass), then run `sortlists.py`. The HAND dict at the top of the script is an extension point for batch-specific overrides (e.g. acquisition aliases, brand-name corrections that don't fit any detector).
+
+  **Concept parity rule for multilingual detectors.** When editing or extending any detector regex in `classify_unknown_domains.py`, every language section must cover the **same set of distinct concepts** that the English section covers — not just one or two transliterated keywords. The English section is the spec; each non-English section is an attempt to express that same concept set in idiomatic terms.
+
+  - **Concept, not keyword.** If the English section covers `{hospital, clinic, pharmacy, healthcare, pharmaceutical industry, nursing home, medical center}`, the Spanish / Russian / Japanese / Khmer / Yoruba sections must each independently express *each* of those concepts using natural compound terms in that language — not a single bare word. A single-word entry per language is the antipattern this rule exists to prevent.
+  - **Idiom over calque.** Use the compound term a native speaker would actually write on a homepage. Don't translate word-by-word; if the language pluralizes, compounds, or marks an institution differently, follow the language's own pattern. Don't invent calques to force a 1:1 mapping to English.
+  - **Skip rather than invent.** If a concept genuinely has no idiomatic compound in the language (e.g. some concepts have no native term in smaller-corpus languages), omit it for that language. A natural gap is fine; an invented phrase that no native page uses is not — it bloats the regex without matching anything and makes the file misleading.
+  - **When you add a new English keyword, add the parallel concept in every language that already has coverage in that detector.** Adding `tire shop` to English without adding `pneuservis` (cs/sk), `шиномонтаж` (ru), `lastik bayii` (tr), `タイヤ販売` (ja), etc. fails parity. Conversely, when you add a new language to a detector, cover all the existing English concepts that have natural translations — don't drop in a single token.
+  - **British vs American spellings.** Where US/UK English diverge (`tire`/`tyre`, `defense`/`defence`, `center`/`centre`, `color`/`colour`), include both in the English section so the detector matches both spellings.
+
+  This rule applies equally to the smaller detectors (MSSP, IaaS/PaaS/SaaS, Defense, Conglomerate, Energy, etc.) — but for those, "skip rather than invent" does most of the work, since many languages have no native compound for "managed security services" or "infrastructure as a service" and the English term is itself loanword-shaped in most contexts.
 - `detect_rebrands.py` — drift sweep that re-fetches every key in `base_reverse_dns_map.csv` with the same machinery as `collect_domain_info.py` and emits a TSV of rows where `rebrand_signal` or `redirect_changed` (final URL host doesn't sit under the input domain) fired. **Run once a year, not more often** — operator rebrands accumulate slowly and a yearly cadence is enough to keep the map current without spending review effort on near-empty diffs. Not part of the standard per-batch workflow. Output is for periodic review — a single signal is one corroborating source; promoting a flagged row still needs a second source per the two-corroborating-sources rule. Resume-safe via `-o`. Use `--limit N` to spot-check a slice; `--include-clean` to also emit non-flagged rows; `--flag-external-links` to additionally flag rows whose only signal is an outbound non-self host (off by default to keep partner/vendor noise out of the review queue).
 - `find_bad_utf8.py` — locates invalid UTF-8 bytes (used after past encoding corruption).
 - `sortlists.py` — case-insensitive sort + dedupe + `type`-column validator for the list files; the authoritative sorter run after every batch edit.