Commit classify_unknown_domains.py — regex-based multilingual classifier (#764)

* Commit classify_unknown_domains.py: regex-based multilingual classifier Promotes the transient `/tmp/classify_b<N>.py` script that grew across the b5–b13 reverse-DNS-map batches into a tracked tool. The classifier takes a `collect_domain_info.py` TSV and emits a CSV of map additions plus a text file of known-unknown additions — the regex baseline that makes step 4 of the unknown-domain workflow ("classify from the TSV, not by re-fetching") tractable at scale. Coverage: - Detectors for all 44 industry types in the README. - Concept-translation parity across ~30 languages on the high-volume detectors (Healthcare, Travel, Government, Retail, Finance, ISP, Web Host, Manufacturing, Logistics, Real Estate, Automotive, Legal, Agriculture). - ~10–20 languages with 1–3 keywords each on the smaller detectors (Photography, Sports, MSSP, Conglomerate, Search Engine, Social Media, Defense, IaaS/PaaS/SaaS, Beauty, Print, Publishing, Religion, Science, Event Planning, Staffing, Email Security/Provider, Marketing, Construction, Industrial, Utilities, Energy, Government Media, Physical Security, News, Nonprofit, Entertainment, Technology, Consulting). Brand-name selection prefers MMDB `as_name` → page title's first segment → non-redacted WHOIS registrant → domain-derived fallback, with a `clean_brand` pass that strips legal-form suffixes (LLC / GmbH / Ltda / EIRELI / sp. z o.o. / s.c.a r.l / UAB / etc.) and prefixes (PT, OOO). When the title has multiple segments, the segment whose simplified form contains the domain root is preferred — accessmontana.com with as_name "MONTANA WEST, L.L.C." and title "Internet, Phone & TV Bundles | Access Montana" maps to "Access Montana", not "Montana West". A small mojibake fixer normalizes the most common UTF-8-as-Latin-1 re-encodings ("Ã³" → "ó", etc.) so Spanish/Portuguese/French homepages that `collect_domain_info.py` mishandled still classify. The empty HAND dict at the top of the file is an extension point for batch-specific overrides — e.g. acquisition aliases or brand-name corrections that don't fit any detector; each `domain → ("Brand", "Type")` entry wins over the auto-classifier. Wired into AGENTS.md's "Related utility scripts" section and documented in `parsedmarc/resources/maps/README.md` alongside the rest of the maps utilities. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * classify_unknown_domains.py: clarify dual-purpose framing The classifier serves both lookup paths into base_reverse_dns_map.csv — the original PTR-side flow (reverse-DNS base domains derived from DMARC report source IPs) and the MMDB-coverage flow (AS domains lifted from the bundled IPinfo Lite MMDB). The initial commit's docstring/comments emphasized the MMDB-coverage flow because that's where the script grew up across the b5–b13 batches, but it was always equally applicable to PTR-side domains. Updates: - Top docstring rewritten to lead with the dual-purpose framing. - README.md adds an explicit "useful for either lookup path" paragraph referencing the original DMARC-report flow and the MMDB-coverage flow. - AGENTS.md "Related utility scripts" entry updated to mention both flows. - Drops a stale "happen to have ASN registrations" aside in the RETAIL_RE comment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-07-05 16:25:09 +00:00 · 2026-05-07 17:16:23 -04:00
parent 9aa930f7cc
commit 3b705aeaa8
3 changed files with 2995 additions and 0 deletions
@@ -225,6 +225,7 @@ When `unknown_base_reverse_dns.csv` has new entries, follow this order rather th
 - `find_unknown_base_reverse_dns.py` — regenerates `unknown_base_reverse_dns.csv` from `base_reverse_dns.csv` by subtracting what is already mapped or known-unknown. Enforces the no-full-IP privacy rule at ingest. Translates non-domain-shaped `source_name` rows (raw MMDB `as_name` strings surfaced by the ASN-fallback path in `utils.py:get_ip_address_info` when the IP had no PTR and the `as_domain` was uncategorized) to their corresponding `as_domain` via the bundled MMDB, so the row enters the pipeline as a researchable domain (and drops out automatically if that `as_domain` is already mapped). Run after merging a batch.
 - `detect_psl_overrides.py` — scans the lists for clustered IP-containing patterns, auto-adds brand suffixes to `psl_overrides.txt`, folds affected entries to their base, and removes any remaining full-IP entries. Run before the collector on any new batch.
 - `collect_domain_info.py` — the bulk enrichment collector described above. Respects `psl_overrides.txt` and skips full-IP entries. Two derived columns surface drift signals that are also useful during initial classification: `rebrand_signal` combines a body-text regex (matches "now X", "formerly known as X", "is now part of X", etc.) with a path/alt-text regex (matches "rebrand", "brand-launch", "brand-announcement", "name-change", "our-new-name") so that image-only acquisition banners — `<a href="…/brand-launch-…"><img alt="Brand announcement"></a>` — also fire. `external_links` lists the homepage's non-self, non-social outbound link hosts; useful as review context but not a flag trigger by default in the drift sweep (most external links are to partners / customers / vendors and don't indicate a rebrand).
+- `classify_unknown_domains.py` — regex-based multilingual classifier that consumes a `collect_domain_info.py` TSV and emits map / known-unknown additions. Useful for both lookup paths into `base_reverse_dns_map.csv`: the original PTR-side flow (classifying reverse-DNS base domains discovered from DMARC report source IPs) and the MMDB-coverage flow (classifying ASN domains lifted from the bundled IPinfo Lite MMDB). Detectors cover all 44 industry types in the README, with concept-translation parity across ~30 languages on the high-volume detectors (Healthcare, Travel, Government, Retail, Finance, ISP, Web Host, Manufacturing, Logistics, Real Estate, Automotive, Legal, Agriculture) and ~10–20 languages on the smaller ones. The classifier is the regex baseline of step 4 of the unknown-domain workflow (see "Workflow for classifying unknown domains" above) — it catches the obvious cases at scale and leaves the genuinely ambiguous to manual / LLM review. Append the script's `--map-out` to `base_reverse_dns_map.csv` and `--ku-out` to `known_unknown_base_reverse_dns.txt` (after the per-batch brand cleanup pass), then run `sortlists.py`. The HAND dict at the top of the script is an extension point for batch-specific overrides (e.g. acquisition aliases, brand-name corrections that don't fit any detector).
 - `detect_rebrands.py` — drift sweep that re-fetches every key in `base_reverse_dns_map.csv` with the same machinery as `collect_domain_info.py` and emits a TSV of rows where `rebrand_signal` or `redirect_changed` (final URL host doesn't sit under the input domain) fired. **Run once a year, not more often** — operator rebrands accumulate slowly and a yearly cadence is enough to keep the map current without spending review effort on near-empty diffs. Not part of the standard per-batch workflow. Output is for periodic review — a single signal is one corroborating source; promoting a flagged row still needs a second source per the two-corroborating-sources rule. Resume-safe via `-o`. Use `--limit N` to spot-check a slice; `--include-clean` to also emit non-flagged rows; `--flag-external-links` to additionally flag rows whose only signal is an outbound non-self host (off by default to keep partner/vendor noise out of the review queue).
 - `find_bad_utf8.py` — locates invalid UTF-8 bytes (used after past encoding corruption).
 - `sortlists.py` — case-insensitive sort + dedupe + `type`-column validator for the list files; the authoritative sorter run after every batch edit.
@@ -138,6 +138,30 @@ The TSV also carries two derived columns that surface drift signals (and double

 The output of `collect_domain_info.py`. Tab-separated, one row per researched domain. Not tracked by Git — it is regenerated on demand and contains transient third-party WHOIS/HTML data.

+## classify_unknown_domains.py
+
+Regex-based multilingual classifier that consumes a `domain_info.tsv` (from `collect_domain_info.py`) and emits two outputs: a CSV of map additions (`domain,name,type` rows) and a text file of known-unknown additions.
+
+Useful for either lookup path that reads `base_reverse_dns_map.csv`:
+
+- The original PTR-side flow that classifies reverse-DNS base domains derived from DMARC report source IPs (`base_reverse_dns.csv` → `unknown_base_reverse_dns.csv` → `domain_info.tsv` → this classifier).
+- The MMDB-coverage flow that classifies ASN domains lifted from the bundled IPinfo Lite MMDB (the b5–b13 batches that drove distinct AS-domain coverage from ~10% to ~50% used this classifier as their regex baseline).
+
+Run it from this directory:
+
+```bash
+python classify_unknown_domains.py \
+    -i /tmp/batch_info.tsv \
+    --map-out /tmp/additions.csv \
+    --ku-out /tmp/ku_additions.txt
+```
+
+Detectors cover all 44 industry types listed in [base_reverse_dns_map.csv](#base_reverse_dns_mapcsv) above. Multilingual coverage is broadest for the high-volume detectors — Healthcare, Travel, Government, Retail, Finance, ISP, Web Host, Manufacturing, Logistics, Real Estate, Automotive, Legal, Agriculture have concept-translation parity across ~30 languages with multiple synonyms per language. Smaller detectors (Photography, Sports, MSSP, Conglomerate, Search Engine, Social Media, Defense, IaaS, PaaS, SaaS, Beauty, Print, Publishing, Religion, Science, Event Planning, Staffing, Email Security, Email Provider, Marketing, Construction, Industrial, Utilities, Energy, Government Media, Physical Security, News, Nonprofit, Entertainment, Technology, Consulting) have ~10–20 languages with 1–3 keywords each. Each successive batch is expected to refine multilingual coverage as new patterns surface in the unclassified pool.
+
+Brand-name selection prefers (in order): the MMDB `as_name` for the domain; the page title's first segment; non-redacted WHOIS registrant org; domain-derived fallback. A `clean_brand` step strips common legal-form suffixes (LLC / GmbH / Ltda / EIRELI / sp. z o.o. / etc.) and prefixes (PT, OOO). When the title has multiple segments separated by `|` / `-` / `—` etc., the segment whose simplified form contains the domain root is preferred — so e.g. accessmontana.com whose `as_name` is "MONTANA WEST, L.L.C." but whose title is "Internet, Phone & TV Bundles | Access Montana" maps to "Access Montana", not "Montana West".
+
+The classifier is the regex baseline of step 4 of the [Workflow for classifying unknown domains](../../../AGENTS.md#workflow-for-classifying-unknown-domains) — it catches obvious cases at scale and leaves only the genuinely ambiguous to manual / LLM review. The empty `HAND` dict at the top of the script is an extension point for batch-specific overrides (e.g. acquisition aliases, brand-name corrections that don't fit any detector); each `domain → ("Brand", "Type")` entry wins over the auto-classifier.
+
 ## detect_rebrands.py

 **Cadence: run roughly once a year.** Not part of the standard mapping workflow — operator rebrands and acquisitions accumulate slowly, and a yearly sweep is sufficient to keep `base_reverse_dns_map.csv` from drifting out of date. There is no benefit to running it more often.