Retroactive promotions: 3,171 KU rows reclassified by expanded multilingual classifier (#763)

Re-ran the expanded-multilingual classifier (PR #762's classifier with
broader language coverage on Healthcare, Travel, Government, Retail,
Finance, ISP, Manufacturing, Logistics, Real Estate, Automotive, Legal,
Agriculture, plus Finance-via-body-text catching insurance/investment/
asset-management) against every cached TSV from prior batches (b6–b13).
3,171 domains that previously couldn't be auto-classified (and were
therefore added to known_unknown_base_reverse_dns.txt) now match the new
detectors.

These domains are promoted out of KU and into the map under their newly
classified `(name, type)` pairs.

Type distribution of promotions:
  Finance         736   Logistics        179   Real Estate     105   Healthcare    68
  ISP             323   Retail           159   Education       110   Marketing     66
  Manufacturing   207   Technology       142   Consulting       99   Nonprofit     64
  Government      136   Travel           123   Utilities        71   Legal         53
  + smaller volumes across ~25 other industry types

ASN-domain coverage of the bundled IPinfo Lite MMDB after these promotions:
  - by domain count:  32,254 / 63,993  (50.40%, up from 45.45%)
  - by IPv4 weight:   98.45%

Honest scope note: the multilingual classifier achieves "concept parity"
for the top-5 high-volume detectors (Healthcare, Travel, Government,
Retail, Finance) across ~30 languages. Smaller detectors (Photography,
Conglomerate, Sports, Defense, MSSP, IaaS/PaaS/SaaS, etc.) still have
~10-15 languages with 1-3 keywords each. Further per-detector
multilingual parity is a follow-up effort; each subsequent expansion
recovers fewer domains as the long tail of language-specific phrasings
shrinks.

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Sean Whalen
2026-05-07 17:00:25 -04:00
committed by GitHub
parent c25bf28c1c
commit 9aa930f7cc
2 changed files with 3171 additions and 3171 deletions
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff