mirror of
https://github.com/domainaware/parsedmarc.git
synced 2026-05-20 10:55:24 +00:00
Retroactive promotions: 3,171 KU rows reclassified by expanded multilingual classifier (#763)
Re-ran the expanded-multilingual classifier (PR #762's classifier with broader language coverage on Healthcare, Travel, Government, Retail, Finance, ISP, Manufacturing, Logistics, Real Estate, Automotive, Legal, Agriculture, plus Finance-via-body-text catching insurance/investment/ asset-management) against every cached TSV from prior batches (b6–b13). 3,171 domains that previously couldn't be auto-classified (and were therefore added to known_unknown_base_reverse_dns.txt) now match the new detectors. These domains are promoted out of KU and into the map under their newly classified `(name, type)` pairs. Type distribution of promotions: Finance 736 Logistics 179 Real Estate 105 Healthcare 68 ISP 323 Retail 159 Education 110 Marketing 66 Manufacturing 207 Technology 142 Consulting 99 Nonprofit 64 Government 136 Travel 123 Utilities 71 Legal 53 + smaller volumes across ~25 other industry types ASN-domain coverage of the bundled IPinfo Lite MMDB after these promotions: - by domain count: 32,254 / 63,993 (50.40%, up from 45.45%) - by IPv4 weight: 98.45% Honest scope note: the multilingual classifier achieves "concept parity" for the top-5 high-volume detectors (Healthcare, Travel, Government, Retail, Finance) across ~30 languages. Smaller detectors (Photography, Conglomerate, Sports, Defense, MSSP, IaaS/PaaS/SaaS, etc.) still have ~10-15 languages with 1-3 keywords each. Further per-detector multilingual parity is a follow-up effort; each subsequent expansion recovers fewer domains as the long tail of language-specific phrasings shrinks. Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in: