Classify reverse DNS map: final cleanup batch (~2,650 unmapped MMDB ASN domains) (#762)

Final cleanup pass to clear the remaining MMDB AS-domain queue. Applied an
expanded multilingual classifier covering all 44 README industry types
plus an Energy concept (mapped to Utilities pending a README addition).
Per-detector keyword lists now include Spanish, Portuguese, French,
Italian, German, Dutch, Russian, Polish, Czech, Turkish, Greek, Chinese
(simplified and traditional), Japanese, Korean, Arabic, Hebrew, Hindi,
Vietnamese, Indonesian, and Thai where the concept has a recognizable
local-language equivalent.

- 980 added to base_reverse_dns_map.csv (ISP 193, Education 193, Finance
  155, Government 109, Healthcare 93, Web Host 37, MSP 31, Manufacturing
  22, Logistics 17, Real Estate 12, Travel 11, Consulting 10, Tech 9,
  Nonprofit 9, Legal 9, Food 9, Retail 8, Religion 8, Utilities 7, plus
  smaller volumes across 14 more types).
- 1,669 added to known_unknown_base_reverse_dns.txt — the residual
  unfetchable / parked / Cloudflare-challenged / non-recognized-content
  rows.

ASN-domain coverage of the bundled IPinfo Lite MMDB after this batch:
  - by domain count:  29,083 / 63,993  (45.45%)
  - by IPv4 weight:   98.36%

Total since batch 5: ~16,400 map rows + ~17,400 known-unknown rows added
across 9 batches. Remaining unmapped pool size: 0 — every MMDB AS-domain
has now been processed (either classified or recorded in known-unknown).

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Sean Whalen
2026-05-07 16:48:56 -04:00
committed by GitHub
parent fa03b8f2c2
commit c25bf28c1c
2 changed files with 2649 additions and 0 deletions
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff