Classify reverse DNS map: next 5000 unmapped MMDB ASN domains (#755)

5x the typical batch size to chase complete ASN-domain coverage. Small ISPs
and web hosts are high-value targets for spam/phishing abuse, so the long
tail of unmapped operators is worth investing review effort in. Each
candidate at this depth represents 3,072–6,144 IPv4 addresses (well below
the 100K+ that head-batches saw); auto-classification rate is 43.5%, similar
to the prior batch.

- 2,177 added to base_reverse_dns_map.csv (ISP 1,477, Web Host 296,
  Education 214, MSP 65, Government 56, Healthcare 40, Finance 29).
- 2,823 added to known_unknown_base_reverse_dns.txt — parked / Cloudflare-
  challenged / generic-server-test pages, obscure-language homepages
  without telecom-keyword cognates the classifier recognized, or rows
  whose WHOIS / MMDB as_name / homepage couldn't combine into two
  corroborating sources.

ASN-domain coverage of the bundled IPinfo Lite MMDB after this batch:
  - by domain count:  12,678 / 63,993  (19.81%, up from 15.86%)
  - by IPv4 weight:   97.85%           (up from 97.55%)

Reused the batch-5 classifier (MMDB as_name as primary brand source with
domain-root-aware title-segment selection, multilingual ISP/Web Host/MSP
keyword regex, government and education TLD lists, Communications-with-
media-context-guard fallback, and the deep brand-suffix cleanup for
EPP/EIRELI/UAB/Druzstvo/etc. plus the UTF-8-as-Latin-1 mojibake fix).
No new classifier changes this batch.

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Sean Whalen
2026-05-07 12:33:23 -04:00
committed by GitHub
parent 34518585b6
commit 7ef153b4da
2 changed files with 5000 additions and 0 deletions
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff