mirror of
https://github.com/domainaware/parsedmarc.git
synced 2026-05-21 11:25:23 +00:00
Classify reverse DNS map: next 5000 unmapped MMDB ASN domains (#755)
5x the typical batch size to chase complete ASN-domain coverage. Small ISPs and web hosts are high-value targets for spam/phishing abuse, so the long tail of unmapped operators is worth investing review effort in. Each candidate at this depth represents 3,072–6,144 IPv4 addresses (well below the 100K+ that head-batches saw); auto-classification rate is 43.5%, similar to the prior batch. - 2,177 added to base_reverse_dns_map.csv (ISP 1,477, Web Host 296, Education 214, MSP 65, Government 56, Healthcare 40, Finance 29). - 2,823 added to known_unknown_base_reverse_dns.txt — parked / Cloudflare- challenged / generic-server-test pages, obscure-language homepages without telecom-keyword cognates the classifier recognized, or rows whose WHOIS / MMDB as_name / homepage couldn't combine into two corroborating sources. ASN-domain coverage of the bundled IPinfo Lite MMDB after this batch: - by domain count: 12,678 / 63,993 (19.81%, up from 15.86%) - by IPv4 weight: 97.85% (up from 97.55%) Reused the batch-5 classifier (MMDB as_name as primary brand source with domain-root-aware title-segment selection, multilingual ISP/Web Host/MSP keyword regex, government and education TLD lists, Communications-with- media-context-guard fallback, and the deep brand-suffix cleanup for EPP/EIRELI/UAB/Druzstvo/etc. plus the UTF-8-as-Latin-1 mojibake fix). No new classifier changes this batch. Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in: