Classify reverse DNS map: next 1000 unmapped MMDB ASN domains (#754)

The next 1000 by aggregate IPv4 weight, all sitting in the long tail (each
candidate ASN holds ~7,400 IPv4 addresses, ~0.21% of total v4 weight), so
auto-classification rate is modest compared to head-batches:

- 460 added to base_reverse_dns_map.csv (ISP 344, Web Host 60, Education 21,
  MSP 12, Healthcare 8, Government 8, Finance 7).
- 540 added to known_unknown_base_reverse_dns.txt — homepages that were
  parked, behind a Cloudflare bot challenge, returning a generic-server test
  page, in obscure languages with no telecom-keyword cognates the classifier
  recognized, or whose WHOIS / MMDB as_name didn't combine with any
  homepage signal to clear two corroborating sources.

Classifier improvements applied this batch (relative to prior batches' code):

- MMDB as_name is the primary brand source, with cleaned title as fallback
  and domain-derived as last resort (WHOIS is mostly privacy-redacted at
  this depth in the long tail).
- Title-segment selection now prefers the segment whose simplified form
  contains the domain root, catching cases like accessmontana.com whose
  as_name is the holding company "MONTANA WEST, L.L.C." but whose title
  surfaces the operator brand "Access Montana".
- as_name fallback for ISP added "Communications" (with a media-context
  guard so "Christian Broadcasting Network" doesn't hit) plus bare
  "Internet" / "Cable" / "Telephone Co." patterns common in rural-US ISP
  brands.
- Government TLD list expanded for .go.id, .gv.at, .gov.cn, .gob.cl/ar/gt,
  .admin.ch, etc.; Education TLD list expanded for .ac.kr / .ac.za /
  .ac.nz / .edu.cn / .edu.tw / .edu.sg / .edu.my / .edu.ph / .edu.eg.
- MSP detection re-added (`it solutions` / `managed it support` /
  `managed tech` patterns) for marconet.com / odyssey.uk / vmi.se type
  long-tail managed-IT shops.
- Brand cleanup deepened to handle Brazilian EPP / EIRELI ME, Italian
  s.c.a r.l, Polish sp z o.o variants, Lithuanian UAB, Czech Druzstvo,
  Venezuelan C.A., trailing-single-letter artifacts, and double-spaces.
- Encoding-mojibake fixer for the common UTF-8-as-Latin-1 cases
  ("Fibra óptica" → "Fibra óptica") so Spanish/Portuguese ISP pages
  classify even when collect_domain_info.py mishandled the encoding.

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Sean Whalen
2026-05-07 12:06:22 -04:00
committed by GitHub
parent 769b16bb03
commit 34518585b6
2 changed files with 1000 additions and 0 deletions
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff