Expand reverse-DNS map and PSL overrides from the live PSL (#716)

* Expand reverse-DNS map and PSL overrides from the live PSL

Parses the private-domains section of the live Public Suffix List and
adds 269 brand-owned suffixes as PSL overrides paired with map
entries, so customer subdomains on shared hosting / SaaS / PaaS
platforms fold to the operator's brand. Adds 33 ASN-domain entries
for the subset of these brands whose IP space is registered under a
different corporate domain in the MMDB, so both the PTR-derived
lookup and the ASN-fallback lookup hit the same (name, type). Also
normalizes ``a2hosting.com`` from ``A2Hosting`` to ``A2 Hosting``
for spelling consistency.

PTR-path wins (overrides + map entries)
- Web hosts: A2 Hosting, alwaysdata, Antagonist, Beget, bplaced,
  Bytemark, Combell, cyber_Folks, cyon, DreamHost, EasyWP, Gehirn,
  HelioHost, home.pl, HostyHosting, Hypernode, IONOS (6 suffixes),
  Jotelulu, JouwWeb, KaasHosting, Keyweb, LCube, LiquidNet, McHost,
  Memset, Mittwald, Mythic Beasts, NearlyFreeSpeech, Nimbus Hosting,
  One.com (20 ccTLD variants), OwnProvider, Pantheon, Planet-Work,
  prgmr, Rackmaze, Rad Web Hosting, Raidboxes, Servebolt,
  SpeedPartner, Uberspace, Whatbox, WP Engine, ZAP-Hosting, Zitcom.
- Dynamic DNS: DuckDNS, DynDNS (24), No-IP (22), Now-DNS, dynv6,
  freemyip, nsupdate.info, ddnss.de, GoIP, DrayTek.
- PaaS/SaaS/IaaS: Netlify, Vercel (6), Heroku, fly.io, Render,
  Firebase/GCP (4), Azure (5), AWS (4), DigitalOcean (2), Red Hat
  OpenShift, Hasura, Supabase, Snowflake/Streamlit, Read the Docs,
  PythonAnywhere, GitHub, GitLab, Adobe Magento.
- Hosted sites/stores: Hatena (6), Notion, Figma, Webflow, Wix (4),
  Shopify, Shopware, Sellfy, Spreadshop (19 ccTLDs), Datto.
- Email/Marketing: Fastmail, ActiveTrail, Leadpages, Heyflow, Carrd,
  Typeform.
- CDN/Technology: Akamai (7), Fastly (3), Yandex Cloud.

ASN-path wins (MMDB coverage now attributes 1,184,256 more IPv4
addresses to a named brand, 85.04% -> 85.08%): yandex.com, ya.ru,
hosting.com (A2 Hosting), beget.com, cyberfolks.pl, fly.io,
bytemark.co.uk, cyberfolks.ro, keyweb.de, mittwald.de, memset.com,
zap-hosting.com, datto.com, jotelulu.com, yandex.cloud, github.com,
asavie.com (Akamai), and 16 others.

Entries are curated from the live PSL rather than any bundled copy;
brand / as_name attribution was verified against the CLAUDE.md rule
that the IP-WHOIS signal is only trusted when the domain name itself
matches the host's name (name-collisions in MMDB were skipped —
Hypernode AU, goipgroup.com, liquidnet.com, One.com substring noise,
nimbusitsolutions.com, etc.). Types follow
``base_reverse_dns_types.txt``; ``sortlists.py`` re-sorts + dedupes +
validates after the batch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Document PSL-derived override workflow and load_psl_overrides gotcha

Adds three pieces of map-maintenance context learned while building
this PR:

- New subsection "Discovering overrides from the live PSL
  private-domains section" — distinct source from live DMARC data
  (unknown_base_reverse_dns.csv) and MMDB coverage-gap analysis. The
  private section is itself a list of brand-owned suffixes; each is a
  candidate (psl_override + map entry) pair. Emphasizes ruthless
  selectivity — most of the 600+ private-section orgs are dev
  sandboxes or hobby zones that will never appear in DMARC reports.

- Two-path coverage as a single linked step, not two round-trips:
  when adding a PSL override for a hosted-content suffix
  (netlify.app), also add a map row for the brand's corporate
  as_domain (netlify.com) in the same pass. The override fixes the
  PTR path; the ASN-domain alias fixes the ASN-fallback path.

- The load_psl_overrides() fetch-first gotcha. The no-arg form pulls
  the file from master on GitHub, so end-to-end testing of local
  overrides silently uses the old remote version. offline=True is
  required to test local changes against get_base_domain().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Sean Whalen
2026-04-23 09:12:32 -04:00
committed by GitHub
parent 2cda5bf59b
commit 2978436d89
3 changed files with 594 additions and 1 deletions

View File

@@ -172,6 +172,28 @@ for d, c, n in miss[:50]:
Apply the same classification rules above (precedence, naming consistency, skip-if-ambiguous, privacy). Many top misses will be brands already in the map under a different rDNS-base key — the goal there is to alias the ASN domain to the same `(name, type)` so both lookup paths hit. For ASN domains with no obvious brand identity (small resellers, parked ASNs), don't map them — the attribution code falls back to the raw `as_name` from the MMDB, which is better than a guess.
### Discovering overrides from the live PSL private-domains section
Separately from live DMARC data and the MMDB, the [Public Suffix List](https://publicsuffix.org/list/public_suffix_list.dat) is itself a source of override candidates. Every entry between `===BEGIN PRIVATE DOMAINS===` and `===END PRIVATE DOMAINS===` is a brand-owned suffix by definition (registered by the operator under their own name), so each is a candidate for a `(psl_override + map entry)` pair — folding `customer.brand.tld` → `brand.tld` and attributing it to the operator.
Workflow:
1. Fetch the live PSL file and parse the private section by `// Org` comment blocks → `{org: [suffixes]}`.
2. Cross-reference against `base_reverse_dns_map.csv` keys and existing `psl_overrides.txt` entries to drop already-covered orgs.
3. **Be ruthlessly selective.** The private section has 600+ orgs, most of which are dev sandboxes, dynamic DNS services, IPFS gateways, single-person hobby domains, or registry subzones that will never appear in a DMARC report. Keep only orgs that clearly host email senders — shared web hosts, PaaS / SaaS where customers publish mail-sending sites, email/marketing platforms, major ISPs, dynamic-DNS services that home mail servers actually use.
4. For each kept org, emit one override (`.brand.tld` per the `psl_overrides.txt` format) and one map row per suffix, all pointing at the same `(name, type)`. Apply the README precedence rules for `type`. Grep existing map keys for the brand name before inventing a new one — the goal is a single canonical display name per operator.
5. **Same-PR follow-up: two-path coverage.** For every brand added this way, also check whether the brand's corporate domain (e.g. `netlify.com` for `netlify.app`, `shopify.com` for `myshopify.com`, `beget.com` for `beget.app`) is an `as_domain` in the MMDB, and add a map row for it with the same `(name, type)`. The PSL override fixes the PTR path; the ASN-domain alias fixes the ASN-fallback path. Do these together — one pass, not two.
### The `load_psl_overrides()` fetch-first gotcha
`parsedmarc.utils.load_psl_overrides()` with no arguments fetches the overrides file from `raw.githubusercontent.com/domainaware/parsedmarc/master/...` *first* and only falls back to the bundled local file on network failure. This means end-to-end testing of local `psl_overrides.txt` changes via `get_base_domain()` silently uses the old remote version until the PR merges. When testing local changes, explicitly pass `offline=True`:
```python
from parsedmarc.utils import load_psl_overrides, get_base_domain
load_psl_overrides(offline=True)
assert get_base_domain("host01.netlify.app") == "netlify.app"
```
### After a batch merge
- Re-sort `base_reverse_dns_map.csv` alphabetically (case-insensitive) by the first column and write it out with CRLF line endings.

File diff suppressed because it is too large Load Diff

View File

@@ -11,19 +11,288 @@
-tataidc.co.in
-veloxfiber.com.br
-wconect.com.br
.123hjemmeside.dk
.123hjemmeside.no
.123homepage.it
.123kotisivu.fi
.123minsida.se
.123miweb.es
.123paginaweb.pt
.123siteweb.fr
.123webseite.at
.123webseite.de
.123website.be
.123website.ch
.123website.lu
.123website.nl
.3utilities.com
.a2hosted.com
.activetrail.biz
.akadns.net
.akamai.net
.akamaiedge.net
.akamaihd.net
.akamaized.net
.alwaysdata.net
.amazonaws.com
.amplifyapp.com
.antagonist.cloud
.app-ionos.space
.apps-1and1.com
.apps-1and1.net
.appspot.com
.awsapprunner.com
.azureedge.net
.azurestaticapps.net
.azurewebsites.net
.basicserver.io
.beget.app
.begetcdn.cloud
.bounceme.net
.box.ca
.bplaced.com
.bplaced.de
.bplaced.net
.carrd.co
.cfolks.pl
.cloudaccess.net
.cloudapp.net
.cloudfront.net
.cloudfunctions.net
.cloudsite.builders
.cprapid.com
.cpserver.com
.crd.co
.customer.speedpartner.de
.cyon.link
.cyon.site
.dattorelay.com
.dattoweb.com
.ddns.net
.ddnsgeek.com
.ddnsking.com
.ddnss.de
.ddnss.org
.deltahost-ptr
.dh.bytemark.co.uk
.digitaloceanspaces.com
.dnsalias.com
.dnsalias.net
.dnsalias.org
.dnsup.net
.drayddns.com
.dreamhosters.com
.duckdns.org
.dyn-ip24.de
.dyndns.biz
.dyndns.info
.dyndns.org
.dyndns.tv
.dyndns.ws
.dyndns1.de
.dynv6.net
.e4.cz
.edgecompute.app
.edgekey.net
.edgesuite.net
.editorx.io
.elasticbeanstalk.com
.enterprisecloud.nu
.ewp.live
.fastlylb.net
.fastvps-server.com
.figma.site
.firebaseapp.com
.fly.dev
.freeddns.us
.freemyip.com
.freetls.fastly.net
.gehirn.ne.jp
.git-repos.de
.github.io
.githubusercontent.com
.gitlab.io
.goip.de
.gotdns.com
.gotdns.org
.gotpantheon.com
.hasura-app.io
.hasura.app
.hateblo.jp
.hatenablog.com
.hatenablog.jp
.hatenadiary.com
.hatenadiary.jp
.hatenadiary.org
.helioho.st
.heliohost.us
.herokuapp.com
.heyflow.page
.heyflow.site
.home-webserver.de
.homeftp.net
.homeftp.org
.homeip.net
.homelinux.net
.homelinux.org
.homesklep.pl
.homeunix.net
.homeunix.org
.hopto.me
.hopto.org
.hostedpi.com
.hosting-cluster.nl
.hostyhosting.io
.hypernode.io
.in-addr-arpa
.in-addr.arpa
.jote.cloud
.jotelulu.cloud
.jouwweb.site
.kaas.gg
.kasserver.com
.keymachine.de
.khplay.nl
.kicks-ass.net
.kicks-ass.org
.kinghost.net
.lcube-server.de
.leadpages.co
.linode.com
.linodeusercontent.com
.live-website.com
.lpages.co
.lpusercontent.com
.magentosite.cloud
.mcdir.me
.mcdir.ru
.mcpre.ru
.memset.net
.miniserver.com
.mittwald.info
.mittwaldserver.info
.mydatto.com
.mydatto.net
.mydbserver.com
.myftp.biz
.myftp.org
.myhome-server.de
.myradweb.net
.myrdbx.io
.myshopify.com
.myspreadshop.at
.myspreadshop.be
.myspreadshop.ca
.myspreadshop.ch
.myspreadshop.co.uk
.myspreadshop.com
.myspreadshop.com.au
.myspreadshop.de
.myspreadshop.dk
.myspreadshop.es
.myspreadshop.fi
.myspreadshop.fr
.myspreadshop.ie
.myspreadshop.it
.myspreadshop.net
.myspreadshop.nl
.myspreadshop.no
.myspreadshop.pl
.myspreadshop.se
.na4u.ru
.netlify.app
.nfshost.com
.nh-serv.co.uk
.nimsite.uk
.no-ip.biz
.no-ip.ca
.no-ip.co.uk
.no-ip.info
.no-ip.net
.no-ip.org
.noip.me
.noip.us
.notion.site
.now-dns.net
.now-dns.org
.now.sh
.nsupdate.info
.on-web.fr
.ondigitalocean.app
.onrender.com
.own.pm
.ownip.net
.ownprovider.com
.pantheonsite.io
.plesk.page
.podzone.net
.podzone.org
.pythonanywhere.com
.rackmaze.com
.rackmaze.net
.readthedocs-hosted.com
.readthedocs.io
.redirectme.net
.rhcloud.com
.sakura.ne.jp
.selfip.com
.selfip.net
.selfip.org
.sellfy.store
.serveblog.net
.servebolt.cloud
.servehttp.com
.serveminecraft.net
.servername.us
.service.one
.shopware.shop
.shopware.store
.simplesite.com
.simplesite.com.br
.simplesite.gr
.simplesite.pl
.site.rb-hosting.io
.snowflake.app
.square7.ch
.square7.de
.square7.net
.streamlit.app
.streamlitapp.com
.supabase.co
.supabase.in
.supabase.net
.svn-repos.de
.sytes.net
.trafficmanager.net
.typeform.com
.typo3server.info
.uber.space
.uk0.bigv.io
.user.fm
.usercontent.jp
.v0.build
.vercel.app
.vercel.dev
.vercel.run
.virtualserver.io
.vm.bytemark.co.uk
.vpndns.net
.vusercontent.net
.we.bs
.web.app
.webadorsite.com
.webflow.io
.webhosting.be
.website.one
.websitebuilder.online
.webspace-host.com
.webspaceconfig.de
.wixsite.com
.wixstudio.com
.wixstudio.io
.wpenginepowered.com
.xen.prgmr.com
.yandexcloud.net
.zap.cloud
.zapto.org
tigobusiness.com.ni