Sean Whalen 851560a9b1 Bulk reverse-DNS map coverage: top-500 ASN audit + KU re-research + curl fallback (#729)
* collect_domain_info.py: add curl fallback for blocked/broken fetches

Many sites that returned no usable homepage under the original requests
fetch turned out to be soft-failures: misconfigured TLS certs (self-signed,
hostname mismatch, weak chain), 403/captcha pages from User-Agent-based
bot filters, or redirect chains the requests stack rejected. None of those
recover under a single retry with the same client config.

This wires a curl fallback into _fetch_homepage that triggers when the
primary attempt errors or returns a non-2xx status. Curl runs with
-k (skip TLS verify), -L (follow redirects), --max-time bound, and a
real-browser User-Agent string -- enough to clear the common UA-block
and bad-cert classes of failure that small ISPs and regional telcos
routinely ship. A 2xx-with-empty-head response is left alone (parked
pages do not improve on retry). When both attempts fail, the error
column carries both signatures so it is obvious that the fallback was
tried.

Smoke-tested against eight previously-failed cert-error domains: six
recovered full title/description (as1101.net, citictel-cpc.com,
xtrim.com.ec, etecsa.cu, zillion.network, sandia.gov), two remained
genuinely unreachable. Happy-path domains take the primary path
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Bulk reverse-DNS map coverage: top-500 ASN audit + KU re-research

Two passes against the bundled IPinfo Lite MMDB and the existing
known-unknown list, both classified under the two-corroborating-sources
rule (AGENTS.md):

1. Top-500 unmapped ASN-domain audit. Walked every record in
   ipinfo_lite.mmdb to find as_domain values not yet in the map,
   ranked by routed IPv4 count, took the top 500 (>= ~/15 each), and
   ran them through collect_domain_info.py. Yield: 435 new map rows
   from operators with two or more independent corroborating sources;
   65 entries to known-unknown for operators where homepage and WHOIS
   were both unavailable from the test environment. Recovered domains
   span ISPs, web hosts, IaaS/MSP/MSSP, education networks, government
   agencies, and a long tail of major industrials.

2. Full re-research of the existing 3,606-entry known-unknown file
   using the new curl fallback (separate commit). The fallback
   recovered homepage content for 1,686 of 3,670 (45.9%) previously
   dark domains. Of those, 770 had a corroborating WHOIS or as_name
   alongside; 508 cleared the strict service-category test and were
   promoted out of known-unknown into the map. The remaining 262
   recovered titles were brand-only / login-portal / under-construction
   pages where service category could not be assigned with confidence.

Also removed a stale "#name?" Excel auto-correction artifact from the
known-unknown file (it would never have matched any real reverse-DNS
base domain).

Cumulative result: base_reverse_dns_map.csv 3,946 -> 4,889 rows
(+943, +23.9%); known_unknown_base_reverse_dns.txt 3,606 -> 3,162
(-444 net after both batches plus the artifact). Every promotion has
two independent sources for the operator's identity and a homepage or
MMDB-as_name signal sufficient to assign a service type.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix chello.sk classification: UPC, not Liberty Global

The original classification aliased chello.sk to "Liberty Global" based
on the IP-WHOIS netname (LGI-INFRASTRUCTURE) plus a stale homepage
redirect to ziggo.nl that the collector observed at fetch time. This
broke the AGENTS.md rule that IP-WHOIS only counts as a corroborating
source when the domain name matches the netname -- "chello" does not
match "LGI", so the IP-WHOIS should not have been treated as a source.

The WHOIS was unambiguous: UPC BROADBAND SLOVAKIA, s.r.o. UPC retains
its consumer brand in Slovakia (unlike Ireland, where upc.ie was
rebranded as Virgin Media Ireland in the existing map). Reverting to
the operator brand per WHOIS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix vodafone.is classification: Sýn, not Vodafone

Same pattern as the chello.sk fix in the previous commit: the historic
brand recorded in the MMDB as_name (Vodafone Iceland) is no longer the
operator. Sýn acquired Vodafone Iceland's operations and the homepage
redirects to syn.is, presenting Vodafone only as a partner relationship
rather than an active sub-brand. Following the upc.ie -> Virgin Media
Ireland precedent for rebranded markets, the canonical attribution is
the current operator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* AGENTS.md: codify the homepage-redirect disambiguation rule

Three classification mistakes during the bulk batch (chello.sk,
vodafone.is, telia.dk, apogee.us) all came from the same gap in the
workflow: when a homepage's final URL is a different host from the
domain being classified, the right brand depends on the *relationship*
between the two domains, not on the WHOIS or as_name in isolation.

Adds a new step 6 to the unknown-domain classification workflow that
spells out the three patterns and the disambiguator:

- Acquisition / rebrand: the homepage shows the acquiring operator's
  marketing site. Use the new operator. MMDB as_name and IP-WHOIS
  netname are commonly stale for years post-acquisition; do not let
  them override an unambiguous current-operator homepage.
- Sister brand / shared infrastructure: the homepage redirects to a
  *sibling* brand under the same parent group, but the WHOIS for the
  original domain still names a *specific* current operator. Use the
  WHOIS operator, not the redirect target. Canonical cautionary tale:
  chello.sk (WHOIS: UPC BROADBAND SLOVAKIA) was originally classified
  as Liberty Global because the homepage redirected to ziggo.nl (a
  sibling Liberty Global brand). The right answer was UPC.
- TLD or subdomain variant: same operator, different domain. Trivial.

Renumbers the remaining steps. The IP-WHOIS rule (step 5) and the
two-source rule (now step 8) are unchanged but cross-referenced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Apply homepage-redirect rule to telia.dk and apogee.us

Same pattern as chello.sk and vodafone.is in earlier commits — the
historic operator name in the MMDB as_name and WHOIS does not reflect
who actually runs the IPs after an acquisition. The homepage redirect
is the current ground truth.

- telia.dk -> Norlys: Norlys acquired Telia Denmark; homepage now
  redirects to shop.norlys.dk and presents Norlys throughout.
- apogee.us -> Boldyn: Boldyn acquired Apogee Telecom; homepage now
  redirects to boldyn.com and shows the Boldyn marketing site for
  higher-education managed services.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Bulk reverse-DNS map coverage: next-500 unmapped ASN-domain audit

Same workflow as the first top-500 batch in this branch, applied to
the next tier of unmapped MMDB as_domain values (ranked 501..1000 by
routed IPv4 count, each ~/15 to /14.5). Pre-screened against the
current state of base_reverse_dns_map.csv and
known_unknown_base_reverse_dns.txt.

Yield: 414 newly-classified map entries + 86 known-unknown additions.
Type breakdown skews ISP-heavy as expected at this scale, with strong
representation from Education (universities now reaching deeper into
the long tail), Government (state/county/national agencies), Web Host
(regional hosting providers), and IaaS (mid-market cloud).

Applied AGENTS.md step 6 (homepage-redirect disambiguation) on every
case where the homepage's final_url crossed hosts: kept new operator
when the redirect target was an acquiring brand (e.g. atlanticmetro.net
-> 365 Data Centers, performive.com -> CloudFirst, fasternet.com.br ->
Desktop, eatel.com -> REV, blic.net -> Supernova, dimensiondata.com ->
NTT DATA, virtela.net -> NTT Communications), used WHOIS operator when
the redirect was sister-brand or shared infra, used the same operator
when the redirect was a TLD/subdomain variant.

Coverage delta: 88.89% -> 90.40% of MMDB IPv4 (+1.51 pp, ~47M IPv4).
Cumulative for this PR: 85.10% -> 90.40% (+5.30 pp, ~165M IPv4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Reclassify the 262 left-dark KU re-research candidates with relaxed heuristic

Of the 770 two-source candidates from the curl-fallback KU re-research
pass earlier in this branch, 262 had homepage content and a corroborating
WHOIS/as_name but were left in known-unknown because the homepage was
brand-only or a login portal that didn't directly describe service
category.

Relaxing the heuristic on a re-pass: when the WHOIS legal name itself
contains a regulated-telecom keyword (TELECOM, TELECOMUNICAÇÕES,
INTERNET, FIBRA, BROADBAND, PROVEDOR DE INTERNET, NET TELECOM), that
*is* a service-category source -- in Brazil, Argentina, Chile, and
peers, operators must register under specific legal naming and the
registration is a regulator-vetted signal. Combined with two-source
identity, that clears the bar without forcing the homepage to also
spell out the service.

Same goes for brand-name-as-service signals: "X Server Limited" with a
customer-portal homepage and matching WHOIS reasonably maps to Web Host;
"X Fiber" + matching as_name maps to ISP. These are what readers would
naturally infer from the operator's own self-naming.

Yield: 95 promotions out of 262 (36% of the left-dark subset). The
remaining 167 stay in known-unknown because the homepage was a generic
placeholder ("Index of /", "Coming Soon", default Apache page), the
brand on the homepage didn't match the WHOIS, the operator was clearly
a non-telecom (e.g. INPASUPRI = supplies for IT, malugainfor =
Comércio de Produtos de Informática, hugel = pharma), or the service
category was genuinely ambiguous.

MMDB IPv4 delta is small (+0.03 pp, +888K IPv4) since most of these are
long-tail operators with low or zero MMDB footprint -- the value is in
PTR-side attribution coverage when these brands appear in actual
reverse-DNS reports.

Cumulative for this PR: map 4,889 -> 5,398 rows; KU 3,162 -> 3,153 lines;
MMDB IPv4 coverage 88.89% -> 90.42% (+1.53 pp from the next-500 batch
plus this re-pass).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 15:15:32 -04:00
2026-04-19 21:20:41 -04:00
2024-12-25 16:09:43 -05:00
2025-06-10 19:05:06 -04:00
2026-04-19 21:20:41 -04:00
2025-12-12 15:56:52 -05:00
2026-03-09 18:16:47 -04:00
2026-03-23 17:08:26 -04:00
2018-02-05 20:23:07 -05:00
2022-10-04 18:45:57 -04:00
2026-03-09 18:24:16 -04:00

parsedmarc

Build
Status Code
Coverage PyPI
Package PyPI - Downloads

A screenshot of DMARC summary charts in Kibana

parsedmarc is a Python module and CLI utility for parsing DMARC reports. When used with Elasticsearch and Kibana (or Splunk), it works as a self-hosted open-source alternative to commercial DMARC report processing services such as Agari Brand Protection, Dmarcian, OnDMARC, ProofPoint Email Fraud Defense, and Valimail.

Note

Domain-based Message Authentication, Reporting, and Conformance (DMARC) is an email authentication protocol.

Sponsors

This is a project is maintained by one developer. Please consider sponsoring my work if you or your organization benefit from it.

Features

  • Parses draft and 1.0 standard aggregate/rua DMARC reports
  • Parses forensic/failure/ruf DMARC reports
  • Parses reports from SMTP TLS Reporting
  • Can parse reports from an inbox over IMAP, Microsoft Graph, or Gmail API
  • Transparently handles gzip or zip compressed reports
  • Consistent data structures
  • Simple JSON and/or CSV output
  • Optionally email the results
  • Optionally send the results to Elasticsearch, Opensearch, and/or Splunk, for use with premade dashboards
  • Optionally send reports to Apache Kafka

Python Compatibility

This project supports the following Python versions, which are either actively maintained or are the default versions for RHEL or Debian.

Version Supported Reason
< 3.6 End of Life (EOL)
3.6 Used in RHEL 8, but not supported by project dependencies
3.7 End of Life (EOL)
3.8 End of Life (EOL)
3.9 Used in Debian 11 and RHEL 9, but not supported by project dependencies
3.10 Actively maintained
3.11 Actively maintained; supported until June 2028 (Debian 12)
3.12 Actively maintained; supported until May 2035 (RHEL 10)
3.13 Actively maintained; supported until June 2030 (Debian 13)
3.14 Supported (requires imapclient>=3.1.0)
S
Description
No description provided
Readme Apache-2.0 119 MiB
Languages
Python 96.7%
Shell 3.2%
Dockerfile 0.1%