Sean Whalen e681da2f35 Run --use-search-fallback against 10,544 bot-blocked KU rows; +473 promotions
Also expands the search-fallback trigger regex to recognize self-signed
TLS interception (firewall block via cert) and a wider class of
local-firewall block-page strings.

Mechanics

1. Identified 10,544 KU rows from the 34,647-row prior TSV that looked
   bot-blocked (via the new `_looks_bot_blocked` detector).
2. Ran `collect_domain_info.py --use-search-fallback` against just
   those rows. Throughput was ~3.4 rows/sec at 32 workers / 3s HTTP
   timeout / 5s WHOIS timeout. ~50 min wall time.
3. Audited the resulting TSV and discovered 2,078 rows whose homepage
   fetch had silently returned a corporate firewall's block page
   (Fortinet "Web Filter Violation" being the most common, 1,419 of
   them). The original `_SEARCH_FALLBACK_TRIGGER_RE` didn't recognize
   those strings, so search-fallback wasn't firing — the firewall's
   block-page text was being fed to the classifier as if it were the
   operator's homepage. Almost no false promotions resulted (block-page
   text doesn't match industry detectors), but the rows weren't
   recovering either.
4. Expanded the trigger regex to catch web-filter block pages, then
   re-fetched just the 2,078 affected rows.
5. Final classifier pass: 474 unambiguous map adds, 41 ambiguous, 1
   silently dropped (adult content), 10,066 still in KU.

Self-signed-cert detection

A separate fix lands in this commit: when the primary fetch fails with
an SSL cert verification error matching "self-signed certificate", the
collector skips the verify=False browser fallback. Rationale: TLS-
intercepting firewalls (corporate or personal-network) present their
own self-signed cert specifically when blocking. The verify=False
fallback would happily retrieve the firewall's block page, which then
poisons the row's title/description. Skipping that path leaves the
row's metadata empty so search-fallback can recover real content.
Other cert errors (hostname mismatch, weak DH, legacy renegotiation)
keep the existing fallback path because they're typically real
operators with misconfigured TLS rather than firewall interception.

Numbers

  Map:  37,640 → 38,114 (+474)
  KU:   32,324 → 31,886 (−438)

  Disjoint check: 0 shared keys
  Unknown CSV: regenerated, just the header

Type distribution of the 474 promotions

  162  ISP                 17  MSP                 4   MSSP / Marketing
   72  Web Host             16  Technology          4   Beauty / Agriculture
   41  Finance              14  Healthcare          3   IaaS / Science / Legal
   19  Government           11  Travel              2   Search / Religion / SaaS
   10  Logistics            8   Manufacturing       2   Email Sec / Email Provider
    9  Education / Retail   8   News                2   Entertainment
    7  Utilities / Phys Sec 6   Real Estate         1   Auto / Staff / PaaS
                            6   Food / Consulting / Industrial / Conglomerate / Nonprofit

Most of the gains are network operators (162 ISPs, 72 Web Hosts) —
the population that's most likely to be Cloudflare-walled or DDoS-
Guard-walled at the homepage layer but show up clearly in DDG
abstracts.

Smoke audit on a 30-row random sample of map adds: 28 plausible, 2
borderline (`es.graphicpkg.com → Food` could also be Industrial since
Graphic Packaging makes packaging *for* the food industry, but the
vertically-specialized rule applies; `annuairesante.ameli.fr` →
Finance via French health-insurance vocabulary, defensible). The 41
ambiguous rows stay in KU per the established workflow — they need
the same one-row-at-a-time human triage as PR #766 used.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 01:55:10 -04:00
2026-05-03 12:36:06 -04:00
2026-04-19 21:20:41 -04:00
2025-12-12 15:56:52 -05:00
2026-03-09 18:16:47 -04:00
2026-03-23 17:08:26 -04:00
2018-02-05 20:23:07 -05:00
2022-10-04 18:45:57 -04:00
2026-03-09 18:24:16 -04:00

parsedmarc

Build
Status Code
Coverage PyPI
Package PyPI - Downloads

A screenshot of DMARC summary charts in Kibana

parsedmarc is a Python module and CLI utility for parsing DMARC reports. When used with Elasticsearch and Kibana (or Splunk), it works as a self-hosted open-source alternative to commercial DMARC report processing services such as Agari Brand Protection, Dmarcian, OnDMARC, ProofPoint Email Fraud Defense, and Valimail.

Note

Domain-based Message Authentication, Reporting, and Conformance (DMARC) is an email authentication protocol.

Sponsors

This is a project is maintained by one developer. Please consider sponsoring my work if you or your organization benefit from it.

Features

  • Parses draft and 1.0 standard aggregate/rua DMARC reports
  • Parses forensic/failure/ruf DMARC reports
  • Parses reports from SMTP TLS Reporting
  • Can parse reports from an inbox over IMAP, Microsoft Graph, or Gmail API
  • Transparently handles gzip or zip compressed reports
  • Consistent data structures
  • Simple JSON and/or CSV output
  • Optionally email the results
  • Optionally send the results to Elasticsearch, Opensearch, and/or Splunk, for use with premade dashboards
  • Optionally send reports to Apache Kafka

Python Compatibility

This project supports the following Python versions, which are either actively maintained or are the default versions for RHEL or Debian.

Version Supported Reason
< 3.6 End of Life (EOL)
3.6 Used in RHEL 8, but not supported by project dependencies
3.7 End of Life (EOL)
3.8 End of Life (EOL)
3.9 Used in Debian 11 and RHEL 9, but not supported by project dependencies
3.10 Actively maintained
3.11 Actively maintained; supported until June 2028 (Debian 12)
3.12 Actively maintained; supported until May 2035 (RHEL 10)
3.13 Actively maintained; supported until June 2030 (Debian 13)
3.14 Supported (requires imapclient>=3.1.0)
S
Description
No description provided
Readme Apache-2.0 206 MiB
Languages
Python 98.3%
Shell 1.6%