mirror of https://github.com/domainaware/parsedmarc.git synced 2026-06-11 04:59:43 +00:00

T

Sean Whalen e681da2f35 Run --use-search-fallback against 10,544 bot-blocked KU rows; +473 promotions

Also expands the search-fallback trigger regex to recognize self-signed
TLS interception (firewall block via cert) and a wider class of
local-firewall block-page strings.

Mechanics

1. Identified 10,544 KU rows from the 34,647-row prior TSV that looked
   bot-blocked (via the new `_looks_bot_blocked` detector).
2. Ran `collect_domain_info.py --use-search-fallback` against just
   those rows. Throughput was ~3.4 rows/sec at 32 workers / 3s HTTP
   timeout / 5s WHOIS timeout. ~50 min wall time.
3. Audited the resulting TSV and discovered 2,078 rows whose homepage
   fetch had silently returned a corporate firewall's block page
   (Fortinet "Web Filter Violation" being the most common, 1,419 of
   them). The original `_SEARCH_FALLBACK_TRIGGER_RE` didn't recognize
   those strings, so search-fallback wasn't firing — the firewall's
   block-page text was being fed to the classifier as if it were the
   operator's homepage. Almost no false promotions resulted (block-page
   text doesn't match industry detectors), but the rows weren't
   recovering either.
4. Expanded the trigger regex to catch web-filter block pages, then
   re-fetched just the 2,078 affected rows.
5. Final classifier pass: 474 unambiguous map adds, 41 ambiguous, 1
   silently dropped (adult content), 10,066 still in KU.

Self-signed-cert detection

A separate fix lands in this commit: when the primary fetch fails with
an SSL cert verification error matching "self-signed certificate", the
collector skips the verify=False browser fallback. Rationale: TLS-
intercepting firewalls (corporate or personal-network) present their
own self-signed cert specifically when blocking. The verify=False
fallback would happily retrieve the firewall's block page, which then
poisons the row's title/description. Skipping that path leaves the
row's metadata empty so search-fallback can recover real content.
Other cert errors (hostname mismatch, weak DH, legacy renegotiation)
keep the existing fallback path because they're typically real
operators with misconfigured TLS rather than firewall interception.

Numbers

  Map:  37,640 → 38,114 (+474)
  KU:   32,324 → 31,886 (−438)

  Disjoint check: 0 shared keys
  Unknown CSV: regenerated, just the header

Type distribution of the 474 promotions

  162  ISP                 17  MSP                 4   MSSP / Marketing
   72  Web Host             16  Technology          4   Beauty / Agriculture
   41  Finance              14  Healthcare          3   IaaS / Science / Legal
   19  Government           11  Travel              2   Search / Religion / SaaS
   10  Logistics            8   Manufacturing       2   Email Sec / Email Provider
    9  Education / Retail   8   News                2   Entertainment
    7  Utilities / Phys Sec 6   Real Estate         1   Auto / Staff / PaaS
                            6   Food / Consulting / Industrial / Conglomerate / Nonprofit

Most of the gains are network operators (162 ISPs, 72 Web Hosts) —
the population that's most likely to be Cloudflare-walled or DDoS-
Guard-walled at the homepage layer but show up clearly in DDG
abstracts.

Smoke audit on a 30-row random sample of map adds: 28 plausible, 2
borderline (`es.graphicpkg.com → Food` could also be Industrial since
Graphic Packaging makes packaging *for* the food industry, but the
vertically-specialized rule applies; `annuairesante.ameli.fr` →
Finance via French health-insurance vocabulary, defensible). The 41
ambiguous rows stay in KU per the established workflow — they need
the same one-row-at-a-time human triage as PR #766 used.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-08 01:55:10 -04:00

.claude

SIGHUP-based configuration reload for watch mode (#697 )

2026-03-21 16:14:48 -04:00

.github

Align Kibana dashboards with OpenSearch Dashboards source-of-truth (#737 )

2026-04-27 01:30:48 -04:00

.vscode

Update dashboard documentation

2026-05-03 12:36:06 -04:00

dashboards

Fix splunk SMTP TLS dashboard: add additional renames for failure details and adjust stats query

2026-05-03 19:58:29 -04:00

docs

docs: update installation instructions for IPinfo Lite and MaxMind GeoLite2 databases

2026-05-04 18:52:18 -04:00

parsedmarc

Run --use-search-fallback against 10,544 bot-blocked KU rows; +473 promotions

2026-05-08 01:55:10 -04:00

samples

Add example google SMTP-TLS report email

2024-09-04 20:03:51 -04:00

.dockerignore

Add Dockerfile & build/push task (#316 )

2022-05-05 21:06:38 -04:00

.gitattributes

Add additional samples and ensure git does not touch CRLF (#456 )

2024-01-02 16:29:06 -05:00

.gitignore

9.7.0 (#709 )

2026-04-19 21:20:41 -04:00

AGENTS.md

collect/classify: link-following + alias map rows for placeholder DDG titles

2026-05-08 00:26:38 -04:00

build.sh

Format on build

2025-12-12 15:56:52 -05:00

CHANGELOG.md

Bump mailsuite to >=2.0.2 for 9.11.1 release (#743 )

2026-04-30 11:59:11 -04:00

ci.ini

Skip DNS lookups in GitHub Actions to prevent test timeouts (#657 )

2026-02-18 18:19:28 -05:00

CLAUDE.md

Add AGENTS.md for AI agent guidance and link from CLAUDE.md

2026-03-03 21:00:55 -05:00

codecov.yml

Tune Codecov statuses for small PRs (#678 )

2026-03-09 17:43:34 -04:00

CONTRIBUTING.md

Add contributing guide (#685 )

2026-03-09 18:16:47 -04:00

dashboard-dev-bootstrap.sh

Align Kibana dashboards with OpenSearch Dashboards source-of-truth (#737 )

2026-04-27 01:30:48 -04:00

docker-compose.dashboard-dev.yml

9.4.0

2026-03-23 17:08:26 -04:00

docker-compose.yml

Update OpenSearch healthcheck to use HTTPS and include authentication

2026-03-16 17:53:37 -04:00

Dockerfile

Updated default python docker base image to 3.13-slim (#618 )

2025-10-29 22:34:06 -04:00

LICENSE

First commit

2018-02-05 20:23:07 -05:00

publish-docs.sh

Add publish-docs.sh

2022-10-04 18:45:57 -04:00

pyproject.toml

collect_domain_info.py: opt-in DuckDuckGo search fallback for bot-blocked rows

2026-05-08 00:14:19 -04:00

README.md

Update sponsorship section in README and documentation

2026-04-04 22:14:38 -04:00

SECURITY.md

Add security policy (#688 )

2026-03-09 18:24:16 -04:00

tests.py

Offload mailbox layer to mailsuite>=2.0.0 (#741 )

2026-04-28 00:58:36 -04:00

README.md

parsedmarc

parsedmarc is a Python module and CLI utility for parsing DMARC reports. When used with Elasticsearch and Kibana (or Splunk), it works as a self-hosted open-source alternative to commercial DMARC report processing services such as Agari Brand Protection, Dmarcian, OnDMARC, ProofPoint Email Fraud Defense, and Valimail.

Note

Domain-based Message Authentication, Reporting, and Conformance (DMARC) is an email authentication protocol.

Features

Parses draft and 1.0 standard aggregate/rua DMARC reports
Parses forensic/failure/ruf DMARC reports
Parses reports from SMTP TLS Reporting
Can parse reports from an inbox over IMAP, Microsoft Graph, or Gmail API
Transparently handles gzip or zip compressed reports
Consistent data structures
Simple JSON and/or CSV output
Optionally email the results
Optionally send the results to Elasticsearch, Opensearch, and/or Splunk, for use with premade dashboards
Optionally send reports to Apache Kafka

Python Compatibility

This project supports the following Python versions, which are either actively maintained or are the default versions for RHEL or Debian.

Version	Supported	Reason
< 3.6	❌	End of Life (EOL)
3.6	❌	Used in RHEL 8, but not supported by project dependencies
3.7	❌	End of Life (EOL)
3.8	❌	End of Life (EOL)
3.9	❌	Used in Debian 11 and RHEL 9, but not supported by project dependencies
3.10	✅	Actively maintained
3.11	✅	Actively maintained; supported until June 2028 (Debian 12)
3.12	✅	Actively maintained; supported until May 2035 (RHEL 10)
3.13	✅	Actively maintained; supported until June 2030 (Debian 13)
3.14	✅	Supported (requires `imapclient>=3.1.0`)

README.md

parsedmarc

Sponsors

Features

Python Compatibility