mirror of https://github.com/domainaware/parsedmarc.git synced 2026-06-11 13:09:44 +00:00

T

Sean Whalen 3839cfff6f collect_domain_info.py: opt-in DuckDuckGo search fallback for bot-blocked rows

A meaningful share of KU domains return a Cloudflare / DDoS-Guard / "Are
you a robot?" / px-captcha interstitial instead of real homepage content
— even after the curl-style relaxed-TLS fallback runs. For those rows we
have neither homepage signal nor (often) a usable as_name, and they fall
through to KU even though the operator is a real (often well-known)
business that the classifier could trivially handle if it could just see
the page.

Added an opt-in `--use-search-fallback` flag that asks DuckDuckGo for
`site:<domain>` when the homepage fetch returned a bot-block / parking /
empty result, and uses the top result's title and description (only if
the result host belongs to the input domain — anti-SEO-spam guard).

Mechanism

- New optional `ddgs` dependency, listed under the `[build]` extras.
  `from ddgs import DDGS` is wrapped in a try/except — the script runs
  without ddgs installed as long as `--use-search-fallback` isn't
  passed; the flag check exits with a helpful install message
  otherwise.
- `_SEARCH_FALLBACK_TRIGGER_RE` — title/description patterns that look
  like a bot-block / WAF interstitial / parked / placeholder. Triggers
  the fallback. Same shape as the classifier's TITLE_NOISE_RE /
  PARKED_PAGE_RE; the search fallback is the recovery path for
  exactly the rows that filter excludes.
- `_looks_bot_blocked()` — combined check: trigger regex matches OR
  title and description are both empty (typical of WAF interstitials
  that strip <title>/<meta> entirely).
- `_hosts_match()` — same-domain SEO-spam guard. A search result is
  accepted only when its host is exactly the input domain or a
  subdomain of it. Third-party SEO-spam pages that scraped the domain
  name are silently skipped.
- `_search_fallback_fetch()` — runs `site:<domain>` through DDG, walks
  results in rank order, returns the first one whose host passes the
  guard. Returns empty if no result matches (caller leaves the row's
  homepage data alone in that case).
- `_collect_one()` now takes a `use_search_fallback` flag, calls the
  fallback after the homepage fetch when the homepage looks
  bot-blocked, and writes `title_source = "homepage"` or
  `"search"` so reviewers can audit which rows came from where.
- New `title_source` column in the TSV.

Smoke test

Test set: bbc.com (real homepage, no fallback expected) plus 5 known
Cloudflare-walled rows (1800contacts.com, americaneagle.com,
broadwaytechnology.com, health.gov.il, mfa.gov.il).

Result: bbc.com classified via homepage; the other 5 all recovered
title + description via search and got `title_source=search`. The
same-domain guard validated independently — for broadwaytechnology.com
the guard correctly rejects bloomberg.com and accepts
support.broadwaytechnology.com (broadway was acquired by Bloomberg, but
the search fallback returns the broadway-domain snippet, not the
parent's bloomberg.com product page).

Caveats codified in AGENTS.md

- Search snippets are still untrusted text (data-not-instructions rule
  applies the same way it does to homepage HTML).
- DDG's index can lag a homepage rebrand by months — when a row
  classified via `title_source=search` disagrees with a fresh manual
  fetch, prefer the manual verification. The fallback is a recovery
  aid, not a tiebreaker against fresh content.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-08 00:14:19 -04:00

.claude

SIGHUP-based configuration reload for watch mode (#697 )

2026-03-21 16:14:48 -04:00

.github

Align Kibana dashboards with OpenSearch Dashboards source-of-truth (#737 )

2026-04-27 01:30:48 -04:00

.vscode

Update dashboard documentation

2026-05-03 12:36:06 -04:00

dashboards

Fix splunk SMTP TLS dashboard: add additional renames for failure details and adjust stats query

2026-05-03 19:58:29 -04:00

docs

docs: update installation instructions for IPinfo Lite and MaxMind GeoLite2 databases

2026-05-04 18:52:18 -04:00

parsedmarc

collect_domain_info.py: opt-in DuckDuckGo search fallback for bot-blocked rows

2026-05-08 00:14:19 -04:00

samples

Add example google SMTP-TLS report email

2024-09-04 20:03:51 -04:00

.dockerignore

Add Dockerfile & build/push task (#316 )

2022-05-05 21:06:38 -04:00

.gitattributes

Add additional samples and ensure git does not touch CRLF (#456 )

2024-01-02 16:29:06 -05:00

.gitignore

9.7.0 (#709 )

2026-04-19 21:20:41 -04:00

AGENTS.md

collect_domain_info.py: opt-in DuckDuckGo search fallback for bot-blocked rows

2026-05-08 00:14:19 -04:00

build.sh

Format on build

2025-12-12 15:56:52 -05:00

CHANGELOG.md

Bump mailsuite to >=2.0.2 for 9.11.1 release (#743 )

2026-04-30 11:59:11 -04:00

ci.ini

Skip DNS lookups in GitHub Actions to prevent test timeouts (#657 )

2026-02-18 18:19:28 -05:00

CLAUDE.md

Add AGENTS.md for AI agent guidance and link from CLAUDE.md

2026-03-03 21:00:55 -05:00

codecov.yml

Tune Codecov statuses for small PRs (#678 )

2026-03-09 17:43:34 -04:00

CONTRIBUTING.md

Add contributing guide (#685 )

2026-03-09 18:16:47 -04:00

dashboard-dev-bootstrap.sh

Align Kibana dashboards with OpenSearch Dashboards source-of-truth (#737 )

2026-04-27 01:30:48 -04:00

docker-compose.dashboard-dev.yml

9.4.0

2026-03-23 17:08:26 -04:00

docker-compose.yml

Update OpenSearch healthcheck to use HTTPS and include authentication

2026-03-16 17:53:37 -04:00

Dockerfile

Updated default python docker base image to 3.13-slim (#618 )

2025-10-29 22:34:06 -04:00

LICENSE

First commit

2018-02-05 20:23:07 -05:00

publish-docs.sh

Add publish-docs.sh

2022-10-04 18:45:57 -04:00

pyproject.toml

collect_domain_info.py: opt-in DuckDuckGo search fallback for bot-blocked rows

2026-05-08 00:14:19 -04:00

README.md

Update sponsorship section in README and documentation

2026-04-04 22:14:38 -04:00

SECURITY.md

Add security policy (#688 )

2026-03-09 18:24:16 -04:00

tests.py

Offload mailbox layer to mailsuite>=2.0.0 (#741 )

2026-04-28 00:58:36 -04:00

README.md

parsedmarc

parsedmarc is a Python module and CLI utility for parsing DMARC reports. When used with Elasticsearch and Kibana (or Splunk), it works as a self-hosted open-source alternative to commercial DMARC report processing services such as Agari Brand Protection, Dmarcian, OnDMARC, ProofPoint Email Fraud Defense, and Valimail.

Note

Domain-based Message Authentication, Reporting, and Conformance (DMARC) is an email authentication protocol.

Features

Parses draft and 1.0 standard aggregate/rua DMARC reports
Parses forensic/failure/ruf DMARC reports
Parses reports from SMTP TLS Reporting
Can parse reports from an inbox over IMAP, Microsoft Graph, or Gmail API
Transparently handles gzip or zip compressed reports
Consistent data structures
Simple JSON and/or CSV output
Optionally email the results
Optionally send the results to Elasticsearch, Opensearch, and/or Splunk, for use with premade dashboards
Optionally send reports to Apache Kafka

Python Compatibility

This project supports the following Python versions, which are either actively maintained or are the default versions for RHEL or Debian.

Version	Supported	Reason
< 3.6	❌	End of Life (EOL)
3.6	❌	Used in RHEL 8, but not supported by project dependencies
3.7	❌	End of Life (EOL)
3.8	❌	End of Life (EOL)
3.9	❌	Used in Debian 11 and RHEL 9, but not supported by project dependencies
3.10	✅	Actively maintained
3.11	✅	Actively maintained; supported until June 2028 (Debian 12)
3.12	✅	Actively maintained; supported until May 2035 (RHEL 10)
3.13	✅	Actively maintained; supported until June 2030 (Debian 13)
3.14	✅	Supported (requires `imapclient>=3.1.0`)

README.md

parsedmarc

Sponsors

Features

Python Compatibility