mirror of https://github.com/domainaware/parsedmarc.git synced 2026-06-11 04:59:43 +00:00

T

Sean Whalen c431a9cd99 collect/classify: link-following + alias map rows for placeholder DDG titles

When the search fallback ran on the original 6-domain smoke set, two of
the recovered titles were essentially placeholder pointers carrying no
classifier signal — DDG returned `Link to fcs.health.gov.il` for one
input and a bare `yangon.mfa.gov.il` for another. Those snippets are
DDG's way of saying "I have an indexed subdomain but no real abstract
to give you", and feeding them to the regex classifier produces no
better signal than the parking-page result we were already trying to
recover from.

This commit teaches the collector to recognize both placeholder shapes,
follow the pointer to the target hostname, and use *that* hostname's
real content for the row. The classifier then emits the original input
and the link target as **two map rows under the same (name, type)** so
both keys are looked up against future DMARC reports.

collect_domain_info.py
- New `_LINK_TO_TITLE_RE` / `_BARE_HOSTNAME_RE` and an
  `_extract_link_target` helper that returns the target hostname when
  the search title is `Link to <hostname>` or a bare hostname, "" when
  the title carries real content.
- After the search-fallback path, if the title looks like a pointer
  and the target differs from the input, `_fetch_homepage(target)` is
  called once. When the target's fetch returns real (non-bot-blocked)
  content, the row's title / description / final_url / rebrand_signal
  / external_links are replaced with the target's, and `title_source`
  becomes `search→<target>` so reviewers can audit the path.
- New `link_target_domain` column records the followed target whether
  or not its fetch succeeded.

classify_unknown_domains.py
- When a row's `link_target_domain` is set and differs from the input
  domain, the classifier emits a second map row for the target with
  the same `(name, type)`. The original input is the "og" domain; the
  target is what DDG pointed us at — both end up in the map as
  aliases. Same handling applies on the ambiguous-bucket path so a
  single human adjudication covers both.

Smoke test on the original 6-domain set:

  bbc.com                  homepage   → BBC Home – Breaking News, …
  1800contacts.com         search     → 1800contacts
  health.gov.il            search     → Homepage – COVID Information Center
                                        of the Israel Ministry of Health
  americaneagle.com        search     → Americaneagle.com | Web Design …
  broadwaytechnology.com   search     → Bloomberg Completes Acquisition of …
  mfa.gov.il               search→yangon.mfa.gov.il
                                      → Home | Ministry of Foreign Affairs
                                        link_target_domain=yangon.mfa.gov.il

The mfa.gov.il row triggered the new path: DDG returned `yangon.mfa.gov.il`
as the title, the collector followed it, the target's homepage gave us
"Home | Ministry of Foreign Affairs", and the classifier emitted both
`mfa.gov.il, Ministry of foreign affairs, Government` and
`yangon.mfa.gov.il, Ministry of foreign affairs, Government`.

AGENTS.md updated with the link-following / alias rules under the
search-fallback subsection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-08 00:26:38 -04:00

.claude

SIGHUP-based configuration reload for watch mode (#697 )

2026-03-21 16:14:48 -04:00

.github

Align Kibana dashboards with OpenSearch Dashboards source-of-truth (#737 )

2026-04-27 01:30:48 -04:00

.vscode

Update dashboard documentation

2026-05-03 12:36:06 -04:00

dashboards

Fix splunk SMTP TLS dashboard: add additional renames for failure details and adjust stats query

2026-05-03 19:58:29 -04:00

docs

docs: update installation instructions for IPinfo Lite and MaxMind GeoLite2 databases

2026-05-04 18:52:18 -04:00

parsedmarc

collect/classify: link-following + alias map rows for placeholder DDG titles

2026-05-08 00:26:38 -04:00

samples

Add example google SMTP-TLS report email

2024-09-04 20:03:51 -04:00

.dockerignore

Add Dockerfile & build/push task (#316 )

2022-05-05 21:06:38 -04:00

.gitattributes

Add additional samples and ensure git does not touch CRLF (#456 )

2024-01-02 16:29:06 -05:00

.gitignore

9.7.0 (#709 )

2026-04-19 21:20:41 -04:00

AGENTS.md

collect/classify: link-following + alias map rows for placeholder DDG titles

2026-05-08 00:26:38 -04:00

build.sh

Format on build

2025-12-12 15:56:52 -05:00

CHANGELOG.md

Bump mailsuite to >=2.0.2 for 9.11.1 release (#743 )

2026-04-30 11:59:11 -04:00

ci.ini

Skip DNS lookups in GitHub Actions to prevent test timeouts (#657 )

2026-02-18 18:19:28 -05:00

CLAUDE.md

Add AGENTS.md for AI agent guidance and link from CLAUDE.md

2026-03-03 21:00:55 -05:00

codecov.yml

Tune Codecov statuses for small PRs (#678 )

2026-03-09 17:43:34 -04:00

CONTRIBUTING.md

Add contributing guide (#685 )

2026-03-09 18:16:47 -04:00

dashboard-dev-bootstrap.sh

Align Kibana dashboards with OpenSearch Dashboards source-of-truth (#737 )

2026-04-27 01:30:48 -04:00

docker-compose.dashboard-dev.yml

9.4.0

2026-03-23 17:08:26 -04:00

docker-compose.yml

Update OpenSearch healthcheck to use HTTPS and include authentication

2026-03-16 17:53:37 -04:00

Dockerfile

Updated default python docker base image to 3.13-slim (#618 )

2025-10-29 22:34:06 -04:00

LICENSE

First commit

2018-02-05 20:23:07 -05:00

publish-docs.sh

Add publish-docs.sh

2022-10-04 18:45:57 -04:00

pyproject.toml

collect_domain_info.py: opt-in DuckDuckGo search fallback for bot-blocked rows

2026-05-08 00:14:19 -04:00

README.md

Update sponsorship section in README and documentation

2026-04-04 22:14:38 -04:00

SECURITY.md

Add security policy (#688 )

2026-03-09 18:24:16 -04:00

tests.py

Offload mailbox layer to mailsuite>=2.0.0 (#741 )

2026-04-28 00:58:36 -04:00

README.md

parsedmarc

parsedmarc is a Python module and CLI utility for parsing DMARC reports. When used with Elasticsearch and Kibana (or Splunk), it works as a self-hosted open-source alternative to commercial DMARC report processing services such as Agari Brand Protection, Dmarcian, OnDMARC, ProofPoint Email Fraud Defense, and Valimail.

Note

Domain-based Message Authentication, Reporting, and Conformance (DMARC) is an email authentication protocol.

Features

Parses draft and 1.0 standard aggregate/rua DMARC reports
Parses forensic/failure/ruf DMARC reports
Parses reports from SMTP TLS Reporting
Can parse reports from an inbox over IMAP, Microsoft Graph, or Gmail API
Transparently handles gzip or zip compressed reports
Consistent data structures
Simple JSON and/or CSV output
Optionally email the results
Optionally send the results to Elasticsearch, Opensearch, and/or Splunk, for use with premade dashboards
Optionally send reports to Apache Kafka

Python Compatibility

This project supports the following Python versions, which are either actively maintained or are the default versions for RHEL or Debian.

Version	Supported	Reason
< 3.6	❌	End of Life (EOL)
3.6	❌	Used in RHEL 8, but not supported by project dependencies
3.7	❌	End of Life (EOL)
3.8	❌	End of Life (EOL)
3.9	❌	Used in Debian 11 and RHEL 9, but not supported by project dependencies
3.10	✅	Actively maintained
3.11	✅	Actively maintained; supported until June 2028 (Debian 12)
3.12	✅	Actively maintained; supported until May 2035 (RHEL 10)
3.13	✅	Actively maintained; supported until June 2030 (Debian 13)
3.14	✅	Supported (requires `imapclient>=3.1.0`)

README.md

parsedmarc

Sponsors

Features

Python Compatibility