mirror of https://github.com/domainaware/parsedmarc.git synced 2026-07-06 00:35:09 +00:00

T

Sean Whalen 3b705aeaa8 Commit classify_unknown_domains.py — regex-based multilingual classifier (#764 )

* Commit classify_unknown_domains.py: regex-based multilingual classifier

Promotes the transient `/tmp/classify_b<N>.py` script that grew across
the b5–b13 reverse-DNS-map batches into a tracked tool. The classifier
takes a `collect_domain_info.py` TSV and emits a CSV of map additions
plus a text file of known-unknown additions — the regex baseline that
makes step 4 of the unknown-domain workflow ("classify from the TSV, not
by re-fetching") tractable at scale.

Coverage:

- Detectors for all 44 industry types in the README.
- Concept-translation parity across ~30 languages on the high-volume
  detectors (Healthcare, Travel, Government, Retail, Finance, ISP, Web
  Host, Manufacturing, Logistics, Real Estate, Automotive, Legal,
  Agriculture).
- ~10–20 languages with 1–3 keywords each on the smaller detectors
  (Photography, Sports, MSSP, Conglomerate, Search Engine, Social Media,
  Defense, IaaS/PaaS/SaaS, Beauty, Print, Publishing, Religion, Science,
  Event Planning, Staffing, Email Security/Provider, Marketing,
  Construction, Industrial, Utilities, Energy, Government Media,
  Physical Security, News, Nonprofit, Entertainment, Technology,
  Consulting).

Brand-name selection prefers MMDB `as_name` → page title's first
segment → non-redacted WHOIS registrant → domain-derived fallback, with
a `clean_brand` pass that strips legal-form suffixes (LLC / GmbH / Ltda
/ EIRELI / sp. z o.o. / s.c.a r.l / UAB / etc.) and prefixes (PT, OOO).
When the title has multiple segments, the segment whose simplified form
contains the domain root is preferred — accessmontana.com with as_name
"MONTANA WEST, L.L.C." and title "Internet, Phone & TV Bundles | Access
Montana" maps to "Access Montana", not "Montana West".

A small mojibake fixer normalizes the most common UTF-8-as-Latin-1
re-encodings ("Ã³" → "ó", etc.) so Spanish/Portuguese/French homepages
that `collect_domain_info.py` mishandled still classify.

The empty HAND dict at the top of the file is an extension point for
batch-specific overrides — e.g. acquisition aliases or brand-name
corrections that don't fit any detector; each `domain → ("Brand",
"Type")` entry wins over the auto-classifier.

Wired into AGENTS.md's "Related utility scripts" section and documented
in `parsedmarc/resources/maps/README.md` alongside the rest of the
maps utilities.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* classify_unknown_domains.py: clarify dual-purpose framing

The classifier serves both lookup paths into base_reverse_dns_map.csv —
the original PTR-side flow (reverse-DNS base domains derived from DMARC
report source IPs) and the MMDB-coverage flow (AS domains lifted from
the bundled IPinfo Lite MMDB). The initial commit's docstring/comments
emphasized the MMDB-coverage flow because that's where the script grew
up across the b5–b13 batches, but it was always equally applicable to
PTR-side domains.

Updates:

- Top docstring rewritten to lead with the dual-purpose framing.
- README.md adds an explicit "useful for either lookup path" paragraph
  referencing the original DMARC-report flow and the MMDB-coverage flow.
- AGENTS.md "Related utility scripts" entry updated to mention both
  flows.
- Drops a stale "happen to have ASN registrations" aside in the
  RETAIL_RE comment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-07 17:16:23 -04:00

.claude

SIGHUP-based configuration reload for watch mode (#697 )

2026-03-21 16:14:48 -04:00

.github

Align Kibana dashboards with OpenSearch Dashboards source-of-truth (#737 )

2026-04-27 01:30:48 -04:00

.vscode

Update dashboard documentation

2026-05-03 12:36:06 -04:00

dashboards

Fix splunk SMTP TLS dashboard: add additional renames for failure details and adjust stats query

2026-05-03 19:58:29 -04:00

docs

docs: update installation instructions for IPinfo Lite and MaxMind GeoLite2 databases

2026-05-04 18:52:18 -04:00

parsedmarc

Commit classify_unknown_domains.py — regex-based multilingual classifier (#764 )

2026-05-07 17:16:23 -04:00

samples

Add example google SMTP-TLS report email

2024-09-04 20:03:51 -04:00

.dockerignore

Add Dockerfile & build/push task (#316 )

2022-05-05 21:06:38 -04:00

.gitattributes

Add additional samples and ensure git does not touch CRLF (#456 )

2024-01-02 16:29:06 -05:00

.gitignore

9.7.0 (#709 )

2026-04-19 21:20:41 -04:00

AGENTS.md

Commit classify_unknown_domains.py — regex-based multilingual classifier (#764 )

2026-05-07 17:16:23 -04:00

build.sh

Format on build

2025-12-12 15:56:52 -05:00

CHANGELOG.md

Bump mailsuite to >=2.0.2 for 9.11.1 release (#743 )

2026-04-30 11:59:11 -04:00

ci.ini

Skip DNS lookups in GitHub Actions to prevent test timeouts (#657 )

2026-02-18 18:19:28 -05:00

CLAUDE.md

Add AGENTS.md for AI agent guidance and link from CLAUDE.md

2026-03-03 21:00:55 -05:00

codecov.yml

Tune Codecov statuses for small PRs (#678 )

2026-03-09 17:43:34 -04:00

CONTRIBUTING.md

Add contributing guide (#685 )

2026-03-09 18:16:47 -04:00

dashboard-dev-bootstrap.sh

Align Kibana dashboards with OpenSearch Dashboards source-of-truth (#737 )

2026-04-27 01:30:48 -04:00

docker-compose.dashboard-dev.yml

9.4.0

2026-03-23 17:08:26 -04:00

docker-compose.yml

Update OpenSearch healthcheck to use HTTPS and include authentication

2026-03-16 17:53:37 -04:00

Dockerfile

Updated default python docker base image to 3.13-slim (#618 )

2025-10-29 22:34:06 -04:00

LICENSE

First commit

2018-02-05 20:23:07 -05:00

publish-docs.sh

Add publish-docs.sh

2022-10-04 18:45:57 -04:00

pyproject.toml

Bump mailsuite to >=2.0.2 for 9.11.1 release (#743 )

2026-04-30 11:59:11 -04:00

README.md

Update sponsorship section in README and documentation

2026-04-04 22:14:38 -04:00

SECURITY.md

Add security policy (#688 )

2026-03-09 18:24:16 -04:00

tests.py

Offload mailbox layer to mailsuite>=2.0.0 (#741 )

2026-04-28 00:58:36 -04:00

README.md

parsedmarc

parsedmarc is a Python module and CLI utility for parsing DMARC reports. When used with Elasticsearch and Kibana (or Splunk), it works as a self-hosted open-source alternative to commercial DMARC report processing services such as Agari Brand Protection, Dmarcian, OnDMARC, ProofPoint Email Fraud Defense, and Valimail.

Note

Domain-based Message Authentication, Reporting, and Conformance (DMARC) is an email authentication protocol.

Features

Parses draft and 1.0 standard aggregate/rua DMARC reports
Parses forensic/failure/ruf DMARC reports
Parses reports from SMTP TLS Reporting
Can parse reports from an inbox over IMAP, Microsoft Graph, or Gmail API
Transparently handles gzip or zip compressed reports
Consistent data structures
Simple JSON and/or CSV output
Optionally email the results
Optionally send the results to Elasticsearch, Opensearch, and/or Splunk, for use with premade dashboards
Optionally send reports to Apache Kafka

Python Compatibility

This project supports the following Python versions, which are either actively maintained or are the default versions for RHEL or Debian.

Version	Supported	Reason
< 3.6	❌	End of Life (EOL)
3.6	❌	Used in RHEL 8, but not supported by project dependencies
3.7	❌	End of Life (EOL)
3.8	❌	End of Life (EOL)
3.9	❌	Used in Debian 11 and RHEL 9, but not supported by project dependencies
3.10	✅	Actively maintained
3.11	✅	Actively maintained; supported until June 2028 (Debian 12)
3.12	✅	Actively maintained; supported until May 2035 (RHEL 10)
3.13	✅	Actively maintained; supported until June 2030 (Debian 13)
3.14	✅	Supported (requires `imapclient>=3.1.0`)

README.md

parsedmarc

Sponsors

Features

Python Compatibility