mirror of
https://github.com/domainaware/parsedmarc.git
synced 2026-05-20 10:55:24 +00:00
3b705aeaa87df85beaee0bf6a010e208801c9025
* Commit classify_unknown_domains.py: regex-based multilingual classifier
Promotes the transient `/tmp/classify_b<N>.py` script that grew across
the b5–b13 reverse-DNS-map batches into a tracked tool. The classifier
takes a `collect_domain_info.py` TSV and emits a CSV of map additions
plus a text file of known-unknown additions — the regex baseline that
makes step 4 of the unknown-domain workflow ("classify from the TSV, not
by re-fetching") tractable at scale.
Coverage:
- Detectors for all 44 industry types in the README.
- Concept-translation parity across ~30 languages on the high-volume
detectors (Healthcare, Travel, Government, Retail, Finance, ISP, Web
Host, Manufacturing, Logistics, Real Estate, Automotive, Legal,
Agriculture).
- ~10–20 languages with 1–3 keywords each on the smaller detectors
(Photography, Sports, MSSP, Conglomerate, Search Engine, Social Media,
Defense, IaaS/PaaS/SaaS, Beauty, Print, Publishing, Religion, Science,
Event Planning, Staffing, Email Security/Provider, Marketing,
Construction, Industrial, Utilities, Energy, Government Media,
Physical Security, News, Nonprofit, Entertainment, Technology,
Consulting).
Brand-name selection prefers MMDB `as_name` → page title's first
segment → non-redacted WHOIS registrant → domain-derived fallback, with
a `clean_brand` pass that strips legal-form suffixes (LLC / GmbH / Ltda
/ EIRELI / sp. z o.o. / s.c.a r.l / UAB / etc.) and prefixes (PT, OOO).
When the title has multiple segments, the segment whose simplified form
contains the domain root is preferred — accessmontana.com with as_name
"MONTANA WEST, L.L.C." and title "Internet, Phone & TV Bundles | Access
Montana" maps to "Access Montana", not "Montana West".
A small mojibake fixer normalizes the most common UTF-8-as-Latin-1
re-encodings ("ó" → "ó", etc.) so Spanish/Portuguese/French homepages
that `collect_domain_info.py` mishandled still classify.
The empty HAND dict at the top of the file is an extension point for
batch-specific overrides — e.g. acquisition aliases or brand-name
corrections that don't fit any detector; each `domain → ("Brand",
"Type")` entry wins over the auto-classifier.
Wired into AGENTS.md's "Related utility scripts" section and documented
in `parsedmarc/resources/maps/README.md` alongside the rest of the
maps utilities.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* classify_unknown_domains.py: clarify dual-purpose framing
The classifier serves both lookup paths into base_reverse_dns_map.csv —
the original PTR-side flow (reverse-DNS base domains derived from DMARC
report source IPs) and the MMDB-coverage flow (AS domains lifted from
the bundled IPinfo Lite MMDB). The initial commit's docstring/comments
emphasized the MMDB-coverage flow because that's where the script grew
up across the b5–b13 batches, but it was always equally applicable to
PTR-side domains.
Updates:
- Top docstring rewritten to lead with the dual-purpose framing.
- README.md adds an explicit "useful for either lookup path" paragraph
referencing the original DMARC-report flow and the MMDB-coverage flow.
- AGENTS.md "Related utility scripts" entry updated to mention both
flows.
- Drops a stale "happen to have ASN registrations" aside in the
RETAIL_RE comment.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
parsedmarc
parsedmarc is a Python module and CLI utility for parsing DMARC
reports. When used with Elasticsearch and Kibana (or Splunk), it works
as a self-hosted open-source alternative to commercial DMARC report
processing services such as Agari Brand Protection, Dmarcian, OnDMARC,
ProofPoint Email Fraud Defense, and Valimail.
Note
Domain-based Message Authentication, Reporting, and Conformance (DMARC) is an email authentication protocol.
Sponsors
This is a project is maintained by one developer. Please consider sponsoring my work if you or your organization benefit from it.
Features
- Parses draft and 1.0 standard aggregate/rua DMARC reports
- Parses forensic/failure/ruf DMARC reports
- Parses reports from SMTP TLS Reporting
- Can parse reports from an inbox over IMAP, Microsoft Graph, or Gmail API
- Transparently handles gzip or zip compressed reports
- Consistent data structures
- Simple JSON and/or CSV output
- Optionally email the results
- Optionally send the results to Elasticsearch, Opensearch, and/or Splunk, for use with premade dashboards
- Optionally send reports to Apache Kafka
Python Compatibility
This project supports the following Python versions, which are either actively maintained or are the default versions for RHEL or Debian.
| Version | Supported | Reason |
|---|---|---|
| < 3.6 | ❌ | End of Life (EOL) |
| 3.6 | ❌ | Used in RHEL 8, but not supported by project dependencies |
| 3.7 | ❌ | End of Life (EOL) |
| 3.8 | ❌ | End of Life (EOL) |
| 3.9 | ❌ | Used in Debian 11 and RHEL 9, but not supported by project dependencies |
| 3.10 | ✅ | Actively maintained |
| 3.11 | ✅ | Actively maintained; supported until June 2028 (Debian 12) |
| 3.12 | ✅ | Actively maintained; supported until May 2035 (RHEL 10) |
| 3.13 | ✅ | Actively maintained; supported until June 2030 (Debian 13) |
| 3.14 | ✅ | Supported (requires imapclient>=3.1.0) |
Description
Languages
Python
98.2%
Shell
1.7%
