Sean Whalen 3b705aeaa8 Commit classify_unknown_domains.py — regex-based multilingual classifier (#764)
* Commit classify_unknown_domains.py: regex-based multilingual classifier

Promotes the transient `/tmp/classify_b<N>.py` script that grew across
the b5–b13 reverse-DNS-map batches into a tracked tool. The classifier
takes a `collect_domain_info.py` TSV and emits a CSV of map additions
plus a text file of known-unknown additions — the regex baseline that
makes step 4 of the unknown-domain workflow ("classify from the TSV, not
by re-fetching") tractable at scale.

Coverage:

- Detectors for all 44 industry types in the README.
- Concept-translation parity across ~30 languages on the high-volume
  detectors (Healthcare, Travel, Government, Retail, Finance, ISP, Web
  Host, Manufacturing, Logistics, Real Estate, Automotive, Legal,
  Agriculture).
- ~10–20 languages with 1–3 keywords each on the smaller detectors
  (Photography, Sports, MSSP, Conglomerate, Search Engine, Social Media,
  Defense, IaaS/PaaS/SaaS, Beauty, Print, Publishing, Religion, Science,
  Event Planning, Staffing, Email Security/Provider, Marketing,
  Construction, Industrial, Utilities, Energy, Government Media,
  Physical Security, News, Nonprofit, Entertainment, Technology,
  Consulting).

Brand-name selection prefers MMDB `as_name` → page title's first
segment → non-redacted WHOIS registrant → domain-derived fallback, with
a `clean_brand` pass that strips legal-form suffixes (LLC / GmbH / Ltda
/ EIRELI / sp. z o.o. / s.c.a r.l / UAB / etc.) and prefixes (PT, OOO).
When the title has multiple segments, the segment whose simplified form
contains the domain root is preferred — accessmontana.com with as_name
"MONTANA WEST, L.L.C." and title "Internet, Phone & TV Bundles | Access
Montana" maps to "Access Montana", not "Montana West".

A small mojibake fixer normalizes the most common UTF-8-as-Latin-1
re-encodings ("ó" → "ó", etc.) so Spanish/Portuguese/French homepages
that `collect_domain_info.py` mishandled still classify.

The empty HAND dict at the top of the file is an extension point for
batch-specific overrides — e.g. acquisition aliases or brand-name
corrections that don't fit any detector; each `domain → ("Brand",
"Type")` entry wins over the auto-classifier.

Wired into AGENTS.md's "Related utility scripts" section and documented
in `parsedmarc/resources/maps/README.md` alongside the rest of the
maps utilities.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* classify_unknown_domains.py: clarify dual-purpose framing

The classifier serves both lookup paths into base_reverse_dns_map.csv —
the original PTR-side flow (reverse-DNS base domains derived from DMARC
report source IPs) and the MMDB-coverage flow (AS domains lifted from
the bundled IPinfo Lite MMDB). The initial commit's docstring/comments
emphasized the MMDB-coverage flow because that's where the script grew
up across the b5–b13 batches, but it was always equally applicable to
PTR-side domains.

Updates:

- Top docstring rewritten to lead with the dual-purpose framing.
- README.md adds an explicit "useful for either lookup path" paragraph
  referencing the original DMARC-report flow and the MMDB-coverage flow.
- AGENTS.md "Related utility scripts" entry updated to mention both
  flows.
- Drops a stale "happen to have ASN registrations" aside in the
  RETAIL_RE comment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 17:16:23 -04:00
2026-05-03 12:36:06 -04:00
2026-04-19 21:20:41 -04:00
2025-12-12 15:56:52 -05:00
2026-03-09 18:16:47 -04:00
2026-03-23 17:08:26 -04:00
2018-02-05 20:23:07 -05:00
2022-10-04 18:45:57 -04:00
2026-03-09 18:24:16 -04:00

parsedmarc

Build
Status Code
Coverage PyPI
Package PyPI - Downloads

A screenshot of DMARC summary charts in Kibana

parsedmarc is a Python module and CLI utility for parsing DMARC reports. When used with Elasticsearch and Kibana (or Splunk), it works as a self-hosted open-source alternative to commercial DMARC report processing services such as Agari Brand Protection, Dmarcian, OnDMARC, ProofPoint Email Fraud Defense, and Valimail.

Note

Domain-based Message Authentication, Reporting, and Conformance (DMARC) is an email authentication protocol.

Sponsors

This is a project is maintained by one developer. Please consider sponsoring my work if you or your organization benefit from it.

Features

  • Parses draft and 1.0 standard aggregate/rua DMARC reports
  • Parses forensic/failure/ruf DMARC reports
  • Parses reports from SMTP TLS Reporting
  • Can parse reports from an inbox over IMAP, Microsoft Graph, or Gmail API
  • Transparently handles gzip or zip compressed reports
  • Consistent data structures
  • Simple JSON and/or CSV output
  • Optionally email the results
  • Optionally send the results to Elasticsearch, Opensearch, and/or Splunk, for use with premade dashboards
  • Optionally send reports to Apache Kafka

Python Compatibility

This project supports the following Python versions, which are either actively maintained or are the default versions for RHEL or Debian.

Version Supported Reason
< 3.6 End of Life (EOL)
3.6 Used in RHEL 8, but not supported by project dependencies
3.7 End of Life (EOL)
3.8 End of Life (EOL)
3.9 Used in Debian 11 and RHEL 9, but not supported by project dependencies
3.10 Actively maintained
3.11 Actively maintained; supported until June 2028 (Debian 12)
3.12 Actively maintained; supported until May 2035 (RHEL 10)
3.13 Actively maintained; supported until June 2030 (Debian 13)
3.14 Supported (requires imapclient>=3.1.0)
S
Description
No description provided
Readme Apache-2.0 160 MiB
Languages
Python 98.2%
Shell 1.7%