Sean Whalen 06d277686d classify_unknown_domains.py: enforce concept-parity across ~60 languages (#765)
Multilingual detectors previously held English at full breadth (e.g. Healthcare =
hospital + clinic + pharmacy + healthcare + pharmaceutical industry + nursing
home + medical center) while many non-English sections covered the same concept
set with only one or two transliterated words. This left every language other
than English under-detecting against pages that used the operator's natural
compound terms.

Reworked every detector so each language now expresses the same English concept
set in idiomatic compounds — never inventing calques where no natural form
exists. Added ~32 new languages (Macedonian, Belarusian, Azerbaijani, Armenian,
Georgian, Kazakh, Uzbek, Mongolian, Khmer, Burmese, Lao, Nepali, Sinhala,
Amharic, Yoruba, Hausa, Igbo, Zulu, Pashto, Kurdish, Tajik, Kyrgyz, Maltese,
Luxembourgish, Haitian Creole, Frisian, Yiddish, Faroese, Tatar, Javanese,
Sundanese, Cebuano) on top of the existing pool, again applied per-concept
rather than as token presence.

Also added British / American spelling pairs where they diverge (`tire`/`tyre`,
`defense`/`defence`, `center`/`centre`, etc.) and a handful of new English
concepts that had been implicit (`tire shop`, `car parts`, `oil exploration`,
`olympic committee`, ...) — each with its multilingual equivalents in the same
edit.

AGENTS.md: codified the rule under "Maintaining the reverse DNS maps" so future
edits are bound by it: every language section must cover the same concept set
the English section covers, with idiomatic compounds rather than calques, skip
rather than invent when no natural form exists, and any new English keyword
must be added in parallel across the existing language set.

Final shape: 11,777 alternations / 175,556 chars across 45 detectors. Ruff
check + format clean. Module compiles.

Known limitation (pre-existing, unchanged): Python's `re` does not treat
Unicode Mn / Mc combining marks as word characters, so Brahmic-script words
ending in vowel signs / virama won't match the outer `\b…\b`. Affects
pre-existing and new entries equally; fixable later by switching to the
`regex` module.

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 19:01:15 -04:00
2026-05-03 12:36:06 -04:00
2026-04-19 21:20:41 -04:00
2025-12-12 15:56:52 -05:00
2026-03-09 18:16:47 -04:00
2026-03-23 17:08:26 -04:00
2018-02-05 20:23:07 -05:00
2022-10-04 18:45:57 -04:00
2026-03-09 18:24:16 -04:00

parsedmarc

Build
Status Code
Coverage PyPI
Package PyPI - Downloads

A screenshot of DMARC summary charts in Kibana

parsedmarc is a Python module and CLI utility for parsing DMARC reports. When used with Elasticsearch and Kibana (or Splunk), it works as a self-hosted open-source alternative to commercial DMARC report processing services such as Agari Brand Protection, Dmarcian, OnDMARC, ProofPoint Email Fraud Defense, and Valimail.

Note

Domain-based Message Authentication, Reporting, and Conformance (DMARC) is an email authentication protocol.

Sponsors

This is a project is maintained by one developer. Please consider sponsoring my work if you or your organization benefit from it.

Features

  • Parses draft and 1.0 standard aggregate/rua DMARC reports
  • Parses forensic/failure/ruf DMARC reports
  • Parses reports from SMTP TLS Reporting
  • Can parse reports from an inbox over IMAP, Microsoft Graph, or Gmail API
  • Transparently handles gzip or zip compressed reports
  • Consistent data structures
  • Simple JSON and/or CSV output
  • Optionally email the results
  • Optionally send the results to Elasticsearch, Opensearch, and/or Splunk, for use with premade dashboards
  • Optionally send reports to Apache Kafka

Python Compatibility

This project supports the following Python versions, which are either actively maintained or are the default versions for RHEL or Debian.

Version Supported Reason
< 3.6 End of Life (EOL)
3.6 Used in RHEL 8, but not supported by project dependencies
3.7 End of Life (EOL)
3.8 End of Life (EOL)
3.9 Used in Debian 11 and RHEL 9, but not supported by project dependencies
3.10 Actively maintained
3.11 Actively maintained; supported until June 2028 (Debian 12)
3.12 Actively maintained; supported until May 2035 (RHEL 10)
3.13 Actively maintained; supported until June 2030 (Debian 13)
3.14 Supported (requires imapclient>=3.1.0)
S
Description
No description provided
Readme Apache-2.0 160 MiB
Languages
Python 98.2%
Shell 1.7%