parsedmarc

mirror of https://github.com/domainaware/parsedmarc.git synced 2026-05-26 05:35:24 +00:00

Author	SHA1	Message	Date
Sean Whalen	2c8b2c0f14	Bump mailsuite to >=2.2.1 (release 10.0.2) (#783 ) * Bump mailsuite to >=2.2.1; release 10.0.2 mailsuite 2.2.1 raises the transitive mail-parser floor to >=4.2.1, which stops mail-parser from returning a phantom ('', '') entry for absent address headers (verified against samples/failure/* with mail-parser 4.2.1: cc/bcc now parse to [] instead of [{address: ""}]). parsedmarc reads the mail-parser object directly via its own parse_email(), so this previously caused an empty {address: ""} Cc/Bcc entry to be indexed for every failure-report sample in Elasticsearch/OpenSearch and emitted in JSON/S3/Kafka output. The Reply-To-always-empty behavior in parsedmarc's own parse_email() (a hyphen-vs-underscore key mismatch, not an upstream issue) and the failure dashboards are out of scope here and tracked separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: note CVE-2023-27043 hardening from mail-parser 4.2.1 in 10.0.2 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 12:57:50 -04:00
Sean Whalen	3f64e30f6f	Update version to 10.0.1 and bump mailsuite requirement to >=2.2.0	2026-05-23 22:08:34 -04:00
Sean Whalen	a6778707d7	Finish forensic→failure rename: archive-folder migration + dashboard/doc cleanup (#776 ) The forensic→failure rename (#659) left a few loose ends and one deliberate hold-back. This closes them. Leftover rename misses (broken paths / stale canonical names): - CONTRIBUTING.md, dashboard-dev-bootstrap.sh: samples/forensic/* → samples/failure/* - dashboard-dev-bootstrap.sh, dashboards/README.md: dmarc_forensic_dashboard.xml → dmarc_failure_dashboard.xml (the file was already renamed; the import path and view name were not) - docs/source/usage.md: PARSEDMARC_GENERAL_SAVE_FORENSIC → ..._SAVE_FAILURE example - samples/parsedmarc.ini: save_forensic → save_failure - pyproject.toml, README.md: canonical "failure" naming (ci.ini intentionally keeps save_forensic to smoke-test the deprecated alias.) Archive subfolder rename + on-startup migration: - New failure reports now archive to <archive>/Failure (was <archive>/Forensic). - _migrate_forensic_archive_folder() runs once on startup (best-effort): renames Forensic→Failure when no Failure folder exists yet, merges the two when both exist, no-ops when there's no legacy folder, and logs-and-skips a mailbox it can't reorganize (warn, don't crash). This consolidates pre- and post-rename failure reports into one folder, replacing the previously documented decision to keep the folder named Forensic to avoid a split archive. Uses the folder-management API (folder_exists / rename_folder / merge_folders) added in mailsuite 2.1.0; the pin is bumped to >=2.1.0. Grafana dashboard (the rename PR updated OSD/Splunk/ES-OS but not Grafana): - Forensic panel titles + the datasource label → Failure; the fo-column display label and its linked byName field-override matcher both → "Failure Policy" (changed together so the column-width override keeps matching). - dev-bootstrap Grafana ES datasource: dmarc_forensic* → dmarc_f* (matches both pre-rename dmarc_forensic* and post-rename dmarc_failure, like the OSD/Kibana dashboards); RESEED wipe loop now also clears dmarc_failure indices. - Removed dashboards/grafana/Grafana-DMARC_Reports.json-new_panel.json, an orphan export accidentally committed in #736 and referenced by nothing. Tests (tests/test_init.py): - TestMigrateForensicArchiveFolderMaildir: real on-disk Maildir round-trips via mailsuite's MaildirConnection (no mocks) — rename, merge, no-op, and the full get_dmarc_reports_from_mailbox orchestration. Runs in CI (no network/creds). - TestMigrateForensicArchiveFolderErrorHandling: the one path a real Maildir can't reproduce — a backend that raises mid-operation must warn, not crash. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 12:29:40 -04:00
Fabio Scaccabarozzi	327fcff2b9	Add optional PostgreSQL storage backend (#667 ) Adds a PostgreSQL output backend as a lighter-weight alternative to Elasticsearch/OpenSearch, configured via a [postgresql] section (host/port/user/password/database or a libpq connection_string). Tables are created automatically on first run; a Grafana dashboard is included. - psycopg is an optional extra (pip install parsedmarc[postgresql]); the import is guarded so `import parsedmarc` works without it, and PostgreSQLClient raises a clear install hint when constructed without the driver. Binary wheels aren't available for every platform. - Schema captures the RFC 9990 / DMARCbis aggregate fields: np, testing, discovery_method, generator, xml_namespace, and per-result human_result on the DKIM/SPF auth-result tables. - forensic -> failure naming throughout (table dmarc_failure_report, save_failure_report_to_postgresql, dashboard, docs) to match #659. - Failure-report de-duplication mirrors the Elasticsearch backend exactly: arrival date + From + To + Subject (NULL-safe via IS NOT DISTINCT FROM; semantic JSONB equality). Aggregate and SMTP-TLS use ON CONFLICT. - PostgreSQLClient.close() for clean CLI shutdown; comment documents why the two timestamp helpers must stay distinct (report dates are local, record/SMTP-TLS dates are UTC). - CLI: config parse raises ConfigurationError on missing host/connection_string; wired into _init_output_clients + save loops. - Tests in tests/test_postgres.py (helpers, mocked-DB save assertions, create_tables, connect/error wrapping, dedup, real-sample round trip) and tests/test_cli.py (config parse + end-to-end save wiring incl. AlreadySaved/PostgreSQLError handling). postgres.py at 99% line coverage; only _main's output-client-init retry path is left. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 09:17:49 -04:00
Sean Whalen	5b08627eaa	Split tests.py into per-module tests/test_<module>.py (#774 ) * Split tests.py into per-module tests/test_<module>.py The 5174-line tests.py monolith is split into per-module files under tests/, mirroring the checkdmarc layout: tests/test_init.py parsedmarc/__init__.py parsing surface tests/test_cli.py parsedmarc/cli.py + config / env-vars / SIGHUP tests/test_utils.py parsedmarc/utils.py (DNS, IP info, PSL, etc.) tests/test_webhook.py parsedmarc/webhook.py tests/test_kafkaclient.py parsedmarc/kafkaclient.py tests/test_splunk.py parsedmarc/splunk.py tests/test_syslog.py parsedmarc/syslog.py tests/test_loganalytics.py parsedmarc/loganalytics.py tests/test_gelf.py parsedmarc/gelf.py tests/test_s3.py parsedmarc/s3.py tests/test_maps.py parsedmarc/resources/maps/ maintainer scripts The split is purely a redistribution — no test bodies changed, no tests added or removed. All 276 existing tests pass under the new layout. The current tests.py contains two kitchen-sink classes (`Test` at line 54 and `TestEnvVarConfig` at line 2360) holding tests that span many modules. Their methods are routed to the correct per-module file by name prefix; the wholly-thematic classes (TestExtractReport, TestUtilsXxx, TestSighupReload, etc.) move whole. Each target file gets its own `class Test(unittest.TestCase)` for the redistributed kitchen-sink methods, plus the thematic classes verbatim. Wiring updates: - `.github/workflows/python-tests.yml`: `pytest ... tests.py` → `python -m pytest ... tests/` (also switches to `python -m pytest` per the checkdmarc convention so cwd lands on the project root). - `pyproject.toml`: adds `[tool.pytest.ini_options] testpaths = ["tests"]` and `[tool.coverage.run] source = ["parsedmarc"]` with an `omit` for `parsedmarc/resources/maps/.py`. The maps scripts are maintainer-only batch tooling that ships out of the wheel; excluding them from coverage makes the headline number reflect only installed library code. Runtime coverage on the new layout is 59% (was 45% with maps counted), and PR-B will push it to 90%+. - `AGENTS.md`: documents the new layout and how to run individual files / tests; tells future contributors not to reintroduce a monolithic tests.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Restore 66.9% coverage baseline (count tests/ + parsedmarc) Master's headline 66.9% number on Codecov includes the tests.py file itself (99.35% covered) being measured alongside parsedmarc/. The original tests.py had no `[tool.coverage.run]` block, so coverage's default — "measure every file imported during the run" — counted the test code as if it were product code. The split commit added `source = ["parsedmarc"]` which suppressed measurement of the test files (correct in principle, since test files aren't shipped code), and that alone made the headline number drop by ~8 percentage points without any actual loss of testing. This commit swaps `source` for an explicit `include = ["parsedmarc/", "tests/"]` so both halves are measured the way they were on master. Verified: 276 tests, 66.96% line coverage (effectively unchanged from master's 66.90%). If you want the shipped-code-only number (was the headline that this commit overrides), run `pytest --cov=parsedmarc tests/`. That number is currently 59% and is the focus of the upcoming coverage-expansion PR. Also adds junit.xml to .gitignore so the CI artefact doesn't get accidentally committed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Restrict coverage to shipped code (`source = ["parsedmarc"]`) Reverts the prior commit's `include = ["tests/*"]`. Counting the test files toward coverage was wrong — it conflates "shipped code exercised by tests" with "test code that pytest auto-runs", inflates the headline number, and rewards writing more tests rather than tests that verify more code. Master's apparent 66.9% was an artefact of the old monolithic tests.py having no [tool.coverage.run] block at all; coverage's default behaviour measured every imported file, including the test file itself at ~99% "covered", which added ~8 percentage points to the displayed number without any real testing signal. Restricting to `source = ["parsedmarc"]` plus the existing maps omit gives a meaningful baseline: 59% of shipped code is exercised by the test suite today. That's the number the next PR is targeting to lift to 90%+ before the 10.0.0 release; the Codecov "drop" here is a measurement correction, not a regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 19:29:09 -04:00
Sean Whalen	ff6f75d740	Map-data build hygiene: README single source of truth, drop maintainer scripts from wheel (9.11.2) (#768 ) * Drop base_reverse_dns_types.txt; sortlists.py now reads types from README.md The .txt file duplicated the README's industry list and introduced drift risk — twice in the project's history we had to add types to the .txt only because the README had been updated independently. Make the README the single source of truth. - Add `<!-- types-list:start -->` / `<!-- types-list:end -->` HTML comment markers around the bullet list in parsedmarc/resources/maps/README.md. Markers don't render in GitHub's preview. - New `load_types_from_readme()` in sortlists.py parses the bullet items between the markers and returns them. Errors clearly if the README is missing or the markers are absent. - Delete base_reverse_dns_types.txt. - Fix a pre-existing typo in README precedence rule 4: `Web Hosting` → `Web Host` (matches the canonical type used in 4,176 map rows). Smoke test: feeding a row with a bogus type still triggers the validator (`'NotARealType' is not an allowed value for 'type'`), confirming the README-derived list flows through identically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * sortlists.py: normalize README types-list block in place Before validating the map, the validator now sorts the <!-- types-list:start --> / <!-- types-list:end --> block in README.md alphabetically (case-insensitively), trims leading and trailing whitespace from each item, and deduplicates case- insensitively, rewriting the README in place if any of those need fixing. Errors clearly when two entries differ only by casing (which would otherwise silently lose one). Adding a new category is now just inserting a `- New Type` line anywhere inside the markers — `sortlists.py` will tidy it on the next run. Same shape as how the validator already normalizes known_unknown_base_reverse_dns.txt and psl_overrides.txt. The pure read path is preserved as `load_types_from_readme()` for callers that don't want a side-effecting rewrite (tests, downstream tooling). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Stop shipping maintainer scripts; bump to 9.11.2 The exclude list in [tool.hatch.build] was originally meant to keep maintainer-only batch tooling under parsedmarc/resources/maps/ out of the wheel and sdist (it lists `find_bad_utf8.py`, `find_unknown_base_reverse_dns.py`, the renamed-and-removed `sortmaps.py`). The list never grew when new tools were added, so `collect_domain_info.py`, `classify_unknown_domains.py`, `detect_psl_overrides.py`, `detect_rebrands.py`, and `sortlists.py` all started shipping in distributions despite contributing nothing to runtime functionality. Replace the per-file basename list with a single glob pattern: parsedmarc/resources/maps/[!_].py The leading-`_` exception keeps `__init__.py` shipping (required so that `importlib.resources.files(parsedmarc.resources.maps)` can locate the bundled CSV/TXT data files), while excluding any other .py file under that directory — including future maintainer scripts that haven't been written yet. Drop the now-redundant per-file entries from the exclude list: `find_bad_utf8.py`, `find_unknown_base_reverse_dns.py`, and the already-removed `sortmaps.py`. The non-.py exclusions stay (`base_reverse_dns.csv`, `unknown_base_reverse_dns.csv`, `README.md`, `.bak`). Verified with `hatch build`: - Wheel under parsedmarc/resources/maps/: __init__.py + 3 data files (CSV/TXTs), no maintainer .py - sdist matches - Clean-venv install of the built wheel loads 298 PSL overrides and `get_base_domain('host01.netlify.app')` returns `netlify.app` Bump to 9.11.2 since this changes shipped artifacts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 12:36:48 -04:00
Sean Whalen	053195581b	collect_domain_info.py: opt-in DuckDuckGo search fallback for bot-blocked rows (#767 ) * collect_domain_info.py: opt-in DuckDuckGo search fallback for bot-blocked rows A meaningful share of KU domains return a Cloudflare / DDoS-Guard / "Are you a robot?" / px-captcha interstitial instead of real homepage content — even after the curl-style relaxed-TLS fallback runs. For those rows we have neither homepage signal nor (often) a usable as_name, and they fall through to KU even though the operator is a real (often well-known) business that the classifier could trivially handle if it could just see the page. Added an opt-in `--use-search-fallback` flag that asks DuckDuckGo for `site:<domain>` when the homepage fetch returned a bot-block / parking / empty result, and uses the top result's title and description (only if the result host belongs to the input domain — anti-SEO-spam guard). Mechanism - New optional `ddgs` dependency, listed under the `[build]` extras. `from ddgs import DDGS` is wrapped in a try/except — the script runs without ddgs installed as long as `--use-search-fallback` isn't passed; the flag check exits with a helpful install message otherwise. - `_SEARCH_FALLBACK_TRIGGER_RE` — title/description patterns that look like a bot-block / WAF interstitial / parked / placeholder. Triggers the fallback. Same shape as the classifier's TITLE_NOISE_RE / PARKED_PAGE_RE; the search fallback is the recovery path for exactly the rows that filter excludes. - `_looks_bot_blocked()` — combined check: trigger regex matches OR title and description are both empty (typical of WAF interstitials that strip <title>/<meta> entirely). - `_hosts_match()` — same-domain SEO-spam guard. A search result is accepted only when its host is exactly the input domain or a subdomain of it. Third-party SEO-spam pages that scraped the domain name are silently skipped. - `_search_fallback_fetch()` — runs `site:<domain>` through DDG, walks results in rank order, returns the first one whose host passes the guard. Returns empty if no result matches (caller leaves the row's homepage data alone in that case). - `_collect_one()` now takes a `use_search_fallback` flag, calls the fallback after the homepage fetch when the homepage looks bot-blocked, and writes `title_source = "homepage"` or `"search"` so reviewers can audit which rows came from where. - New `title_source` column in the TSV. Smoke test Test set: bbc.com (real homepage, no fallback expected) plus 5 known Cloudflare-walled rows (1800contacts.com, americaneagle.com, broadwaytechnology.com, health.gov.il, mfa.gov.il). Result: bbc.com classified via homepage; the other 5 all recovered title + description via search and got `title_source=search`. The same-domain guard validated independently — for broadwaytechnology.com the guard correctly rejects bloomberg.com and accepts support.broadwaytechnology.com (broadway was acquired by Bloomberg, but the search fallback returns the broadway-domain snippet, not the parent's bloomberg.com product page). Caveats codified in AGENTS.md - Search snippets are still untrusted text (data-not-instructions rule applies the same way it does to homepage HTML). - DDG's index can lag a homepage rebrand by months — when a row classified via `title_source=search` disagrees with a fresh manual fetch, prefer the manual verification. The fallback is a recovery aid, not a tiebreaker against fresh content. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * collect/classify: link-following + alias map rows for placeholder DDG titles When the search fallback ran on the original 6-domain smoke set, two of the recovered titles were essentially placeholder pointers carrying no classifier signal — DDG returned `Link to fcs.health.gov.il` for one input and a bare `yangon.mfa.gov.il` for another. Those snippets are DDG's way of saying "I have an indexed subdomain but no real abstract to give you", and feeding them to the regex classifier produces no better signal than the parking-page result we were already trying to recover from. This commit teaches the collector to recognize both placeholder shapes, follow the pointer to the target hostname, and use that hostname's real content for the row. The classifier then emits the original input and the link target as two map rows under the same (name, type) so both keys are looked up against future DMARC reports. collect_domain_info.py - New `_LINK_TO_TITLE_RE` / `_BARE_HOSTNAME_RE` and an `_extract_link_target` helper that returns the target hostname when the search title is `Link to <hostname>` or a bare hostname, "" when the title carries real content. - After the search-fallback path, if the title looks like a pointer and the target differs from the input, `_fetch_homepage(target)` is called once. When the target's fetch returns real (non-bot-blocked) content, the row's title / description / final_url / rebrand_signal / external_links are replaced with the target's, and `title_source` becomes `search→<target>` so reviewers can audit the path. - New `link_target_domain` column records the followed target whether or not its fetch succeeded. classify_unknown_domains.py - When a row's `link_target_domain` is set and differs from the input domain, the classifier emits a second map row for the target with the same `(name, type)`. The original input is the "og" domain; the target is what DDG pointed us at — both end up in the map as aliases. Same handling applies on the ambiguous-bucket path so a single human adjudication covers both. Smoke test on the original 6-domain set: bbc.com homepage → BBC Home – Breaking News, … 1800contacts.com search → 1800contacts health.gov.il search → Homepage – COVID Information Center of the Israel Ministry of Health americaneagle.com search → Americaneagle.com \| Web Design … broadwaytechnology.com search → Bloomberg Completes Acquisition of … mfa.gov.il search→yangon.mfa.gov.il → Home \| Ministry of Foreign Affairs link_target_domain=yangon.mfa.gov.il The mfa.gov.il row triggered the new path: DDG returned `yangon.mfa.gov.il` as the title, the collector followed it, the target's homepage gave us "Home \| Ministry of Foreign Affairs", and the classifier emitted both `mfa.gov.il, Ministry of foreign affairs, Government` and `yangon.mfa.gov.il, Ministry of foreign affairs, Government`. AGENTS.md updated with the link-following / alias rules under the search-fallback subsection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Run --use-search-fallback against 10,544 bot-blocked KU rows; +473 promotions Also expands the search-fallback trigger regex to recognize self-signed TLS interception (firewall block via cert) and a wider class of local-firewall block-page strings. Mechanics 1. Identified 10,544 KU rows from the 34,647-row prior TSV that looked bot-blocked (via the new `_looks_bot_blocked` detector). 2. Ran `collect_domain_info.py --use-search-fallback` against just those rows. Throughput was ~3.4 rows/sec at 32 workers / 3s HTTP timeout / 5s WHOIS timeout. ~50 min wall time. 3. Audited the resulting TSV and discovered 2,078 rows whose homepage fetch had silently returned a corporate firewall's block page (Fortinet "Web Filter Violation" being the most common, 1,419 of them). The original `_SEARCH_FALLBACK_TRIGGER_RE` didn't recognize those strings, so search-fallback wasn't firing — the firewall's block-page text was being fed to the classifier as if it were the operator's homepage. Almost no false promotions resulted (block-page text doesn't match industry detectors), but the rows weren't recovering either. 4. Expanded the trigger regex to catch web-filter block pages, then re-fetched just the 2,078 affected rows. 5. Final classifier pass: 474 unambiguous map adds, 41 ambiguous, 1 silently dropped (adult content), 10,066 still in KU. Self-signed-cert detection A separate fix lands in this commit: when the primary fetch fails with an SSL cert verification error matching "self-signed certificate", the collector skips the verify=False browser fallback. Rationale: TLS- intercepting firewalls (corporate or personal-network) present their own self-signed cert specifically when blocking. The verify=False fallback would happily retrieve the firewall's block page, which then poisons the row's title/description. Skipping that path leaves the row's metadata empty so search-fallback can recover real content. Other cert errors (hostname mismatch, weak DH, legacy renegotiation) keep the existing fallback path because they're typically real operators with misconfigured TLS rather than firewall interception. Numbers Map: 37,640 → 38,114 (+474) KU: 32,324 → 31,886 (−438) Disjoint check: 0 shared keys Unknown CSV: regenerated, just the header Type distribution of the 474 promotions 162 ISP 17 MSP 4 MSSP / Marketing 72 Web Host 16 Technology 4 Beauty / Agriculture 41 Finance 14 Healthcare 3 IaaS / Science / Legal 19 Government 11 Travel 2 Search / Religion / SaaS 10 Logistics 8 Manufacturing 2 Email Sec / Email Provider 9 Education / Retail 8 News 2 Entertainment 7 Utilities / Phys Sec 6 Real Estate 1 Auto / Staff / PaaS 6 Food / Consulting / Industrial / Conglomerate / Nonprofit Most of the gains are network operators (162 ISPs, 72 Web Hosts) — the population that's most likely to be Cloudflare-walled or DDoS- Guard-walled at the homepage layer but show up clearly in DDG abstracts. Smoke audit on a 30-row random sample of map adds: 28 plausible, 2 borderline (`es.graphicpkg.com → Food` could also be Industrial since Graphic Packaging makes packaging for the food industry, but the vertically-specialized rule applies; `annuairesante.ameli.fr` → Finance via French health-insurance vocabulary, defensible). The 41 ambiguous rows stay in KU per the established workflow — they need the same one-row-at-a-time human triage as PR #766 used. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Search-fallback batch (partial; outage-truncated): +226 promotions Hotspot-bypass collector run was interrupted ~6,300/10,107 in when the hotspot lost connectivity and the machine reverted to the firewalled connection. Stopping here to commit what was unambiguously classifiable; the remaining ~3,800 candidates (plus any rows whose homepage fetch was tainted by the firewall fallback during the transition) will be re-collected in a fresh run after network stability is restored. Promotions in this batch: - 219 auto-classified by the regex classifier on the partial TSV - 17 ambiguous rows resolved per LLM auto-resolution rules + user manual review - 5 KU rows the user adjudicated explicitly (Bielsko-Biała, Douala-IX, Ekol Logistics, ICB, Marcus Corporation) - 13 from earlier triage worklist with brands assigned - Net 226 net-new map entries after dedupe, alias-leak filtering (3 link-target subdomains dropped where the parent base was already in the adds), full-IP privacy filtering (2 dropped), and ~30 targeted brand/category cleanups for rows where the search-fallback snippet had picked up a wrong page or the title contained registrant cruft / corporate-suffix leaks. AGENTS.md updates: - Codifies the "LLM auto-resolution of high-confidence ambiguous rows" workflow with R1-R5 high-confidence rules, low-confidence surface-to-human criteria, and the one-line auto-decision output format for reviewer overrule. - Adds 7 triage lessons learned during this batch's bot-blocked-KU review (Polish/IT/ES/GR/RO city domains, "Sports Club" venues, vertically-specialized investment firms, sub-page fetch FPs, Telecom-suffix brand pinning, Hospital/Health-System suffix, IXP -ix brand pinning). Map and KU files are disjoint after this commit. unknown_base_reverse_dns.csv is empty (header-only) since every base_reverse_dns input is now either mapped or in KU. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Search-fallback hotspot batch: +213 promotions Fresh hotspot run on the 9,881 still-bot-blocked KU candidates left after the prior outage-truncated batch. Classifier: 202 auto + 31 ambiguous (14 LLM auto-resolved per the R1-R5 high-confidence rules, 17 surfaced for interactive review) + 9,665 still KU + 1 dropped. Net 213 net-new map entries after dedupe, alias-leak filtering (13 link-target subdomains dropped where the parent base was already in the map or in this batch's adds), 1 full-IP privacy filter, 2 user-DROPs (1 alias of an as-numbered domain, 1 KU because the only signal was a cross-vertical client list), and ~8 targeted brand cleanups for rows where the search snippet had left a registrant-leak or domain-as-name placeholder. LLM auto-resolutions (R1-R5): africell.ao ISP wi-tribe.pk ISP ags.school.nz Education vwfs.com.au Finance allaria.com.ar Finance wanxp.com ISP asturias.org Government varendraisp.com ISP bdo.com.ph Finance titansi.com.my IaaS bikada.kz ISP redeyenetworks.com MSSP informatiq.org ISP plusinfo.ru ISP User-decided rows: admincomp.com Consulting korisp.com Web Host anrb.ru Science linkexplorer.net.br ISP arpc.ir Industrial novatech.bg MSP as63031.net Consulting reliable-nets.com ISP aviti.net Web Host satortech.com MSP binaryelements.com.au MSP skyworld.co.ke Finance juni.net.br ISP telegroup-ltd.com Technology west-webworld.fr Technology User KU/drops: itatec.com.py KU (cross-vertical client list, no operator signal) ns2.as63031.net DROP (alias of as63031.net) AGENTS.md addition: codifies the "Web Host vs Email Provider — bundled email-hosting is still Web Host" rule. Same shape as the existing CCaaS/CPaaS-vs-ISP and MSP-vs-MSSP rules: classify by the operator's primary product, not by every feature in their bundle. Prompted by the korisp.com triage during this batch. Map and KU files are disjoint after this commit. unknown_base_reverse_dns.csv remains header-only (every base_reverse_dns input is now mapped or in KU). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 11:33:10 -04:00
Sean Whalen	397378de8e	Bump mailsuite to >=2.0.2 for 9.11.1 release (#743 ) Addresses RuntimeError: Event loop is closed in the MS Graph mailbox backend (#742). Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 11:59:11 -04:00
Sean Whalen	5d816a4e56	Offload mailbox layer to mailsuite>=2.0.0 (#741 ) mailsuite 2.0.0 extracted the IMAP, Microsoft Graph, Gmail, and Maildir connections out of parsedmarc into mailsuite.mailbox so other projects can reuse the same provider-agnostic interface. Replace the parsedmarc/mail submodules with a thin re-export of mailsuite.mailbox and drop the duplicated implementations. Per the migration note in seanthegeek/mailsuite#22, pass token_cache_name="parsedmarc" so existing AuthenticationRecord caches on disk continue to work without re-prompting users to authenticate. The existing graph_url config knob is forwarded unchanged. Drop direct dependencies that are now installed transitively via mailsuite[gmail,msgraph] (msgraph-core, imapclient, google-*). The extras are pulled in non-optionally so Gmail and Microsoft Graph support remain available out of the box. Drop nine test classes that were exercising mailsuite-side implementation internals (TestGmailConnection, TestGraphConnection, TestImapConnection, the _get_creds/_generate_credential half of TestGmailAuthModes, TestImapFallbacks, TestMSGraphFolderFallback, TestMaildirConnection, TestMaildirReportsFolder, TestMaildirUidHandling, TestTokenParentDirCreation); these are mailsuite's tests now. The CLI integration tests that mock parsedmarc.cli.{IMAP,Gmail,MSGraph}Connection are kept. Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 00:58:36 -04:00
Sean Whalen	2ac8cb406e	Replace DB-IP Country Lite with IPinfo Lite (9.8.0) (#711 ) Switch the bundled IP-to-country database from DB-IP Country Lite to IPinfo Lite for greater lookup accuracy. The download URL, cached filename, and packaged module path all move from dbip/dbip-country-lite.mmdb to ipinfo/ipinfo_lite.mmdb. IPinfo Lite uses a different MMDB schema (flat country_code) that is incompatible with geoip2's Reader.country() helper, so get_ip_address_country() now uses maxminddb directly and handles both the IPinfo schema and the MaxMind/DBIP nested country.iso_code schema so users who drop in their own MMDB from any of these providers continue to work. Drop the geoip2 dependency (it was only used for the incompatible helper) and add maxminddb as a direct dependency — it was already installed transitively through geoip2. Callers that imported parsedmarc.resources.dbip directly need to switch to parsedmarc.resources.ipinfo. Old parsedmarc versions downloading from the dbip/ GitHub raw URL will 404 and fall back to their bundled copy — this is the documented behavior of load_ip_db(). Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 00:31:54 -04:00
Sean Whalen	2032438d3b	9.4.0 ### Added - Extracted `load_reverse_dns_map()` utility function in `utils.py` for loading the reverse DNS map independently of individual IP lookups. - SIGHUP reload now re-downloads/reloads the reverse DNS map, so changes take effect without restarting. - Add premade OpenSearch index patterns, visualizations, and dashboards ### Changed - When `index_prefix_domain_map` is configured, SMTP TLS reports for domains not in the map are now silently dropped instead of being output. Unlike DMARC, TLS-RPT has no DNS authorization records, so this filtering prevents processing reports for unrelated domains. - Bump OpenSearch support to `< 4` ### Fixed - Fixed `get_index_prefix` using wrong key (`domain` instead of `policy_domain`) for SMTP TLS reports, which prevented domain map matching from working for TLS reports. - Domain matching in `get_index_prefix` now lowercases the domain for case-insensitive comparison.	2026-03-23 17:08:26 -04:00
Kili	e98fdfa96b	Fix Python 3.14 support metadata and require imapclient 3.1.0 (#662 )	2026-03-04 12:36:15 -05:00
Copilot	2e3ee25ec9	Drop Python 3.9 support (#661 ) * Initial plan * Drop Python 3.9 support: update CI matrix, pyproject.toml, docs, and README Co-authored-by: seanthegeek <44679+seanthegeek@users.noreply.github.com> * Update Python 3.9 version table entry to note Debian 11/RHEL 9 usage Co-authored-by: seanthegeek <44679+seanthegeek@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: seanthegeek <44679+seanthegeek@users.noreply.github.com>	2026-03-03 11:34:35 -05:00
Sean Whalen	dd9ef90773	9.0.10 - Support Python 3.14+	2026-01-17 14:09:18 -05:00
Sean Whalen	34fa0c145d	9.0.8 - Fix logging configuration not propagating to child parser processes (#646). - Update `mailsuite` dependency to `?=1.11.1` to solve issues with iCloud IMAP (#493).	2025-12-29 17:07:38 -05:00
Sean Whalen	bc1dae29bd	Update mailsuite dependency version to 1.11.0	2025-12-25 15:32:27 -05:00
Sean Whalen	af9ad568ec	Specify Python version requirements in pyproject.toml	2025-12-17 16:18:24 -05:00
Sean Whalen	cdd000e675	9.0.3 - Set `requires-python` to `>=3.9, <3.14` to avoid [this bug](https://github.com/python/cpython/issues/142307)	2025-12-05 10:43:28 -05:00
Anael Mobilia	4b98d795ff	Define minimal Python version on pyproject (#634 )	2025-12-01 20:22:49 -05:00
Anael Mobilia	00267c9847	Codestyle cleanup (#631 ) * Fix typos * Copyright - Update date * Codestyle xxx is False -> not xxx * Ensure "_find_label_id_for_label" always return str * PEP-8 : apiKey -> api_key + backward compatibility for config files * Duplicate variable initialization * Fix format	2025-11-30 19:13:57 -05:00
Sean Whalen	a05c230152	8.19.0 (#622 ) 8.19.0 - Add multi-tenant support via an index-prefix domain mapping file - PSL overrides so that services like AWS are correctly identified - Additional improvements to report type detection - Fix webhook timeout parsing (PR #623) - Output to STDOUT when the new general config boolean `silent` is set to `False` (Close #614) - Additional services added to `base_reverse_dns_map.csv` --------- Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Félix <felix.debloisbeaucage@gmail.com>	2025-11-28 12:47:00 -05:00
Sean Whalen	4b3d32c5a6	Actual, actual Actual 6.18.7 release Revert back to using python csv instead of pandas to avoid conflicts with numpy in elasticsearch	2025-08-17 20:36:15 -04:00
Sean Whalen	5df5c10f80	Pin pandas an numpy versions	2025-08-17 19:59:53 -04:00
Sean Whalen	9f339e11f5	Actual 6.18.7 release	2025-08-17 19:34:14 -04:00
Sean Whalen	3feb478793	8.18.6 - Fix since option to correctly work with weeks (PR #604) - Add 183 entries to `base_reverse_dns_map.csv` - Add 57 entries to `known_unknown_base_reverse_dns.txt` - Check for invalid UTF-8 bytes in `base_reverse_dns_map.csv` at build - Remove unneeded items from the `parsedmarc.resources` module at build	2025-08-17 17:00:11 -04:00
Sean Whalen	607a091a5f	8.18.3 - Move `__version__` to `parsedmarc.constants` - Create a constant `USER_AGENT` - Use the HTTP `User-Agent` header value `parsedmarc/version` for all HTTP requests	2025-06-02 16:43:26 -04:00
Sean Whalen	f2133aacd4	Fix build dependencies	2024-12-25 18:52:42 -05:00
Sean Whalen	31917e58a9	Update build backend	2024-12-25 18:28:30 -05:00
Sean Whalen	976a3274e6	8.15.2	2024-10-24 18:04:19 -04:00
Jed Laundry	8444053476	Create optional dependency group for build, fix codecov (#567 ) * Create optional dependency groups for build and cli * revert cli optional-dependencies group	2024-10-07 13:47:35 -04:00
Sean Whalen	1ef3057110	8.15.1 - Proper IMAP namespace fix (Closes issue #557 and issue #563) - Require `mailsuite>=1.9.17` - Revert PR #552 - Add pre-flight check for nameservers (PR #562 closes issue #543) - Reformat code with `ruff`	2024-10-02 21:19:57 -04:00
Jason Lingohr	11e0461b9d	Add GELF support (#532 ) * Implement the ability to log to a GELF server/input, via the use of pygelf. * Fix flake8 style checks.	2024-08-24 11:28:55 -04:00
Patrick Linnane	f98dc6d452	build: move to `kafka-python-ng` (#510 ) Signed-off-by: Patrick Linnane <patrick@linnane.io>	2024-05-22 08:11:29 -04:00
Szasza Palmer	995bdbcd97	adding OpenSearch support, fixing minor typos, and code styling (#481 ) * adding OpenSearch support, fixing minor typos and code styling * documentation update	2024-03-04 10:06:26 -05:00
Sean Whalen	b8088505b1	Add support for SMTP TLS reports (#453 )	2024-02-19 18:45:38 -05:00
Jed Laundry	a06fdc586f	Change publicsuffix2 to publicsuffixlist (#406 ) * change to publicsuffixlist * update publicsuffixlist (now auto-updating) * Fix unused imports	2023-05-09 08:49:41 -04:00
rubeste	a7280988eb	Implemented Azure Log Analytics ingestion via Data Collection Rules (#394 ) * Implemented Azure Log Analytics ingestion via Data Collection Rules * Update loganalytics.py * Update cli.py * Update pyproject.toml * Fixed config bug Fixed a bug that causes the program to fail if you do not configure a Data stream. * Fixed code format	2023-05-03 15:54:25 -04:00
Sean Whalen	1e0fa9472c	Fix build	2022-09-09 16:46:57 -04:00
Sean Whalen	10e15d963b	8.3.1 - Handle unexpected xml parsing errors more gracefully	2022-09-09 16:22:28 -04:00

39 Commits