mirror of
https://github.com/domainaware/parsedmarc.git
synced 2026-05-25 21:35:22 +00:00
08db305e5af4736c731746953fa68c8acf216b29
1541 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
08db305e5a |
test: cover no-display-name Reply-To header flattening (#786)
The 10.0.3 Reply-To header flattening (elastic.py / opensearch.py line 711)
has two branches: display-name present ("Name <addr>") and absent (bare
address). The existing test only exercised the former, leaving the
empty-display-name branch uncovered — the two lines Codecov flagged on the
10.0.3 patch. Add a failure report whose Reply-To has no display name and
assert sample.headers["reply-to"] flattens to the bare address.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
e104f1118c |
Land 10.0.3 changes on master (#785)
PR #784 was stacked on the #783 branch and its base was never retargeted to master, so it merged into fix/mailsuite-2.2.1-empty-address instead of master. master therefore has 10.0.2 (#783's squash) but is missing the 10.0.3 changes. This re-lands exactly that delta — the Reply-To/Delivered-To parser fix, the ES/OS Reply-To header flattening, and the Splunk/OpenSearch/Grafana failure dashboard fixes, with the version bumped to 10.0.3. No mailsuite re-bump (the >=2.2.1 floor is already on master from 10.0.2). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>10.0.3 |
||
|
|
2c8b2c0f14 |
Bump mailsuite to >=2.2.1 (release 10.0.2) (#783)
* Bump mailsuite to >=2.2.1; release 10.0.2
mailsuite 2.2.1 raises the transitive mail-parser floor to >=4.2.1, which
stops mail-parser from returning a phantom ('', '') entry for absent address
headers (verified against samples/failure/* with mail-parser 4.2.1: cc/bcc
now parse to [] instead of [{address: ""}]). parsedmarc reads the mail-parser
object directly via its own parse_email(), so this previously caused an empty
{address: ""} Cc/Bcc entry to be indexed for every failure-report sample in
Elasticsearch/OpenSearch and emitted in JSON/S3/Kafka output.
The Reply-To-always-empty behavior in parsedmarc's own parse_email() (a
hyphen-vs-underscore key mismatch, not an upstream issue) and the failure
dashboards are out of scope here and tracked separately.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: note CVE-2023-27043 hardening from mail-parser 4.2.1 in 10.0.2
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
3f64e30f6f | Update version to 10.0.1 and bump mailsuite requirement to >=2.2.0 | ||
|
|
9e675bf43c | Update build command to use pytest with coverage reporting 10.0.0 | ||
|
|
d92593f2da | Add RFC 9990 fields to opensearch_dashboards.json | ||
|
|
180fc581fe |
fix: OSD Global-tenant import + dropped report files with glob metacharacters; validate dev stack on OpenSearch 3.x with PostgreSQL (#781)
* fix: import OpenSearch dashboards into the real Global tenant dashboard-dev-bootstrap.sh sent `securitytenant: global_tenant`. The OpenSearch security plugin reads that header as a tenant *name*, and `global_tenant` is a sample custom tenant from the security demo config -- not the shared Global tenant, whose token is the literal `global`. The import therefore landed in a separate `global_tenant` tenant (its own `.kibana_<hash>_globaltenant_1` index) and the dashboards were invisible to anyone viewing the Global tenant in OpenSearch Dashboards. Verified against the live dev cluster: `_find` under `securitytenant: global` returned 26 objects and `.kibana_1` (the Global tenant index the UI reads) went from 2 to 67 docs after re-importing with the fix. An empty/omitted header read 0 from Global -- it falls back to the user's configured default tenant -- so `global` is the only reliable token. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: don't drop report files whose names contain glob metacharacters The CLI expanded every file argument with glob(), which treats [, ], *, and ? as pattern syntax. A literal path like "[Netease DMARC Failure Report] Rent Reminder.eml" -- the bracketed shape many providers use for emailed failure reports -- was read as a character class, matched nothing, and was dropped before reaching the parser, with no error. File arguments that exist on disk are now taken literally; only non-existent paths are globbed, so shell-style wildcards still expand. Also adds "postgresql" to _KNOWN_SECTIONS so PARSEDMARC_POSTGRESQL_* env vars (and their _FILE Docker-secret variants) resolve like every other backend -- the PostgreSQL backend is new in 10.0.0, so this completes the unreleased feature rather than fixing a released regression, and is documented under the PostgreSQL enhancement, not Bug fixes. Regression tests added for both. Verified end-to-end: all four samples/failure/*.eml now index (the bracketed Netease report included). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * dev: validate dashboards on OpenSearch 3.x and add PostgreSQL to the dev stack The dev stack ran OpenSearch Dashboards 3.x against OpenSearch 2.x, an unsupported cross-major pairing. Bump opensearch to :3 (validated on 3.6.0: OSD import into the Global tenant and all dashboards work). Add a postgresql service plus bootstrap wiring so the new PostgreSQL backend is exercised alongside the others: wait for PG, seed it via PARSEDMARC_POSTGRESQL_* env vars on the same parsedmarc run, wipe it on RESEED, create a Grafana grafana-postgresql-datasource (uid dmarc-pg), and import dashboards/grafana/Grafana-DMARC_Reports-PostgreSQL.json. PG seeding is gated on psycopg being importable: parsedmarc aborts the whole run (exit 1, nothing written to any backend) when a configured output backend can't initialize, so wiring in PG without the optional extra would silently zero ES/OS/Splunk too. When psycopg is absent the script warns and skips PG, leaving the other backends seeded. Also fix the Grafana admin password env: the container was given GRAFANA_PASSWORD, which Grafana ignores -- it reads GF_SECURITY_ADMIN_PASSWORD. Defaults to admin to match the script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: list PostgreSQL on the premade-dashboards features bullet PostgreSQL ships a premade Grafana dashboard (dashboards/grafana/Grafana-DMARC_Reports-PostgreSQL.json), so it belongs on the "for use with premade dashboards" bullet alongside Elasticsearch, OpenSearch, and Splunk rather than on the plain-output-destinations line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: clear stale org_email mapping conflict in the OpenSearch dashboards The aggregate index pattern in dashboards/opensearch/opensearch_dashboards.ndjson shipped a cached field-list snapshot where org_email was a text/object conflict, plus leftover org_email.#text and org_email.#text.keyword subfields. Those came from a cluster that had indexed a langAttrString email dict ({"#text": ..., "@lang": ...}) before the parser unwrapped it. org_email is mapped as Text() and parse_aggregate_report_xml now unwraps a dict email to a plain string, so current data is consistently text -- a clean cluster's _field_caps reports no conflict. Cleared the frozen conflict and the two artifact subfields, leaving org_email (text) and org_email.keyword, matching the live mapping. Verified: re-importing the corrected ndjson yields an index pattern with org_email as a plain text field and zero conflicts; only the aggregate index-pattern line changed, all other saved objects byte-identical. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * dev: seed the RFC 9990 (dmarc-2.0) aggregate samples samples/aggregate/rfc9990-sample.xml and rfc9990-example.net!...xml were not in the bootstrap's SAMPLE_FILES, so the dev stack only ever indexed RFC 7489 reports and the new DMARCbis fields (np, testing, discovery_method, generator, xml_namespace) never appeared in the OpenSearch/Kibana indices or were available to the dashboards. Added both samples (one declares the urn:ietf:params:xml:ns:dmarc-2.0 namespace, the other is namespaceless RFC 9990-shaped, covering both detection paths). Verified the seeded data now carries np/testing/ discovery_method/generator and xml_namespace=urn:ietf:params:xml:ns:dmarc-2.0; OpenSearch Dashboards surfaces them on an index-pattern field-list refresh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * dev: auto-resolve (or create) a venv for the seed and ensure psycopg The seed previously required parsedmarc to be pre-installed and only warned-and-skipped PostgreSQL when psycopg was missing. Resolve the seed environment by precedence instead: 1. explicit PARSEDMARC_BIN -> used as-is, nothing installed 2. active $VIRTUAL_ENV 3. existing repo venv/ or .venv/ 4. otherwise create $REPO_ROOT/venv For cases 2-4, run `pip install -e .[postgresql]` only when the CLI or psycopg is missing, so the dev stack can populate Postgres out of the box without a manual install step. The explicit-PARSEDMARC_BIN path is left untouched (and the psycopg seed guard still warns/skips if that env lacks the extra). Verified: a RESEED run resolves the active venv, seeds ES/OS/Splunk/PG including the RFC 9990 fields, with no output-client errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
411f5a8886 |
chore: tidy cSpell config; fix two doc typos (#779)
- Ignore data/export trees via cSpell.ignorePaths: parsedmarc/resources/** (maps tooling holds thousands of intentional foreign-language classifier keywords + bundled data), plus samples/** and dashboards/** (report samples and dashboard exports). These are data, not whitelist vocabulary, so excluding them keeps the editor quiet without bloating the word list. - Add the remaining genuine false-positives across code, docs, CI workflows, and editor config to cSpell.words (technical terms, library names, SQL/identifier tokens, brand/operator and multilingual examples from AGENTS.md, plus charliermarsh/junitxml/mktemp/pipefail/seanthegeek). - Fix two genuine typos found while triaging rather than whitelisting them: "maidir" -> "maildir" and "connexion" -> "connection". Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
2b3cd32b9c |
chore: move PostgreSQL Grafana dashboard into dashboards/grafana/ (#780)
The PostgreSQL dashboard shipped at the repo-root grafana/ by oversight; every other dashboard source lives under dashboards/ (opensearch/, grafana/, splunk/). Move it next to the existing Grafana dashboard, list it in dashboards/README.md, and fix the CHANGELOG path reference. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
caac8e68f0 |
docs: note DMARC RFC support in the features list (#778)
* docs: note DMARC RFC support in the features list The features list only mentioned "draft and 1.0" aggregate reports. Spell out the standards parsedmarc parses: RFC 7489 (legacy DMARC) and the final DMARC standard RFC 9989 with RFC 9990 aggregate reports, RFC 6591 and RFC 9991 failure reports, and RFC 8460 SMTP TLS reports. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: align Python compatibility table pipes (MD060) The emoji cells were padded for display width, leaving the source pipes misaligned by character count and tripping markdownlint MD060. Re-pad so every row's pipes line up by codepoint. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: list all optional output destinations; fix table emoji alignment Expand the features list to cover every output sink: Elasticsearch, OpenSearch, Splunk, and PostgreSQL (premade dashboards), plus Kafka, Amazon S3, Azure Log Analytics (Microsoft Sentinel), Graylog (GELF), syslog, and HTTP webhooks. Also re-pad the Python compatibility table using display width (the status emoji render two columns wide), which is what markdownlint MD060 measures — the previous codepoint-based padding still tripped the rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: separate PostgreSQL from the premade-dashboards clause PostgreSQL is a storage target without bundled premade dashboards, so it shouldn't sit inside the "for use with premade dashboards" phrase next to Elasticsearch/OpenSearch/Splunk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: move PostgreSQL to the non-dashboard outputs line Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: use compact markdown tables Switch the markdown tables (Python compatibility, env-var section mapping) to compact single-space format. It reads cleanly in a text editor and sidesteps the column-alignment churn that emoji/variable-width content caused with padded tables (markdownlint MD060). The reStructuredText grid table in dmarc.md is left as-is — it relies on multi-line cells markdown can't express. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
ef2fb84cc0 |
test: cover parsedmarc's mailbox processing loop end-to-end on a real Maildir (#777)
AGENTS.md notes get_dmarc_reports_from_mailbox was halted at low coverage
because honest testing needed a live IMAP server or mocks so deep they test
the mock. mailsuite's MaildirConnection is a real on-disk backend with no
network or credentials, so the fetch -> parse/classify -> route loop can now be
exercised for real in CI.
TestGetDmarcReportsFromMailboxMaildir delivers real sample reports (one
aggregate, failure, and SMTP-TLS email) plus an unparseable message into a
Maildir INBOX, runs get_dmarc_reports_from_mailbox offline, and asserts on
observable results — parsed report counts and which archive subfolder each
message physically lands in:
- each report type routed to Archive/{Aggregate,Failure,SMTP-TLS}, the junk
message to Archive/Invalid, INBOX drained
- delete=True removes processed messages instead of archiving them
- test=True parses and returns reports but moves nothing and creates no folders
setUp resets the module-global SEEN_AGGREGATE_REPORT_IDS dedup cache so test
order can't drop an already-"seen" aggregate report, and the maildir lives at a
fresh subpath so mailbox.Maildir(create=True) actually builds cur/new/tmp.
Lifts parsedmarc/__init__.py from 76% to 82%, honestly.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
a6778707d7 |
Finish forensic→failure rename: archive-folder migration + dashboard/doc cleanup (#776)
The forensic→failure rename (#659) left a few loose ends and one deliberate hold-back. This closes them. Leftover rename misses (broken paths / stale canonical names): - CONTRIBUTING.md, dashboard-dev-bootstrap.sh: samples/forensic/* → samples/failure/* - dashboard-dev-bootstrap.sh, dashboards/README.md: dmarc_forensic_dashboard.xml → dmarc_failure_dashboard.xml (the file was already renamed; the import path and view name were not) - docs/source/usage.md: PARSEDMARC_GENERAL_SAVE_FORENSIC → ..._SAVE_FAILURE example - samples/parsedmarc.ini: save_forensic → save_failure - pyproject.toml, README.md: canonical "failure" naming (ci.ini intentionally keeps save_forensic to smoke-test the deprecated alias.) Archive subfolder rename + on-startup migration: - New failure reports now archive to <archive>/Failure (was <archive>/Forensic). - _migrate_forensic_archive_folder() runs once on startup (best-effort): renames Forensic→Failure when no Failure folder exists yet, merges the two when both exist, no-ops when there's no legacy folder, and logs-and-skips a mailbox it can't reorganize (warn, don't crash). This consolidates pre- and post-rename failure reports into one folder, replacing the previously documented decision to keep the folder named Forensic to avoid a split archive. Uses the folder-management API (folder_exists / rename_folder / merge_folders) added in mailsuite 2.1.0; the pin is bumped to >=2.1.0. Grafana dashboard (the rename PR updated OSD/Splunk/ES-OS but not Grafana): - Forensic panel titles + the datasource label → Failure; the fo-column display label and its linked byName field-override matcher both → "Failure Policy" (changed together so the column-width override keeps matching). - dev-bootstrap Grafana ES datasource: dmarc_forensic* → dmarc_f* (matches both pre-rename dmarc_forensic* and post-rename dmarc_failure*, like the OSD/Kibana dashboards); RESEED wipe loop now also clears dmarc_failure* indices. - Removed dashboards/grafana/Grafana-DMARC_Reports.json-new_panel.json, an orphan export accidentally committed in #736 and referenced by nothing. Tests (tests/test_init.py): - TestMigrateForensicArchiveFolderMaildir: real on-disk Maildir round-trips via mailsuite's MaildirConnection (no mocks) — rename, merge, no-op, and the full get_dmarc_reports_from_mailbox orchestration. Runs in CI (no network/creds). - TestMigrateForensicArchiveFolderErrorHandling: the one path a real Maildir can't reproduce — a backend that raises mid-operation must warn, not crash. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
327fcff2b9 |
Add optional PostgreSQL storage backend (#667)
Adds a PostgreSQL output backend as a lighter-weight alternative to Elasticsearch/OpenSearch, configured via a [postgresql] section (host/port/user/password/database or a libpq connection_string). Tables are created automatically on first run; a Grafana dashboard is included. - psycopg is an optional extra (pip install parsedmarc[postgresql]); the import is guarded so `import parsedmarc` works without it, and PostgreSQLClient raises a clear install hint when constructed without the driver. Binary wheels aren't available for every platform. - Schema captures the RFC 9990 / DMARCbis aggregate fields: np, testing, discovery_method, generator, xml_namespace, and per-result human_result on the DKIM/SPF auth-result tables. - forensic -> failure naming throughout (table dmarc_failure_report, save_failure_report_to_postgresql, dashboard, docs) to match #659. - Failure-report de-duplication mirrors the Elasticsearch backend exactly: arrival date + From + To + Subject (NULL-safe via IS NOT DISTINCT FROM; semantic JSONB equality). Aggregate and SMTP-TLS use ON CONFLICT. - PostgreSQLClient.close() for clean CLI shutdown; comment documents why the two timestamp helpers must stay distinct (report dates are local, record/SMTP-TLS dates are UTC). - CLI: config parse raises ConfigurationError on missing host/connection_string; wired into _init_output_clients + save loops. - Tests in tests/test_postgres.py (helpers, mocked-DB save assertions, create_tables, connect/error wrapping, dedup, real-sample round trip) and tests/test_cli.py (config parse + end-to-end save wiring incl. AlreadySaved/PostgreSQLError handling). postgres.py at 99% line coverage; only _main's output-client-init retry path is left. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
0a703172de | Reorder enhancements in the changelog | ||
|
|
bf37ded688 | Add support for Elastic Cloud Serverless projects (#770) | ||
|
|
535d9db1ad |
cli: support _FILE suffix on PARSEDMARC_* env vars for Docker secrets (#772)
Appending _FILE to any PARSEDMARC_{SECTION}_{KEY} env var reads the
value from the referenced file, with one trailing newline stripped.
This matches the Postgres/MariaDB/Redis container-image convention so
Docker Compose and Kubernetes secret mounts work without extra glue,
keeping credentials out of plain environment: blocks (and out of
docker inspect, container logs, and /proc/<pid>/environ).
When both the direct var and its _FILE companion are set, the file
wins. A missing or unreadable file raises ConfigurationError rather
than silently degrading to an empty credential. The four pre-existing
config keys whose own names end in _file ([general] log_file,
[msgraph] token_file, [gmail_api] credentials_file / token_file)
keep their direct-path semantics; pass their values via secret by
doubling the suffix (_FILE_FILE).
|
||
|
|
b7b8383fa4 |
Expand honest test coverage from 59% to 83%; fix two latent bugs (#775)
* Expand honest test coverage from 59% to 83%; fix two latent bugs 271 new tests across the output modules, ES/OS clients, CLI config parsing, and the top-level parsing surface. Coverage measured against shipped code only (see [tool.coverage.run] source = ["parsedmarc"] omit = ["*/parsedmarc/resources/maps/*.py"] in pyproject.toml). Per-module results: s3.py 38% → 100% (also fixes SMTP-TLS-to-S3 bug below) gelf.py 40% → 100% syslog.py 46% → 100% kafkaclient.py 34% → 100% splunk.py 24% → 100% loganalytics.py 56% → 100% webhook.py 78% → 100% (also removes redundant try/except) elastic.py 36% → 99% opensearch.py 40% → 99% cli.py 52% → 69% __init__.py 74% → 76% (also fixes append_json bug below) utils.py 84% (unchanged in this PR) TOTAL 59% → 83% The remaining 17% is honest. The biggest unreached blocks are _main() in cli.py and the watch-mode mailbox iteration in __init__.py, both of which would require either standing up live subsystems (real Elasticsearch, real IMAP) or mocking deep enough that the test would verify the mock rather than the code. The PR-A AGENTS.md guidance — "if 90% requires faking it, ship 85% honestly" — applies here. Bugs fixed while writing tests: 1. parsedmarc/s3.py — SMTP-TLS-to-S3 was completely broken. save_report_to_s3 unconditionally read report["report_metadata"] when building S3 object metadata, but RFC 8460 §4.3 SMTP TLS reports are flat (no report_metadata sub-object). The CLI's surrounding try/except silently swallowed the KeyError, so every SMTP-TLS report quietly failed to upload. Also fixes a related issue: parse_smtp_tls_report_json stores begin_date as the raw ISO-8601 string from the report (per the SMTPTLSReport TypedDict and RFC 8460 §4.3), but the S3 code path assumed a datetime with .year / .month / .day attributes. Both fixed; the broken metadata-extraction branch now uses the flat-report fields, and the date branch normalizes via human_timestamp_to_datetime. 2. parsedmarc/__init__.py — append_json corrupted JSON output files on the second write. The original implementation opened files in "a+" mode, then seek()ed backwards to overwrite the trailing "]" with ",\n" before appending more elements. Python's docs are explicit (https://docs.python.org/3/library/functions.html#open): on POSIX, writes in "a"/"a+" mode always go to EOF regardless of seek() position. The result was that the second call produced [...]\n],\n[...] -style corrupted output instead of a single merged array. Replaced with a read-merge-write pattern: load the existing array (if any), append the new elements, rewrite the whole file. The CSV cousin append_csv was not affected — it doesn't seek backwards. 3. parsedmarc/webhook.py — removed redundant try/except blocks in save_aggregate_report_to_webhook / save_failure_report_to_webhook / save_smtp_tls_report_to_webhook. _send_to_webhook already catches every Exception itself, so the outer except blocks were unreachable dead code (covered nothing, defended against nothing, and inflated the source-line count without testing value). Testing approach: mocks at SDK boundaries (boto3 resource, kafka producer, requests session, opensearch/elasticsearch Document/Search, azure LogsIngestionClient). Tests verify the parsedmarc-side transformation logic — document/event construction, index/topic naming, dedup queries, error wrapping — rather than asserting on mock invocations as a proxy for behaviour. Where a branch is defensive against a caller that doesn't exist in the codebase, the test is omitted (commented in code rather than hidden behind a pragma). 547 tests total (was 276), all passing. ruff check + format clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Document the two bug fixes from this PR in the 10.0.0 changelog Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Document testing standards in AGENTS.md Adds a "Testing standards" section covering the principles applied in PR-A (split) and PR-B (coverage expansion): - Coverage measures shipped code only — don't reintroduce tests/* to the scope, don't expand omit, don't use # pragma: no cover. - Honest tests assert on observable behaviour, not "the mock was called". Mock at SDK boundaries; parse the payload that gets sent. - "If 90% requires faking it, ship 85% honestly" — coverage is a tool, not a goal. PR-B's deliberate stops at cli.py 69% and __init__.py 76% are the documented precedent for when to halt. - Verify bug claims against the relevant RFC, internal types, installed SDK source, or upstream docs before changing code. Cite the source in the commit message and test docstring (RFC 8460 §4.3 and the Python open() docs for #775's two bug fixes are the pattern to follow). - Bugs found while writing tests are fixed in the same PR; the test doubles as the regression guard. - File layout (tests/test_<module>.py) is non-negotiable; module-level test loggers need fresh-handler setup so test ordering doesn't break assertLogs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Cover the corrupt-file fallback in append_json Codecov flagged 2 missing patch-coverage lines on PR #775: the except (json.JSONDecodeError, OSError) branch in append_json, which falls back to overwriting when the existing file isn't a parseable JSON array. Two new tests in tests/test_init.py:TestAppendJson exercise both paths: - test_corrupt_existing_file_is_overwritten_cleanly: existing file contains invalid JSON; append_json overwrites with the new array. - test_existing_file_with_non_list_root_is_overwritten: existing file parses as {"foo": ...} (dict, not list); the isinstance guard rejects it and we overwrite cleanly. Patch coverage now 100% on the bug fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
5b08627eaa |
Split tests.py into per-module tests/test_<module>.py (#774)
* Split tests.py into per-module tests/test_<module>.py The 5174-line tests.py monolith is split into per-module files under tests/, mirroring the checkdmarc layout: tests/test_init.py parsedmarc/__init__.py parsing surface tests/test_cli.py parsedmarc/cli.py + config / env-vars / SIGHUP tests/test_utils.py parsedmarc/utils.py (DNS, IP info, PSL, etc.) tests/test_webhook.py parsedmarc/webhook.py tests/test_kafkaclient.py parsedmarc/kafkaclient.py tests/test_splunk.py parsedmarc/splunk.py tests/test_syslog.py parsedmarc/syslog.py tests/test_loganalytics.py parsedmarc/loganalytics.py tests/test_gelf.py parsedmarc/gelf.py tests/test_s3.py parsedmarc/s3.py tests/test_maps.py parsedmarc/resources/maps/ maintainer scripts The split is purely a redistribution — no test bodies changed, no tests added or removed. All 276 existing tests pass under the new layout. The current tests.py contains two kitchen-sink classes (`Test` at line 54 and `TestEnvVarConfig` at line 2360) holding tests that span many modules. Their methods are routed to the correct per-module file by name prefix; the wholly-thematic classes (TestExtractReport, TestUtilsXxx, TestSighupReload, etc.) move whole. Each target file gets its own `class Test(unittest.TestCase)` for the redistributed kitchen-sink methods, plus the thematic classes verbatim. Wiring updates: - `.github/workflows/python-tests.yml`: `pytest ... tests.py` → `python -m pytest ... tests/` (also switches to `python -m pytest` per the checkdmarc convention so cwd lands on the project root). - `pyproject.toml`: adds `[tool.pytest.ini_options] testpaths = ["tests"]` and `[tool.coverage.run] source = ["parsedmarc"]` with an `omit` for `parsedmarc/resources/maps/*.py`. The maps scripts are maintainer-only batch tooling that ships out of the wheel; excluding them from coverage makes the headline number reflect only installed library code. Runtime coverage on the new layout is 59% (was 45% with maps counted), and PR-B will push it to 90%+. - `AGENTS.md`: documents the new layout and how to run individual files / tests; tells future contributors not to reintroduce a monolithic tests.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Restore 66.9% coverage baseline (count tests/ + parsedmarc) Master's headline 66.9% number on Codecov includes the tests.py file itself (99.35% covered) being measured alongside parsedmarc/*. The original tests.py had no `[tool.coverage.run]` block, so coverage's default — "measure every file imported during the run" — counted the test code as if it were product code. The split commit added `source = ["parsedmarc"]` which suppressed measurement of the test files (correct in principle, since test files aren't shipped code), and that alone made the headline number drop by ~8 percentage points without any actual loss of testing. This commit swaps `source` for an explicit `include = ["parsedmarc/*", "tests/*"]` so both halves are measured the way they were on master. Verified: 276 tests, 66.96% line coverage (effectively unchanged from master's 66.90%). If you want the shipped-code-only number (was the headline that this commit overrides), run `pytest --cov=parsedmarc tests/`. That number is currently 59% and is the focus of the upcoming coverage-expansion PR. Also adds junit.xml to .gitignore so the CI artefact doesn't get accidentally committed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Restrict coverage to shipped code (`source = ["parsedmarc"]`) Reverts the prior commit's `include = ["tests/*"]`. Counting the test files toward coverage was wrong — it conflates "shipped code exercised by tests" with "test code that pytest auto-runs", inflates the headline number, and rewards writing more tests rather than tests that verify more code. Master's apparent 66.9% was an artefact of the old monolithic tests.py having no [tool.coverage.run] block at all; coverage's default behaviour measured every imported file, including the test file itself at ~99% "covered", which added ~8 percentage points to the displayed number without any real testing signal. Restricting to `source = ["parsedmarc"]` plus the existing maps omit gives a meaningful baseline: 59% of shipped code is exercised by the test suite today. That's the number the next PR is targeting to lift to 90%+ before the 10.0.0 release; the Codecov "drop" here is a measurement correction, not a regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
ae1e5adb66 |
Add RFC 9989/9990/9991 (final DMARC) report support; rename forensic→failure project-wide (#659)
* Add DMARCbis report support; rename forensic→failure project-wide
Rebased on top of master @
|
||
|
|
8c5f63620c |
Fix Validate-dashboards CI: heredoc was redirecting itself to stdin (#773)
`echo "$response" | python3 - <<'PY' ... PY` redirected the heredoc to python3's stdin (where it was correctly read as the script body), but sys.stdin was then at EOF when the script called json.load(sys.stdin) — so the assertion blew up with 'Expecting value: line 1 column 1' even when Kibana's import had succeeded. Pass the response via env var instead. The OSD ndjson import itself was working all along (successCount: 26, success: true); only the assertion step was broken, so master has been showing a red Validate-dashboards run since the workflow was introduced. |
||
|
|
2d3e896f6d | Fix pytest command line argument typo | ||
|
|
c5b2fcec54 |
Enhance CI with JUnit XML output and Codecov results
Added JUnit XML output for pytest and Codecov test results upload. |
||
|
|
a6ea169df5 |
chore: update IPinfo Lite MMDB (#771)
Co-authored-by: seanthegeek <44679+seanthegeek@users.noreply.github.com> |
||
|
|
1fc1134f77 |
chore: update IPinfo Lite MMDB (#769)
Co-authored-by: seanthegeek <44679+seanthegeek@users.noreply.github.com> |
||
|
|
ff6f75d740 |
Map-data build hygiene: README single source of truth, drop maintainer scripts from wheel (9.11.2) (#768)
* Drop base_reverse_dns_types.txt; sortlists.py now reads types from README.md The .txt file duplicated the README's industry list and introduced drift risk — twice in the project's history we had to add types to the .txt only because the README had been updated independently. Make the README the single source of truth. - Add `<!-- types-list:start -->` / `<!-- types-list:end -->` HTML comment markers around the bullet list in parsedmarc/resources/maps/README.md. Markers don't render in GitHub's preview. - New `load_types_from_readme()` in sortlists.py parses the bullet items between the markers and returns them. Errors clearly if the README is missing or the markers are absent. - Delete base_reverse_dns_types.txt. - Fix a pre-existing typo in README precedence rule 4: `Web Hosting` → `Web Host` (matches the canonical type used in 4,176 map rows). Smoke test: feeding a row with a bogus type still triggers the validator (`'NotARealType' is not an allowed value for 'type'`), confirming the README-derived list flows through identically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * sortlists.py: normalize README types-list block in place Before validating the map, the validator now sorts the <!-- types-list:start --> / <!-- types-list:end --> block in README.md alphabetically (case-insensitively), trims leading and trailing whitespace from each item, and deduplicates case- insensitively, rewriting the README in place if any of those need fixing. Errors clearly when two entries differ only by casing (which would otherwise silently lose one). Adding a new category is now just inserting a `- New Type` line anywhere inside the markers — `sortlists.py` will tidy it on the next run. Same shape as how the validator already normalizes known_unknown_base_reverse_dns.txt and psl_overrides.txt. The pure read path is preserved as `load_types_from_readme()` for callers that don't want a side-effecting rewrite (tests, downstream tooling). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Stop shipping maintainer scripts; bump to 9.11.2 The exclude list in [tool.hatch.build] was originally meant to keep maintainer-only batch tooling under parsedmarc/resources/maps/ out of the wheel and sdist (it lists `find_bad_utf8.py`, `find_unknown_base_reverse_dns.py`, the renamed-and-removed `sortmaps.py`). The list never grew when new tools were added, so `collect_domain_info.py`, `classify_unknown_domains.py`, `detect_psl_overrides.py`, `detect_rebrands.py`, and `sortlists.py` all started shipping in distributions despite contributing nothing to runtime functionality. Replace the per-file basename list with a single glob pattern: parsedmarc/resources/maps/[!_]*.py The leading-`_` exception keeps `__init__.py` shipping (required so that `importlib.resources.files(parsedmarc.resources.maps)` can locate the bundled CSV/TXT data files), while excluding any other .py file under that directory — including future maintainer scripts that haven't been written yet. Drop the now-redundant per-file entries from the exclude list: `find_bad_utf8.py`, `find_unknown_base_reverse_dns.py`, and the already-removed `sortmaps.py`. The non-.py exclusions stay (`base_reverse_dns.csv`, `unknown_base_reverse_dns.csv`, `README.md`, `*.bak`). Verified with `hatch build`: - Wheel under parsedmarc/resources/maps/: __init__.py + 3 data files (CSV/TXTs), no maintainer .py - sdist matches - Clean-venv install of the built wheel loads 298 PSL overrides and `get_base_domain('host01.netlify.app')` returns `netlify.app` Bump to 9.11.2 since this changes shipped artifacts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>9.11.2 |
||
|
|
053195581b |
collect_domain_info.py: opt-in DuckDuckGo search fallback for bot-blocked rows (#767)
* collect_domain_info.py: opt-in DuckDuckGo search fallback for bot-blocked rows A meaningful share of KU domains return a Cloudflare / DDoS-Guard / "Are you a robot?" / px-captcha interstitial instead of real homepage content — even after the curl-style relaxed-TLS fallback runs. For those rows we have neither homepage signal nor (often) a usable as_name, and they fall through to KU even though the operator is a real (often well-known) business that the classifier could trivially handle if it could just see the page. Added an opt-in `--use-search-fallback` flag that asks DuckDuckGo for `site:<domain>` when the homepage fetch returned a bot-block / parking / empty result, and uses the top result's title and description (only if the result host belongs to the input domain — anti-SEO-spam guard). Mechanism - New optional `ddgs` dependency, listed under the `[build]` extras. `from ddgs import DDGS` is wrapped in a try/except — the script runs without ddgs installed as long as `--use-search-fallback` isn't passed; the flag check exits with a helpful install message otherwise. - `_SEARCH_FALLBACK_TRIGGER_RE` — title/description patterns that look like a bot-block / WAF interstitial / parked / placeholder. Triggers the fallback. Same shape as the classifier's TITLE_NOISE_RE / PARKED_PAGE_RE; the search fallback is the recovery path for exactly the rows that filter excludes. - `_looks_bot_blocked()` — combined check: trigger regex matches OR title and description are both empty (typical of WAF interstitials that strip <title>/<meta> entirely). - `_hosts_match()` — same-domain SEO-spam guard. A search result is accepted only when its host is exactly the input domain or a subdomain of it. Third-party SEO-spam pages that scraped the domain name are silently skipped. - `_search_fallback_fetch()` — runs `site:<domain>` through DDG, walks results in rank order, returns the first one whose host passes the guard. Returns empty if no result matches (caller leaves the row's homepage data alone in that case). - `_collect_one()` now takes a `use_search_fallback` flag, calls the fallback after the homepage fetch when the homepage looks bot-blocked, and writes `title_source = "homepage"` or `"search"` so reviewers can audit which rows came from where. - New `title_source` column in the TSV. Smoke test Test set: bbc.com (real homepage, no fallback expected) plus 5 known Cloudflare-walled rows (1800contacts.com, americaneagle.com, broadwaytechnology.com, health.gov.il, mfa.gov.il). Result: bbc.com classified via homepage; the other 5 all recovered title + description via search and got `title_source=search`. The same-domain guard validated independently — for broadwaytechnology.com the guard correctly rejects bloomberg.com and accepts support.broadwaytechnology.com (broadway was acquired by Bloomberg, but the search fallback returns the broadway-domain snippet, not the parent's bloomberg.com product page). Caveats codified in AGENTS.md - Search snippets are still untrusted text (data-not-instructions rule applies the same way it does to homepage HTML). - DDG's index can lag a homepage rebrand by months — when a row classified via `title_source=search` disagrees with a fresh manual fetch, prefer the manual verification. The fallback is a recovery aid, not a tiebreaker against fresh content. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * collect/classify: link-following + alias map rows for placeholder DDG titles When the search fallback ran on the original 6-domain smoke set, two of the recovered titles were essentially placeholder pointers carrying no classifier signal — DDG returned `Link to fcs.health.gov.il` for one input and a bare `yangon.mfa.gov.il` for another. Those snippets are DDG's way of saying "I have an indexed subdomain but no real abstract to give you", and feeding them to the regex classifier produces no better signal than the parking-page result we were already trying to recover from. This commit teaches the collector to recognize both placeholder shapes, follow the pointer to the target hostname, and use *that* hostname's real content for the row. The classifier then emits the original input and the link target as **two map rows under the same (name, type)** so both keys are looked up against future DMARC reports. collect_domain_info.py - New `_LINK_TO_TITLE_RE` / `_BARE_HOSTNAME_RE` and an `_extract_link_target` helper that returns the target hostname when the search title is `Link to <hostname>` or a bare hostname, "" when the title carries real content. - After the search-fallback path, if the title looks like a pointer and the target differs from the input, `_fetch_homepage(target)` is called once. When the target's fetch returns real (non-bot-blocked) content, the row's title / description / final_url / rebrand_signal / external_links are replaced with the target's, and `title_source` becomes `search→<target>` so reviewers can audit the path. - New `link_target_domain` column records the followed target whether or not its fetch succeeded. classify_unknown_domains.py - When a row's `link_target_domain` is set and differs from the input domain, the classifier emits a second map row for the target with the same `(name, type)`. The original input is the "og" domain; the target is what DDG pointed us at — both end up in the map as aliases. Same handling applies on the ambiguous-bucket path so a single human adjudication covers both. Smoke test on the original 6-domain set: bbc.com homepage → BBC Home – Breaking News, … 1800contacts.com search → 1800contacts health.gov.il search → Homepage – COVID Information Center of the Israel Ministry of Health americaneagle.com search → Americaneagle.com | Web Design … broadwaytechnology.com search → Bloomberg Completes Acquisition of … mfa.gov.il search→yangon.mfa.gov.il → Home | Ministry of Foreign Affairs link_target_domain=yangon.mfa.gov.il The mfa.gov.il row triggered the new path: DDG returned `yangon.mfa.gov.il` as the title, the collector followed it, the target's homepage gave us "Home | Ministry of Foreign Affairs", and the classifier emitted both `mfa.gov.il, Ministry of foreign affairs, Government` and `yangon.mfa.gov.il, Ministry of foreign affairs, Government`. AGENTS.md updated with the link-following / alias rules under the search-fallback subsection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Run --use-search-fallback against 10,544 bot-blocked KU rows; +473 promotions Also expands the search-fallback trigger regex to recognize self-signed TLS interception (firewall block via cert) and a wider class of local-firewall block-page strings. Mechanics 1. Identified 10,544 KU rows from the 34,647-row prior TSV that looked bot-blocked (via the new `_looks_bot_blocked` detector). 2. Ran `collect_domain_info.py --use-search-fallback` against just those rows. Throughput was ~3.4 rows/sec at 32 workers / 3s HTTP timeout / 5s WHOIS timeout. ~50 min wall time. 3. Audited the resulting TSV and discovered 2,078 rows whose homepage fetch had silently returned a corporate firewall's block page (Fortinet "Web Filter Violation" being the most common, 1,419 of them). The original `_SEARCH_FALLBACK_TRIGGER_RE` didn't recognize those strings, so search-fallback wasn't firing — the firewall's block-page text was being fed to the classifier as if it were the operator's homepage. Almost no false promotions resulted (block-page text doesn't match industry detectors), but the rows weren't recovering either. 4. Expanded the trigger regex to catch web-filter block pages, then re-fetched just the 2,078 affected rows. 5. Final classifier pass: 474 unambiguous map adds, 41 ambiguous, 1 silently dropped (adult content), 10,066 still in KU. Self-signed-cert detection A separate fix lands in this commit: when the primary fetch fails with an SSL cert verification error matching "self-signed certificate", the collector skips the verify=False browser fallback. Rationale: TLS- intercepting firewalls (corporate or personal-network) present their own self-signed cert specifically when blocking. The verify=False fallback would happily retrieve the firewall's block page, which then poisons the row's title/description. Skipping that path leaves the row's metadata empty so search-fallback can recover real content. Other cert errors (hostname mismatch, weak DH, legacy renegotiation) keep the existing fallback path because they're typically real operators with misconfigured TLS rather than firewall interception. Numbers Map: 37,640 → 38,114 (+474) KU: 32,324 → 31,886 (−438) Disjoint check: 0 shared keys Unknown CSV: regenerated, just the header Type distribution of the 474 promotions 162 ISP 17 MSP 4 MSSP / Marketing 72 Web Host 16 Technology 4 Beauty / Agriculture 41 Finance 14 Healthcare 3 IaaS / Science / Legal 19 Government 11 Travel 2 Search / Religion / SaaS 10 Logistics 8 Manufacturing 2 Email Sec / Email Provider 9 Education / Retail 8 News 2 Entertainment 7 Utilities / Phys Sec 6 Real Estate 1 Auto / Staff / PaaS 6 Food / Consulting / Industrial / Conglomerate / Nonprofit Most of the gains are network operators (162 ISPs, 72 Web Hosts) — the population that's most likely to be Cloudflare-walled or DDoS- Guard-walled at the homepage layer but show up clearly in DDG abstracts. Smoke audit on a 30-row random sample of map adds: 28 plausible, 2 borderline (`es.graphicpkg.com → Food` could also be Industrial since Graphic Packaging makes packaging *for* the food industry, but the vertically-specialized rule applies; `annuairesante.ameli.fr` → Finance via French health-insurance vocabulary, defensible). The 41 ambiguous rows stay in KU per the established workflow — they need the same one-row-at-a-time human triage as PR #766 used. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Search-fallback batch (partial; outage-truncated): +226 promotions Hotspot-bypass collector run was interrupted ~6,300/10,107 in when the hotspot lost connectivity and the machine reverted to the firewalled connection. Stopping here to commit what was unambiguously classifiable; the remaining ~3,800 candidates (plus any rows whose homepage fetch was tainted by the firewall fallback during the transition) will be re-collected in a fresh run after network stability is restored. Promotions in this batch: - 219 auto-classified by the regex classifier on the partial TSV - 17 ambiguous rows resolved per LLM auto-resolution rules + user manual review - 5 KU rows the user adjudicated explicitly (Bielsko-Biała, Douala-IX, Ekol Logistics, ICB, Marcus Corporation) - 13 from earlier triage worklist with brands assigned - Net 226 net-new map entries after dedupe, alias-leak filtering (3 link-target subdomains dropped where the parent base was already in the adds), full-IP privacy filtering (2 dropped), and ~30 targeted brand/category cleanups for rows where the search-fallback snippet had picked up a wrong page or the title contained registrant cruft / corporate-suffix leaks. AGENTS.md updates: - Codifies the "LLM auto-resolution of high-confidence ambiguous rows" workflow with R1-R5 high-confidence rules, low-confidence surface-to-human criteria, and the one-line auto-decision output format for reviewer overrule. - Adds 7 triage lessons learned during this batch's bot-blocked-KU review (Polish/IT/ES/GR/RO city domains, "Sports Club" venues, vertically-specialized investment firms, sub-page fetch FPs, Telecom-suffix brand pinning, Hospital/Health-System suffix, IXP -ix brand pinning). Map and KU files are disjoint after this commit. unknown_base_reverse_dns.csv is empty (header-only) since every base_reverse_dns input is now either mapped or in KU. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Search-fallback hotspot batch: +213 promotions Fresh hotspot run on the 9,881 still-bot-blocked KU candidates left after the prior outage-truncated batch. Classifier: 202 auto + 31 ambiguous (14 LLM auto-resolved per the R1-R5 high-confidence rules, 17 surfaced for interactive review) + 9,665 still KU + 1 dropped. Net 213 net-new map entries after dedupe, alias-leak filtering (13 link-target subdomains dropped where the parent base was already in the map or in this batch's adds), 1 full-IP privacy filter, 2 user-DROPs (1 alias of an as-numbered domain, 1 KU because the only signal was a cross-vertical client list), and ~8 targeted brand cleanups for rows where the search snippet had left a registrant-leak or domain-as-name placeholder. LLM auto-resolutions (R1-R5): africell.ao ISP wi-tribe.pk ISP ags.school.nz Education vwfs.com.au Finance allaria.com.ar Finance wanxp.com ISP asturias.org Government varendraisp.com ISP bdo.com.ph Finance titansi.com.my IaaS bikada.kz ISP redeyenetworks.com MSSP informatiq.org ISP plusinfo.ru ISP User-decided rows: admincomp.com Consulting korisp.com Web Host anrb.ru Science linkexplorer.net.br ISP arpc.ir Industrial novatech.bg MSP as63031.net Consulting reliable-nets.com ISP aviti.net Web Host satortech.com MSP binaryelements.com.au MSP skyworld.co.ke Finance juni.net.br ISP telegroup-ltd.com Technology west-webworld.fr Technology User KU/drops: itatec.com.py KU (cross-vertical client list, no operator signal) ns2.as63031.net DROP (alias of as63031.net) AGENTS.md addition: codifies the "Web Host vs Email Provider — bundled email-hosting is still Web Host" rule. Same shape as the existing CCaaS/CPaaS-vs-ISP and MSP-vs-MSSP rules: classify by the operator's primary product, not by every feature in their bundle. Prompted by the korisp.com triage during this batch. Map and KU files are disjoint after this commit. unknown_base_reverse_dns.csv remains header-only (every base_reverse_dns input is now mapped or in KU). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
b31a9e022f |
Reclassify KU pool: 2,248 promotions + new ambiguous-output worklist (#766)
* Reclassify KU pool: 2,248 promotions; surface 78 ambiguous rows for review
Re-fetched homepage / WHOIS / DNS for all 34,647 domains in
known_unknown_base_reverse_dns.txt via collect_domain_info.py and re-ran the
classifier. The classifier itself was extended in several directions while
auditing the unclassified pool — the changes are listed below.
Numbers
- 2,248 KU rows promoted to base_reverse_dns_map.csv (unambiguous matches).
- 78 rows surfaced as ambiguous (two or more distinct detector categories
fired) — these are NOT auto-promoted; they need human adjudication.
- 32,399 rows remain in KU (genuinely no signal — most have privacy-only
WHOIS, parked / blocked / Cloudflare-walled homepages, or empty MMDB
enrichment).
- Disjoint invariant verified: comm -12 of map keys and KU prints nothing.
- Unknown-list regenerated via find_unknown_base_reverse_dns.py.
Classifier changes (classify_unknown_domains.py)
1. Three output buckets via new --ambiguous-out flag. Per-row outcome is now
one of: map (auto-promote), ambiguous (worklist for human review), or
ku (no signal). When ≥2 distinct detector categories fire on a row, the
classifier picks a primary in precedence order but does NOT auto-promote
— instead it writes the row to the ambiguous TSV with the alternatives
listed. Rationale: the operator-typology question ("is this a SaaS
company or an Energy company?") is a judgment call the classifier
shouldn't make on its own.
2. Plural-matching fix: outer `\b` boundary changed to `s?\b` across all 46
detectors so `dedicated server` matches `dedicated servers`,
`law firm` matches `law firms`, etc. This was silently dropping the
majority of English-text matches.
3. TLD-only signal classification: bare-TLD rows (gov.kh / ac.id / mil.bd /
.jus.br etc.) now classify even when title/desc/as_name are all empty.
Previously short-circuited at "need some signal".
4. TLD lists massively expanded:
- Education: ~85 TLDs (every gov-restricted edu / ac suffix worldwide)
- Government: ~110 TLDs incl. judicial branch (.jus.br) and legislative
(.leg.br); covers Eastern Europe, MENA, SE Asia, Africa, Caribbean,
Pacific
- Military: ~45 .mil.* suffixes
- Plus US K-12 regex (.k12.<state>.us)
5. New concrete-vocabulary patterns added based on KU-pool audit:
- cybersecurity / cyber security for business → MSSP
- autonomous system / asn owner / network operator / peering exchange
/ IXP → ISP
- ICANN registrar / domain registrar / domain name platform / CDN /
WAF / anti-DDoS → Web Host
- BPM platform / CXM / CCaaS / CPaaS / contact center platform /
compliance software → SaaS
- katılım bankası / pensioen en verzekeringen / empréstimo consignado
/ credit (scores|reports|cards|comparison|bureau) /
stock and commodity market → Finance
- aeroportos de / passagem de ônibus / bilişim şirketi / havacılık →
Travel & Tech variants
- acciaio inossidabile / laminati piani → Industrial
- Russian football-club declension forms (футбольного клуба, etc.)
- tv channel / movie streaming / video streaming platform →
Entertainment
- genetic sequencing / next-generation sequencing /
clinical diagnostic → Healthcare
- punto vendita → Italian Retail
- electrolyser / electrolyzer / green hydrogen → Energy
6. Mojibake table extended for Western European compounds: ã/â/ê/î/ô
(Portuguese ã, French/PT â/ê/ô) plus uppercase variants.
Bug fixes from cross-language collisions
The audit pass exposed three short tokens that meant one thing in the
language they were added for and something completely different in another
language the classifier also targets:
- `por` (added as Luxembourgish for "parish" → Religion). Also the Spanish
and Portuguese preposition "for / by", which appears on roughly every
Spanish-language page. Was producing ~34 Religion false positives on
Mexican ISPs, Brazilian utilities, etc.
- `pura` (added as Indonesian/Sundanese/Balinese for "Hindu temple" →
Religion). Also the feminine of "pure" in Portuguese / Spanish / Italian,
and a frequent brand-name fragment ("Pura Energia", "Angkasa Pura").
Was misclassifying Brazilian electric utilities and Indonesian aviation
services.
- bare `broker` (added as Luxembourgish for Finance). Matched any English
text containing "broker" / "brokers" — including Cushman & Wakefield's
"real estate brokers" line, which forced the row into Finance instead
of Real Estate.
All three removed; AGENTS.md now codifies the rule.
AGENTS.md additions
- "Three output buckets" subsection: documents map / ambiguous / ku output
and how PRs should call out ambiguous review counts.
- "No taglines / slogans" rule: marketing copy ("we make it easy",
"smarter decisions") doesn't belong in any detector.
- "No ambiguous signals" rule: cross-category bare words (gazette / academy
/ society / club / studio) are forbidden as classifier keywords; use the
pinning compound instead. Same rule applies in every language.
- "Cross-language grammar / lexical overlap" rule: short tokens that mean X
in language A often mean a function word / adjective / brand fragment in
language B. Cites the por / pura / broker incidents.
- "Classify by what the operator literally provides" rule: clusters by
acronym suffix (UCaaS / CCaaS / CPaaS) tempt mis-grouping; CCaaS is SaaS
not ISP, etc. Includes the root-cause analysis of the
contact-center-as-ISP mistake.
- "Genuinely-ambiguous-between-two-types" rule: phrases like
"energy management software" that fit equally on a SaaS startup, an
Industrial conglomerate, and a consultancy belong in NO detector — leave
the row unmapped and rely on more-specific compounds.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Triage 78 ambiguous rows + new classifier filters and rules
Interactive triage of the 78 rows the v1 classifier surfaced as ambiguous
(two or more distinct categories fired). Net result of this commit, on
top of the v1 promotions already in the branch:
- 74 ambiguous rows promoted to map with a human-adjudicated category
(and 10 of those with a corrected human-cleaned brand vs. the noisy
as_name / title-bleed the v1 classifier captured).
- 1 row dropped silently per the AGENTS.md adult-content rule.
- 3 rows kept in KU (personal projects, parked pages caught by the
classifier mid-triage that we then surface'd-then-confirmed).
Map: 37,566 → 37,640 (+74). KU: 32,399 → 32,324 (−75). Disjoint clean.
Three new classifier filters added during triage as recurring patterns
surfaced — these run before category detectors and short-circuit to KU
or DROP rather than letting the operator-typology detectors fire on
parking-page / personal-page / adult-page text:
1. PARKED_PAGE_RE — Media Temple "automatically generated default server
page", Hostinger Horizons, Apache default, parked-by-registrar pages,
"site has shut down", "has completed its journey". Cloudflare /
DDoS-Guard / "Are you a robot?" interstitials are explicitly NOT
filtered (they leave the TLD-signal path open for gov / edu / mil
sites that are bot-blocked).
2. PERSONAL_PROJECT_RE — "personal BGP project", "personal website and
CV", "homelab", "hobby project", "side project". Hobbyists running
their own ASN aren't commercial operators.
3. ADULT_CONTENT_RE — adult web design / adult-entertainment hosting /
xxx / escort directory etc. Returns a sentinel ("DROP", None) so the
caller drops the domain from both map and KU per the AGENTS.md
content rule.
The classifier API now also writes a fourth output file (--dropped-out)
listing domains the adult-content filter caught, so the caller can
remove them from any tracked list files they currently sit in.
Title-noise list extended to catch: "attention required" / "are you a
robot" / "checking your browser" / "please enable javascript" /
"ddos-guard" / "px-captcha" / "site is not available" / "page is not
available" / "access to this page has been denied". This stops these
strings from bleeding into the brand column when TLD-only classification
fires (the `health.gov.il → "Attention Required!"` shape of bug).
Several cross-language false positives caught during the triage — same
shape as the por / pura / broker incidents the previous commit fixed:
- bare French `e?mailing` matched "Mailing Solutions" (mail-server
infrastructure on a Cisco VAR's product list, not marketing). Required
to start with `e` to keep the email-marketing meaning while losing the
bare-mailing collision.
- Norwegian / Danish bare `avis` (newspaper) matched "Avis Romania" car
rental and any French text saying "avis" (notice/opinion). Replaced
with compound forms (`dagsavis`, `lokalavis`, `morgenavis`, etc.).
- Vietnamese bare `bộ` (ministry) matched "bộ phim" (movie set), "bộ
sưu tập" (collection), and the founding-text references on Vietnam
Eximbank's about page. Replaced with compound forms (`bộ trưởng`, `bộ
tài chính`, `bộ ngoại giao`, etc.).
- Russian bare `провайдер` (provider) matched "хостинг провайдер"
(hosting provider, Web Host) on a Tajikistan domain registrar. Removed
the bare form; only the internet-specific compounds remain.
- Luxembourgish bare `broker` (Finance) matched "real estate brokers"
on Cushman & Wakefield's homepage and any English page mentioning
brokers. Removed the bare form entirely.
- Turkish bare `vakıf` (foundation) matched "Vakıf Katılım Bankası" —
for-profit Islamic-finance bank whose brand uses the word. Replaced
with nonprofit-specific compounds (`yardım vakfı`, `hayır vakfı`,
`kamu yararına vakıf`).
New positive-classification keywords added based on triage gaps:
- MSP rescue path now matches the SMB-IT-shop idiom in Polish
(`usługi IT dla biznesu`, `obsługa informatyczna firm`,
`outsourcing IT`), Spanish (`servicios informáticos para empresas`),
German (`IT-Dienstleister für`, `managed-IT-services`), French
(`infogérance`, `prestataire de services informatiques`), Italian
(`servizi informatici gestiti`, `outsourcing informatico`),
Portuguese (`serviços de TI gerenciados`, `terceirização de TI`),
Dutch (`beheerd-IT`, `IT-beheer`), and Indonesian
(`penyedia solusi IT`, `solusi IT terpadu/berbasis`).
- Finance now matches `accounting firm` / `cpa firm` /
`certified public accountants` / `chartered accountants` /
`tax preparation` / `tax advisory` / `audit firm` plus equivalents in
Spanish, Portuguese, French, German, Italian, and Polish.
- SaaS now matches CCaaS / CPaaS / `contact-center-as-a-service` /
`communications-platform-as-a-service` / `compliance software` /
`regulatory management software` and CCaaS no longer lives in ISP
(carryover from the user-flagged "contact centers are not ISPs"
correction).
AGENTS.md additions:
- "Triage heuristics learned from the 78-row interactive review of
PR #766's ambiguous bucket" subsection codifying every adjudication
rule the user applied during the review:
* pick the main-focus category (first / most-mentioned)
* clients are not operator typology
* vertically-specialized firms take the vertical
* stream-hosting infrastructure is Web Host
* multi-service SMB IT shops are MSP
* VARs are Technology
* CCaaS / CPaaS / UCaaS are SaaS
* gov/edu/mil/jus TLD signal trumps Cloudflare interstitials
* esports tournament organizers are Entertainment
* personal projects / parked pages / adult content go to KU or DROP
* brand quality is its own dimension — capture corrected brand
during triage rather than shipping the noisy as_name
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
06d277686d |
classify_unknown_domains.py: enforce concept-parity across ~60 languages (#765)
Multilingual detectors previously held English at full breadth (e.g. Healthcare = hospital + clinic + pharmacy + healthcare + pharmaceutical industry + nursing home + medical center) while many non-English sections covered the same concept set with only one or two transliterated words. This left every language other than English under-detecting against pages that used the operator's natural compound terms. Reworked every detector so each language now expresses the same English concept set in idiomatic compounds — never inventing calques where no natural form exists. Added ~32 new languages (Macedonian, Belarusian, Azerbaijani, Armenian, Georgian, Kazakh, Uzbek, Mongolian, Khmer, Burmese, Lao, Nepali, Sinhala, Amharic, Yoruba, Hausa, Igbo, Zulu, Pashto, Kurdish, Tajik, Kyrgyz, Maltese, Luxembourgish, Haitian Creole, Frisian, Yiddish, Faroese, Tatar, Javanese, Sundanese, Cebuano) on top of the existing pool, again applied per-concept rather than as token presence. Also added British / American spelling pairs where they diverge (`tire`/`tyre`, `defense`/`defence`, `center`/`centre`, etc.) and a handful of new English concepts that had been implicit (`tire shop`, `car parts`, `oil exploration`, `olympic committee`, ...) — each with its multilingual equivalents in the same edit. AGENTS.md: codified the rule under "Maintaining the reverse DNS maps" so future edits are bound by it: every language section must cover the same concept set the English section covers, with idiomatic compounds rather than calques, skip rather than invent when no natural form exists, and any new English keyword must be added in parallel across the existing language set. Final shape: 11,777 alternations / 175,556 chars across 45 detectors. Ruff check + format clean. Module compiles. Known limitation (pre-existing, unchanged): Python's `re` does not treat Unicode Mn / Mc combining marks as word characters, so Brahmic-script words ending in vowel signs / virama won't match the outer `\b…\b`. Affects pre-existing and new entries equally; fixable later by switching to the `regex` module. Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
3b705aeaa8 |
Commit classify_unknown_domains.py — regex-based multilingual classifier (#764)
* Commit classify_unknown_domains.py: regex-based multilingual classifier
Promotes the transient `/tmp/classify_b<N>.py` script that grew across
the b5–b13 reverse-DNS-map batches into a tracked tool. The classifier
takes a `collect_domain_info.py` TSV and emits a CSV of map additions
plus a text file of known-unknown additions — the regex baseline that
makes step 4 of the unknown-domain workflow ("classify from the TSV, not
by re-fetching") tractable at scale.
Coverage:
- Detectors for all 44 industry types in the README.
- Concept-translation parity across ~30 languages on the high-volume
detectors (Healthcare, Travel, Government, Retail, Finance, ISP, Web
Host, Manufacturing, Logistics, Real Estate, Automotive, Legal,
Agriculture).
- ~10–20 languages with 1–3 keywords each on the smaller detectors
(Photography, Sports, MSSP, Conglomerate, Search Engine, Social Media,
Defense, IaaS/PaaS/SaaS, Beauty, Print, Publishing, Religion, Science,
Event Planning, Staffing, Email Security/Provider, Marketing,
Construction, Industrial, Utilities, Energy, Government Media,
Physical Security, News, Nonprofit, Entertainment, Technology,
Consulting).
Brand-name selection prefers MMDB `as_name` → page title's first
segment → non-redacted WHOIS registrant → domain-derived fallback, with
a `clean_brand` pass that strips legal-form suffixes (LLC / GmbH / Ltda
/ EIRELI / sp. z o.o. / s.c.a r.l / UAB / etc.) and prefixes (PT, OOO).
When the title has multiple segments, the segment whose simplified form
contains the domain root is preferred — accessmontana.com with as_name
"MONTANA WEST, L.L.C." and title "Internet, Phone & TV Bundles | Access
Montana" maps to "Access Montana", not "Montana West".
A small mojibake fixer normalizes the most common UTF-8-as-Latin-1
re-encodings ("ó" → "ó", etc.) so Spanish/Portuguese/French homepages
that `collect_domain_info.py` mishandled still classify.
The empty HAND dict at the top of the file is an extension point for
batch-specific overrides — e.g. acquisition aliases or brand-name
corrections that don't fit any detector; each `domain → ("Brand",
"Type")` entry wins over the auto-classifier.
Wired into AGENTS.md's "Related utility scripts" section and documented
in `parsedmarc/resources/maps/README.md` alongside the rest of the
maps utilities.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* classify_unknown_domains.py: clarify dual-purpose framing
The classifier serves both lookup paths into base_reverse_dns_map.csv —
the original PTR-side flow (reverse-DNS base domains derived from DMARC
report source IPs) and the MMDB-coverage flow (AS domains lifted from
the bundled IPinfo Lite MMDB). The initial commit's docstring/comments
emphasized the MMDB-coverage flow because that's where the script grew
up across the b5–b13 batches, but it was always equally applicable to
PTR-side domains.
Updates:
- Top docstring rewritten to lead with the dual-purpose framing.
- README.md adds an explicit "useful for either lookup path" paragraph
referencing the original DMARC-report flow and the MMDB-coverage flow.
- AGENTS.md "Related utility scripts" entry updated to mention both
flows.
- Drops a stale "happen to have ASN registrations" aside in the
RETAIL_RE comment.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
9aa930f7cc |
Retroactive promotions: 3,171 KU rows reclassified by expanded multilingual classifier (#763)
Re-ran the expanded-multilingual classifier (PR #762's classifier with broader language coverage on Healthcare, Travel, Government, Retail, Finance, ISP, Manufacturing, Logistics, Real Estate, Automotive, Legal, Agriculture, plus Finance-via-body-text catching insurance/investment/ asset-management) against every cached TSV from prior batches (b6–b13). 3,171 domains that previously couldn't be auto-classified (and were therefore added to known_unknown_base_reverse_dns.txt) now match the new detectors. These domains are promoted out of KU and into the map under their newly classified `(name, type)` pairs. Type distribution of promotions: Finance 736 Logistics 179 Real Estate 105 Healthcare 68 ISP 323 Retail 159 Education 110 Marketing 66 Manufacturing 207 Technology 142 Consulting 99 Nonprofit 64 Government 136 Travel 123 Utilities 71 Legal 53 + smaller volumes across ~25 other industry types ASN-domain coverage of the bundled IPinfo Lite MMDB after these promotions: - by domain count: 32,254 / 63,993 (50.40%, up from 45.45%) - by IPv4 weight: 98.45% Honest scope note: the multilingual classifier achieves "concept parity" for the top-5 high-volume detectors (Healthcare, Travel, Government, Retail, Finance) across ~30 languages. Smaller detectors (Photography, Conglomerate, Sports, Defense, MSSP, IaaS/PaaS/SaaS, etc.) still have ~10-15 languages with 1-3 keywords each. Further per-detector multilingual parity is a follow-up effort; each subsequent expansion recovers fewer domains as the long tail of language-specific phrasings shrinks. Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
c25bf28c1c |
Classify reverse DNS map: final cleanup batch (~2,650 unmapped MMDB ASN domains) (#762)
Final cleanup pass to clear the remaining MMDB AS-domain queue. Applied an expanded multilingual classifier covering all 44 README industry types plus an Energy concept (mapped to Utilities pending a README addition). Per-detector keyword lists now include Spanish, Portuguese, French, Italian, German, Dutch, Russian, Polish, Czech, Turkish, Greek, Chinese (simplified and traditional), Japanese, Korean, Arabic, Hebrew, Hindi, Vietnamese, Indonesian, and Thai where the concept has a recognizable local-language equivalent. - 980 added to base_reverse_dns_map.csv (ISP 193, Education 193, Finance 155, Government 109, Healthcare 93, Web Host 37, MSP 31, Manufacturing 22, Logistics 17, Real Estate 12, Travel 11, Consulting 10, Tech 9, Nonprofit 9, Legal 9, Food 9, Retail 8, Religion 8, Utilities 7, plus smaller volumes across 14 more types). - 1,669 added to known_unknown_base_reverse_dns.txt — the residual unfetchable / parked / Cloudflare-challenged / non-recognized-content rows. ASN-domain coverage of the bundled IPinfo Lite MMDB after this batch: - by domain count: 29,083 / 63,993 (45.45%) - by IPv4 weight: 98.36% Total since batch 5: ~16,400 map rows + ~17,400 known-unknown rows added across 9 batches. Remaining unmapped pool size: 0 — every MMDB AS-domain has now been processed (either classified or recorded in known-unknown). Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
fa03b8f2c2 |
Classify reverse DNS map: next 10000 unmapped MMDB ASN domains (#761)
Batch 12. Auto-rate dropped to 23% (2330/9998) — significantly lower than batch 11. The deeper into the long tail, the more candidates fall into non-classifier-recognized industries (retail, manufacturing, hospitality, local services) where the ISP/Web Host/MSP regex doesn't fire even though the page is fetchable. - 2,330 added to base_reverse_dns_map.csv (ISP 991, Education 295, Finance 290, Government 265, Web Host 229, Healthcare 134, MSP 126). - 7,667 added to known_unknown_base_reverse_dns.txt. ASN-domain coverage of the bundled IPinfo Lite MMDB after this batch: - by domain count: 27,824 / 63,993 (43.48%, up from 40.27%) - by IPv4 weight: 98.36% Same classifier as prior batches (no new code). Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
e6716c9e80 |
Classify reverse DNS map: next 10000 unmapped MMDB ASN domains (#760)
Batch 11. Auto-rate 30.6% — slightly lower than batch 10's 35.6%, consistent with continuing long-tail descent. - 3,063 added to base_reverse_dns_map.csv (ISP 1,734, Finance 270, Web Host 254, Education 253, Government 249, MSP 153, Healthcare 150). - 6,934 added to known_unknown_base_reverse_dns.txt. ASN-domain coverage of the bundled IPinfo Lite MMDB after this batch: - by domain count: 25,773 / 63,993 (40.27%, up from 35.49%) - by IPv4 weight: 98.32% (up from 98.27%) Same classifier as prior batches (no new code). Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
ee9bda7228 |
Classify reverse DNS map: next 10000 unmapped MMDB ASN domains (#759)
First of the 10K-batch series toward complete coverage. Auto-rate 35.6% this round, consistent with the long-tail descent. - 3,556 added to base_reverse_dns_map.csv (ISP 2,410, Web Host 348, Education 247, Finance 201, Government 124, MSP 121, Healthcare 105). - 6,442 added to known_unknown_base_reverse_dns.txt. ASN-domain coverage of the bundled IPinfo Lite MMDB after this batch: - by domain count: 22,710 / 63,993 (35.49%, up from 29.93%) - by IPv4 weight: 98.27% (up from 98.18%) Same classifier as batches 5-9 (no new code). Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
80a132801d |
Classify reverse DNS map: next 5000 unmapped MMDB ASN domains (#758)
Auto-classification rate jumped back to 50% (2502/4999) from 36.5% in batch 8 — this slice happens to contain a higher proportion of small ISPs with conventional homepages, lifting the regex hit rate. - 2,502 added to base_reverse_dns_map.csv (ISP 2,065, Web Host 133, Education 96, Finance 67, MSP 60, Government 58, Healthcare 23). - 2,496 added to known_unknown_base_reverse_dns.txt. ASN-domain coverage of the bundled IPinfo Lite MMDB after this batch: - by domain count: 19,154 / 63,993 (29.93%, up from 26.02%) - by IPv4 weight: 98.18% (up from 98.09%) Same classifier as batches 5-8 (no new code). Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
c523d0da9c |
Classify reverse DNS map: next 5000 unmapped MMDB ASN domains (#757)
Continuing the long-tail sweep. Auto-classification rate dropped to 36.5% this round (1826/5000) from ~43% in prior batches — the further into the tail we go, the more candidates have parked / Cloudflare-challenged / sparse homepages where the regex can't match. - 1,826 added to base_reverse_dns_map.csv (ISP 1,187, Web Host 267, Education 112, MSP 80, Finance 65, Government 62, Healthcare 53). - 3,174 added to known_unknown_base_reverse_dns.txt. ASN-domain coverage of the bundled IPinfo Lite MMDB after this batch: - by domain count: 16,652 / 63,993 (26.02%, up from 23.17%) - by IPv4 weight: 98.09% (up from 98.01%) Same classifier as batches 5-7 (no new code). Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
4446702b84 |
Classify reverse DNS map: next 5000 unmapped MMDB ASN domains (#756)
Continuing the long-tail sweep toward complete ASN-domain coverage. This batch's candidates each represent only 2,048-3,072 IPv4 addresses (smaller than batch 6's 3K-6K range) so by-weight gains are diminishing, but each classified operator is one more small ISP / web host the project can name in DMARC reports. - 2,148 added to base_reverse_dns_map.csv (ISP 1,605, Web Host 236, Education 126, Government 56, MSP 46, Finance 43, Healthcare 36). - 2,852 added to known_unknown_base_reverse_dns.txt — homepages that were parked / Cloudflare-challenged / generic-server-test pages, in obscure-language without telecom-keyword cognates the classifier recognized, or rows whose WHOIS / MMDB as_name / homepage couldn't combine into two corroborating sources. ASN-domain coverage of the bundled IPinfo Lite MMDB after this batch: - by domain count: 14,826 / 63,993 (23.17%, up from 19.81%) - by IPv4 weight: 98.01% (up from 97.85%) Same classifier as batch 6 (no new code). Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
7ef153b4da |
Classify reverse DNS map: next 5000 unmapped MMDB ASN domains (#755)
5x the typical batch size to chase complete ASN-domain coverage. Small ISPs and web hosts are high-value targets for spam/phishing abuse, so the long tail of unmapped operators is worth investing review effort in. Each candidate at this depth represents 3,072–6,144 IPv4 addresses (well below the 100K+ that head-batches saw); auto-classification rate is 43.5%, similar to the prior batch. - 2,177 added to base_reverse_dns_map.csv (ISP 1,477, Web Host 296, Education 214, MSP 65, Government 56, Healthcare 40, Finance 29). - 2,823 added to known_unknown_base_reverse_dns.txt — parked / Cloudflare- challenged / generic-server-test pages, obscure-language homepages without telecom-keyword cognates the classifier recognized, or rows whose WHOIS / MMDB as_name / homepage couldn't combine into two corroborating sources. ASN-domain coverage of the bundled IPinfo Lite MMDB after this batch: - by domain count: 12,678 / 63,993 (19.81%, up from 15.86%) - by IPv4 weight: 97.85% (up from 97.55%) Reused the batch-5 classifier (MMDB as_name as primary brand source with domain-root-aware title-segment selection, multilingual ISP/Web Host/MSP keyword regex, government and education TLD lists, Communications-with- media-context-guard fallback, and the deep brand-suffix cleanup for EPP/EIRELI/UAB/Druzstvo/etc. plus the UTF-8-as-Latin-1 mojibake fix). No new classifier changes this batch. Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
34518585b6 |
Classify reverse DNS map: next 1000 unmapped MMDB ASN domains (#754)
The next 1000 by aggregate IPv4 weight, all sitting in the long tail (each
candidate ASN holds ~7,400 IPv4 addresses, ~0.21% of total v4 weight), so
auto-classification rate is modest compared to head-batches:
- 460 added to base_reverse_dns_map.csv (ISP 344, Web Host 60, Education 21,
MSP 12, Healthcare 8, Government 8, Finance 7).
- 540 added to known_unknown_base_reverse_dns.txt — homepages that were
parked, behind a Cloudflare bot challenge, returning a generic-server test
page, in obscure languages with no telecom-keyword cognates the classifier
recognized, or whose WHOIS / MMDB as_name didn't combine with any
homepage signal to clear two corroborating sources.
Classifier improvements applied this batch (relative to prior batches' code):
- MMDB as_name is the primary brand source, with cleaned title as fallback
and domain-derived as last resort (WHOIS is mostly privacy-redacted at
this depth in the long tail).
- Title-segment selection now prefers the segment whose simplified form
contains the domain root, catching cases like accessmontana.com whose
as_name is the holding company "MONTANA WEST, L.L.C." but whose title
surfaces the operator brand "Access Montana".
- as_name fallback for ISP added "Communications" (with a media-context
guard so "Christian Broadcasting Network" doesn't hit) plus bare
"Internet" / "Cable" / "Telephone Co." patterns common in rural-US ISP
brands.
- Government TLD list expanded for .go.id, .gv.at, .gov.cn, .gob.cl/ar/gt,
.admin.ch, etc.; Education TLD list expanded for .ac.kr / .ac.za /
.ac.nz / .edu.cn / .edu.tw / .edu.sg / .edu.my / .edu.ph / .edu.eg.
- MSP detection re-added (`it solutions` / `managed it support` /
`managed tech` patterns) for marconet.com / odyssey.uk / vmi.se type
long-tail managed-IT shops.
- Brand cleanup deepened to handle Brazilian EPP / EIRELI ME, Italian
s.c.a r.l, Polish sp z o.o variants, Lithuanian UAB, Czech Druzstvo,
Venezuelan C.A., trailing-single-letter artifacts, and double-spaces.
- Encoding-mojibake fixer for the common UTF-8-as-Latin-1 cases
("Fibra óptica" → "Fibra óptica") so Spanish/Portuguese ISP pages
classify even when collect_domain_info.py mishandled the encoding.
Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
769b16bb03 |
Drift-detect rebrands: tighten regex; promote 11 verified rebrand-aliased map keys (#753)
* Tighten rebrand regex to drop CTA, third-party-mention, and CSS-asset FPs The first run of detect_rebrands.py against the live map surfaced systemic false-positive categories that drowned the real signals. Tightening over two rounds of FP triage: REBRAND_RE — drop bare "now <Cap>" and "joined the X" branches: - "Buy Now PROMO", "Apply Now Who", "Order Now Free Shipping" — modern marketing pages saturate body text with CTA fragments and ~95% of bare "now <Capital>" matches were these. Replaced with the linguistically meaningful pattern "(is|are|was|were|am) now (?:(?:a )?part of)?" which still catches "BankOnIT is now Navanta", "We are now Cencora", "is now part of Lumen", etc. - "joined the Festo Certified System Integrator Program", "joined the ClimateCAP Initiative", "joined the Fredonia Women's Rugby team" — the "joined the X" pattern was too generic; real "joined the X family" rebrand banners are rare enough that dropping the branch is the right trade. REBRAND_RE — add `\b` word boundary at the start so triggers don't match mid-word: "Stre*am* now Mystery" was matching `am now <Cap>` because the last two letters of "Stream" satisfied the verb alternation. REBRAND_PATH_RE — drop bare `rebrand`, `name change`, `new name for`, and `brand-update` / `brand-refresh` patterns. They appeared too often as CSS class names (`class="rebrand-page"`), CSS variables (`--rebrand-underline-color`), image filenames (`bms-rebrand-logo.svg`, `brand-update.css`), and JSON/JS strings (`"name change"` user-account labels). Adding `\b` boundaries doesn't help because dashes are non-word characters. The remaining narrow patterns (`brand-launch`, `brand-announcement`, `brand-reveal`, `our-new-name`, `our-new-brand`, `acquisition-announcement`, `merger-announcement`) still catch the canonical bankonitusa.com case via its `brand-launch-frequently-asked- questions` URL slug and `Brand announcement` alt text. _REBRAND_NOISE — make the comparison case-insensitive and add "included", "iso", "secure", "part" to suppress "is now ON" / "is now LIVE" / "is now ISO 27001 certified" / "is now Secure Managed Wi-Fi" / "is now Part of" patterns. Twitter/Facebook/Square (the social-platform rebrand mentions in footers like "X (formerly Twitter)") moved to lowercase since the comparison is now case-insensitive. Net effect on a full sweep over the ~13,100-key map: rebrand-signal flagged-row count dropped from ~270 (initial run) to 108 (round-3), clearing the dominant FP categories while every real signal — verified against the bankonitusa.com canonical case plus 11 other actual rebrands — still fires. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Promote 11 verified rebrands found by drift sweep; alias 4 acquirer domains Renames produced by `detect_rebrands.py` running against the full ~13,100-key map and verified by re-reading each operator's homepage. Type column unchanged for every row — only the canonical `name` shifts to the new operator. Where the new operator's primary domain wasn't already in the map, a case-1 alias row is added pointing to the same `(name, type)`. Renames: - amerisourcebergen.com: AMERISOURCEBERGEN → Cencora - aurorahealthcare.org: Aurora Health Care → Advocate Health - consolidated.com: Consolidated Communications → Fidium Fiber - databridgesites.com: Meridian Parkway Data Center Owner → TierPoint - emarsys.com: SAP Emarsys → SAP Engagement Cloud - rig.net: RigNet → Viasat - rxlightning.com: RxLightning → CoverMyMeds - telepoint.bg: Telepoint → Digital Realty - thehostgroup.com: The Host Group → HostGo - ultisat.com: Globecomm Services Maryland → UltiSat - unifiedpostgroup.com: Unifiedpost Group → Banqup New aliases (operator's primary domain not previously mapped): - cencora.com → Cencora, Healthcare - advocatehealth.com → Advocate Health, Healthcare - covermymeds.com → CoverMyMeds, Healthcare - banqup.com → Banqup, SaaS Five sweep hits intentionally deferred for lack of a clear second source: megatel.co.nz → Nova (`nova.co.nz` is for sale via a domain broker; unclear which Nova entity), pogozone.com → NeuBeam (NeuBeam's homepage doesn't acknowledge the PogoZone acquisition), prempub.com → Ingenious Media (ingeniousmedia.com fetch failed), voltagepark.com → ? (merger with Lightning AI rather than a clean rebrand), and a handful of more ambiguous Synopsys/Ansys/OmniAccess/Rakuten/Indigital/Synthite signals that need manual research. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Document detect_rebrands.py cadence as run-once-a-year The drift sweep is for catching operator rebrands and acquisitions that accumulated since the previous run; M&A activity over the mapped operator set is slow enough that yearly is sufficient. Annotate the script's own docstring, the maps README, and the AGENTS.md "Related utility scripts" entry so a future contributor doesn't mistake it for a per-batch step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
c752e776de |
Detect map-key rebrands via homepage drift sweep (#752)
Adds two complementary pieces of M&A drift detection over base_reverse_dns_map.csv:
- `collect_domain_info.py` gains two derived columns. `rebrand_signal` combines
a body-text regex ("now X" / "formerly known as X" / "we became X" / ...)
with a narrow path-and-alt-text regex ("rebrand", "brand-launch",
"brand-announcement", "name-change", "our-new-name", ...) that runs against
the JSON-unescaped page bytes, so URL slugs and image alt attributes inside
Elementor / hydration script blobs are reachable. The two-regex split is
what catches image-only acquisition banners like bankonitusa.com's "now
Navanta" — a `<a href="https://navanta.com/brand-launch-..."><img
alt="Brand announcement"></a>` with no visible text — that pure body-text
scanning misses. `external_links` collects the homepage's non-self,
non-social outbound link hosts as review context only.
- `detect_rebrands.py` is a new sibling drift sweep. It re-fetches every key
in base_reverse_dns_map.csv with the same fetch machinery, evaluates two
default flag triggers (`rebrand_signal` matched, or final URL host doesn't
sit under the input domain), and writes a compact TSV of just the flagged
rows. `external_links` is captured into the row as context but is not a
default trigger — most outbound links are to partners / customers / vendors,
and flagging them would flood review with noise. `--flag-external-links`
opts into that signal for thorough sweeps. Resume-safe via `-o`.
Output is review fodder, not automated map mutation: a single signal is one
corroborating source, and promoting a flagged row into the map still requires
a second source per the two-corroborating-sources rule.
README and AGENTS.md updated to document the new columns and script.
Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
6fa561d172 |
Classify reverse DNS map: ~2,100 unmapped MMDB ASN domains; bankonitusa.com → Navanta (#751)
Adds ~2,125 ASN-domain classifications carried out across four ~1,000-domain batches in a prior session that wasn't pushed before #748/#749 merged. The overlap with those merged batches is dropped — origin/master's classifications are kept as authoritative — and only the genuinely-new domains land here. 188 known-unknown rows are promoted out to the map for the same reason. Also updates bankonitusa.com from BankOnIT to Navanta and adds navanta.com as an alias after a spot check observed the operator's "now Navanta" rebrand banner. Two corroborating sources: the banner on bankonitusa.com itself (image-only `<a href="https://navanta.com/brand-launch-..."><img alt="Brand announcement"></a>`) and the rebrand explainer on navanta.com ("Why We Became Navanta", "MyBankonIT has been rebranded to MyBPC"). The MMDB still names the pre-rebrand entity (BankOnIT, L.L.C.) — typical years-of-lag pattern. Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
bf526f4e12 |
docs(AGENTS.md): require fresh branch off origin/master per batch (#750)
* docs(AGENTS.md): require fresh branch off origin/master per batch Add a "Starting the next batch" subsection to the reverse-DNS-maps workflow. Each batch must start from a fresh checkout of origin/master, not from the previous batch's branch. The trap: if the previous batch's commit has already merged via a PR pushed from elsewhere (a co-worker's session, an unsynced laptop, an earlier session), the local copy of that commit still sits on the old branch. Stacking new work on top makes the new PR conflict with master, because the merged commit and the local copy insert identical map rows at identical sorted positions and the same lines collide. Hit live this batch (PR #749) and recovered via `git rebase --onto origin/master <stale-commit> <branch>` plus a force-push, then a PR-description trim. Documenting the failure mode and the recovery so the next contributor avoids the trap entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(AGENTS.md): also check for open map PRs before starting a batch Add a pre-flight `gh pr list --search` step ahead of the branch-fresh- off-master rule. Same scenario in mind: a previous batch's PR is still in flight, started from a different machine or session, and starting a new batch in parallel duplicates effort or splits attention across two competing PRs touching the same files. Cheap one-liner; cost of forgetting it is the kind of conflict #749 already documented at the branch-hygiene level. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
7ef31f8083 |
Classify reverse DNS map: next 1000 unmapped MMDB ASN domains (#749)
Continued the MMDB ASN-domain coverage walk into the 14k-10k IPv4-weight band. Added 883 new map entries and 117 new known-unknown entries from the top 1,000 unmapped candidates. ASN-domain coverage by IPv4 weight: 96.5% -> 96.8%. ASN-domain coverage by domain count: 11.0% -> 12.4%. Composition: ~50 globally-known brands (Vanguard, AIG, Aon, Equifax, Mercedes-Benz USA, BP, BHP, Bechtel, Tetra Pak, Anheuser-Busch, Air Canada, Maersk, NFL, NHL, MGM Resorts, Wolfram, Red Hat, Palo Alto Networks, New Relic, Travelport, Epicor, IQVIA, Dassault Systemes, Disney+, Valve, Seagate, Analog Devices, Renesas, Dow Jones, Lee Enterprises, IGN, Mondadori, AtkinsRealis, Eiffage, Ogilvy, Interpublic, Equifax, Ooredoo Maldives, MTN Zambia, Movistar Costa Rica, Telekom Romania Mobile, Sparkle, Vodafone Ireland, etc.); ~30 universities and government / state agencies (City of San Jose, City of Phoenix, Bulgarian gov, Region Uppsala, Weld County, Long Beach Unified, Escambia School District, Region 4 ESC, Merced COE, Santa Cruz COE, Politechnika Warszawska, Bogazici, KAIST-affiliated Korean universities, Ural Federal University, etc.); the long tail of regional ISPs / hosters / MSPs / data-center operators classified via MMDB as_name + homepage / WHOIS corroboration. 117 added to known-unknown where the two-corroborating-sources bar wasn't met (Cloudflare-blocked sites with privacy-redacted WHOIS, generic-token AS-names with empty homepages, parked domains, etc.). Files remain disjoint per the workflow guardrail. sortlists.py validates clean (types, sort, dedupe). CRLF preserved. Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
ab9d4e93f5 |
Classify reverse DNS map: next 1000 unmapped MMDB ASN domains (#748)
Continued the MMDB ASN-domain coverage walk into the 18k–14k IPv4-weight band. Added 941 new map entries and 59 new known-unknown entries from the top 1,000 unmapped candidates. ASN-domain coverage by IPv4 weight: 96.0% → 96.5%. ASN-domain coverage by domain count: 9.5% → 11.0%. Composition: ~50 universities and government / state agencies (HMRC, SSA, DHS, DOJ, BART, Pittsburgh, Charlotte, NY courts, Bank of Canada, MTA, gov.si, gov.ru, KAUST, Sharif University, Karolinska Institutet, IIT, KTH, etc.), ~70 globally-known brands (Nvidia, AMD, BMW, Mastercard, Nasdaq, NetApp, Allianz, Honeywell, JPMorgan, Goldman Sachs, Mitel, Arista, Take-Two, Universal Music, Disney Go, Fox, Nike, Cigna, Aetna, Humana, AbbVie, Mitsubishi Electric, Saint-Gobain, Reliance Industries, Hyundai Autoever, Square Enix, NEXON, Riot Games, Mahidol University, Hong Kong HSBC, Standard Chartered, etc.), and the long tail of regional ISPs / hosters / MSPs / data center operators classified via MMDB as_name + homepage corroboration. 59 added to known-unknown where the two-corroborating-sources bar wasn't met. Files remain disjoint per the workflow guardrail. sortlists.py validates clean (types, sort, dedupe). CRLF preserved. Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
1fd833bbf0 |
Classify reverse DNS map: next 1000 unmapped MMDB ASN domains (#747)
Continued the MMDB ASN-domain coverage walk into the 30k–18k IPv4-weight band. Added 971 new map entries and 32 new known-unknown entries from the top 1,000 unmapped candidates. ASN-domain coverage by IPv4 weight: 95.3% → 96.0%. ASN-domain coverage by domain count: 8.0% → 9.5%. Composition: ~30 universities and government / state agencies (maryland.gov, ok.gov, nj.gov, NIA Korea, NICTEC Thailand, etc.), ~80 globally-known brands (Nvidia, Tesla, Intel, Ford, GM, Volvo, Disney, EA, Roblox, Riot Games, Sony PlayStation, JPMorgan, Goldman Sachs, Morgan Stanley, Charles Schwab, AXA, Cigna, Cargill, Hallmark, Pepsi, Kroger, Random House, NBCUniversal, Qualcomm, Deutsche Bank, UBS, Citi, Lloyds Banking, Westpac, CommBank, Adobe, Broadcom, NXP, Schaeffler, Saint-Gobain, Hanwha, Doosan, Hyundai Autoever, Square Enix, Garena, etc.), and the long tail of regional ISPs / hosters / MSPs classified via MMDB as_name + homepage corroboration. 1 entry promoted out of known_unknown_base_reverse_dns.txt; 32 added where the two-corroborating-sources bar still wasn't met. Files remain disjoint per the workflow guardrail. sortlists.py validates clean (types, sort, dedupe). CRLF preserved. Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
05adb9c831 |
Classify reverse DNS map: top ~1,950 unmapped MMDB ASN domains (#746)
Walked the bundled IPinfo Lite MMDB for ASN domains absent from base_reverse_dns_map.csv, processed the top ~1,950 by IPv4 weight across five batches (collect_domain_info.py + tier-based classification per AGENTS.md), and added 1,664 new map entries. ASN-domain coverage by IPv4 weight: ~92.9% → 95.3%. ASN-domain coverage by domain count: 5.4% → 8.0%. Composition: ~250 universities/government (Tier 0 — restricted TLD + MMDB as_name), ~80 globally-known brands (Saudi Telecom, JAXA, RailTel, LY Corporation, Tesla, Intel, Citi, Schwab, Disney, EA, Volvo, Mitsubishi Electric, Cargill, Hallmark, Medtronic, Banco do Brasil, Petrobras, etc.), direct aliases for already-mapped brands (HKBN, Tata Teleservices, Cox, NTT, T-Mobile, etc.), and the long tail of regional ISPs / hosters / DC operators classified via MMDB as_name + homepage corroboration. 66 entries promoted out of known_unknown_base_reverse_dns.txt where the new collector data cleared the two-corroborating-sources bar; 55 added where the bar still wasn't met. Files remain disjoint per the workflow guardrail. sortlists.py validates clean (types, sort, dedupe). CRLF preserved. Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
7ba078bff1 |
Translate AS-name source rows via MMDB; classify reverse DNS batch (#745)
* feat(maps): translate AS-name source rows via MMDB When parsedmarc's ASN-fallback path in utils.get_ip_address_info surfaces a raw MMDB as_name (e.g. "Vodafone Group PLC") for an IP that has no PTR and whose as_domain isn't in the map, find_unknown_base_reverse_dns.py now looks the as_name up in the bundled ipinfo_lite.mmdb and substitutes the matching as_domain so the row enters the unknown pipeline as a researchable domain instead of being dropped or polluting the list. Normalize non-breaking spaces (U+00A0) and runs of whitespace when building and querying the as_name index — the source CSV and MMDB disagree on NBSP placement for several names (e.g. "UDomain\xa0Web Hosting Company Ltd" in the CSV vs. "UDomain Web Hosting Company Ltd" in the MMDB), causing exact-match lookups to miss otherwise-identical entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(maps): classify a batch of unknown reverse DNS base domains 40 map additions (35 source domains + 5 redirect-target/promotion aliases) and 35 known-unknown additions, covering the 71-entry unknown_base_reverse_dns.csv refresh. Newly mapped operators include several MMDB-AS-translated regional ISPs (Babilon-T/TJ, MegaFon Tajikistan, Ucell, Ufone, PinPro, Teraline Telecom, Transtelecom Kazakhstan, Satis, AlmaTV, Radius-NET, Burlington Telecom), aliases of existing brands (Telstra/bigpond.net.au, UDomain/udomain.hk, AG Telekom/katv1.net, EWE/ewe-ip-backbone.de, Hostinger/hstgr.cloud, Docusign/docusign.net, Brevo/sp2-brevo.net, MegaFon/megafon.tj, Beeline/beeline.uz), Tier-0 brands (Visa, Tripster, Verde Agritech), one healthcare entry (Sanwakai Hospital), one government entry (Special Communication Service of Azerbaijan), one education entry (KazRENA), and an MSP (Otava). Redirect-target aliases added for burlingtontelecom.com, alma.plus, cn.at, and teraline-telecom.net per the post-batch sweep rule. fea.net promoted out of known-unknown to West Coast Internet (WCI) after its homepage redirect-target was already mapped. Domains with single-source corroboration (privacy WHOIS plus unreachable site, parked-domain pages, ambiguous categorizations) went to known_unknown_base_reverse_dns.txt rather than the map. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
6ff6261df9 | docs: update installation instructions for IPinfo Lite and MaxMind GeoLite2 databases | ||
|
|
06fd3f2b09 | docs: update installation instructions and usage notes for parsedmarc |