mirror of
https://github.com/domainaware/parsedmarc.git
synced 2026-05-20 19:05:24 +00:00
8317ffcde814897b3659ac13f919c5f334a00c8a
1489 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
8317ffcde8 | Fix rename syntax for parsed_sample headers in Splunk DMARC forensic dashboard | ||
|
|
3b9e678533 |
Refactor SMTP TLS dashboard with base search
Refactored the SMTP TLS Splunk dashboard to use a base search for improved query efficiency and maintainability. Updated input token names and adjusted search queries for better organization and clarity. |
||
|
|
5ba72d2783 | Add source AS name to fillnull and search queries in DMARC aggregate dashboard | ||
|
|
e40b53da64 | Enhance Splunk DMARC aggregate dashboard: add source AS name dropdown and update search queries | ||
|
|
fe296ca869 |
Update dashboard documentation
- Introduced a new README.md for dashboard development with detailed instructions. - Removed outdated README files for Grafana and Splunk dashboards. |
||
|
|
397378de8e |
Bump mailsuite to >=2.0.2 for 9.11.1 release (#743)
Addresses RuntimeError: Event loop is closed in the MS Graph mailbox backend (#742). Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>9.11.1 |
||
|
|
5d816a4e56 |
Offload mailbox layer to mailsuite>=2.0.0 (#741)
mailsuite 2.0.0 extracted the IMAP, Microsoft Graph, Gmail, and Maildir connections out of parsedmarc into mailsuite.mailbox so other projects can reuse the same provider-agnostic interface. Replace the parsedmarc/mail submodules with a thin re-export of mailsuite.mailbox and drop the duplicated implementations. Per the migration note in seanthegeek/mailsuite#22, pass token_cache_name="parsedmarc" so existing AuthenticationRecord caches on disk continue to work without re-prompting users to authenticate. The existing graph_url config knob is forwarded unchanged. Drop direct dependencies that are now installed transitively via mailsuite[gmail,msgraph] (msgraph-core, imapclient, google-*). The extras are pulled in non-optionally so Gmail and Microsoft Graph support remain available out of the box. Drop nine test classes that were exercising mailsuite-side implementation internals (TestGmailConnection, TestGraphConnection, TestImapConnection, the _get_creds/_generate_credential half of TestGmailAuthModes, TestImapFallbacks, TestMSGraphFolderFallback, TestMaildirConnection, TestMaildirReportsFolder, TestMaildirUidHandling, TestTokenParentDirCreation); these are mailsuite's tests now. The CLI integration tests that mock parsedmarc.cli.{IMAP,Gmail,MSGraph}Connection are kept. Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>9.11.0 |
||
|
|
900ee22525 | Make map and country list side by side in the Splunk DMARC aggregate dashboard XML | ||
|
|
e709839f79 | Fix typo in source ip viz | ||
|
|
e7f6e1b5e7 | Update map files | ||
|
|
26f54b1269 | Add content rule to exclude adult websites from domain lists | ||
|
|
44fd1aa555 |
Coerce malformed <email> in aggregate report metadata to None (#740)
xmltodict turns stray angle brackets in <email> (e.g. "<bad-xml@bad-xml.net>") into a nested dict, which then flows through parse_aggregate_report_xml as the org_email value. Parsing succeeds, but Elasticsearch / OpenSearch reject the document at index time because the org_email mapping is text — observed as document_parsing_exception / mapper_parsing_exception with a "{#text=..., bad-xml=null}" preview. When report_metadata["email"] comes back as a dict, log it at debug and discard. The rest of the report still ingests with org_email=None instead of failing the whole document downstream. Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
f3a2e894e0 |
chore: update IPinfo Lite MMDB (#739)
Co-authored-by: seanthegeek <44679+seanthegeek@users.noreply.github.com> |
||
|
|
265bf64240 |
Align Grafana dashboard with OpenSearch Dashboards source-of-truth (#738)
* Align Grafana dashboard with OpenSearch Dashboards source-of-truth Adds the two aggregate-DMARC panels that exist on the OSD dashboard but were missing from the bundled Grafana dashboard: - "Message sources by name and type" — buckets by source_name + source_type, sums message_count per (name, type) tuple. Mirrors the OSD viz from 9.4.x. - "Message sources by Autonomous System" — buckets by source_asn + source_as_name + source_as_domain, sums message_count per ASN. Mirrors the OSD viz added in 9.9.0 with the IPinfo Lite ASN integration. Both panels are patterned on the existing "Reporting Organisations" panel (same datasource $datasourceag, same sum(message_count) metric, same gradient-gauge "Messages" column with rename transforms). They sit at the bottom of the existing layout (gridPos y=129 and y=140) so the existing panel positions are unchanged. Verified against the bundled grafana/grafana:12.3.0: dashboard import returns status=success, both panels render with real data from the sample-corpus indexes, and the ES aggregations (terms on source_name + source_type, numeric terms on source_asn) return the expected results. Out of scope: - Extras in the Grafana dashboard that aren't on OSD (SPF/DKIM Results Over Time, Alignment Over Time, Stat overview, Published Policies, Forensic IP / country tables) are left in place. They were community-contributed and likely valued by some users. - Migrating the deprecated `graph` and `grafana-worldmap-panel` panel types to modern timeseries / geomap is a separate, larger task. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Grafana: migrate deprecated graph and worldmap panels Replaces the 6 legacy `graph` panels with `timeseries` panels and the 2 legacy `grafana-worldmap-panel` panels with `geomap` panels. Both deprecated plugins still rendered in Grafana 12 via auto-migration but were flagged for removal; this ships the modern saved shape. graph -> timeseries (6 panels): SPF Results Over Time, DKIM Results Over Time, SPF Alignment Over Time, DKIM Alignment Over Time, DMARC Passage Over Time, Message Disposition Over Time. Panel `aliasColors` (e.g. {true: dark-green, false: dark-red}) are translated into per-series `fieldConfig.overrides` so the green/red by-pass-fail colorings carry forward; legacy graph fields (lines, fill, yaxes, tooltip etc.) are dropped in favor of the new `fieldConfig.defaults.custom` block and `options.legend` / `options.tooltip`. worldmap -> geomap (2 panels): Map of Message Source Countries (aggregate), Forensic Sample Sources by Country (forensic). The legacy `locationData=countries` lookup-by-ISO becomes a geomap markers layer with `location.mode=lookup`, `gazetteer=public/gazetteer/countries.json`, and `lookup=source_country.keyword` — same input data, modern renderer. Drops the date_histogram bucket from the geomap targets since the map is a snapshot over the panel time range, not a time series. Verified against the bundled grafana/grafana:12.3.0: dashboard imports with status=success and `version=19`, live panel types now report `{timeseries: 6, geomap: 2, table: 14, grafana-piechart-panel: 3, stat: 1, row: 3}` — no more `graph` or `grafana-worldmap-panel` entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
4e8c28bbc0 |
Align Kibana dashboards with OpenSearch Dashboards source-of-truth (#737)
* Align Kibana dashboards with OpenSearch Dashboards source-of-truth
OSD is a fork of Kibana 7.10 and Kibana 8.x's saved-object migration
handlers accept OSD's saved-object format directly. Replace the legacy
Kibana export with a byte-identical copy of the OSD ndjson, so the two
backends ship the same panels, metric aggregations, panel titles, and
field assignments instead of drifting independently.
Verified against Kibana 8.19.7: import returns successCount=26 with no
errors and Kibana auto-migrates each viz / dashboard to its current
saved-object schema (typeMigrationVersion 8.5.0 for visualizations,
10.3.0 for dashboards) on import.
Net effects for Kibana users on import:
- Picks up the metric-aggregation fix from 9.10.3 — pies, tables, and
the choropleth now sum(message_count) instead of counting OS docs,
giving real message volume rather than distinct source-row counts.
- Adds "Message sources by Autonomous System" and "Message sources by
name and type" panels (previously only on OSD).
- Forensic dashboard simplified to OSD's two-panel layout (markdown
intro + samples table) — drops the Kibana-only IP-address and
country-ISO tables and the choropleth.
- Adds the "SMTP TLS reporting" dashboard (was absent from the bundled
Kibana export).
- Drops the extraneous "Evolution DMARC par source_reverse_DNS" Lens
visualization that snuck in via a community contribution.
Updates docs/source/kibana.md to reflect the new dashboard names
("DMARC aggregate reports" / "DMARC failure reports") and adds a brief
section on the SMTP TLS reporting dashboard.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Drop the duplicate Kibana ndjson; point Kibana users at the OSD file
Kibana 8.x's saved-object migration handlers accept the OpenSearch
Dashboards saved-object format directly (verified by import returning
successCount=26 with no errors), so a separate kibana/export.ndjson
was just two copies of the same bytes that would inevitably drift. Drop
it and update the bootstrap script and docs to point at the existing
dashboards/opensearch/opensearch_dashboards.ndjson.
Add a path-filtered CI workflow (.github/workflows/dashboards.yml) that
fires only when the OSD ndjson changes. It stands up an Elasticsearch +
Kibana 8.19.7 service pair, POSTs the file at the saved-objects import
endpoint, and asserts success=true with no errors. That keeps the
single-file source compatible with Kibana on every change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
826e78c390 |
Fix DMARC dashboard metrics (OSD + Splunk) and add dashboard-dev bootstrap (#736)
* OSD: fix aggregate dashboard metrics to sum(message_count) 13 panels on the DMARC aggregate dashboard were aggregating with `count` (number of OSD docs) when they should have been summing `message_count`. Each parsedmarc OSD doc represents one (source_ip, auth_results) tuple from the XML and carries an integer message_count, so doc-counting reports "distinct sources" rather than "messages". Panels with titles like "Message volume by header from", "DMARC passage over time", etc. were producing misleading numbers. Affected panels: SPF/DKIM/Passed-DMARC pies; Reporting orgs; Sources by reverse DNS / header from / name+type / ASN / country / IP; Map; SPF and DKIM details. (DMARC failure email samples kept count — one OSD doc per RUF sample, so it's correct. SMTP TLS panels untouched — they sum the right session-count fields.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Splunk: align dashboards with OSD and fix query bugs Aggregate dashboard: - Add "Message sources by Autonomous System" panel (source_asn / as_name / as_domain), formatted "AS<n>" at render with eval, matching the OSD addition. - DKIM details: add the missing dkim_aligned column. - SPF details: reorder columns to OSD order (spf_aligned at end). - Map / country titles renamed to match OSD ("Map of message sources by country", "Message sources by country"). - Map widget: stats count by Country -> stats sum(message_count) by Country, so the choropleth shades by message volume not record count. - fillnull "none"/"unknown" applied to source_reverse_dns, source_base_domain, source_country to mirror OSD's missing-bucket labels. - charting.fieldColors {true: green, false: red} on SPF/DKIM/Passed-DMARC pies and the DMARC-passage timechart. Forensic dashboard: - Restructure to match OSD's two-panel layout (markdown + samples table). - Drop the country map / IP table / country-ISO table panels (not in OSD). - Samples table columns aligned to OSD: arrival_date_utc, source.ip_address, from, subject, reply_to, authentication_results. - Tolerate null headers in the base_search filter (was: parsed_sample.headers.From=* required field to exist; LinkedIn RUF sample with null From was filtered out). SMTP TLS dashboard: - Reorder metrics to OSD order (successful before failed). - Domains panel: add policy_type bucket. - Failure details: replace search-time `failed_session_count>0` (which doesn't evaluate against multivalued JSON paths in Splunk) with `result_type=*` for presence + post-stats `where failed_sessions>0`. Drop _time/successful_sessions columns; reorder to match OSD. - Wire the existing policy_type input into all three searches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add dashboard-dev bootstrap script and VSCode task dashboard-dev-bootstrap.sh brings up docker-compose.dashboard-dev.yml, seeds parsedmarc sample data into ES + OS + Splunk via parsedmarc-dev.ini, and re-imports every dashboard into Kibana, OpenSearch Dashboards, Grafana, and Splunk. Idempotent: existence checks skip provisioning that's already done; only the dashboard imports re-run unconditionally on every invocation (that's the point of running it after a dashboard edit). Notable provisioning quirks the script handles: - Splunk's auto-created HEC token (from the SPLUNK_HEC_TOKEN env) ships with indexes=[] and index=default; rewrites it to allow the email index. - ES 8.x rejects wildcard DELETEs by default; RESEED=1 enumerates daily parsedmarc indexes via _cat/indices and deletes one at a time. - Splunk has no clean-in-place REST endpoint for live indexes; RESEED=1 deletes and recreates the email index (then re-applies the HEC token). - OSD security plugin tenants: imports target global_tenant explicitly via the securitytenant header so they're visible to the shared workspace rather than landing in the API user's private tenant. Override with OSD_TENANT=<name>. - Splunk ships an in-product announcement view (scheduled_export_dashboard) with sharing=global; the script narrows it to sharing=app so it stops showing up in every app's dashboards list. Adds a "Dev Dashboard: Bootstrap" task to .vscode/tasks.json that runs the script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * CHANGELOG: 9.10.3 entry for the dashboard metric fix and alignment work Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Bump version to 9.10.3 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * CHANGELOG: warn against the "Create new objects with unique IDs" import mode OSD's import dialog has two modes: the default "Check for existing objects" (which honors saved-object IDs and overwrites in place when "Automatically overwrite conflicts" is on) and "Create new objects with unique IDs" (which imports under fresh UUIDs and leaves the buggy originals untouched). Picking the second one means the dashboards keep rendering the wrong numbers because the originals are never replaced. Spell that out so users don't fall into the trap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * OSD: label the metric column "messages" instead of "Sum of message_count" OSD's table column header defaults to "Sum of message_count" when the metric agg has no customLabel. "messages" reads better and matches what the panels are actually counting. Applies to all 15 aggregate-DMARC visualizations that use sum(message_count). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * CHANGELOG: tighten the 9.10.3 entry — clearer and more actionable Trim the verbose technical exposition; lead each fix with the user-visible symptom. Move the action-required call out to its own header in upgrade notes so the re-import instructions don't get lost in a wall of text. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Move per-tool dashboard exports under a single dashboards/ directory Consolidates the four sibling top-level folders (kibana/, opensearch/, grafana/, splunk/) into dashboards/{kibana,opensearch,grafana,splunk}/. Updates the only path references in tracked files: bootstrap script (5 lines), CHANGELOG.md (1 line), and the kibana/export.ndjson raw URL in docs/source/elasticsearch.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * OSD: restore the "DKIM alignment" panel title on the aggregate dashboard The DKIM alignment panel had no title override in panelsJSON, so OSD fell back to the visualization's own name ("Aggregate DMARC DKIM alignment"). Every other pie/table on the same dashboard sets a clean title (SPF alignment, Passed DMARC, etc.) — this was a stray regression. Set the panel title to "DKIM alignment" to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Splunk: color the message-disposition timechart by severity Reject is red, quarantine is yellow, none is green — same semantic mapping as the SPF/DKIM/Passed-DMARC pies and the DMARC-passage timechart, applied via charting.fieldColors. Matches OSD's existing color overrides on the equivalent viz. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * CHANGELOG: clarify that "Create new objects with unique IDs" is the default The OSD import dialog defaults to that mode — users have to actively switch away from it, not just avoid picking it. Reword the upgrade note to lead with the switch and explain why the default would silently preserve the bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>9.10.3 |
||
|
|
8cc017fe84 |
ASN-domain coverage sweep #3: 516 new map entries (#735)
* Add Tier 0 to the verification triage: globally-known brand at primary domain In the previous ASN-domain coverage sweep, the agent ran web searches for entries like `bestbuy.com → Best Buy`, `ups.com → United Parcel Service`, `usps.gov → US Postal Service`, `marriott.com → Marriott`, `henkel.cn → Henkel`, `experian.com → Experian`, `jd.com → JD.com`, `ing.com → ING`, `verisign.com → Verisign`. For each of these the domain ↔ brand pairing is encyclopedic — same outcome a few seconds slower. The two-corroborating-sources rule (rule 8) was being applied mechanically: "MMDB as_name alone is one source, must fetch a second." But for globally-known brands at their primary domain, the brand identity itself is the second source. Searching for confirmation that Best Buy owns bestbuy.com is the kind of busywork the tier system exists to avoid. Adds Tier 0 with explicit guardrails — must be globally known (multinational or top-tier-national, decades-old, single canonical entity), must be the entity's primary marketing/corporate domain (not a tracking subdomain or regional ccTLD where ownership is non-obvious), and no recent acquisition/rebrand status in question. Cross-references the existing parent-too-generic sub-rule and warns against stretching to mid-size brands the agent happens to recognize. When in doubt: drop to Tier 3 and search. Also generalizes the section's lead from "redirect-target candidates" to cover MMDB coverage-gap and PSL private-domain candidates — the tier logic transfers cleanly across all three workflows. Updates the Tier 1 description with an explicit MMDB-coverage-gap analog. Refreshes the held-back-review split stat to 0 / 109 / 2 / 34 / 35 (Tier 0 didn't apply to that batch because every candidate was a redirect target that needed to inherit the *source row's* existing canonical name, not its own brand identity). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ASN-domain coverage sweep #3: 516 new map entries Third pass against the IPinfo Lite MMDB coverage gap, processing the top ~500 unmapped as_domain entries by IPv4 weight after the prior two sweeps. Verifies each entry against AGENTS.md's tiered triage: - **Tier 0** (globally-known brand at primary domain, no search needed): Barclays, Liberty Mutual, Zurich Insurance, ABN AMRO, Swedbank, CIBC, Allstate, Julius Baer, MUFG, Travelers, USPS-Bank, ING, Florida Blue, AgriBank, Energy Transfer, FirstEnergy, Scania, Evonik, Merck KGaA, Agfa, Bosch, Iveco, Applied Materials, Micron, Andritz, Whirlpool, Leonardo, QinetiQ, Atlas Elektronik, Draper, Airbus, Jacobs Engineering, Teledyne, Dropbox, Autodesk, Wind River, Stratus, Unisys, ByteDance, Chevron, BBC, CDC, NEC, HPE, Kimberly-Clark, U.S. Bank, NATO, EUROCONTROL, Federal Reserve, NIST, NSF, DARPA, Library of Congress, IMF, FAO, IAEA, ITU, several US state/county/city governments, Australian state/federal departments, European national agencies, United Airlines, Alaska Airlines, Rakuten Mobile, Coles, Woolworths. - **Tier 1** (MMDB as_name lexically matches candidate domain, no search needed): ~150+ ISPs / hosters / cable TV operators where the as_name itself is the second corroborating source — major national/regional telcos (BTC Botswana, Uganda Telecom, ONE Albania, Tanzania Telecommunications, Kyrgyztelecom, Uzbektelekom, Telecom Algeria, MTN Rwanda, Vodacom Tanzania, Celcom Axiata, Triple T Broadcasting/Jasmine Thailand, MyRepublic Indonesia, Northwestel Canada, JT Jersey, Liberty Networks Colombia, ARLINK Argentina, Cable & Wireless Dominica, SETAR Aruba, AR Telecom Portugal), regional fiber providers (Trooli, Allied Telecom, OEC Fiber, Conexon Connect, Ben Lomand, Great Plains, BrightNet Oklahoma, All West, SDN, Tularosa, Blackfoot, Greeneville Energy, Avanti Broadband, Net at Once, Avanti, Aura Fiber, Stichting Breedband Delft), regional cable TV operators across Japan/Korea/Taiwan (Miyazaki Cable, Toyohashi Cable, Nagasaki Cable, Cable TV Toyama, Kurashiki Cable, Himeji Cable, Keumgang Cable Network), data center operators (eStruxture, PureVoltage, Hyonix, NovoServe, Voxility, Webzilla, Worldstream, Atman Poland, EO Data Center). - **Education** (TLD-restricted .edu / .ac.* / .edu.* — restriction is itself a corroborating source): 200+ universities and research institutions across US, Canada, Europe, Asia, and Australia, including Notre Dame, Washington State, U Texas Rio Grande Valley / Arlington / El Paso / San Antonio / Medical Branch, McMaster, U Ottawa, U Calgary, U Waterloo, Memorial U Newfoundland, U Auckland, U Otago, TU Munich, U Cologne, Goethe Frankfurt, Ruhr-Bochum, U Warwick, Chalmers, Lund, Gothenburg, Luleå, Osaka, Yonsei, Kasetsart, Pusan, Kuwait U, Aristotle Thessaloniki, Ł Tech U, Vienna U Economics, several Cancer Research Centers (MSKCC, Fred Hutchinson, MD Anderson, Cold Spring Harbor), national research institutes (KEK, IAEA, ITRI Taiwan, ETRI, IPM Iran, Smithsonian, ucar, Jefferson Lab, CSHL, mbari, Lam Research, Andritz Hydropower, sri.com, GSI Germany, Max Delbrück, jhuapl). - **Government** (.gov / .gov.* TLD-restricted, or as_name unambiguously names a government entity): NIST, NSF, NATO, DARPA, ITU, FAO, IAEA, IMF, US Centers for Disease Control, Federal Reserve, Library of Congress, Idaho/Chicago/King County/Pierce County/State of New York, Indianapolis, Tacoma, Fairfax County, Sweden's Vägverket and Forsakringskassan, Hessen GWDG, ANSTO Australia, South Florida Water Management District, Communications Research Centre Canada, Dataport Germany, Cenitex Victoria, EUROCONTROL. Skipped: Cox Enterprises (multi-product parent, no clean type fit), Tucows already added, sknt.ru already added, etc. Full triage shows 1 duplicate-skip from the apply pass. Sortlists.py runs cleanly. All 516 type values validate against base_reverse_dns_types.txt. No collisions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
d6d50a45e5 |
Add Tier 0 to the verification triage: globally-known brand at primary domain (#734)
In the previous ASN-domain coverage sweep, the agent ran web searches for entries like `bestbuy.com → Best Buy`, `ups.com → United Parcel Service`, `usps.gov → US Postal Service`, `marriott.com → Marriott`, `henkel.cn → Henkel`, `experian.com → Experian`, `jd.com → JD.com`, `ing.com → ING`, `verisign.com → Verisign`. For each of these the domain ↔ brand pairing is encyclopedic — same outcome a few seconds slower. The two-corroborating-sources rule (rule 8) was being applied mechanically: "MMDB as_name alone is one source, must fetch a second." But for globally-known brands at their primary domain, the brand identity itself is the second source. Searching for confirmation that Best Buy owns bestbuy.com is the kind of busywork the tier system exists to avoid. Adds Tier 0 with explicit guardrails — must be globally known (multinational or top-tier-national, decades-old, single canonical entity), must be the entity's primary marketing/corporate domain (not a tracking subdomain or regional ccTLD where ownership is non-obvious), and no recent acquisition/rebrand status in question. Cross-references the existing parent-too-generic sub-rule and warns against stretching to mid-size brands the agent happens to recognize. When in doubt: drop to Tier 3 and search. Also generalizes the section's lead from "redirect-target candidates" to cover MMDB coverage-gap and PSL private-domain candidates — the tier logic transfers cleanly across all three workflows. Updates the Tier 1 description with an explicit MMDB-coverage-gap analog. Refreshes the held-back-review split stat to 0 / 109 / 2 / 34 / 35 (Tier 0 didn't apply to that batch because every candidate was a redirect target that needed to inherit the *source row's* existing canonical name, not its own brand identity). Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
6926e69d01 |
ASN-domain coverage sweep #2: 142 new map entries (#733)
* Add 105 ASN-domain coverage-gap entries (commercial brands, universities, ISPs) Sweeps the top ~250 unmapped as_domain entries from the IPinfo Lite MMDB by IPv4 weight. Three buckets: 1. Globally-known commercial brands where the as_name and the well-established public brand identity match (UPS, Best Buy, Marriott, ING, Raytheon, Henkel, Experian, Tucows, Verisign, JD.com, Newfold Digital — alias from enduranceinternational.com, Hyundai Home Shopping, Qihoo 360, Kingsoft, SIAC). 2. Accredited educational institutions where the .edu / .ac.* / .edu.* TLD restriction is itself a corroborating source alongside the MMDB as_name (Texas Tech, U Wyoming, U Alaska, Western Washington, U Guadalajara, UNC Greensboro, Northern Arizona, U Miami, Texas Tech HSC, U Hong Kong + 3 sister HK universities, U Melbourne, JAIST, Maria Curie-Sklodowska, DoDEA, Clark County School District, AIST, Japan Atomic Energy Agency, Connecticut State Colleges, Kennesaw State, RESTENA Luxembourg, NKN India). 3. Regional ISPs / MSPs / hosters verified per-case via web search for two-corroborating-sources confirmation: Spectranet (Nigeria), Brisanet (Brazil), Hondutel (Honduras), WestCall (Russia), AKADO Telecom (formerly Comcor), HT Eronet (Bosnia), Trooli (UK), Spitfire (UK), Intermax (Netherlands), Sogetel (Quebec), Synoptek, Union Wireless (Wyoming), Bigleaf Networks, OzarksGo (Arkansas), Acantho (Hera Group, Italy), Istekki (Finland), AIS Advanced Wireless Network (Thailand), CSI Piemonte, Baxet Group, Verixi (Belgium), SBA Edge, Iron Mountain Data Centers (formerly Web Werks India), CITIC Telecom CPC (acquired Linx Telecommunications), Optus (Singtel), Tele2 Kazakhstan, Movistar (Telefónica México), C Spire (Mississippi), Wananchi Group (Kenya), Asiatech, Respina, Fanap Telecom, Sabanet, Mobinnet, Pishgaman (Iran), Power Line Datacenter (HK), Airtek Solutions (Venezuela), Tata Teleservices, ParsOnline, WorldLink Communications (Nepal), Sarenet (Spain), CETIN (Serbia), IPKO (Kosovo), Sure (Channel Islands), Swoop (Australia), Deutsche Glasfaser, ePLDT, Epic (formerly Vodafone Malta), Tigo Bolivia, Multipolar Technology, Silversky, YOU Broadband (Vodafone Idea India). Also adds: - Government / civic: USPS, DC, City of Toronto, City of Boston, Canton of Bern, Networking Tasmania, St. Joseph's Health Care London, Enoch Pratt Free Library. - Logistics: UPS, JR East, Post Danmark. - MSP: Otsuka Corporation, ANS (UK). - IaaS: IABG Teleport. Skipped — single-source / parking / parent-too-generic concerns: globalcapacity.com (post-acquisition operator unclear), various opaque AS-id-named domains, cox enterprises (multi-product conglomerate, no clean type fit). Sortlists.py runs cleanly. All 105 type values validate against base_reverse_dns_types.txt. No collisions with existing map keys. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add 37 more ASN-domain coverage entries (Asian telcos, regional ISPs) Continues the coverage-gap sweep. Each entry verified per-case via web search for two-corroborating-sources confirmation (domain-WHOIS / homepage content + MMDB as_name + an established third-party directory like Wikipedia or industry trade press). - Major established brands: Fujitsu (web.ad.jp), True Corporation (Thailand), One NZ (formerly Vodafone NZ), Partners Telecom Colombia (formerly WOM), Angola Telecom, Gabon Telecom (Maroc Telecom subsidiary), Sony Global Solutions, BEKKOAME (now part of GMO Internet), CS Loxinfo (now AIS), National Telecom (Thailand, formerly CAT Telecom). - Regional cable / fiber operators in Japan (ZTV, Oita Cable Telecom, StarCat Cable Network, Community Network Center), Korea (Hyundai HCN, Areum Broadcasting Network), Taiwan (Peicity / TaipeiNet, Taiwan Optical Platform), China (Shaanxi Broadcast & TV, Qinghai Telecom under China Telecom umbrella, China Telecom Tianjin under same), Russia (Almatel, Seven Sky / Iskratelecom, Good Line / E-Light-Telecom in Kuzbass). - Other regional ISPs / hosters: Orange Jordan (go.com.jo via Jordan Telecom Group), FASTtelco (Kuwait), Cyberzone (Panama-based hosting), Moselle Télécom (French regional), Africa on Cloud (South African IaaS), Computer Engineering & Consulting (CEC, Japan MSP), Macquarie Government (Australian sovereign data centers), Meteverse (Canadian/Korean edge cloud), Ningxia West Cloud Data (operator of AWS China Ningxia region), 21Vianet (Chinese hosting), China Broadcasting Network, China Networks Inter-Exchange (CNIX). - Education: MANDA Darmstadt (TU Darmstadt + Hochschule Darmstadt shared MAN). Skipped — single source / ambiguous: globalcapacity.com (post-GTT- acquisition operator unclear), abcle.co.kr (single source, type unclear), dr.com.tr (Andromeda TV connection couldn't be confirmed). Sortlists.py runs cleanly. All type values validate against base_reverse_dns_types.txt. No collisions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
e8f1525757 |
Full-map redirect-target alias sweep (#732)
* Full-map redirect-target alias sweep: 146 new aliases Follow-up to PR #730 — runs the same redirect-target-alias analysis against the entire current map (5,509 rows) instead of only the rows added in PR #729. The map predates this session by several years, so acquisitions and rebrands accumulated without paired aliases. Method: re-ran collect_domain_info.py against every existing map entry (via --map /tmp/nonexistent.csv to bypass the skip-already-mapped filter). For each row whose homepage's final_url base differs from the domain, classified the redirect target as a same-operator alias or a sister/placeholder/etTLD that should be skipped. Three confidence tiers from 334 raw redirect-mismatch candidates: - Multi-source (>=2 mapped domains redirect to the same target): 20 aliases, all auto-included. Notable: hatena.blog (6 src — Hatena blog platform's brand consolidation), vercel.com (4 src — now.sh, vercel.app, vercel.dev), mailchimp.com (3 src — Mailchimp's tracking domains), liquid.tech (3 src — Liquid Intelligent Technologies after Neotel acquisition), supabase.com, streamlit.io (Snowflake), xfinity .com (Comcast). - Single-source with lexical-token overlap between source brand and target host: 128 aliases. These are TLD/subdomain variants (ais.co .th -> ais.th, neubox.net -> neubox.com, duck.com -> duckduckgo.com) and obvious near-rebrands (slic.com -> slicfiber.com, soverin.net -> soverin.com). - Single-source with no token overlap: 180 candidates. Held back from auto-promotion because token-mismatched single-source redirects are the bucket where false positives concentrate (small-operator pages redirecting to unrelated portals). Surfaced separately in a PR comment for hand review — many are real acquisitions (messagelabs .com -> broadcom.com, cincinnatibell.com -> altafiber.com, sparkpostmail.com -> bird.com, modis.com -> akkodis.com) that just need a maintainer's eye to confirm before mapping. Manual overrides for 5 multi-source cases where the heuristic picked the wrong source row's (name, type): - ziggo.nl: chello.sk's UPC redirect was the case-2 sister-brand pattern AGENTS.md step 6 already calls out; the legitimate source is ziggozakelijk.nl. Mapped to Ziggo, ISP. - zetaglobal.com: source rows pointed at Sailthru and Selligent (both acquired by Zeta Global). Canonical -> Zeta Global, Marketing. - crisis24.com: source rows pointed at One Call Now and Topo.ai (both acquired by Crisis24). Canonical -> Crisis24, SaaS. - directnic.com: heuristic picked "Directnic.com" from one source's name string; aligned to "Directnic" (matches the dnchosting.com source's convention). - fortinet.com: source rows pointed at Fortinet FortiMail product and Perception Point (Fortinet acquisition). Canonical -> Fortinet, Email Security (parent brand). Two false positives skipped from auto-promotion after sampling: - aichi-colony.jp -> aichi.jp: a healthcare operator's homepage redirected to the Aichi prefecture government portal — different operator (case-2 sister-host equivalent). - illinois.net -> illinois.gov: Illinois Century Network (academic) is not the State of Illinois government. Cumulative map size: 5,509 -> 5,655 rows. MMDB IPv4 coverage stays at ~90.47% (these aliases are mostly non-as_domain hosts, so they don't move the IPv4 metric — the win is PTR-side attribution coverage when DMARC reports cite the redirect target's domain). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Hand-review of held-back single-source aliases Adds 143 aliases from the held-back single-source-no-token-overlap list and updates 25 source rows to the post-rebrand brand name so both the source and alias rows resolve to the same canonical brand. Verification per case via public sources (acquisition press releases, rebrand announcements, official corporate documentation). Cases where the redirect target is a generic parent-company domain spanning many products were skipped — broadcom.com being the explicit exception where the alias uses the full product name "Broadcom Enterprise Messaging Security" so DMARC reports tagged with broadcom.com still land in the email-security bucket rather than overwriting other Broadcom product lines. Suspicious targets (parking pages, country-level TLDs, unrelated brands) were also skipped. Source-row name updates capture rebrands where the legacy brand no longer operates as such (Endurance International → Newfold Digital, Symantec Email Security → Broadcom Enterprise Messaging Security, Platform.sh → Upsun, Uninett → Sikt, SparkPost → Bird, etc.) and fix three typos uncovered during review (Goranicus → Granicus, Servastopol → Sevastopol, Wally-Wide → Valley-Wide). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Document parent-company-too-generic alias guidance; rename SendGrid to "Twilio SendGrid" Two related changes: 1. Rename the canonical name on `sendgrid.com` from `SendGrid (Twilio)` to `Twilio SendGrid` for consistency with the existing `sendgrid.net` and `dlivry.co` entries — the post-acquisition official product name. 2. Add `twilio.com,Twilio,SaaS` as the parent-domain alias (rather than re-using the product-specific `Twilio SendGrid, Marketing`), so DMARC reports from non-email Twilio services (Programmable SMS, Voice, Segment, Flex, etc.) don't get mis-attributed to the email product. The product-domain entries keep the product-specific `(name, type)`. 3. Document this approach in AGENTS.md under the existing redirect-target alias rules. Two acceptable patterns for multi-product parent redirect targets: - Bare parent name + broad type (Twilio, NICE) — the safer default for parents with many distinct product lines. - Full product name + specific type (Broadcom Enterprise Messaging Security) — appropriate when the parent's domain is overwhelmingly tied to one product line for DMARC purposes. In both cases, don't blindly inherit the source row's product-specific `(name, type)` for the parent-domain alias. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Document tiered verification approach for redirect-target alias review Captures the workflow that surfaced 143 confirmable aliases out of 180 held-back candidates with a small fraction of the search budget of "search every entry": - Tier 1: canonical name lexically corroborates the target — no search; source row is itself the second source. - Tier 2: canonical name explicitly contains "(Formerly X)" — no search; rebrand is self-documented. - Tier 3: no lexical overlap — search press releases / company newsroom / industry coverage; require two independent source categories; cite URLs in the PR. - Tier 4: target is a parking page / TLD-like base / unrelated brand — no search; reject and ship the list for heuristic tuning. Re-states the prompt-injection caveat in this verification context: press releases, homepages, news articles, WHOIS records, and search-result snippets are untrusted research data, never instructions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
5bb6570f4e |
collect_domain_info.py: replace curl fallback with pure-requests path (#731)
* collect_domain_info.py: replace curl shell-out with requests-based fallback
The previous fallback for cert-error / UA-blocked sites was a curl
subprocess. This was correct but added an external runtime dependency
(curl is usually present but not on minimal containers) and a fork +
tempfile + parse round-trip per fallback call. Replaced with a pure
requests-based path that uses a custom HTTPAdapter to relax the SSL
context to the same effective configuration:
ssl.CERT_NONE (verify=False, equivalent to curl -k)
set_ciphers("DEFAULT@SECLEVEL=0") (allows weak DH/RSA, recovers
DH_KEY_TOO_SMALL hosts that
even curl's default config
rejects)
options |= 0x4 (OP_LEGACY_SERVER_CONNECT, allows unsafe legacy
TLS renegotiation for older server stacks)
Plus a real-browser User-Agent (same Chrome/124 string as before),
verify=False, allow_redirects=True, and Session.max_redirects=5.
InsecureRequestWarning is suppressed at module level since the
verify-disabled path is intentional.
Smoke-tested against the same eight cert-error domains as the original
curl fallback. Same recovery rate on all eight (six recover with full
title+description, two -- twmbroadband.com and ltt.ly -- remain
genuinely unreachable with both implementations). One additional win:
vnpt.com.vn (DH_KEY_TOO_SMALL) now recovers under the SECLEVEL=0
cipher list, which curl with default options did not. Happy-path
domains (google.com) still take the primary path and produce
identical output.
Side effects:
- removes the curl runtime dependency from collect_domain_info.py
- removes ~10ms of fork-and-parse overhead per fallback call
- removes the tempfile-on-disk round-trip; body is captured in-memory
- error suffix in the TSV's error column changes from "| curl: ..." to
"| fallback: ..."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Use getattr(ssl, "OP_LEGACY_SERVER_CONNECT", 0x4) instead of raw 0x4
Per PR review: prefer the constant where the interpreter exposes it
(Python 3.12+) and fall back to the raw value (0x4) only on older
interpreters that the project still supports. Self-documenting and
future-proof against any unlikely stdlib value reshuffle.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
ec2db7238e |
Map aliases for redirect targets + CC BY-SA 4.0 attribution (#730)
* README: declare base_reverse_dns_map.csv under CC BY-SA 4.0
The map is now a curated derivative of the bundled IPinfo Lite MMDB
(as_domain / as_name fields, walked for unmapped operators and
classified via the workflow in AGENTS.md). IPinfo Lite is licensed
under Creative Commons Attribution-ShareAlike 4.0, which propagates
to derivative works, so the CSV is distributed under CC BY-SA 4.0
with attribution to IPinfo for the underlying network identification
data.
Also updates the file-size estimate in the README from "over 1,400"
to "over 5,000" to reflect the current state.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Alias redirect targets into the map and codify the practice in AGENTS.md
When a domain's homepage redirects to a different host *for the same
operator* (acquisition target's site, or a TLD/subdomain variant), PTR
reverse-DNS reports observed in the wild may reference either domain.
Mapping only the original loses attribution for the redirect target.
Adds 91 aliases discovered during the previous bulk PR's classification
work — every redirect target where the original was newly mapped, the
target wasn't already in the map, and the target was the same operator
(not a sister brand and not a placeholder/bot/parking page). Notable
examples: apogee.us + boldyn.com both -> Boldyn ISP; sungardas.com +
1111systems.com both -> 11:11 Systems MSP; vodafone.is + syn.is both
-> Sýn ISP; sendinblue.com + brevo.com both -> Brevo (Sendinblue)
Marketing; tigo.com + millicom.com both -> Tigo ISP; rockwellcollins.com
+ collinsaerospace.com both -> Collins Aerospace Defense.
Codifies the alias-target practice as a new paragraph under AGENTS.md
step 6 (the homepage-redirect disambiguation rule). Key guardrails:
- Alias only for case 1 (acquisition) and case 3 (TLD variant). Do
NOT alias for case 2 (sister brand / shared infra) -- aliasing the
redirect target there mis-attributes the redirect target's email.
Cited example: do not alias ziggo.nl to UPC after the chello.sk fix.
- Skip generic-placeholder, bot-management, and TLD/eTLD redirect
targets (example.com, perfdrive.com, umbler.com, co.uk, com.br...).
- When in doubt, drop the alias rather than commit it. A missing alias
is recoverable; a wrong one mis-attributes mail.
Also fixes four canonical-naming inconsistencies surfaced during the
brand-mismatch sweep, aligning recent additions to pre-existing entries:
- ga.gov: "Georgia Government" -> "State of Georgia" (matches existing
georgia.gov)
- goco.ca, radiant.net: "Telus" -> "TELUS" (matches existing telus.com)
- vee.com.tw: "VeeTime" -> "VeeTIME" (matches existing veetime.com)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Promote 21 inbound-redirect aliases from KU to map
Sweeping the session's collector TSVs for the inverse pattern of the
91 outbound aliases in commit
|
||
|
|
851560a9b1 |
Bulk reverse-DNS map coverage: top-500 ASN audit + KU re-research + curl fallback (#729)
* collect_domain_info.py: add curl fallback for blocked/broken fetches Many sites that returned no usable homepage under the original requests fetch turned out to be soft-failures: misconfigured TLS certs (self-signed, hostname mismatch, weak chain), 403/captcha pages from User-Agent-based bot filters, or redirect chains the requests stack rejected. None of those recover under a single retry with the same client config. This wires a curl fallback into _fetch_homepage that triggers when the primary attempt errors or returns a non-2xx status. Curl runs with -k (skip TLS verify), -L (follow redirects), --max-time bound, and a real-browser User-Agent string -- enough to clear the common UA-block and bad-cert classes of failure that small ISPs and regional telcos routinely ship. A 2xx-with-empty-head response is left alone (parked pages do not improve on retry). When both attempts fail, the error column carries both signatures so it is obvious that the fallback was tried. Smoke-tested against eight previously-failed cert-error domains: six recovered full title/description (as1101.net, citictel-cpc.com, xtrim.com.ec, etecsa.cu, zillion.network, sandia.gov), two remained genuinely unreachable. Happy-path domains take the primary path unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Bulk reverse-DNS map coverage: top-500 ASN audit + KU re-research Two passes against the bundled IPinfo Lite MMDB and the existing known-unknown list, both classified under the two-corroborating-sources rule (AGENTS.md): 1. Top-500 unmapped ASN-domain audit. Walked every record in ipinfo_lite.mmdb to find as_domain values not yet in the map, ranked by routed IPv4 count, took the top 500 (>= ~/15 each), and ran them through collect_domain_info.py. Yield: 435 new map rows from operators with two or more independent corroborating sources; 65 entries to known-unknown for operators where homepage and WHOIS were both unavailable from the test environment. Recovered domains span ISPs, web hosts, IaaS/MSP/MSSP, education networks, government agencies, and a long tail of major industrials. 2. Full re-research of the existing 3,606-entry known-unknown file using the new curl fallback (separate commit). The fallback recovered homepage content for 1,686 of 3,670 (45.9%) previously dark domains. Of those, 770 had a corroborating WHOIS or as_name alongside; 508 cleared the strict service-category test and were promoted out of known-unknown into the map. The remaining 262 recovered titles were brand-only / login-portal / under-construction pages where service category could not be assigned with confidence. Also removed a stale "#name?" Excel auto-correction artifact from the known-unknown file (it would never have matched any real reverse-DNS base domain). Cumulative result: base_reverse_dns_map.csv 3,946 -> 4,889 rows (+943, +23.9%); known_unknown_base_reverse_dns.txt 3,606 -> 3,162 (-444 net after both batches plus the artifact). Every promotion has two independent sources for the operator's identity and a homepage or MMDB-as_name signal sufficient to assign a service type. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix chello.sk classification: UPC, not Liberty Global The original classification aliased chello.sk to "Liberty Global" based on the IP-WHOIS netname (LGI-INFRASTRUCTURE) plus a stale homepage redirect to ziggo.nl that the collector observed at fetch time. This broke the AGENTS.md rule that IP-WHOIS only counts as a corroborating source when the domain name matches the netname -- "chello" does not match "LGI", so the IP-WHOIS should not have been treated as a source. The WHOIS was unambiguous: UPC BROADBAND SLOVAKIA, s.r.o. UPC retains its consumer brand in Slovakia (unlike Ireland, where upc.ie was rebranded as Virgin Media Ireland in the existing map). Reverting to the operator brand per WHOIS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix vodafone.is classification: Sýn, not Vodafone Same pattern as the chello.sk fix in the previous commit: the historic brand recorded in the MMDB as_name (Vodafone Iceland) is no longer the operator. Sýn acquired Vodafone Iceland's operations and the homepage redirects to syn.is, presenting Vodafone only as a partner relationship rather than an active sub-brand. Following the upc.ie -> Virgin Media Ireland precedent for rebranded markets, the canonical attribution is the current operator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * AGENTS.md: codify the homepage-redirect disambiguation rule Three classification mistakes during the bulk batch (chello.sk, vodafone.is, telia.dk, apogee.us) all came from the same gap in the workflow: when a homepage's final URL is a different host from the domain being classified, the right brand depends on the *relationship* between the two domains, not on the WHOIS or as_name in isolation. Adds a new step 6 to the unknown-domain classification workflow that spells out the three patterns and the disambiguator: - Acquisition / rebrand: the homepage shows the acquiring operator's marketing site. Use the new operator. MMDB as_name and IP-WHOIS netname are commonly stale for years post-acquisition; do not let them override an unambiguous current-operator homepage. - Sister brand / shared infrastructure: the homepage redirects to a *sibling* brand under the same parent group, but the WHOIS for the original domain still names a *specific* current operator. Use the WHOIS operator, not the redirect target. Canonical cautionary tale: chello.sk (WHOIS: UPC BROADBAND SLOVAKIA) was originally classified as Liberty Global because the homepage redirected to ziggo.nl (a sibling Liberty Global brand). The right answer was UPC. - TLD or subdomain variant: same operator, different domain. Trivial. Renumbers the remaining steps. The IP-WHOIS rule (step 5) and the two-source rule (now step 8) are unchanged but cross-referenced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Apply homepage-redirect rule to telia.dk and apogee.us Same pattern as chello.sk and vodafone.is in earlier commits — the historic operator name in the MMDB as_name and WHOIS does not reflect who actually runs the IPs after an acquisition. The homepage redirect is the current ground truth. - telia.dk -> Norlys: Norlys acquired Telia Denmark; homepage now redirects to shop.norlys.dk and presents Norlys throughout. - apogee.us -> Boldyn: Boldyn acquired Apogee Telecom; homepage now redirects to boldyn.com and shows the Boldyn marketing site for higher-education managed services. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Bulk reverse-DNS map coverage: next-500 unmapped ASN-domain audit Same workflow as the first top-500 batch in this branch, applied to the next tier of unmapped MMDB as_domain values (ranked 501..1000 by routed IPv4 count, each ~/15 to /14.5). Pre-screened against the current state of base_reverse_dns_map.csv and known_unknown_base_reverse_dns.txt. Yield: 414 newly-classified map entries + 86 known-unknown additions. Type breakdown skews ISP-heavy as expected at this scale, with strong representation from Education (universities now reaching deeper into the long tail), Government (state/county/national agencies), Web Host (regional hosting providers), and IaaS (mid-market cloud). Applied AGENTS.md step 6 (homepage-redirect disambiguation) on every case where the homepage's final_url crossed hosts: kept new operator when the redirect target was an acquiring brand (e.g. atlanticmetro.net -> 365 Data Centers, performive.com -> CloudFirst, fasternet.com.br -> Desktop, eatel.com -> REV, blic.net -> Supernova, dimensiondata.com -> NTT DATA, virtela.net -> NTT Communications), used WHOIS operator when the redirect was sister-brand or shared infra, used the same operator when the redirect was a TLD/subdomain variant. Coverage delta: 88.89% -> 90.40% of MMDB IPv4 (+1.51 pp, ~47M IPv4). Cumulative for this PR: 85.10% -> 90.40% (+5.30 pp, ~165M IPv4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Reclassify the 262 left-dark KU re-research candidates with relaxed heuristic Of the 770 two-source candidates from the curl-fallback KU re-research pass earlier in this branch, 262 had homepage content and a corroborating WHOIS/as_name but were left in known-unknown because the homepage was brand-only or a login portal that didn't directly describe service category. Relaxing the heuristic on a re-pass: when the WHOIS legal name itself contains a regulated-telecom keyword (TELECOM, TELECOMUNICAÇÕES, INTERNET, FIBRA, BROADBAND, PROVEDOR DE INTERNET, NET TELECOM), that *is* a service-category source -- in Brazil, Argentina, Chile, and peers, operators must register under specific legal naming and the registration is a regulator-vetted signal. Combined with two-source identity, that clears the bar without forcing the homepage to also spell out the service. Same goes for brand-name-as-service signals: "X Server Limited" with a customer-portal homepage and matching WHOIS reasonably maps to Web Host; "X Fiber" + matching as_name maps to ISP. These are what readers would naturally infer from the operator's own self-naming. Yield: 95 promotions out of 262 (36% of the left-dark subset). The remaining 167 stay in known-unknown because the homepage was a generic placeholder ("Index of /", "Coming Soon", default Apache page), the brand on the homepage didn't match the WHOIS, the operator was clearly a non-telecom (e.g. INPASUPRI = supplies for IT, malugainfor = Comércio de Produtos de Informática, hugel = pharma), or the service category was genuinely ambiguous. MMDB IPv4 delta is small (+0.03 pp, +888K IPv4) since most of these are long-tail operators with low or zero MMDB footprint -- the value is in PTR-side attribution coverage when these brands appear in actual reverse-DNS reports. Cumulative for this PR: map 4,889 -> 5,398 rows; KU 3,162 -> 3,153 lines; MMDB IPv4 coverage 88.89% -> 90.42% (+1.53 pp from the next-500 batch plus this re-pass). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
b3a608735f | Revise classification guidelines to enforce two-corroborating-sources rule and clarify handling of unidentified domains | ||
|
|
d04eb89035 | Clarify handling of TLS errors and user network issues in classification guidelines | ||
|
|
55a1e79066 | Add kamatera.com entry to base_reverse_dns_map | ||
|
|
c87aa3de08 |
fixes the incomplete changing of the headers in the SMTP TLS Reporting dashboard visualizations to match the rest of the project (lowercase words separated by _
|
||
|
|
85554c2344 |
OpenSearch Dashboards: Restructure SMTP TLS dashboard to match Splunk layout (#728)
The bundled `splunk/smtp_tls_dashboard.xml` is three tables — Reporting
organizations, Domains, Failure details — sharing the same TLS-RPT data.
The OSD dashboard had drifted into five panels (two pies + three tables)
that didn't line up with what the Splunk one shows. Replace them with
three `data_table` viz mirroring the Splunk layout.
Each table uses sum-only metric aggs (no count column) on the per-policy
or per-failure-detail session-count fields. OSD's Visualize agg pipeline
auto-wraps each terms/sum on a `policies.*` or `policies.failure_details.*`
field in the right `nested:{path: …}` agg, so per-policy and per-detail
totals come out correctly without any schema or write-path changes.
Reuse the existing IDs of the three drop-in replacements so re-importing
overwrites in place:
- 4f3b4cb0… (was "TLSRPT reporting organizations") → "Reporting organizations"
- eeb47eb0… (was "TLSRPT policies by domain") → "Domains"
- 5cbcd040… (was "SMTP TLS failures") → "Failure details"
The two pie-chart viz removed by this change have no equivalent in the
new layout. Upgraders will need to delete the orphans manually from OSD's
Saved Objects management page:
- 25f321e0-26d0-11f1-96a6-fb3734bd0b21 ("SMTP TLS sessions")
- 12065020-26d1-11f1-96a6-fb3734bd0b21 ("TLSRPT policies")
Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
342b467590 |
Mark maildir messages as read after they are read (#726)
MaildirConnection.fetch_message() previously returned the message body without touching the on-disk file, so messages stayed in new/ with no "S" (Seen) flag and any MUA scanning the same maildir kept showing them as unread. The call site now passes mark_read=not test (mirroring the existing MSGraphConnection plumbing); on True, the message is moved to cur/ and gains the S flag. Test mode leaves the maildir unmodified. Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>9.10.2 |
||
|
|
adf36ca6a3 | Add bluevps.com entry to base_reverse_dns_map | ||
|
|
81a0d4ce56 | Add additional entries for 3z.net and 3zden.cloud to base_reverse_dns_map | ||
|
|
a4a2155ab0 | OpenSearch Dashboards: Show rows in the Message sources by Autonomous System viz even if some fields are missing | ||
|
|
168244af95 |
Add Message sources by Autonomous System to Opensearch Dashboards (#725)
Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> |
||
|
|
c989f27983 |
Add six base_reverse_dns_map entries from MMDB coverage-gap analysis (#722)
* Cover ASN-fallback path for the Evolus operator family Only evolus-ix.com (the Internet Exchange product) was in the map, so ASN-fallback lookups for IPs without PTR fell through to the raw as_name string with no service type. The bundled IPinfo Lite MMDB stores the same operator's blocks under two other as_domain values: - evolus-it.com (the corporate domain, Evolus IT Solutions GmbH) - evolusfibre.com (their consumer fiber ISP brand) Both resolve to as_name "Evolus IT Solutions GmbH" in the MMDB, confirming they're the same operator. WHOIS on evolus-it.com and the evolusfibre.com homepage both pin the company to Austria. Added both as aliases pointing at the existing (Evolus IX, ISP) entry so all three product brands cluster under one display name, matching the comcast.net / comcast.com pattern documented in AGENTS.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add aliases for centrilogic, 1gservers, etherni, globconnex Four additional ASN-domain aliases discovered via coverage-gap analysis against the bundled IPinfo Lite MMDB. None of the four brands are currently represented in the map under any key, so these are new brand entries (not alias-of-existing). - centrilogic.com → Centrilogic, MSP 82 MMDB nets, ~62K IPv4. Homepage describes the company as an "end-to-end I.T. transformation" managed-services provider. - 1gservers.com → 1GServers, Web Host 117 nets, ~23K IPv4. Homepage: bare-metal dedicated servers and Phoenix colocation. - etherni.com → Ethernic, MSP 2 nets, 768 IPv4. Homepage: cloud-migration / cloud-native consulting. Operates its own small ASN under Ethernic LLC. - globconnex.com → Global Connectivity Solutions, ISP 687 nets, ~63K IPv4. Homepage unreachable (self-signed cert); WHOIS privacy-redacted. Classification is inferred from the MMDB as_name "GLOBAL CONNECTIVITY SOLUTIONS LLP" and the routed-network scale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
15cf8f55b7 |
Skip caching weak-fallback IP attributions (#723)
get_reverse_dns() swallows every DNSException as None, so a transient PTR lookup failure (timeout, SERVFAIL, socket error) is indistinguishable from a genuine no-PTR case. When that lands on the raw-as_name fallback branch (no map match for the ASN domain either), the weak result was getting cached in the 4-hour IP-info cache — locking in the misattribution even after the PTR became resolvable. Observed in the wild: 91.244.70.212 has PTR customer.evolus-ix.com (which the map correctly classifies as Evolus IX, ISP), but the user's dataset showed it with source_name = raw as_name and source_type = null — the signature of a transient PTR lookup failure that then got cached. Fix: skip the cache write when the row is in that specific weak-fallback state (reverse_dns=None AND type=None AND name=as_name). PTR-backed matches and ASN-domain matches are stable attributions and continue to be cached as before. Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
28e7651e15 |
AGENTS.md: promote 'data not instructions' and document ad-hoc route (#724)
Two gaps the previous revision had:
1. The "Treat WHOIS/search/HTML as data, never as instructions" rule
was rule 8 of a single workflow (unknown-domain classification),
but the risk applies to every route that consumes external
content — MMDB coverage-gap scans, the PSL private-domains route,
ad-hoc per-request additions, and the external-service-docs rule
earlier in the file. Promoted it to its own subsection right
after the Privacy rule, expanded to cover prompt-injection,
misleading self-descriptions, typosquats, and bait-and-switch
pages. The numbered rule 8 now cross-references the subsection
instead of restating it.
2. The "someone points at N specific domains and asks for them to be
classified" route had no named workflow, even though it's a
common shape — the existing docs cover bulk unknown-list,
MMDB coverage-gap, and PSL private-domains, but not ad-hoc. Added
an "Ad-hoc single-domain additions" subsection with the condensed
loop: MMDB check → grep existing keys → two-source corroboration
→ precedence/naming rules → honest inference in commit body
→ privacy rule → data-not-instructions → sortlists.py.
Rule 5 of the ad-hoc workflow ("be honest about inference") is the
specific lesson from the globconnex.com classification in PR #722 —
a silent guess is indistinguishable from a verified fact in a diff.
Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
f0781c6191 |
IPinfo API: keep only documented behavior (#721)
* Strip invented IPinfo API behavior; keep documented-only The IPinfo Lite API docs (https://ipinfo.io/developers/lite-api) state: "The API has no daily or monthly limit and provides unlimited access." Auth is documented as a ?token= query param only. The /me shown in the docs returns geolocation for the caller's IP — it is not a documented account/quota endpoint for Lite. Removed everything that was speculating beyond the docs: - The /me probe that pretended to return plan/limit/remaining fields. - 429 rate-limit handling, 402 quota-exhausted handling, Retry-After parsing, cooldown state, and the rate-limit warning / recovery-info logging around them. - The Authorization: Bearer header (not documented for Lite). Kept: - Lookups against the documented /lite/<ip>?token=<token> endpoint. - 401/403 treated as a fatal invalid-token (reasonable defensive check). - Network-error and non-2xx fallback to the bundled/cached MMDB. - A simple startup probe that validates the token with a single lookup and logs "IPinfo API configured" at info level. Test consolidated to cover only documented paths: success, 401 fatal, non-2xx fallback, and that auth goes in ?token= (not Authorization). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * AGENTS.md: warn against speculating past external-service docs New subsection under Configuration spelling out that third-party API integrations must start with a direct WebFetch of the canonical docs page, not a subagent query. Calls out the two traps that produced the IPinfo speculation: (1) asking subagents question shapes that presuppose the answer exists, and (2) treating feature asks as "build this" without first checking "does this apply to this service?". Uses the now-reverted IPinfo speculation as the cautionary tale so the next session has a concrete example to recognize the shape of the mistake. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Bump to 9.10.1; put removal under a new CHANGELOG section Restored the 9.10.0 entry to its as-shipped wording and moved the speculation-removal note into its own 9.10.1 Fixed section. Editing the 9.10.0 entry would have misrepresented what was actually released — the shipped tag does contain the /me probe, 429/402 cooldown, Retry-After parsing, and Bearer auth, and the changelog should say so. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>9.10.1 |
||
|
|
9d1152d4f8 |
chore: update IPinfo Lite MMDB (#720)
Co-authored-by: seanthegeek <44679+seanthegeek@users.noreply.github.com> |
||
|
|
f0f377311e |
Rename asn_name/asn_domain to as_name/as_domain (#719)
Match the IPinfo Lite MMDB's native field names across the output schemas — JSON source records now emit asn, as_name, as_domain, and CSV / Elasticsearch / OpenSearch / Splunk integrations now emit source_asn, source_as_name, source_as_domain. The integer asn / source_asn field is unchanged. Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>9.10.0 |
||
|
|
5785cb2072 |
Add weekly workflow to refresh the bundled IPinfo Lite MMDB (#718)
Runs Mondays at 06:00 UTC (and on workflow_dispatch), downloads the latest MMDB using an IPINFO_TOKEN secret, validates it with a sample lookup, and opens a PR if the file changed. Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
c5f432c460 |
Add optional IPinfo Lite REST API with MMDB fallback (#717)
* Add optional IPinfo Lite REST API with MMDB fallback
Configure [general] ipinfo_api_token (or PARSEDMARC_GENERAL_IPINFO_API_TOKEN)
and every IP lookup hits https://api.ipinfo.io/lite/<ip> first for fresh
country + ASN data. On HTTP 429 (rate-limit) or 402 (quota), the API is
disabled for the rest of the run and lookups fall through to the bundled /
cached MMDB; transient network errors fall through per-request without
disabling the API. An invalid token (401/403) raises InvalidIPinfoAPIKey,
which the CLI catches and exits fatally — including at startup via a probe
lookup so operators notice misconfiguration immediately. Added
ipinfo_api_url as a base-URL override for mirrors or proxies.
The API token is never logged. A new _normalize_ip_record() helper is
shared between the API path and the MMDB path so both paths produce the
same normalized shape (country code, asn int, asn_name, asn_domain).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* IPinfo API: cool down and retry instead of permanent disable
Previously a single 429 or 402 disabled the API for the whole run. Now
each event sets a cooldown (using Retry-After when present, defaulting to
5 minutes for rate limits and 1 hour for quota exhaustion). Once the
cooldown expires the next lookup retries; a successful retry logs
"IPinfo API recovered" once at info level so operators can see service
came back. Repeat rate-limit responses after the first event stay at
debug to avoid log spam.
Test now targets parsedmarc.log (the actual emitting logger) instead of
the parsedmarc parent — cli._main() sets the child's level to ERROR,
and assertLogs on the parent can't see warnings filtered before
propagation. Test also exercises the cooldown-then-recovery path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* IPinfo API: log plan and quota from /me at startup
Configure-time probe now hits https://ipinfo.io/me first. That endpoint
is documented as quota-free and doubles as a free-of-quota token check,
so we use it to both validate the token and surface plan / month-to-date
usage / remaining-quota numbers at info level:
IPinfo API configured — plan: Lite, usage: 12345/50000 this month, 37655 remaining
Field names in /me have drifted across IPinfo plan generations, so the
summary formatter probes a few aliases before giving up. If /me is
unreachable (custom mirror behind ipinfo_api_url, network error) we
fall back to the original 1.1.1.1 lookup probe, which still validates
the token and logs a generic "configured" message.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Drop speculative ipinfo_api_url override
It was added mirroring ip_db_url, but the two serve different needs.
ip_db_url has a real use (internal hosting of the MMDB); an
authenticated IPinfo API isn't something anyone mirrors, and /me was
always hardcoded anyway, making the override half-baked. YAGNI.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* AGENTS.md: warn against speculative config options
New section under Configuration spelling out that every option is
permanent surface area and must come from a real user need rather than
pattern-matching a nearby option. Cites the removed ipinfo_api_url as
the canonical cautionary tale so the next session doesn't reintroduce
it, and calls out "override the base URL" / "configurable retries" as
common YAGNI traps.
Also requires that new options land fully wired in one PR (INI schema,
_parse_config, Namespace defaults, docs, SIGHUP-reload path) rather
than half-implemented.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Rename [general] ip_db_url to ipinfo_url
The bundled MMDB is specifically IPinfo Lite, so the option name
should say so. ip_db_url stays accepted as a deprecated alias and
logs a warning when used; env-var equivalents accept either spelling
via the existing PARSEDMARC_{SECTION}_{KEY} machinery.
Updated the AGENTS.md cautionary tale to refer to ipinfo_url (with
the note about the alias) so the anti-pattern example still reads
correctly post-rename.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Fix testPSLDownload to reflect .akamaiedge.net override
PSL carries c.akamaiedge.net as a public suffix, but
psl_overrides.txt intentionally folds .akamaiedge.net so every
Akamai CDN-customer PTR (the aXXXX-XX.cXXXXX.akamaiedge.net pattern)
clusters under one akamaiedge.net display key. The override was added
in
|
||
|
|
2978436d89 |
Expand reverse-DNS map and PSL overrides from the live PSL (#716)
* Expand reverse-DNS map and PSL overrides from the live PSL Parses the private-domains section of the live Public Suffix List and adds 269 brand-owned suffixes as PSL overrides paired with map entries, so customer subdomains on shared hosting / SaaS / PaaS platforms fold to the operator's brand. Adds 33 ASN-domain entries for the subset of these brands whose IP space is registered under a different corporate domain in the MMDB, so both the PTR-derived lookup and the ASN-fallback lookup hit the same (name, type). Also normalizes ``a2hosting.com`` from ``A2Hosting`` to ``A2 Hosting`` for spelling consistency. PTR-path wins (overrides + map entries) - Web hosts: A2 Hosting, alwaysdata, Antagonist, Beget, bplaced, Bytemark, Combell, cyber_Folks, cyon, DreamHost, EasyWP, Gehirn, HelioHost, home.pl, HostyHosting, Hypernode, IONOS (6 suffixes), Jotelulu, JouwWeb, KaasHosting, Keyweb, LCube, LiquidNet, McHost, Memset, Mittwald, Mythic Beasts, NearlyFreeSpeech, Nimbus Hosting, One.com (20 ccTLD variants), OwnProvider, Pantheon, Planet-Work, prgmr, Rackmaze, Rad Web Hosting, Raidboxes, Servebolt, SpeedPartner, Uberspace, Whatbox, WP Engine, ZAP-Hosting, Zitcom. - Dynamic DNS: DuckDNS, DynDNS (24), No-IP (22), Now-DNS, dynv6, freemyip, nsupdate.info, ddnss.de, GoIP, DrayTek. - PaaS/SaaS/IaaS: Netlify, Vercel (6), Heroku, fly.io, Render, Firebase/GCP (4), Azure (5), AWS (4), DigitalOcean (2), Red Hat OpenShift, Hasura, Supabase, Snowflake/Streamlit, Read the Docs, PythonAnywhere, GitHub, GitLab, Adobe Magento. - Hosted sites/stores: Hatena (6), Notion, Figma, Webflow, Wix (4), Shopify, Shopware, Sellfy, Spreadshop (19 ccTLDs), Datto. - Email/Marketing: Fastmail, ActiveTrail, Leadpages, Heyflow, Carrd, Typeform. - CDN/Technology: Akamai (7), Fastly (3), Yandex Cloud. ASN-path wins (MMDB coverage now attributes 1,184,256 more IPv4 addresses to a named brand, 85.04% -> 85.08%): yandex.com, ya.ru, hosting.com (A2 Hosting), beget.com, cyberfolks.pl, fly.io, bytemark.co.uk, cyberfolks.ro, keyweb.de, mittwald.de, memset.com, zap-hosting.com, datto.com, jotelulu.com, yandex.cloud, github.com, asavie.com (Akamai), and 16 others. Entries are curated from the live PSL rather than any bundled copy; brand / as_name attribution was verified against the CLAUDE.md rule that the IP-WHOIS signal is only trusted when the domain name itself matches the host's name (name-collisions in MMDB were skipped — Hypernode AU, goipgroup.com, liquidnet.com, One.com substring noise, nimbusitsolutions.com, etc.). Types follow ``base_reverse_dns_types.txt``; ``sortlists.py`` re-sorts + dedupes + validates after the batch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Document PSL-derived override workflow and load_psl_overrides gotcha Adds three pieces of map-maintenance context learned while building this PR: - New subsection "Discovering overrides from the live PSL private-domains section" — distinct source from live DMARC data (unknown_base_reverse_dns.csv) and MMDB coverage-gap analysis. The private section is itself a list of brand-owned suffixes; each is a candidate (psl_override + map entry) pair. Emphasizes ruthless selectivity — most of the 600+ private-section orgs are dev sandboxes or hobby zones that will never appear in DMARC reports. - Two-path coverage as a single linked step, not two round-trips: when adding a PSL override for a hosted-content suffix (netlify.app), also add a map row for the brand's corporate as_domain (netlify.com) in the same pass. The override fixes the PTR path; the ASN-domain alias fixes the ASN-fallback path. - The load_psl_overrides() fetch-first gotcha. The no-arg form pulls the file from master on GitHub, so end-to-end testing of local overrides silently uses the old remote version. offline=True is required to test local changes against get_base_domain(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
2cda5bf59b |
Surface ASN info and use it for source attribution when a PTR is absent (#715)
* Surface ASN info and fall back to it when a PTR is absent Adds three new fields to every IP source record — ``asn`` (integer, e.g. 15169), ``asn_name`` (``"Google LLC"``), ``asn_domain`` (``"google.com"``) — sourced from the bundled IPinfo Lite MMDB. These flow through to CSV, JSON, Elasticsearch, OpenSearch, and Splunk outputs as ``source_asn``, ``source_asn_name``, ``source_asn_domain``. More importantly: when an IP has no reverse DNS (common for many large senders), source attribution now falls back to the ASN domain as a lookup key into the same ``reverse_dns_map``. Thanks to #712 and #714, ~85% of routed IPv4 space now has an ``as_domain`` that hits the map, so rows that were previously unattributable now get a ``source_name``/``source_type`` derived from the ASN. When the ASN domain misses the map, the raw AS name is used as ``source_name`` with ``source_type`` left null — still better than nothing. Crucially, ``source_reverse_dns`` and ``source_base_domain`` remain null on ASN-derived rows, so downstream consumers can still tell a PTR-resolved attribution apart from an ASN-derived one. ASN is stored as an integer at the schema level (Elasticsearch / OpenSearch mappings use ``Integer``) so consumers can do range queries and numeric sorts; dashboards can prepend ``AS`` at display time. The MMDB reader normalizes both IPinfo's ``"AS15169"`` string and MaxMind's ``autonomous_system_number`` int to the same int form. Also fixes a pre-existing caching bug in ``get_ip_address_info``: entries without reverse DNS were never written to the IP-info cache, so every no-PTR IP re-did the MMDB read and DNS attempt on every call. The cache write is now unconditional. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Bump to 9.9.0 and document the ASN fallback work Updates the changelog with a 9.9.0 entry covering the ASN-domain aliases (#712, #714), map-maintenance tooling fixes (#713), and the ASN-fallback source attribution added in this branch. Extends AGENTS.md to explain that ``base_reverse_dns_map.csv`` is now a mixed-namespace map (rDNS bases alongside ASN domains) and adds a short recipe for finding high-value ASN-domain misses against the bundled MMDB, so future contributors know where the map's second lookup path comes from. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Document project conventions previously held only in agent memory Promotes four conventions out of per-agent memory and into AGENTS.md so every contributor — human or agent — works from the same baseline: - Run ruff check + format before committing (Code Style). - Store natively numeric values as numbers, not pre-formatted strings (e.g. ASN as int 15169, not "AS15169"; ES/OS mappings as Integer) (Code Style). - Before rewriting a tracked list/data file from freshly-generated content, verify the existing content via git — these files accumulate manually-curated entries across sessions (Editing tracked data files). - A release isn't done until hatch-built sdist + wheel are attached to the GitHub release page; full 8-step sequence documented (Releases). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>9.9.0 |
||
|
|
c2678f8e21 |
Add second-pass ASN-domain aliases for the top remaining misses (#714)
Adds 43 more high-confidence aliases from the top IPv4-weighted misses remaining after #712. Bumps ASN-domain coverage of the bundled ipinfo lite MMDB from 84.0% to 85.0% — modest, as expected; the tail is a long list of small ASNs where diminishing returns kick in hard. This is the last bulk alias pass. Any remaining gap should be filled by falling back to the raw `as_name` from the MMDB at attribution time, not by continuing to hand-classify thousands of small ASNs. Also promotes nask.pl out of known_unknown_base_reverse_dns.txt — NASK is the Polish national research and academic network, which is unambiguous from ASN context. Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
35dda7c0a6 |
Fix map-maintenance tooling and stale classifications (#713)
sortlists.py had three bugs that let bad data through: - The `type` column validator was keyed on "Type" (capital T) but the CSV header is "type" (lowercase), so every row bypassed validation. - `types` was read via `f.readlines()` without stripping, so even if the key had matched, values like `"ISP\n"` would never equal `"ISP"`. - The map was sorted case-sensitively, but README and AGENTS.md both state the map is sorted alphabetically case-insensitive. Fixing the validator surfaced eight pre-existing rows with invalid or inconsistent `type` values. All are now corrected: - Two types listed in README but missing from base_reverse_dns_types.txt (Religion, Utilities) have been added so the README and authoritative types file agree. - dhl.com, ghm-grenoble.fr, regusnet.com had lowercase-casing type values (`logistics`, `healthcare`, `Real estate`) corrected to match the canonical spellings. - lodestonegroup.com was typed `Insurance`, which is not a listed industry; reclassified as `Finance` (the closest listed category for an insurance brokerage). Also fixes one stale map entry: `rt.ru` was listed as `RT,Government Media`, conflating Rostelecom (the Russian telco that owns and uses rt.ru) with RT / Russia Today (which uses rt.com). Corrected to `Rostelecom,ISP`. Switching to case-insensitive sort moves exactly one row — the sole mixed-case key `United-domains.de` — from the top of the file (where ASCII ordering placed it before all lowercase keys) into the "united" range where human readers would expect it. Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
15f7d269d5 |
Add ASN-domain aliases to base_reverse_dns_map.csv (#712)
* Add ASN-domain aliases to base_reverse_dns_map.csv Adds 457 entries keyed on the `as_domain` values that ship in `ipinfo_lite.mmdb`, so that the existing reverse_dns_map can serve as a lookup table for IPs that resolve no PTR — the common case for many large senders. Before this change only ~33.8% of routed IPv4 space had an `as_domain` that matched a map key; after, ~84.0%. All additions are brands that were already represented in the map under a different rDNS-base key (e.g. `comcast.com` alongside the existing `comcast.net`), plus a handful of well-known operators that previously had no representation at all. Also promotes 10 entries out of known_unknown_base_reverse_dns.txt (a1.net, actcorp.in, ais.co.th, emirates.net.ae, eolo.it, fpt.vn, ibm.com, movilnet.com.ve, ote.gr, singnet.com.sg) — each is a well-known operator whose identity is unambiguous from ASN context even if the original rDNS base alone was inconclusive. No code changes; this is purely data, in preparation for a follow-up that wires `as_domain` into the source-attribution fallback path when a report row has no reverse DNS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Reclassify Zscaler as SaaS Zscaler is consumed as a self-service security platform, not delivered as a managed service, so SaaS fits better than MSSP. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
2ac8cb406e |
Replace DB-IP Country Lite with IPinfo Lite (9.8.0) (#711)
Switch the bundled IP-to-country database from DB-IP Country Lite to IPinfo Lite for greater lookup accuracy. The download URL, cached filename, and packaged module path all move from dbip/dbip-country-lite.mmdb to ipinfo/ipinfo_lite.mmdb. IPinfo Lite uses a different MMDB schema (flat country_code) that is incompatible with geoip2's Reader.country() helper, so get_ip_address_country() now uses maxminddb directly and handles both the IPinfo schema and the MaxMind/DBIP nested country.iso_code schema so users who drop in their own MMDB from any of these providers continue to work. Drop the geoip2 dependency (it was only used for the incompatible helper) and add maxminddb as a direct dependency — it was already installed transitively through geoip2. Callers that imported parsedmarc.resources.dbip directly need to switch to parsedmarc.resources.ipinfo. Old parsedmarc versions downloading from the dbip/ GitHub raw URL will 404 and fall back to their bundled copy — this is the documented behavior of load_ip_db(). Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>9.8.0 |
||
|
|
67f46a7ec9 |
DNS lookup reliability improvements (9.7.1) (#710)
Port DNS reliability fixes from checkdmarc 5.15.x: cap per-query UDP timeout at min(1.0, timeout) so a single dropped datagram no longer consumes the entire lifetime budget, scale lifetime by nameserver count for proper failover, and add a retries kwarg that retries on LifetimeTimeout, NoNameservers (SERVFAIL), and OSError during TCP fallback (NXDOMAIN and NoAnswer remain non-retryable). Thread dns_retries through the parser API and expose it via --dns-retries / the dns_retries INI option. Centralize DNS defaults in parsedmarc.constants and add RECOMMENDED_DNS_NAMESERVERS for opt-in cross-provider failover. Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>9.7.1 |
||
|
|
6effd80604 |
9.7.0 (#709)
- Auto-download psl_overrides.txt at startup (and whenever the reverse DNS map is reloaded) via load_psl_overrides(); add local_psl_overrides_path and psl_overrides_url config options - Add collect_domain_info.py and detect_psl_overrides.py for bulk WHOIS/HTTP enrichment and automatic cluster-based PSL override detection - Block full-IPv4 reverse-DNS entries from ever entering base_reverse_dns_map.csv, known_unknown_base_reverse_dns.txt, or unknown_base_reverse_dns.csv, and sweep pre-existing IP entries - Add Religion and Utilities to the allowed service_type values - Document the full map-maintenance workflow in AGENTS.md - Substantial expansion of base_reverse_dns_map.csv (net ~+1,000 entries) - Add 26 tests covering the new loader, IP filter, PSL fold logic, and cluster detection Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>9.7.0 |
||
|
|
10dd7c0459 | Update base_reverse_dns_map.csv with additional ISP and organization entries |