Commit Graph

1504 Commits

Author SHA1 Message Date
Sean Whalen 7ef153b4da Classify reverse DNS map: next 5000 unmapped MMDB ASN domains (#755)
5x the typical batch size to chase complete ASN-domain coverage. Small ISPs
and web hosts are high-value targets for spam/phishing abuse, so the long
tail of unmapped operators is worth investing review effort in. Each
candidate at this depth represents 3,072–6,144 IPv4 addresses (well below
the 100K+ that head-batches saw); auto-classification rate is 43.5%, similar
to the prior batch.

- 2,177 added to base_reverse_dns_map.csv (ISP 1,477, Web Host 296,
  Education 214, MSP 65, Government 56, Healthcare 40, Finance 29).
- 2,823 added to known_unknown_base_reverse_dns.txt — parked / Cloudflare-
  challenged / generic-server-test pages, obscure-language homepages
  without telecom-keyword cognates the classifier recognized, or rows
  whose WHOIS / MMDB as_name / homepage couldn't combine into two
  corroborating sources.

ASN-domain coverage of the bundled IPinfo Lite MMDB after this batch:
  - by domain count:  12,678 / 63,993  (19.81%, up from 15.86%)
  - by IPv4 weight:   97.85%           (up from 97.55%)

Reused the batch-5 classifier (MMDB as_name as primary brand source with
domain-root-aware title-segment selection, multilingual ISP/Web Host/MSP
keyword regex, government and education TLD lists, Communications-with-
media-context-guard fallback, and the deep brand-suffix cleanup for
EPP/EIRELI/UAB/Druzstvo/etc. plus the UTF-8-as-Latin-1 mojibake fix).
No new classifier changes this batch.

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 12:33:23 -04:00
Sean Whalen 34518585b6 Classify reverse DNS map: next 1000 unmapped MMDB ASN domains (#754)
The next 1000 by aggregate IPv4 weight, all sitting in the long tail (each
candidate ASN holds ~7,400 IPv4 addresses, ~0.21% of total v4 weight), so
auto-classification rate is modest compared to head-batches:

- 460 added to base_reverse_dns_map.csv (ISP 344, Web Host 60, Education 21,
  MSP 12, Healthcare 8, Government 8, Finance 7).
- 540 added to known_unknown_base_reverse_dns.txt — homepages that were
  parked, behind a Cloudflare bot challenge, returning a generic-server test
  page, in obscure languages with no telecom-keyword cognates the classifier
  recognized, or whose WHOIS / MMDB as_name didn't combine with any
  homepage signal to clear two corroborating sources.

Classifier improvements applied this batch (relative to prior batches' code):

- MMDB as_name is the primary brand source, with cleaned title as fallback
  and domain-derived as last resort (WHOIS is mostly privacy-redacted at
  this depth in the long tail).
- Title-segment selection now prefers the segment whose simplified form
  contains the domain root, catching cases like accessmontana.com whose
  as_name is the holding company "MONTANA WEST, L.L.C." but whose title
  surfaces the operator brand "Access Montana".
- as_name fallback for ISP added "Communications" (with a media-context
  guard so "Christian Broadcasting Network" doesn't hit) plus bare
  "Internet" / "Cable" / "Telephone Co." patterns common in rural-US ISP
  brands.
- Government TLD list expanded for .go.id, .gv.at, .gov.cn, .gob.cl/ar/gt,
  .admin.ch, etc.; Education TLD list expanded for .ac.kr / .ac.za /
  .ac.nz / .edu.cn / .edu.tw / .edu.sg / .edu.my / .edu.ph / .edu.eg.
- MSP detection re-added (`it solutions` / `managed it support` /
  `managed tech` patterns) for marconet.com / odyssey.uk / vmi.se type
  long-tail managed-IT shops.
- Brand cleanup deepened to handle Brazilian EPP / EIRELI ME, Italian
  s.c.a r.l, Polish sp z o.o variants, Lithuanian UAB, Czech Druzstvo,
  Venezuelan C.A., trailing-single-letter artifacts, and double-spaces.
- Encoding-mojibake fixer for the common UTF-8-as-Latin-1 cases
  ("Fibra óptica" → "Fibra óptica") so Spanish/Portuguese ISP pages
  classify even when collect_domain_info.py mishandled the encoding.

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 12:06:22 -04:00
Sean Whalen 769b16bb03 Drift-detect rebrands: tighten regex; promote 11 verified rebrand-aliased map keys (#753)
* Tighten rebrand regex to drop CTA, third-party-mention, and CSS-asset FPs

The first run of detect_rebrands.py against the live map surfaced systemic
false-positive categories that drowned the real signals. Tightening over two
rounds of FP triage:

REBRAND_RE — drop bare "now <Cap>" and "joined the X" branches:

- "Buy Now PROMO", "Apply Now Who", "Order Now Free Shipping" — modern
  marketing pages saturate body text with CTA fragments and ~95% of bare
  "now <Capital>" matches were these. Replaced with the linguistically
  meaningful pattern "(is|are|was|were|am) now (?:(?:a )?part of)?" which
  still catches "BankOnIT is now Navanta", "We are now Cencora",
  "is now part of Lumen", etc.
- "joined the Festo Certified System Integrator Program", "joined the
  ClimateCAP Initiative", "joined the Fredonia Women's Rugby team" — the
  "joined the X" pattern was too generic; real "joined the X family"
  rebrand banners are rare enough that dropping the branch is the right
  trade.

REBRAND_RE — add `\b` word boundary at the start so triggers don't match
mid-word: "Stre*am* now Mystery" was matching `am now <Cap>` because the
last two letters of "Stream" satisfied the verb alternation.

REBRAND_PATH_RE — drop bare `rebrand`, `name change`, `new name for`, and
`brand-update` / `brand-refresh` patterns. They appeared too often as CSS
class names (`class="rebrand-page"`), CSS variables
(`--rebrand-underline-color`), image filenames (`bms-rebrand-logo.svg`,
`brand-update.css`), and JSON/JS strings (`"name change"` user-account
labels). Adding `\b` boundaries doesn't help because dashes are non-word
characters. The remaining narrow patterns (`brand-launch`,
`brand-announcement`, `brand-reveal`, `our-new-name`, `our-new-brand`,
`acquisition-announcement`, `merger-announcement`) still catch the
canonical bankonitusa.com case via its `brand-launch-frequently-asked-
questions` URL slug and `Brand announcement` alt text.

_REBRAND_NOISE — make the comparison case-insensitive and add
"included", "iso", "secure", "part" to suppress "is now ON" / "is now
LIVE" / "is now ISO 27001 certified" / "is now Secure Managed Wi-Fi" /
"is now Part of" patterns. Twitter/Facebook/Square (the social-platform
rebrand mentions in footers like "X (formerly Twitter)") moved to
lowercase since the comparison is now case-insensitive.

Net effect on a full sweep over the ~13,100-key map: rebrand-signal
flagged-row count dropped from ~270 (initial run) to 108 (round-3),
clearing the dominant FP categories while every real signal — verified
against the bankonitusa.com canonical case plus 11 other actual
rebrands — still fires.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Promote 11 verified rebrands found by drift sweep; alias 4 acquirer domains

Renames produced by `detect_rebrands.py` running against the full ~13,100-key
map and verified by re-reading each operator's homepage. Type column
unchanged for every row — only the canonical `name` shifts to the new
operator. Where the new operator's primary domain wasn't already in the map,
a case-1 alias row is added pointing to the same `(name, type)`.

Renames:

- amerisourcebergen.com: AMERISOURCEBERGEN → Cencora
- aurorahealthcare.org: Aurora Health Care → Advocate Health
- consolidated.com: Consolidated Communications → Fidium Fiber
- databridgesites.com: Meridian Parkway Data Center Owner → TierPoint
- emarsys.com: SAP Emarsys → SAP Engagement Cloud
- rig.net: RigNet → Viasat
- rxlightning.com: RxLightning → CoverMyMeds
- telepoint.bg: Telepoint → Digital Realty
- thehostgroup.com: The Host Group → HostGo
- ultisat.com: Globecomm Services Maryland → UltiSat
- unifiedpostgroup.com: Unifiedpost Group → Banqup

New aliases (operator's primary domain not previously mapped):

- cencora.com → Cencora, Healthcare
- advocatehealth.com → Advocate Health, Healthcare
- covermymeds.com → CoverMyMeds, Healthcare
- banqup.com → Banqup, SaaS

Five sweep hits intentionally deferred for lack of a clear second source:
megatel.co.nz → Nova (`nova.co.nz` is for sale via a domain broker;
unclear which Nova entity), pogozone.com → NeuBeam (NeuBeam's homepage
doesn't acknowledge the PogoZone acquisition), prempub.com → Ingenious
Media (ingeniousmedia.com fetch failed), voltagepark.com → ? (merger
with Lightning AI rather than a clean rebrand), and a handful of more
ambiguous Synopsys/Ansys/OmniAccess/Rakuten/Indigital/Synthite signals
that need manual research.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Document detect_rebrands.py cadence as run-once-a-year

The drift sweep is for catching operator rebrands and acquisitions that
accumulated since the previous run; M&A activity over the mapped operator
set is slow enough that yearly is sufficient. Annotate the script's own
docstring, the maps README, and the AGENTS.md "Related utility scripts"
entry so a future contributor doesn't mistake it for a per-batch step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 11:31:58 -04:00
Sean Whalen c752e776de Detect map-key rebrands via homepage drift sweep (#752)
Adds two complementary pieces of M&A drift detection over base_reverse_dns_map.csv:

- `collect_domain_info.py` gains two derived columns. `rebrand_signal` combines
  a body-text regex ("now X" / "formerly known as X" / "we became X" / ...)
  with a narrow path-and-alt-text regex ("rebrand", "brand-launch",
  "brand-announcement", "name-change", "our-new-name", ...) that runs against
  the JSON-unescaped page bytes, so URL slugs and image alt attributes inside
  Elementor / hydration script blobs are reachable. The two-regex split is
  what catches image-only acquisition banners like bankonitusa.com's "now
  Navanta" — a `<a href="https://navanta.com/brand-launch-..."><img
  alt="Brand announcement"></a>` with no visible text — that pure body-text
  scanning misses. `external_links` collects the homepage's non-self,
  non-social outbound link hosts as review context only.

- `detect_rebrands.py` is a new sibling drift sweep. It re-fetches every key
  in base_reverse_dns_map.csv with the same fetch machinery, evaluates two
  default flag triggers (`rebrand_signal` matched, or final URL host doesn't
  sit under the input domain), and writes a compact TSV of just the flagged
  rows. `external_links` is captured into the row as context but is not a
  default trigger — most outbound links are to partners / customers / vendors,
  and flagging them would flood review with noise. `--flag-external-links`
  opts into that signal for thorough sweeps. Resume-safe via `-o`.

Output is review fodder, not automated map mutation: a single signal is one
corroborating source, and promoting a flagged row into the map still requires
a second source per the two-corroborating-sources rule.

README and AGENTS.md updated to document the new columns and script.

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 21:22:30 -04:00
Sean Whalen 6fa561d172 Classify reverse DNS map: ~2,100 unmapped MMDB ASN domains; bankonitusa.com → Navanta (#751)
Adds ~2,125 ASN-domain classifications carried out across four ~1,000-domain
batches in a prior session that wasn't pushed before #748/#749 merged. The
overlap with those merged batches is dropped — origin/master's classifications
are kept as authoritative — and only the genuinely-new domains land here.
188 known-unknown rows are promoted out to the map for the same reason.

Also updates bankonitusa.com from BankOnIT to Navanta and adds navanta.com as
an alias after a spot check observed the operator's "now Navanta" rebrand
banner. Two corroborating sources: the banner on bankonitusa.com itself
(image-only `<a href="https://navanta.com/brand-launch-..."><img alt="Brand
announcement"></a>`) and the rebrand explainer on navanta.com ("Why We Became
Navanta", "MyBankonIT has been rebranded to MyBPC"). The MMDB still names the
pre-rebrand entity (BankOnIT, L.L.C.) — typical years-of-lag pattern.

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 21:20:14 -04:00
Sean Whalen bf526f4e12 docs(AGENTS.md): require fresh branch off origin/master per batch (#750)
* docs(AGENTS.md): require fresh branch off origin/master per batch

Add a "Starting the next batch" subsection to the reverse-DNS-maps
workflow. Each batch must start from a fresh checkout of origin/master,
not from the previous batch's branch.

The trap: if the previous batch's commit has already merged via a PR
pushed from elsewhere (a co-worker's session, an unsynced laptop, an
earlier session), the local copy of that commit still sits on the old
branch. Stacking new work on top makes the new PR conflict with master,
because the merged commit and the local copy insert identical map rows
at identical sorted positions and the same lines collide.

Hit live this batch (PR #749) and recovered via
`git rebase --onto origin/master <stale-commit> <branch>` plus a
force-push, then a PR-description trim. Documenting the failure mode
and the recovery so the next contributor avoids the trap entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(AGENTS.md): also check for open map PRs before starting a batch

Add a pre-flight `gh pr list --search` step ahead of the branch-fresh-
off-master rule. Same scenario in mind: a previous batch's PR is still
in flight, started from a different machine or session, and starting
a new batch in parallel duplicates effort or splits attention across
two competing PRs touching the same files.

Cheap one-liner; cost of forgetting it is the kind of conflict #749
already documented at the branch-hygiene level.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 21:14:26 -04:00
Sean Whalen 7ef31f8083 Classify reverse DNS map: next 1000 unmapped MMDB ASN domains (#749)
Continued the MMDB ASN-domain coverage walk into the 14k-10k IPv4-weight
band. Added 883 new map entries and 117 new known-unknown entries from
the top 1,000 unmapped candidates.

ASN-domain coverage by IPv4 weight: 96.5% -> 96.8%.
ASN-domain coverage by domain count: 11.0% -> 12.4%.

Composition: ~50 globally-known brands (Vanguard, AIG, Aon, Equifax,
Mercedes-Benz USA, BP, BHP, Bechtel, Tetra Pak, Anheuser-Busch, Air
Canada, Maersk, NFL, NHL, MGM Resorts, Wolfram, Red Hat, Palo Alto
Networks, New Relic, Travelport, Epicor, IQVIA, Dassault Systemes,
Disney+, Valve, Seagate, Analog Devices, Renesas, Dow Jones, Lee
Enterprises, IGN, Mondadori, AtkinsRealis, Eiffage, Ogilvy, Interpublic,
Equifax, Ooredoo Maldives, MTN Zambia, Movistar Costa Rica, Telekom
Romania Mobile, Sparkle, Vodafone Ireland, etc.); ~30 universities
and government / state agencies (City of San Jose, City of Phoenix,
Bulgarian gov, Region Uppsala, Weld County, Long Beach Unified, Escambia
School District, Region 4 ESC, Merced COE, Santa Cruz COE, Politechnika
Warszawska, Bogazici, KAIST-affiliated Korean universities, Ural Federal
University, etc.); the long tail of regional ISPs / hosters / MSPs /
data-center operators classified via MMDB as_name + homepage / WHOIS
corroboration.

117 added to known-unknown where the two-corroborating-sources bar
wasn't met (Cloudflare-blocked sites with privacy-redacted WHOIS,
generic-token AS-names with empty homepages, parked domains, etc.).
Files remain disjoint per the workflow guardrail.

sortlists.py validates clean (types, sort, dedupe). CRLF preserved.

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 21:08:53 -04:00
Sean Whalen ab9d4e93f5 Classify reverse DNS map: next 1000 unmapped MMDB ASN domains (#748)
Continued the MMDB ASN-domain coverage walk into the 18k–14k IPv4-weight
band. Added 941 new map entries and 59 new known-unknown entries from
the top 1,000 unmapped candidates.

ASN-domain coverage by IPv4 weight: 96.0% → 96.5%.
ASN-domain coverage by domain count: 9.5% → 11.0%.

Composition: ~50 universities and government / state agencies (HMRC, SSA,
DHS, DOJ, BART, Pittsburgh, Charlotte, NY courts, Bank of Canada, MTA,
gov.si, gov.ru, KAUST, Sharif University, Karolinska Institutet, IIT,
KTH, etc.), ~70 globally-known brands (Nvidia, AMD, BMW, Mastercard,
Nasdaq, NetApp, Allianz, Honeywell, JPMorgan, Goldman Sachs, Mitel,
Arista, Take-Two, Universal Music, Disney Go, Fox, Nike, Cigna, Aetna,
Humana, AbbVie, Mitsubishi Electric, Saint-Gobain, Reliance Industries,
Hyundai Autoever, Square Enix, NEXON, Riot Games, Mahidol University,
Hong Kong HSBC, Standard Chartered, etc.), and the long tail of regional
ISPs / hosters / MSPs / data center operators classified via MMDB
as_name + homepage corroboration.

59 added to known-unknown where the two-corroborating-sources bar
wasn't met. Files remain disjoint per the workflow guardrail.

sortlists.py validates clean (types, sort, dedupe). CRLF preserved.

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 17:59:36 -04:00
Sean Whalen 1fd833bbf0 Classify reverse DNS map: next 1000 unmapped MMDB ASN domains (#747)
Continued the MMDB ASN-domain coverage walk into the 30k–18k IPv4-weight
band. Added 971 new map entries and 32 new known-unknown entries from
the top 1,000 unmapped candidates.

ASN-domain coverage by IPv4 weight: 95.3% → 96.0%.
ASN-domain coverage by domain count: 8.0% → 9.5%.

Composition: ~30 universities and government / state agencies
(maryland.gov, ok.gov, nj.gov, NIA Korea, NICTEC Thailand, etc.),
~80 globally-known brands (Nvidia, Tesla, Intel, Ford, GM, Volvo,
Disney, EA, Roblox, Riot Games, Sony PlayStation, JPMorgan, Goldman
Sachs, Morgan Stanley, Charles Schwab, AXA, Cigna, Cargill, Hallmark,
Pepsi, Kroger, Random House, NBCUniversal, Qualcomm, Deutsche Bank,
UBS, Citi, Lloyds Banking, Westpac, CommBank, Adobe, Broadcom, NXP,
Schaeffler, Saint-Gobain, Hanwha, Doosan, Hyundai Autoever, Square Enix,
Garena, etc.), and the long tail of regional ISPs / hosters / MSPs
classified via MMDB as_name + homepage corroboration.

1 entry promoted out of known_unknown_base_reverse_dns.txt; 32 added
where the two-corroborating-sources bar still wasn't met. Files remain
disjoint per the workflow guardrail.

sortlists.py validates clean (types, sort, dedupe). CRLF preserved.

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 16:23:23 -04:00
Sean Whalen 05adb9c831 Classify reverse DNS map: top ~1,950 unmapped MMDB ASN domains (#746)
Walked the bundled IPinfo Lite MMDB for ASN domains absent from
base_reverse_dns_map.csv, processed the top ~1,950 by IPv4 weight
across five batches (collect_domain_info.py + tier-based classification
per AGENTS.md), and added 1,664 new map entries.

ASN-domain coverage by IPv4 weight: ~92.9% → 95.3%.
ASN-domain coverage by domain count: 5.4% → 8.0%.

Composition: ~250 universities/government (Tier 0 — restricted TLD +
MMDB as_name), ~80 globally-known brands (Saudi Telecom, JAXA, RailTel,
LY Corporation, Tesla, Intel, Citi, Schwab, Disney, EA, Volvo, Mitsubishi
Electric, Cargill, Hallmark, Medtronic, Banco do Brasil, Petrobras, etc.),
direct aliases for already-mapped brands (HKBN, Tata Teleservices, Cox,
NTT, T-Mobile, etc.), and the long tail of regional ISPs / hosters / DC
operators classified via MMDB as_name + homepage corroboration.

66 entries promoted out of known_unknown_base_reverse_dns.txt where the
new collector data cleared the two-corroborating-sources bar; 55 added
where the bar still wasn't met. Files remain disjoint per the workflow
guardrail.

sortlists.py validates clean (types, sort, dedupe). CRLF preserved.

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 14:20:10 -04:00
Sean Whalen 7ba078bff1 Translate AS-name source rows via MMDB; classify reverse DNS batch (#745)
* feat(maps): translate AS-name source rows via MMDB

When parsedmarc's ASN-fallback path in utils.get_ip_address_info surfaces
a raw MMDB as_name (e.g. "Vodafone Group PLC") for an IP that has no PTR
and whose as_domain isn't in the map, find_unknown_base_reverse_dns.py
now looks the as_name up in the bundled ipinfo_lite.mmdb and substitutes
the matching as_domain so the row enters the unknown pipeline as a
researchable domain instead of being dropped or polluting the list.

Normalize non-breaking spaces (U+00A0) and runs of whitespace when
building and querying the as_name index — the source CSV and MMDB
disagree on NBSP placement for several names (e.g. "UDomain\xa0Web
Hosting Company Ltd" in the CSV vs. "UDomain Web Hosting Company Ltd"
in the MMDB), causing exact-match lookups to miss otherwise-identical
entries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(maps): classify a batch of unknown reverse DNS base domains

40 map additions (35 source domains + 5 redirect-target/promotion
aliases) and 35 known-unknown additions, covering the 71-entry
unknown_base_reverse_dns.csv refresh.

Newly mapped operators include several MMDB-AS-translated regional
ISPs (Babilon-T/TJ, MegaFon Tajikistan, Ucell, Ufone, PinPro, Teraline
Telecom, Transtelecom Kazakhstan, Satis, AlmaTV, Radius-NET, Burlington
Telecom), aliases of existing brands (Telstra/bigpond.net.au,
UDomain/udomain.hk, AG Telekom/katv1.net, EWE/ewe-ip-backbone.de,
Hostinger/hstgr.cloud, Docusign/docusign.net, Brevo/sp2-brevo.net,
MegaFon/megafon.tj, Beeline/beeline.uz), Tier-0 brands (Visa, Tripster,
Verde Agritech), one healthcare entry (Sanwakai Hospital), one
government entry (Special Communication Service of Azerbaijan), one
education entry (KazRENA), and an MSP (Otava). Redirect-target aliases
added for burlingtontelecom.com, alma.plus, cn.at, and
teraline-telecom.net per the post-batch sweep rule. fea.net promoted
out of known-unknown to West Coast Internet (WCI) after its homepage
redirect-target was already mapped.

Domains with single-source corroboration (privacy WHOIS plus
unreachable site, parked-domain pages, ambiguous categorizations) went
to known_unknown_base_reverse_dns.txt rather than the map.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:17:43 -04:00
Sean Whalen 6ff6261df9 docs: update installation instructions for IPinfo Lite and MaxMind GeoLite2 databases 2026-05-04 18:52:18 -04:00
Sean Whalen 06fd3f2b09 docs: update installation instructions and usage notes for parsedmarc 2026-05-04 16:34:51 -04:00
github-actions[bot] 7ba8a1d10f chore: update IPinfo Lite MMDB (#744)
Co-authored-by: seanthegeek <44679+seanthegeek@users.noreply.github.com>
2026-05-04 12:17:49 -04:00
Sean Whalen 02a8014893 Fix splunk SMTP TLS dashboard: add additional renames for failure details and adjust stats query 2026-05-03 19:58:29 -04:00
Sean Whalen 8317ffcde8 Fix rename syntax for parsed_sample headers in Splunk DMARC forensic dashboard 2026-05-03 19:09:10 -04:00
Sean Whalen 3b9e678533 Refactor SMTP TLS dashboard with base search
Refactored the SMTP TLS Splunk  dashboard to use a base search for improved query efficiency and maintainability. Updated input token names and adjusted search queries for better organization and clarity.
2026-05-03 18:50:54 -04:00
Sean Whalen 5ba72d2783 Add source AS name to fillnull and search queries in DMARC aggregate dashboard 2026-05-03 15:27:43 -04:00
Sean Whalen e40b53da64 Enhance Splunk DMARC aggregate dashboard: add source AS name dropdown and update search queries 2026-05-03 14:57:43 -04:00
Sean Whalen fe296ca869 Update dashboard documentation
- Introduced a new README.md for dashboard development with detailed instructions.
- Removed outdated README files for Grafana and Splunk dashboards.
2026-05-03 12:36:06 -04:00
Sean Whalen 397378de8e Bump mailsuite to >=2.0.2 for 9.11.1 release (#743)
Addresses RuntimeError: Event loop is closed in the MS Graph mailbox
backend (#742).

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9.11.1
2026-04-30 11:59:11 -04:00
Sean Whalen 5d816a4e56 Offload mailbox layer to mailsuite>=2.0.0 (#741)
mailsuite 2.0.0 extracted the IMAP, Microsoft Graph, Gmail, and Maildir
connections out of parsedmarc into mailsuite.mailbox so other projects
can reuse the same provider-agnostic interface. Replace the
parsedmarc/mail submodules with a thin re-export of mailsuite.mailbox
and drop the duplicated implementations.

Per the migration note in seanthegeek/mailsuite#22, pass
token_cache_name="parsedmarc" so existing AuthenticationRecord caches
on disk continue to work without re-prompting users to authenticate.
The existing graph_url config knob is forwarded unchanged.

Drop direct dependencies that are now installed transitively via
mailsuite[gmail,msgraph] (msgraph-core, imapclient, google-*). The
extras are pulled in non-optionally so Gmail and Microsoft Graph
support remain available out of the box.

Drop nine test classes that were exercising mailsuite-side
implementation internals (TestGmailConnection, TestGraphConnection,
TestImapConnection, the _get_creds/_generate_credential half of
TestGmailAuthModes, TestImapFallbacks, TestMSGraphFolderFallback,
TestMaildirConnection, TestMaildirReportsFolder, TestMaildirUidHandling,
TestTokenParentDirCreation); these are mailsuite's tests now. The CLI
integration tests that mock parsedmarc.cli.{IMAP,Gmail,MSGraph}Connection
are kept.

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9.11.0
2026-04-28 00:58:36 -04:00
Sean Whalen 900ee22525 Make map and country list side by side in the Splunk DMARC aggregate dashboard XML 2026-04-27 16:03:29 -04:00
Sean Whalen e709839f79 Fix typo in source ip viz 2026-04-27 15:20:45 -04:00
Sean Whalen e7f6e1b5e7 Update map files 2026-04-27 12:59:11 -04:00
Sean Whalen 26f54b1269 Add content rule to exclude adult websites from domain lists 2026-04-27 12:01:57 -04:00
Sean Whalen 44fd1aa555 Coerce malformed <email> in aggregate report metadata to None (#740)
xmltodict turns stray angle brackets in <email> (e.g.
"<bad-xml@bad-xml.net>") into a nested dict, which then flows through
parse_aggregate_report_xml as the org_email value. Parsing succeeds, but
Elasticsearch / OpenSearch reject the document at index time because the
org_email mapping is text — observed as document_parsing_exception /
mapper_parsing_exception with a "{#text=..., bad-xml=null}" preview.

When report_metadata["email"] comes back as a dict, log it at debug and
discard. The rest of the report still ingests with org_email=None
instead of failing the whole document downstream.

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 12:00:55 -04:00
github-actions[bot] f3a2e894e0 chore: update IPinfo Lite MMDB (#739)
Co-authored-by: seanthegeek <44679+seanthegeek@users.noreply.github.com>
2026-04-27 08:28:00 -04:00
Sean Whalen 265bf64240 Align Grafana dashboard with OpenSearch Dashboards source-of-truth (#738)
* Align Grafana dashboard with OpenSearch Dashboards source-of-truth

Adds the two aggregate-DMARC panels that exist on the OSD dashboard but
were missing from the bundled Grafana dashboard:

- "Message sources by name and type" — buckets by source_name + source_type,
  sums message_count per (name, type) tuple. Mirrors the OSD viz from 9.4.x.
- "Message sources by Autonomous System" — buckets by source_asn +
  source_as_name + source_as_domain, sums message_count per ASN. Mirrors
  the OSD viz added in 9.9.0 with the IPinfo Lite ASN integration.

Both panels are patterned on the existing "Reporting Organisations" panel
(same datasource $datasourceag, same sum(message_count) metric, same
gradient-gauge "Messages" column with rename transforms). They sit at
the bottom of the existing layout (gridPos y=129 and y=140) so the
existing panel positions are unchanged.

Verified against the bundled grafana/grafana:12.3.0: dashboard import
returns status=success, both panels render with real data from the
sample-corpus indexes, and the ES aggregations (terms on source_name
+ source_type, numeric terms on source_asn) return the expected results.

Out of scope:
- Extras in the Grafana dashboard that aren't on OSD (SPF/DKIM Results
  Over Time, Alignment Over Time, Stat overview, Published Policies,
  Forensic IP / country tables) are left in place. They were
  community-contributed and likely valued by some users.
- Migrating the deprecated `graph` and `grafana-worldmap-panel` panel
  types to modern timeseries / geomap is a separate, larger task.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Grafana: migrate deprecated graph and worldmap panels

Replaces the 6 legacy `graph` panels with `timeseries` panels and the
2 legacy `grafana-worldmap-panel` panels with `geomap` panels. Both
deprecated plugins still rendered in Grafana 12 via auto-migration but
were flagged for removal; this ships the modern saved shape.

graph -> timeseries (6 panels):
  SPF Results Over Time, DKIM Results Over Time, SPF Alignment Over Time,
  DKIM Alignment Over Time, DMARC Passage Over Time, Message Disposition
  Over Time. Panel `aliasColors` (e.g. {true: dark-green, false: dark-red})
  are translated into per-series `fieldConfig.overrides` so the green/red
  by-pass-fail colorings carry forward; legacy graph fields (lines, fill,
  yaxes, tooltip etc.) are dropped in favor of the new
  `fieldConfig.defaults.custom` block and `options.legend` / `options.tooltip`.

worldmap -> geomap (2 panels):
  Map of Message Source Countries (aggregate), Forensic Sample Sources
  by Country (forensic). The legacy `locationData=countries` lookup-by-ISO
  becomes a geomap markers layer with `location.mode=lookup`,
  `gazetteer=public/gazetteer/countries.json`, and `lookup=source_country.keyword`
  — same input data, modern renderer. Drops the date_histogram bucket
  from the geomap targets since the map is a snapshot over the panel
  time range, not a time series.

Verified against the bundled grafana/grafana:12.3.0: dashboard imports
with status=success and `version=19`, live panel types now report
`{timeseries: 6, geomap: 2, table: 14, grafana-piechart-panel: 3,
stat: 1, row: 3}` — no more `graph` or `grafana-worldmap-panel` entries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 01:32:29 -04:00
Sean Whalen 4e8c28bbc0 Align Kibana dashboards with OpenSearch Dashboards source-of-truth (#737)
* Align Kibana dashboards with OpenSearch Dashboards source-of-truth

OSD is a fork of Kibana 7.10 and Kibana 8.x's saved-object migration
handlers accept OSD's saved-object format directly. Replace the legacy
Kibana export with a byte-identical copy of the OSD ndjson, so the two
backends ship the same panels, metric aggregations, panel titles, and
field assignments instead of drifting independently.

Verified against Kibana 8.19.7: import returns successCount=26 with no
errors and Kibana auto-migrates each viz / dashboard to its current
saved-object schema (typeMigrationVersion 8.5.0 for visualizations,
10.3.0 for dashboards) on import.

Net effects for Kibana users on import:

- Picks up the metric-aggregation fix from 9.10.3 — pies, tables, and
  the choropleth now sum(message_count) instead of counting OS docs,
  giving real message volume rather than distinct source-row counts.
- Adds "Message sources by Autonomous System" and "Message sources by
  name and type" panels (previously only on OSD).
- Forensic dashboard simplified to OSD's two-panel layout (markdown
  intro + samples table) — drops the Kibana-only IP-address and
  country-ISO tables and the choropleth.
- Adds the "SMTP TLS reporting" dashboard (was absent from the bundled
  Kibana export).
- Drops the extraneous "Evolution DMARC par source_reverse_DNS" Lens
  visualization that snuck in via a community contribution.

Updates docs/source/kibana.md to reflect the new dashboard names
("DMARC aggregate reports" / "DMARC failure reports") and adds a brief
section on the SMTP TLS reporting dashboard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Drop the duplicate Kibana ndjson; point Kibana users at the OSD file

Kibana 8.x's saved-object migration handlers accept the OpenSearch
Dashboards saved-object format directly (verified by import returning
successCount=26 with no errors), so a separate kibana/export.ndjson
was just two copies of the same bytes that would inevitably drift. Drop
it and update the bootstrap script and docs to point at the existing
dashboards/opensearch/opensearch_dashboards.ndjson.

Add a path-filtered CI workflow (.github/workflows/dashboards.yml) that
fires only when the OSD ndjson changes. It stands up an Elasticsearch +
Kibana 8.19.7 service pair, POSTs the file at the saved-objects import
endpoint, and asserts success=true with no errors. That keeps the
single-file source compatible with Kibana on every change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 01:30:48 -04:00
Sean Whalen 826e78c390 Fix DMARC dashboard metrics (OSD + Splunk) and add dashboard-dev bootstrap (#736)
* OSD: fix aggregate dashboard metrics to sum(message_count)

13 panels on the DMARC aggregate dashboard were aggregating with `count`
(number of OSD docs) when they should have been summing `message_count`.
Each parsedmarc OSD doc represents one (source_ip, auth_results) tuple from
the XML and carries an integer message_count, so doc-counting reports
"distinct sources" rather than "messages". Panels with titles like "Message
volume by header from", "DMARC passage over time", etc. were producing
misleading numbers.

Affected panels: SPF/DKIM/Passed-DMARC pies; Reporting orgs; Sources by
reverse DNS / header from / name+type / ASN / country / IP; Map; SPF and
DKIM details. (DMARC failure email samples kept count — one OSD doc per
RUF sample, so it's correct. SMTP TLS panels untouched — they sum the
right session-count fields.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Splunk: align dashboards with OSD and fix query bugs

Aggregate dashboard:
- Add "Message sources by Autonomous System" panel (source_asn / as_name /
  as_domain), formatted "AS<n>" at render with eval, matching the OSD addition.
- DKIM details: add the missing dkim_aligned column.
- SPF details: reorder columns to OSD order (spf_aligned at end).
- Map / country titles renamed to match OSD ("Map of message sources by
  country", "Message sources by country").
- Map widget: stats count by Country -> stats sum(message_count) by
  Country, so the choropleth shades by message volume not record count.
- fillnull "none"/"unknown" applied to source_reverse_dns, source_base_domain,
  source_country to mirror OSD's missing-bucket labels.
- charting.fieldColors {true: green, false: red} on SPF/DKIM/Passed-DMARC
  pies and the DMARC-passage timechart.

Forensic dashboard:
- Restructure to match OSD's two-panel layout (markdown + samples table).
- Drop the country map / IP table / country-ISO table panels (not in OSD).
- Samples table columns aligned to OSD: arrival_date_utc, source.ip_address,
  from, subject, reply_to, authentication_results.
- Tolerate null headers in the base_search filter (was: parsed_sample.headers.From=*
  required field to exist; LinkedIn RUF sample with null From was filtered out).

SMTP TLS dashboard:
- Reorder metrics to OSD order (successful before failed).
- Domains panel: add policy_type bucket.
- Failure details: replace search-time `failed_session_count>0` (which
  doesn't evaluate against multivalued JSON paths in Splunk) with
  `result_type=*` for presence + post-stats `where failed_sessions>0`.
  Drop _time/successful_sessions columns; reorder to match OSD.
- Wire the existing policy_type input into all three searches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add dashboard-dev bootstrap script and VSCode task

dashboard-dev-bootstrap.sh brings up docker-compose.dashboard-dev.yml,
seeds parsedmarc sample data into ES + OS + Splunk via parsedmarc-dev.ini,
and re-imports every dashboard into Kibana, OpenSearch Dashboards, Grafana,
and Splunk. Idempotent: existence checks skip provisioning that's already
done; only the dashboard imports re-run unconditionally on every invocation
(that's the point of running it after a dashboard edit).

Notable provisioning quirks the script handles:
- Splunk's auto-created HEC token (from the SPLUNK_HEC_TOKEN env) ships
  with indexes=[] and index=default; rewrites it to allow the email index.
- ES 8.x rejects wildcard DELETEs by default; RESEED=1 enumerates daily
  parsedmarc indexes via _cat/indices and deletes one at a time.
- Splunk has no clean-in-place REST endpoint for live indexes; RESEED=1
  deletes and recreates the email index (then re-applies the HEC token).
- OSD security plugin tenants: imports target global_tenant explicitly
  via the securitytenant header so they're visible to the shared workspace
  rather than landing in the API user's private tenant. Override with
  OSD_TENANT=<name>.
- Splunk ships an in-product announcement view (scheduled_export_dashboard)
  with sharing=global; the script narrows it to sharing=app so it stops
  showing up in every app's dashboards list.

Adds a "Dev Dashboard: Bootstrap" task to .vscode/tasks.json that runs
the script.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* CHANGELOG: 9.10.3 entry for the dashboard metric fix and alignment work

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Bump version to 9.10.3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* CHANGELOG: warn against the "Create new objects with unique IDs" import mode

OSD's import dialog has two modes: the default "Check for existing objects"
(which honors saved-object IDs and overwrites in place when "Automatically
overwrite conflicts" is on) and "Create new objects with unique IDs" (which
imports under fresh UUIDs and leaves the buggy originals untouched). Picking
the second one means the dashboards keep rendering the wrong numbers because
the originals are never replaced. Spell that out so users don't fall into
the trap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* OSD: label the metric column "messages" instead of "Sum of message_count"

OSD's table column header defaults to "Sum of message_count" when the
metric agg has no customLabel. "messages" reads better and matches what
the panels are actually counting.

Applies to all 15 aggregate-DMARC visualizations that use sum(message_count).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* CHANGELOG: tighten the 9.10.3 entry — clearer and more actionable

Trim the verbose technical exposition; lead each fix with the user-visible
symptom. Move the action-required call out to its own header in upgrade
notes so the re-import instructions don't get lost in a wall of text.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Move per-tool dashboard exports under a single dashboards/ directory

Consolidates the four sibling top-level folders (kibana/, opensearch/,
grafana/, splunk/) into dashboards/{kibana,opensearch,grafana,splunk}/.
Updates the only path references in tracked files: bootstrap script (5
lines), CHANGELOG.md (1 line), and the kibana/export.ndjson raw URL in
docs/source/elasticsearch.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* OSD: restore the "DKIM alignment" panel title on the aggregate dashboard

The DKIM alignment panel had no title override in panelsJSON, so OSD fell
back to the visualization's own name ("Aggregate DMARC DKIM alignment").
Every other pie/table on the same dashboard sets a clean title (SPF
alignment, Passed DMARC, etc.) — this was a stray regression. Set the
panel title to "DKIM alignment" to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Splunk: color the message-disposition timechart by severity

Reject is red, quarantine is yellow, none is green — same semantic
mapping as the SPF/DKIM/Passed-DMARC pies and the DMARC-passage
timechart, applied via charting.fieldColors. Matches OSD's existing
color overrides on the equivalent viz.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* CHANGELOG: clarify that "Create new objects with unique IDs" is the default

The OSD import dialog defaults to that mode — users have to actively
switch away from it, not just avoid picking it. Reword the upgrade note
to lead with the switch and explain why the default would silently
preserve the bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9.10.3
2026-04-27 00:40:01 -04:00
Sean Whalen 8cc017fe84 ASN-domain coverage sweep #3: 516 new map entries (#735)
* Add Tier 0 to the verification triage: globally-known brand at primary domain

In the previous ASN-domain coverage sweep, the agent ran web searches
for entries like `bestbuy.com → Best Buy`, `ups.com → United Parcel
Service`, `usps.gov → US Postal Service`, `marriott.com → Marriott`,
`henkel.cn → Henkel`, `experian.com → Experian`, `jd.com → JD.com`,
`ing.com → ING`, `verisign.com → Verisign`. For each of these the
domain ↔ brand pairing is encyclopedic — same outcome a few seconds
slower.

The two-corroborating-sources rule (rule 8) was being applied
mechanically: "MMDB as_name alone is one source, must fetch a second."
But for globally-known brands at their primary domain, the brand
identity itself is the second source. Searching for confirmation that
Best Buy owns bestbuy.com is the kind of busywork the tier system
exists to avoid.

Adds Tier 0 with explicit guardrails — must be globally known
(multinational or top-tier-national, decades-old, single canonical
entity), must be the entity's primary marketing/corporate domain
(not a tracking subdomain or regional ccTLD where ownership is
non-obvious), and no recent acquisition/rebrand status in question.
Cross-references the existing parent-too-generic sub-rule and
warns against stretching to mid-size brands the agent happens to
recognize. When in doubt: drop to Tier 3 and search.

Also generalizes the section's lead from "redirect-target candidates"
to cover MMDB coverage-gap and PSL private-domain candidates — the
tier logic transfers cleanly across all three workflows. Updates the
Tier 1 description with an explicit MMDB-coverage-gap analog.

Refreshes the held-back-review split stat to 0 / 109 / 2 / 34 / 35
(Tier 0 didn't apply to that batch because every candidate was a
redirect target that needed to inherit the *source row's* existing
canonical name, not its own brand identity).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ASN-domain coverage sweep #3: 516 new map entries

Third pass against the IPinfo Lite MMDB coverage gap, processing the
top ~500 unmapped as_domain entries by IPv4 weight after the prior two
sweeps. Verifies each entry against AGENTS.md's tiered triage:

- **Tier 0** (globally-known brand at primary domain, no search
  needed): Barclays, Liberty Mutual, Zurich Insurance, ABN AMRO,
  Swedbank, CIBC, Allstate, Julius Baer, MUFG, Travelers, USPS-Bank,
  ING, Florida Blue, AgriBank, Energy Transfer, FirstEnergy, Scania,
  Evonik, Merck KGaA, Agfa, Bosch, Iveco, Applied Materials, Micron,
  Andritz, Whirlpool, Leonardo, QinetiQ, Atlas Elektronik, Draper,
  Airbus, Jacobs Engineering, Teledyne, Dropbox, Autodesk, Wind River,
  Stratus, Unisys, ByteDance, Chevron, BBC, CDC, NEC, HPE,
  Kimberly-Clark, U.S. Bank, NATO, EUROCONTROL, Federal Reserve, NIST,
  NSF, DARPA, Library of Congress, IMF, FAO, IAEA, ITU, several US
  state/county/city governments, Australian state/federal departments,
  European national agencies, United Airlines, Alaska Airlines,
  Rakuten Mobile, Coles, Woolworths.

- **Tier 1** (MMDB as_name lexically matches candidate domain, no
  search needed): ~150+ ISPs / hosters / cable TV operators where
  the as_name itself is the second corroborating source — major
  national/regional telcos (BTC Botswana, Uganda Telecom, ONE Albania,
  Tanzania Telecommunications, Kyrgyztelecom, Uzbektelekom, Telecom
  Algeria, MTN Rwanda, Vodacom Tanzania, Celcom Axiata, Triple T
  Broadcasting/Jasmine Thailand, MyRepublic Indonesia, Northwestel
  Canada, JT Jersey, Liberty Networks Colombia, ARLINK Argentina,
  Cable & Wireless Dominica, SETAR Aruba, AR Telecom Portugal),
  regional fiber providers (Trooli, Allied Telecom, OEC Fiber,
  Conexon Connect, Ben Lomand, Great Plains, BrightNet Oklahoma,
  All West, SDN, Tularosa, Blackfoot, Greeneville Energy, Avanti
  Broadband, Net at Once, Avanti, Aura Fiber, Stichting Breedband
  Delft), regional cable TV operators across Japan/Korea/Taiwan
  (Miyazaki Cable, Toyohashi Cable, Nagasaki Cable, Cable TV Toyama,
  Kurashiki Cable, Himeji Cable, Keumgang Cable Network), data center
  operators (eStruxture, PureVoltage, Hyonix, NovoServe, Voxility,
  Webzilla, Worldstream, Atman Poland, EO Data Center).

- **Education** (TLD-restricted .edu / .ac.* / .edu.* — restriction is
  itself a corroborating source): 200+ universities and research
  institutions across US, Canada, Europe, Asia, and Australia,
  including Notre Dame, Washington State, U Texas Rio Grande Valley /
  Arlington / El Paso / San Antonio / Medical Branch, McMaster, U
  Ottawa, U Calgary, U Waterloo, Memorial U Newfoundland, U Auckland,
  U Otago, TU Munich, U Cologne, Goethe Frankfurt, Ruhr-Bochum, U
  Warwick, Chalmers, Lund, Gothenburg, Luleå, Osaka, Yonsei, Kasetsart,
  Pusan, Kuwait U, Aristotle Thessaloniki, Ł Tech U, Vienna U Economics,
  several Cancer Research Centers (MSKCC, Fred Hutchinson, MD Anderson,
  Cold Spring Harbor), national research institutes (KEK, IAEA, ITRI
  Taiwan, ETRI, IPM Iran, Smithsonian, ucar, Jefferson Lab,
  CSHL, mbari, Lam Research, Andritz Hydropower, sri.com, GSI Germany,
  Max Delbrück, jhuapl).

- **Government** (.gov / .gov.* TLD-restricted, or as_name unambiguously
  names a government entity): NIST, NSF, NATO, DARPA, ITU, FAO, IAEA,
  IMF, US Centers for Disease Control, Federal Reserve, Library of
  Congress, Idaho/Chicago/King County/Pierce County/State of New York,
  Indianapolis, Tacoma, Fairfax County, Sweden's Vägverket and
  Forsakringskassan, Hessen GWDG, ANSTO Australia, South Florida
  Water Management District, Communications Research Centre Canada,
  Dataport Germany, Cenitex Victoria, EUROCONTROL.

Skipped: Cox Enterprises (multi-product parent, no clean type fit),
Tucows already added, sknt.ru already added, etc. Full triage shows
1 duplicate-skip from the apply pass.

Sortlists.py runs cleanly. All 516 type values validate against
base_reverse_dns_types.txt. No collisions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 21:01:47 -04:00
Sean Whalen d6d50a45e5 Add Tier 0 to the verification triage: globally-known brand at primary domain (#734)
In the previous ASN-domain coverage sweep, the agent ran web searches
for entries like `bestbuy.com → Best Buy`, `ups.com → United Parcel
Service`, `usps.gov → US Postal Service`, `marriott.com → Marriott`,
`henkel.cn → Henkel`, `experian.com → Experian`, `jd.com → JD.com`,
`ing.com → ING`, `verisign.com → Verisign`. For each of these the
domain ↔ brand pairing is encyclopedic — same outcome a few seconds
slower.

The two-corroborating-sources rule (rule 8) was being applied
mechanically: "MMDB as_name alone is one source, must fetch a second."
But for globally-known brands at their primary domain, the brand
identity itself is the second source. Searching for confirmation that
Best Buy owns bestbuy.com is the kind of busywork the tier system
exists to avoid.

Adds Tier 0 with explicit guardrails — must be globally known
(multinational or top-tier-national, decades-old, single canonical
entity), must be the entity's primary marketing/corporate domain
(not a tracking subdomain or regional ccTLD where ownership is
non-obvious), and no recent acquisition/rebrand status in question.
Cross-references the existing parent-too-generic sub-rule and
warns against stretching to mid-size brands the agent happens to
recognize. When in doubt: drop to Tier 3 and search.

Also generalizes the section's lead from "redirect-target candidates"
to cover MMDB coverage-gap and PSL private-domain candidates — the
tier logic transfers cleanly across all three workflows. Updates the
Tier 1 description with an explicit MMDB-coverage-gap analog.

Refreshes the held-back-review split stat to 0 / 109 / 2 / 34 / 35
(Tier 0 didn't apply to that batch because every candidate was a
redirect target that needed to inherit the *source row's* existing
canonical name, not its own brand identity).

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 19:03:45 -04:00
Sean Whalen 6926e69d01 ASN-domain coverage sweep #2: 142 new map entries (#733)
* Add 105 ASN-domain coverage-gap entries (commercial brands, universities, ISPs)

Sweeps the top ~250 unmapped as_domain entries from the IPinfo Lite
MMDB by IPv4 weight. Three buckets:

1. Globally-known commercial brands where the as_name and the
   well-established public brand identity match (UPS, Best Buy,
   Marriott, ING, Raytheon, Henkel, Experian, Tucows, Verisign,
   JD.com, Newfold Digital — alias from enduranceinternational.com,
   Hyundai Home Shopping, Qihoo 360, Kingsoft, SIAC).

2. Accredited educational institutions where the .edu / .ac.* / .edu.*
   TLD restriction is itself a corroborating source alongside the
   MMDB as_name (Texas Tech, U Wyoming, U Alaska, Western Washington,
   U Guadalajara, UNC Greensboro, Northern Arizona, U Miami, Texas
   Tech HSC, U Hong Kong + 3 sister HK universities, U Melbourne,
   JAIST, Maria Curie-Sklodowska, DoDEA, Clark County School District,
   AIST, Japan Atomic Energy Agency, Connecticut State Colleges,
   Kennesaw State, RESTENA Luxembourg, NKN India).

3. Regional ISPs / MSPs / hosters verified per-case via web search
   for two-corroborating-sources confirmation: Spectranet (Nigeria),
   Brisanet (Brazil), Hondutel (Honduras), WestCall (Russia),
   AKADO Telecom (formerly Comcor), HT Eronet (Bosnia), Trooli (UK),
   Spitfire (UK), Intermax (Netherlands), Sogetel (Quebec), Synoptek,
   Union Wireless (Wyoming), Bigleaf Networks, OzarksGo (Arkansas),
   Acantho (Hera Group, Italy), Istekki (Finland), AIS Advanced
   Wireless Network (Thailand), CSI Piemonte, Baxet Group, Verixi
   (Belgium), SBA Edge, Iron Mountain Data Centers (formerly Web
   Werks India), CITIC Telecom CPC (acquired Linx Telecommunications),
   Optus (Singtel), Tele2 Kazakhstan, Movistar (Telefónica México),
   C Spire (Mississippi), Wananchi Group (Kenya), Asiatech, Respina,
   Fanap Telecom, Sabanet, Mobinnet, Pishgaman (Iran), Power Line
   Datacenter (HK), Airtek Solutions (Venezuela), Tata Teleservices,
   ParsOnline, WorldLink Communications (Nepal), Sarenet (Spain),
   CETIN (Serbia), IPKO (Kosovo), Sure (Channel Islands), Swoop
   (Australia), Deutsche Glasfaser, ePLDT, Epic (formerly Vodafone
   Malta), Tigo Bolivia, Multipolar Technology, Silversky, YOU
   Broadband (Vodafone Idea India).

Also adds:
- Government / civic: USPS, DC, City of Toronto, City of Boston,
  Canton of Bern, Networking Tasmania, St. Joseph's Health Care
  London, Enoch Pratt Free Library.
- Logistics: UPS, JR East, Post Danmark.
- MSP: Otsuka Corporation, ANS (UK).
- IaaS: IABG Teleport.

Skipped — single-source / parking / parent-too-generic concerns:
globalcapacity.com (post-acquisition operator unclear), various
opaque AS-id-named domains, cox enterprises (multi-product
conglomerate, no clean type fit).

Sortlists.py runs cleanly. All 105 type values validate against
base_reverse_dns_types.txt. No collisions with existing map keys.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add 37 more ASN-domain coverage entries (Asian telcos, regional ISPs)

Continues the coverage-gap sweep. Each entry verified per-case via
web search for two-corroborating-sources confirmation
(domain-WHOIS / homepage content + MMDB as_name + an established
third-party directory like Wikipedia or industry trade press).

- Major established brands: Fujitsu (web.ad.jp), True Corporation
  (Thailand), One NZ (formerly Vodafone NZ), Partners Telecom Colombia
  (formerly WOM), Angola Telecom, Gabon Telecom (Maroc Telecom
  subsidiary), Sony Global Solutions, BEKKOAME (now part of GMO
  Internet), CS Loxinfo (now AIS), National Telecom (Thailand,
  formerly CAT Telecom).

- Regional cable / fiber operators in Japan (ZTV, Oita Cable Telecom,
  StarCat Cable Network, Community Network Center), Korea (Hyundai
  HCN, Areum Broadcasting Network), Taiwan (Peicity / TaipeiNet,
  Taiwan Optical Platform), China (Shaanxi Broadcast & TV, Qinghai
  Telecom under China Telecom umbrella, China Telecom Tianjin under
  same), Russia (Almatel, Seven Sky / Iskratelecom, Good Line /
  E-Light-Telecom in Kuzbass).

- Other regional ISPs / hosters: Orange Jordan (go.com.jo via Jordan
  Telecom Group), FASTtelco (Kuwait), Cyberzone (Panama-based hosting),
  Moselle Télécom (French regional), Africa on Cloud (South African
  IaaS), Computer Engineering & Consulting (CEC, Japan MSP),
  Macquarie Government (Australian sovereign data centers),
  Meteverse (Canadian/Korean edge cloud), Ningxia West Cloud Data
  (operator of AWS China Ningxia region), 21Vianet (Chinese hosting),
  China Broadcasting Network, China Networks Inter-Exchange (CNIX).

- Education: MANDA Darmstadt (TU Darmstadt + Hochschule Darmstadt
  shared MAN).

Skipped — single source / ambiguous: globalcapacity.com (post-GTT-
acquisition operator unclear), abcle.co.kr (single source, type
unclear), dr.com.tr (Andromeda TV connection couldn't be confirmed).

Sortlists.py runs cleanly. All type values validate against
base_reverse_dns_types.txt. No collisions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 18:53:48 -04:00
Sean Whalen e8f1525757 Full-map redirect-target alias sweep (#732)
* Full-map redirect-target alias sweep: 146 new aliases

Follow-up to PR #730 — runs the same redirect-target-alias analysis
against the entire current map (5,509 rows) instead of only the rows
added in PR #729. The map predates this session by several years, so
acquisitions and rebrands accumulated without paired aliases.

Method: re-ran collect_domain_info.py against every existing map entry
(via --map /tmp/nonexistent.csv to bypass the skip-already-mapped
filter). For each row whose homepage's final_url base differs from the
domain, classified the redirect target as a same-operator alias or a
sister/placeholder/etTLD that should be skipped.

Three confidence tiers from 334 raw redirect-mismatch candidates:
- Multi-source (>=2 mapped domains redirect to the same target):
  20 aliases, all auto-included. Notable: hatena.blog (6 src — Hatena
  blog platform's brand consolidation), vercel.com (4 src — now.sh,
  vercel.app, vercel.dev), mailchimp.com (3 src — Mailchimp's tracking
  domains), liquid.tech (3 src — Liquid Intelligent Technologies after
  Neotel acquisition), supabase.com, streamlit.io (Snowflake), xfinity
  .com (Comcast).
- Single-source with lexical-token overlap between source brand and
  target host: 128 aliases. These are TLD/subdomain variants (ais.co
  .th -> ais.th, neubox.net -> neubox.com, duck.com -> duckduckgo.com)
  and obvious near-rebrands (slic.com -> slicfiber.com, soverin.net ->
  soverin.com).
- Single-source with no token overlap: 180 candidates. Held back from
  auto-promotion because token-mismatched single-source redirects are
  the bucket where false positives concentrate (small-operator pages
  redirecting to unrelated portals). Surfaced separately in a PR
  comment for hand review — many are real acquisitions (messagelabs
  .com -> broadcom.com, cincinnatibell.com -> altafiber.com,
  sparkpostmail.com -> bird.com, modis.com -> akkodis.com) that just
  need a maintainer's eye to confirm before mapping.

Manual overrides for 5 multi-source cases where the heuristic picked
the wrong source row's (name, type):
- ziggo.nl: chello.sk's UPC redirect was the case-2 sister-brand
  pattern AGENTS.md step 6 already calls out; the legitimate source
  is ziggozakelijk.nl. Mapped to Ziggo, ISP.
- zetaglobal.com: source rows pointed at Sailthru and Selligent (both
  acquired by Zeta Global). Canonical -> Zeta Global, Marketing.
- crisis24.com: source rows pointed at One Call Now and Topo.ai
  (both acquired by Crisis24). Canonical -> Crisis24, SaaS.
- directnic.com: heuristic picked "Directnic.com" from one source's
  name string; aligned to "Directnic" (matches the dnchosting.com
  source's convention).
- fortinet.com: source rows pointed at Fortinet FortiMail product and
  Perception Point (Fortinet acquisition). Canonical -> Fortinet,
  Email Security (parent brand).

Two false positives skipped from auto-promotion after sampling:
- aichi-colony.jp -> aichi.jp: a healthcare operator's homepage
  redirected to the Aichi prefecture government portal — different
  operator (case-2 sister-host equivalent).
- illinois.net -> illinois.gov: Illinois Century Network (academic)
  is not the State of Illinois government.

Cumulative map size: 5,509 -> 5,655 rows. MMDB IPv4 coverage stays at
~90.47% (these aliases are mostly non-as_domain hosts, so they don't
move the IPv4 metric — the win is PTR-side attribution coverage when
DMARC reports cite the redirect target's domain).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Hand-review of held-back single-source aliases

Adds 143 aliases from the held-back single-source-no-token-overlap
list and updates 25 source rows to the post-rebrand brand name so
both the source and alias rows resolve to the same canonical brand.

Verification per case via public sources (acquisition press releases,
rebrand announcements, official corporate documentation). Cases where
the redirect target is a generic parent-company domain spanning many
products were skipped — broadcom.com being the explicit exception
where the alias uses the full product name "Broadcom Enterprise
Messaging Security" so DMARC reports tagged with broadcom.com still
land in the email-security bucket rather than overwriting other
Broadcom product lines. Suspicious targets (parking pages,
country-level TLDs, unrelated brands) were also skipped.

Source-row name updates capture rebrands where the legacy brand no
longer operates as such (Endurance International → Newfold Digital,
Symantec Email Security → Broadcom Enterprise Messaging Security,
Platform.sh → Upsun, Uninett → Sikt, SparkPost → Bird, etc.) and
fix three typos uncovered during review (Goranicus → Granicus,
Servastopol → Sevastopol, Wally-Wide → Valley-Wide).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Document parent-company-too-generic alias guidance; rename SendGrid to "Twilio SendGrid"

Two related changes:

1. Rename the canonical name on `sendgrid.com` from `SendGrid (Twilio)`
   to `Twilio SendGrid` for consistency with the existing `sendgrid.net`
   and `dlivry.co` entries — the post-acquisition official product
   name.

2. Add `twilio.com,Twilio,SaaS` as the parent-domain alias (rather
   than re-using the product-specific `Twilio SendGrid, Marketing`),
   so DMARC reports from non-email Twilio services (Programmable SMS,
   Voice, Segment, Flex, etc.) don't get mis-attributed to the email
   product. The product-domain entries keep the product-specific
   `(name, type)`.

3. Document this approach in AGENTS.md under the existing
   redirect-target alias rules. Two acceptable patterns for
   multi-product parent redirect targets:

   - Bare parent name + broad type (Twilio, NICE) — the safer
     default for parents with many distinct product lines.
   - Full product name + specific type (Broadcom Enterprise Messaging
     Security) — appropriate when the parent's domain is
     overwhelmingly tied to one product line for DMARC purposes.

   In both cases, don't blindly inherit the source row's
   product-specific `(name, type)` for the parent-domain alias.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Document tiered verification approach for redirect-target alias review

Captures the workflow that surfaced 143 confirmable aliases out of
180 held-back candidates with a small fraction of the search budget
of "search every entry":

- Tier 1: canonical name lexically corroborates the target — no
  search; source row is itself the second source.
- Tier 2: canonical name explicitly contains "(Formerly X)" — no
  search; rebrand is self-documented.
- Tier 3: no lexical overlap — search press releases / company
  newsroom / industry coverage; require two independent source
  categories; cite URLs in the PR.
- Tier 4: target is a parking page / TLD-like base / unrelated
  brand — no search; reject and ship the list for heuristic
  tuning.

Re-states the prompt-injection caveat in this verification context:
press releases, homepages, news articles, WHOIS records, and
search-result snippets are untrusted research data, never
instructions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 18:22:52 -04:00
Sean Whalen 5bb6570f4e collect_domain_info.py: replace curl fallback with pure-requests path (#731)
* collect_domain_info.py: replace curl shell-out with requests-based fallback

The previous fallback for cert-error / UA-blocked sites was a curl
subprocess. This was correct but added an external runtime dependency
(curl is usually present but not on minimal containers) and a fork +
tempfile + parse round-trip per fallback call. Replaced with a pure
requests-based path that uses a custom HTTPAdapter to relax the SSL
context to the same effective configuration:

  ssl.CERT_NONE                 (verify=False, equivalent to curl -k)
  set_ciphers("DEFAULT@SECLEVEL=0")  (allows weak DH/RSA, recovers
                                       DH_KEY_TOO_SMALL hosts that
                                       even curl's default config
                                       rejects)
  options |= 0x4 (OP_LEGACY_SERVER_CONNECT, allows unsafe legacy
                  TLS renegotiation for older server stacks)

Plus a real-browser User-Agent (same Chrome/124 string as before),
verify=False, allow_redirects=True, and Session.max_redirects=5.
InsecureRequestWarning is suppressed at module level since the
verify-disabled path is intentional.

Smoke-tested against the same eight cert-error domains as the original
curl fallback. Same recovery rate on all eight (six recover with full
title+description, two -- twmbroadband.com and ltt.ly -- remain
genuinely unreachable with both implementations). One additional win:
vnpt.com.vn (DH_KEY_TOO_SMALL) now recovers under the SECLEVEL=0
cipher list, which curl with default options did not. Happy-path
domains (google.com) still take the primary path and produce
identical output.

Side effects:
- removes the curl runtime dependency from collect_domain_info.py
- removes ~10ms of fork-and-parse overhead per fallback call
- removes the tempfile-on-disk round-trip; body is captured in-memory
- error suffix in the TSV's error column changes from "| curl: ..." to
  "| fallback: ..."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Use getattr(ssl, "OP_LEGACY_SERVER_CONNECT", 0x4) instead of raw 0x4

Per PR review: prefer the constant where the interpreter exposes it
(Python 3.12+) and fall back to the raw value (0x4) only on older
interpreters that the project still supports. Self-documenting and
future-proof against any unlikely stdlib value reshuffle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 16:34:57 -04:00
Sean Whalen ec2db7238e Map aliases for redirect targets + CC BY-SA 4.0 attribution (#730)
* README: declare base_reverse_dns_map.csv under CC BY-SA 4.0

The map is now a curated derivative of the bundled IPinfo Lite MMDB
(as_domain / as_name fields, walked for unmapped operators and
classified via the workflow in AGENTS.md). IPinfo Lite is licensed
under Creative Commons Attribution-ShareAlike 4.0, which propagates
to derivative works, so the CSV is distributed under CC BY-SA 4.0
with attribution to IPinfo for the underlying network identification
data.

Also updates the file-size estimate in the README from "over 1,400"
to "over 5,000" to reflect the current state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Alias redirect targets into the map and codify the practice in AGENTS.md

When a domain's homepage redirects to a different host *for the same
operator* (acquisition target's site, or a TLD/subdomain variant), PTR
reverse-DNS reports observed in the wild may reference either domain.
Mapping only the original loses attribution for the redirect target.

Adds 91 aliases discovered during the previous bulk PR's classification
work — every redirect target where the original was newly mapped, the
target wasn't already in the map, and the target was the same operator
(not a sister brand and not a placeholder/bot/parking page). Notable
examples: apogee.us + boldyn.com both -> Boldyn ISP; sungardas.com +
1111systems.com both -> 11:11 Systems MSP; vodafone.is + syn.is both
-> Sýn ISP; sendinblue.com + brevo.com both -> Brevo (Sendinblue)
Marketing; tigo.com + millicom.com both -> Tigo ISP; rockwellcollins.com
+ collinsaerospace.com both -> Collins Aerospace Defense.

Codifies the alias-target practice as a new paragraph under AGENTS.md
step 6 (the homepage-redirect disambiguation rule). Key guardrails:
- Alias only for case 1 (acquisition) and case 3 (TLD variant). Do
  NOT alias for case 2 (sister brand / shared infra) -- aliasing the
  redirect target there mis-attributes the redirect target's email.
  Cited example: do not alias ziggo.nl to UPC after the chello.sk fix.
- Skip generic-placeholder, bot-management, and TLD/eTLD redirect
  targets (example.com, perfdrive.com, umbler.com, co.uk, com.br...).
- When in doubt, drop the alias rather than commit it. A missing alias
  is recoverable; a wrong one mis-attributes mail.

Also fixes four canonical-naming inconsistencies surfaced during the
brand-mismatch sweep, aligning recent additions to pre-existing entries:
- ga.gov: "Georgia Government" -> "State of Georgia" (matches existing
  georgia.gov)
- goco.ca, radiant.net: "Telus" -> "TELUS" (matches existing telus.com)
- vee.com.tw: "VeeTime" -> "VeeTIME" (matches existing veetime.com)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Promote 21 inbound-redirect aliases from KU to map

Sweeping the session's collector TSVs for the inverse pattern of the
91 outbound aliases in commit ddf962e: domains that stayed in
known-unknown this session but whose homepage final_url redirected to
an entry that's now in the map. These are acquisitions and TLD/
subdomain variants where the operator can be inferred from the
redirect-target's existing mapping.

Notable acquisitions surfaced:
- nitelusa.com -> Comcast (NITEL was acquired by Comcast Business)
- level3.net -> Lumen (Level 3 rebranded)
- novis.pt -> NOS (Novis acquired by NOS Portugal)
- oxfordnetworks.net -> FirstLight Fiber (acquisition)
- saunalahti.fi -> Elisa (acquisition)
- omnicity.net, wcoil.com -> Watch Communications (acquisitions)
- servercentral.net -> Summit (acquisition)

TLD / subdomain variants:
- as29550.net (Simply Transit ASN domain) -> Simply Transit
- asahi-net.or.jp -> ASAHI Net (.jp variant)
- cyber-folks.pl -> cyber_Folks (cyberfolks.pl)
- digicelsr.com -> Digicel (Suriname variant)
- edpnet.net -> EDPnet (.be variant)
- la.net.ua -> Lanet
- pair.net -> Pair Networks (pair.com)
- twlakes.net -> Twin Lakes Communications
- megamailservers.eu -> MegaMailServers (.com variant)

Cloudflare email/SMTP family:
- cloudflare-email.org, cloudflare-smtp.com/.net/.org -> Cloudflare,
  Email Security (matches cloudflare-email.com/.net, distinct from
  the bare cloudflare.com/.net which use SaaS)

Of 32 redirect-to-mapped hits in the session TSVs, 21 cleared the
same-operator bar. The other 11 were excluded as case-2-equivalent
redirects (homepage hosted on Google/Wordpress/Aruba), registrar
parking pages (Dynadot), or ambiguous brand relationships requiring
research beyond what the redirect alone could justify (frontiernet.net
-> yahoo.com from Frontier's 2017 email-services migration to Yahoo,
dido.com -> socket.net, evo.uz -> tps.uz, ncport.ru -> avantel.ru).
Those are flagged in the PR comment for follow-up review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* AGENTS.md: document the inbound redirect-target alias sweep

After a batch lands, the same collector TSVs that drove the original
classifications are also the input to a free secondary pass: KU
domains whose final_url redirects to a host that's now mapped are
typically the inbound mirror of the outbound alias rule (step 6).
Each such pair is an acquisition or TLD/subdomain variant where the
operator is inferable from the redirect-target's existing mapping.

Adds a new bullet to "After a batch merge" describing the sweep and
the same case-2 exclusion list as the outbound rule (sister-brand,
generic hosting platform, bot-management proxy). Notes that the
sweep routinely surfaces 5-15% of the prior batch's KU additions as
legitimate map promotions, citing the actual examples that landed in
this PR (nitelusa.com -> Comcast, level3.net -> Lumen,
saunalahti.fi -> Elisa, oxfordnetworks.net -> FirstLight Fiber,
asahi-net.or.jp -> ASAHI Net, etc.).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 16:14:07 -04:00
Sean Whalen 851560a9b1 Bulk reverse-DNS map coverage: top-500 ASN audit + KU re-research + curl fallback (#729)
* collect_domain_info.py: add curl fallback for blocked/broken fetches

Many sites that returned no usable homepage under the original requests
fetch turned out to be soft-failures: misconfigured TLS certs (self-signed,
hostname mismatch, weak chain), 403/captcha pages from User-Agent-based
bot filters, or redirect chains the requests stack rejected. None of those
recover under a single retry with the same client config.

This wires a curl fallback into _fetch_homepage that triggers when the
primary attempt errors or returns a non-2xx status. Curl runs with
-k (skip TLS verify), -L (follow redirects), --max-time bound, and a
real-browser User-Agent string -- enough to clear the common UA-block
and bad-cert classes of failure that small ISPs and regional telcos
routinely ship. A 2xx-with-empty-head response is left alone (parked
pages do not improve on retry). When both attempts fail, the error
column carries both signatures so it is obvious that the fallback was
tried.

Smoke-tested against eight previously-failed cert-error domains: six
recovered full title/description (as1101.net, citictel-cpc.com,
xtrim.com.ec, etecsa.cu, zillion.network, sandia.gov), two remained
genuinely unreachable. Happy-path domains take the primary path
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Bulk reverse-DNS map coverage: top-500 ASN audit + KU re-research

Two passes against the bundled IPinfo Lite MMDB and the existing
known-unknown list, both classified under the two-corroborating-sources
rule (AGENTS.md):

1. Top-500 unmapped ASN-domain audit. Walked every record in
   ipinfo_lite.mmdb to find as_domain values not yet in the map,
   ranked by routed IPv4 count, took the top 500 (>= ~/15 each), and
   ran them through collect_domain_info.py. Yield: 435 new map rows
   from operators with two or more independent corroborating sources;
   65 entries to known-unknown for operators where homepage and WHOIS
   were both unavailable from the test environment. Recovered domains
   span ISPs, web hosts, IaaS/MSP/MSSP, education networks, government
   agencies, and a long tail of major industrials.

2. Full re-research of the existing 3,606-entry known-unknown file
   using the new curl fallback (separate commit). The fallback
   recovered homepage content for 1,686 of 3,670 (45.9%) previously
   dark domains. Of those, 770 had a corroborating WHOIS or as_name
   alongside; 508 cleared the strict service-category test and were
   promoted out of known-unknown into the map. The remaining 262
   recovered titles were brand-only / login-portal / under-construction
   pages where service category could not be assigned with confidence.

Also removed a stale "#name?" Excel auto-correction artifact from the
known-unknown file (it would never have matched any real reverse-DNS
base domain).

Cumulative result: base_reverse_dns_map.csv 3,946 -> 4,889 rows
(+943, +23.9%); known_unknown_base_reverse_dns.txt 3,606 -> 3,162
(-444 net after both batches plus the artifact). Every promotion has
two independent sources for the operator's identity and a homepage or
MMDB-as_name signal sufficient to assign a service type.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix chello.sk classification: UPC, not Liberty Global

The original classification aliased chello.sk to "Liberty Global" based
on the IP-WHOIS netname (LGI-INFRASTRUCTURE) plus a stale homepage
redirect to ziggo.nl that the collector observed at fetch time. This
broke the AGENTS.md rule that IP-WHOIS only counts as a corroborating
source when the domain name matches the netname -- "chello" does not
match "LGI", so the IP-WHOIS should not have been treated as a source.

The WHOIS was unambiguous: UPC BROADBAND SLOVAKIA, s.r.o. UPC retains
its consumer brand in Slovakia (unlike Ireland, where upc.ie was
rebranded as Virgin Media Ireland in the existing map). Reverting to
the operator brand per WHOIS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix vodafone.is classification: Sýn, not Vodafone

Same pattern as the chello.sk fix in the previous commit: the historic
brand recorded in the MMDB as_name (Vodafone Iceland) is no longer the
operator. Sýn acquired Vodafone Iceland's operations and the homepage
redirects to syn.is, presenting Vodafone only as a partner relationship
rather than an active sub-brand. Following the upc.ie -> Virgin Media
Ireland precedent for rebranded markets, the canonical attribution is
the current operator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* AGENTS.md: codify the homepage-redirect disambiguation rule

Three classification mistakes during the bulk batch (chello.sk,
vodafone.is, telia.dk, apogee.us) all came from the same gap in the
workflow: when a homepage's final URL is a different host from the
domain being classified, the right brand depends on the *relationship*
between the two domains, not on the WHOIS or as_name in isolation.

Adds a new step 6 to the unknown-domain classification workflow that
spells out the three patterns and the disambiguator:

- Acquisition / rebrand: the homepage shows the acquiring operator's
  marketing site. Use the new operator. MMDB as_name and IP-WHOIS
  netname are commonly stale for years post-acquisition; do not let
  them override an unambiguous current-operator homepage.
- Sister brand / shared infrastructure: the homepage redirects to a
  *sibling* brand under the same parent group, but the WHOIS for the
  original domain still names a *specific* current operator. Use the
  WHOIS operator, not the redirect target. Canonical cautionary tale:
  chello.sk (WHOIS: UPC BROADBAND SLOVAKIA) was originally classified
  as Liberty Global because the homepage redirected to ziggo.nl (a
  sibling Liberty Global brand). The right answer was UPC.
- TLD or subdomain variant: same operator, different domain. Trivial.

Renumbers the remaining steps. The IP-WHOIS rule (step 5) and the
two-source rule (now step 8) are unchanged but cross-referenced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Apply homepage-redirect rule to telia.dk and apogee.us

Same pattern as chello.sk and vodafone.is in earlier commits — the
historic operator name in the MMDB as_name and WHOIS does not reflect
who actually runs the IPs after an acquisition. The homepage redirect
is the current ground truth.

- telia.dk -> Norlys: Norlys acquired Telia Denmark; homepage now
  redirects to shop.norlys.dk and presents Norlys throughout.
- apogee.us -> Boldyn: Boldyn acquired Apogee Telecom; homepage now
  redirects to boldyn.com and shows the Boldyn marketing site for
  higher-education managed services.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Bulk reverse-DNS map coverage: next-500 unmapped ASN-domain audit

Same workflow as the first top-500 batch in this branch, applied to
the next tier of unmapped MMDB as_domain values (ranked 501..1000 by
routed IPv4 count, each ~/15 to /14.5). Pre-screened against the
current state of base_reverse_dns_map.csv and
known_unknown_base_reverse_dns.txt.

Yield: 414 newly-classified map entries + 86 known-unknown additions.
Type breakdown skews ISP-heavy as expected at this scale, with strong
representation from Education (universities now reaching deeper into
the long tail), Government (state/county/national agencies), Web Host
(regional hosting providers), and IaaS (mid-market cloud).

Applied AGENTS.md step 6 (homepage-redirect disambiguation) on every
case where the homepage's final_url crossed hosts: kept new operator
when the redirect target was an acquiring brand (e.g. atlanticmetro.net
-> 365 Data Centers, performive.com -> CloudFirst, fasternet.com.br ->
Desktop, eatel.com -> REV, blic.net -> Supernova, dimensiondata.com ->
NTT DATA, virtela.net -> NTT Communications), used WHOIS operator when
the redirect was sister-brand or shared infra, used the same operator
when the redirect was a TLD/subdomain variant.

Coverage delta: 88.89% -> 90.40% of MMDB IPv4 (+1.51 pp, ~47M IPv4).
Cumulative for this PR: 85.10% -> 90.40% (+5.30 pp, ~165M IPv4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Reclassify the 262 left-dark KU re-research candidates with relaxed heuristic

Of the 770 two-source candidates from the curl-fallback KU re-research
pass earlier in this branch, 262 had homepage content and a corroborating
WHOIS/as_name but were left in known-unknown because the homepage was
brand-only or a login portal that didn't directly describe service
category.

Relaxing the heuristic on a re-pass: when the WHOIS legal name itself
contains a regulated-telecom keyword (TELECOM, TELECOMUNICAÇÕES,
INTERNET, FIBRA, BROADBAND, PROVEDOR DE INTERNET, NET TELECOM), that
*is* a service-category source -- in Brazil, Argentina, Chile, and
peers, operators must register under specific legal naming and the
registration is a regulator-vetted signal. Combined with two-source
identity, that clears the bar without forcing the homepage to also
spell out the service.

Same goes for brand-name-as-service signals: "X Server Limited" with a
customer-portal homepage and matching WHOIS reasonably maps to Web Host;
"X Fiber" + matching as_name maps to ISP. These are what readers would
naturally infer from the operator's own self-naming.

Yield: 95 promotions out of 262 (36% of the left-dark subset). The
remaining 167 stay in known-unknown because the homepage was a generic
placeholder ("Index of /", "Coming Soon", default Apache page), the
brand on the homepage didn't match the WHOIS, the operator was clearly
a non-telecom (e.g. INPASUPRI = supplies for IT, malugainfor =
Comércio de Produtos de Informática, hugel = pharma), or the service
category was genuinely ambiguous.

MMDB IPv4 delta is small (+0.03 pp, +888K IPv4) since most of these are
long-tail operators with low or zero MMDB footprint -- the value is in
PTR-side attribution coverage when these brands appear in actual
reverse-DNS reports.

Cumulative for this PR: map 4,889 -> 5,398 rows; KU 3,162 -> 3,153 lines;
MMDB IPv4 coverage 88.89% -> 90.42% (+1.53 pp from the next-500 batch
plus this re-pass).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 15:15:32 -04:00
Sean Whalen b3a608735f Revise classification guidelines to enforce two-corroborating-sources rule and clarify handling of unidentified domains 2026-04-26 12:10:56 -04:00
Sean Whalen d04eb89035 Clarify handling of TLS errors and user network issues in classification guidelines 2026-04-26 11:56:14 -04:00
Sean Whalen 55a1e79066 Add kamatera.com entry to base_reverse_dns_map 2026-04-26 11:43:19 -04:00
Sean Whalen c87aa3de08 fixes the incomplete changing of the headers in the SMTP TLS Reporting dashboard visualizations to match the rest of the project (lowercase words separated by _ 2026-04-25 19:17:28 -04:00
Sean Whalen 85554c2344 OpenSearch Dashboards: Restructure SMTP TLS dashboard to match Splunk layout (#728)
The bundled `splunk/smtp_tls_dashboard.xml` is three tables — Reporting
organizations, Domains, Failure details — sharing the same TLS-RPT data.
The OSD dashboard had drifted into five panels (two pies + three tables)
that didn't line up with what the Splunk one shows. Replace them with
three `data_table` viz mirroring the Splunk layout.

Each table uses sum-only metric aggs (no count column) on the per-policy
or per-failure-detail session-count fields. OSD's Visualize agg pipeline
auto-wraps each terms/sum on a `policies.*` or `policies.failure_details.*`
field in the right `nested:{path: …}` agg, so per-policy and per-detail
totals come out correctly without any schema or write-path changes.

Reuse the existing IDs of the three drop-in replacements so re-importing
overwrites in place:
- 4f3b4cb0… (was "TLSRPT reporting organizations") → "Reporting organizations"
- eeb47eb0… (was "TLSRPT policies by domain") → "Domains"
- 5cbcd040… (was "SMTP TLS failures") → "Failure details"

The two pie-chart viz removed by this change have no equivalent in the
new layout. Upgraders will need to delete the orphans manually from OSD's
Saved Objects management page:
- 25f321e0-26d0-11f1-96a6-fb3734bd0b21 ("SMTP TLS sessions")
- 12065020-26d1-11f1-96a6-fb3734bd0b21 ("TLSRPT policies")

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 16:14:35 -04:00
Sean Whalen 342b467590 Mark maildir messages as read after they are read (#726)
MaildirConnection.fetch_message() previously returned the message body
without touching the on-disk file, so messages stayed in new/ with no
"S" (Seen) flag and any MUA scanning the same maildir kept showing them
as unread. The call site now passes mark_read=not test (mirroring the
existing MSGraphConnection plumbing); on True, the message is moved to
cur/ and gains the S flag. Test mode leaves the maildir unmodified.

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
9.10.2
2026-04-24 19:16:42 -04:00
Sean Whalen adf36ca6a3 Add bluevps.com entry to base_reverse_dns_map 2026-04-24 12:03:52 -04:00
Sean Whalen 81a0d4ce56 Add additional entries for 3z.net and 3zden.cloud to base_reverse_dns_map 2026-04-24 11:38:31 -04:00
Sean Whalen a4a2155ab0 OpenSearch Dashboards: Show rows in the Message sources by Autonomous System viz even if some fields are missing 2026-04-23 22:38:10 -04:00
Sean Whalen 168244af95 Add Message sources by Autonomous System to Opensearch Dashboards (#725)
Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
2026-04-23 19:22:03 -04:00
Sean Whalen c989f27983 Add six base_reverse_dns_map entries from MMDB coverage-gap analysis (#722)
* Cover ASN-fallback path for the Evolus operator family

Only evolus-ix.com (the Internet Exchange product) was in the map,
so ASN-fallback lookups for IPs without PTR fell through to the raw
as_name string with no service type. The bundled IPinfo Lite MMDB
stores the same operator's blocks under two other as_domain values:

- evolus-it.com (the corporate domain, Evolus IT Solutions GmbH)
- evolusfibre.com (their consumer fiber ISP brand)

Both resolve to as_name "Evolus IT Solutions GmbH" in the MMDB,
confirming they're the same operator. WHOIS on evolus-it.com and
the evolusfibre.com homepage both pin the company to Austria. Added
both as aliases pointing at the existing (Evolus IX, ISP) entry so
all three product brands cluster under one display name, matching
the comcast.net / comcast.com pattern documented in AGENTS.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add aliases for centrilogic, 1gservers, etherni, globconnex

Four additional ASN-domain aliases discovered via coverage-gap
analysis against the bundled IPinfo Lite MMDB. None of the four
brands are currently represented in the map under any key, so these
are new brand entries (not alias-of-existing).

- centrilogic.com → Centrilogic, MSP
  82 MMDB nets, ~62K IPv4. Homepage describes the company as an
  "end-to-end I.T. transformation" managed-services provider.
- 1gservers.com → 1GServers, Web Host
  117 nets, ~23K IPv4. Homepage: bare-metal dedicated servers and
  Phoenix colocation.
- etherni.com → Ethernic, MSP
  2 nets, 768 IPv4. Homepage: cloud-migration / cloud-native
  consulting. Operates its own small ASN under Ethernic LLC.
- globconnex.com → Global Connectivity Solutions, ISP
  687 nets, ~63K IPv4. Homepage unreachable (self-signed cert); WHOIS
  privacy-redacted. Classification is inferred from the MMDB as_name
  "GLOBAL CONNECTIVITY SOLUTIONS LLP" and the routed-network scale.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 17:59:21 -04:00
Sean Whalen 15cf8f55b7 Skip caching weak-fallback IP attributions (#723)
get_reverse_dns() swallows every DNSException as None, so a transient
PTR lookup failure (timeout, SERVFAIL, socket error) is
indistinguishable from a genuine no-PTR case. When that lands on the
raw-as_name fallback branch (no map match for the ASN domain either),
the weak result was getting cached in the 4-hour IP-info cache —
locking in the misattribution even after the PTR became resolvable.

Observed in the wild: 91.244.70.212 has PTR customer.evolus-ix.com
(which the map correctly classifies as Evolus IX, ISP), but the
user's dataset showed it with source_name = raw as_name and
source_type = null — the signature of a transient PTR lookup
failure that then got cached.

Fix: skip the cache write when the row is in that specific
weak-fallback state (reverse_dns=None AND type=None AND
name=as_name). PTR-backed matches and ASN-domain matches are stable
attributions and continue to be cached as before.

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 17:25:56 -04:00