Commit Graph

7 Commits

Author SHA1 Message Date
Sean Whalen f0781c6191 IPinfo API: keep only documented behavior (#721)
* Strip invented IPinfo API behavior; keep documented-only

The IPinfo Lite API docs (https://ipinfo.io/developers/lite-api) state:
"The API has no daily or monthly limit and provides unlimited access."
Auth is documented as a ?token= query param only. The /me shown in the
docs returns geolocation for the caller's IP — it is not a documented
account/quota endpoint for Lite.

Removed everything that was speculating beyond the docs:

- The /me probe that pretended to return plan/limit/remaining fields.
- 429 rate-limit handling, 402 quota-exhausted handling, Retry-After
  parsing, cooldown state, and the rate-limit warning / recovery-info
  logging around them.
- The Authorization: Bearer header (not documented for Lite).

Kept:

- Lookups against the documented /lite/<ip>?token=<token> endpoint.
- 401/403 treated as a fatal invalid-token (reasonable defensive check).
- Network-error and non-2xx fallback to the bundled/cached MMDB.
- A simple startup probe that validates the token with a single lookup
  and logs "IPinfo API configured" at info level.

Test consolidated to cover only documented paths: success, 401 fatal,
non-2xx fallback, and that auth goes in ?token= (not Authorization).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* AGENTS.md: warn against speculating past external-service docs

New subsection under Configuration spelling out that third-party API
integrations must start with a direct WebFetch of the canonical docs
page, not a subagent query. Calls out the two traps that produced the
IPinfo speculation: (1) asking subagents question shapes that
presuppose the answer exists, and (2) treating feature asks as "build
this" without first checking "does this apply to this service?".

Uses the now-reverted IPinfo speculation as the cautionary tale so the
next session has a concrete example to recognize the shape of the
mistake.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Bump to 9.10.1; put removal under a new CHANGELOG section

Restored the 9.10.0 entry to its as-shipped wording and moved the
speculation-removal note into its own 9.10.1 Fixed section.
Editing the 9.10.0 entry would have misrepresented what was
actually released — the shipped tag does contain the /me probe,
429/402 cooldown, Retry-After parsing, and Bearer auth, and the
changelog should say so.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 11:51:44 -04:00
Sean Whalen c5f432c460 Add optional IPinfo Lite REST API with MMDB fallback (#717)
* Add optional IPinfo Lite REST API with MMDB fallback

Configure [general] ipinfo_api_token (or PARSEDMARC_GENERAL_IPINFO_API_TOKEN)
and every IP lookup hits https://api.ipinfo.io/lite/<ip> first for fresh
country + ASN data. On HTTP 429 (rate-limit) or 402 (quota), the API is
disabled for the rest of the run and lookups fall through to the bundled /
cached MMDB; transient network errors fall through per-request without
disabling the API. An invalid token (401/403) raises InvalidIPinfoAPIKey,
which the CLI catches and exits fatally — including at startup via a probe
lookup so operators notice misconfiguration immediately. Added
ipinfo_api_url as a base-URL override for mirrors or proxies.

The API token is never logged. A new _normalize_ip_record() helper is
shared between the API path and the MMDB path so both paths produce the
same normalized shape (country code, asn int, asn_name, asn_domain).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* IPinfo API: cool down and retry instead of permanent disable

Previously a single 429 or 402 disabled the API for the whole run. Now
each event sets a cooldown (using Retry-After when present, defaulting to
5 minutes for rate limits and 1 hour for quota exhaustion). Once the
cooldown expires the next lookup retries; a successful retry logs
"IPinfo API recovered" once at info level so operators can see service
came back. Repeat rate-limit responses after the first event stay at
debug to avoid log spam.

Test now targets parsedmarc.log (the actual emitting logger) instead of
the parsedmarc parent — cli._main() sets the child's level to ERROR,
and assertLogs on the parent can't see warnings filtered before
propagation. Test also exercises the cooldown-then-recovery path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* IPinfo API: log plan and quota from /me at startup

Configure-time probe now hits https://ipinfo.io/me first. That endpoint
is documented as quota-free and doubles as a free-of-quota token check,
so we use it to both validate the token and surface plan / month-to-date
usage / remaining-quota numbers at info level:

  IPinfo API configured — plan: Lite, usage: 12345/50000 this month, 37655 remaining

Field names in /me have drifted across IPinfo plan generations, so the
summary formatter probes a few aliases before giving up. If /me is
unreachable (custom mirror behind ipinfo_api_url, network error) we
fall back to the original 1.1.1.1 lookup probe, which still validates
the token and logs a generic "configured" message.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Drop speculative ipinfo_api_url override

It was added mirroring ip_db_url, but the two serve different needs.
ip_db_url has a real use (internal hosting of the MMDB); an
authenticated IPinfo API isn't something anyone mirrors, and /me was
always hardcoded anyway, making the override half-baked. YAGNI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* AGENTS.md: warn against speculative config options

New section under Configuration spelling out that every option is
permanent surface area and must come from a real user need rather than
pattern-matching a nearby option. Cites the removed ipinfo_api_url as
the canonical cautionary tale so the next session doesn't reintroduce
it, and calls out "override the base URL" / "configurable retries" as
common YAGNI traps.

Also requires that new options land fully wired in one PR (INI schema,
_parse_config, Namespace defaults, docs, SIGHUP-reload path) rather
than half-implemented.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Rename [general] ip_db_url to ipinfo_url

The bundled MMDB is specifically IPinfo Lite, so the option name
should say so. ip_db_url stays accepted as a deprecated alias and
logs a warning when used; env-var equivalents accept either spelling
via the existing PARSEDMARC_{SECTION}_{KEY} machinery.

Updated the AGENTS.md cautionary tale to refer to ipinfo_url (with
the note about the alias) so the anti-pattern example still reads
correctly post-rename.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix testPSLDownload to reflect .akamaiedge.net override

PSL carries c.akamaiedge.net as a public suffix, but
psl_overrides.txt intentionally folds .akamaiedge.net so every
Akamai CDN-customer PTR (the aXXXX-XX.cXXXXX.akamaiedge.net pattern)
clusters under one akamaiedge.net display key. The override was added
in 2978436 as a design decision for source attribution; the test
assertion just predates it.

Updated the comment to explain why override wins over the live PSL
here so the next reader doesn't reach for the PSL answer again.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 10:11:37 -04:00
Sean Whalen 2978436d89 Expand reverse-DNS map and PSL overrides from the live PSL (#716)
* Expand reverse-DNS map and PSL overrides from the live PSL

Parses the private-domains section of the live Public Suffix List and
adds 269 brand-owned suffixes as PSL overrides paired with map
entries, so customer subdomains on shared hosting / SaaS / PaaS
platforms fold to the operator's brand. Adds 33 ASN-domain entries
for the subset of these brands whose IP space is registered under a
different corporate domain in the MMDB, so both the PTR-derived
lookup and the ASN-fallback lookup hit the same (name, type). Also
normalizes ``a2hosting.com`` from ``A2Hosting`` to ``A2 Hosting``
for spelling consistency.

PTR-path wins (overrides + map entries)
- Web hosts: A2 Hosting, alwaysdata, Antagonist, Beget, bplaced,
  Bytemark, Combell, cyber_Folks, cyon, DreamHost, EasyWP, Gehirn,
  HelioHost, home.pl, HostyHosting, Hypernode, IONOS (6 suffixes),
  Jotelulu, JouwWeb, KaasHosting, Keyweb, LCube, LiquidNet, McHost,
  Memset, Mittwald, Mythic Beasts, NearlyFreeSpeech, Nimbus Hosting,
  One.com (20 ccTLD variants), OwnProvider, Pantheon, Planet-Work,
  prgmr, Rackmaze, Rad Web Hosting, Raidboxes, Servebolt,
  SpeedPartner, Uberspace, Whatbox, WP Engine, ZAP-Hosting, Zitcom.
- Dynamic DNS: DuckDNS, DynDNS (24), No-IP (22), Now-DNS, dynv6,
  freemyip, nsupdate.info, ddnss.de, GoIP, DrayTek.
- PaaS/SaaS/IaaS: Netlify, Vercel (6), Heroku, fly.io, Render,
  Firebase/GCP (4), Azure (5), AWS (4), DigitalOcean (2), Red Hat
  OpenShift, Hasura, Supabase, Snowflake/Streamlit, Read the Docs,
  PythonAnywhere, GitHub, GitLab, Adobe Magento.
- Hosted sites/stores: Hatena (6), Notion, Figma, Webflow, Wix (4),
  Shopify, Shopware, Sellfy, Spreadshop (19 ccTLDs), Datto.
- Email/Marketing: Fastmail, ActiveTrail, Leadpages, Heyflow, Carrd,
  Typeform.
- CDN/Technology: Akamai (7), Fastly (3), Yandex Cloud.

ASN-path wins (MMDB coverage now attributes 1,184,256 more IPv4
addresses to a named brand, 85.04% -> 85.08%): yandex.com, ya.ru,
hosting.com (A2 Hosting), beget.com, cyberfolks.pl, fly.io,
bytemark.co.uk, cyberfolks.ro, keyweb.de, mittwald.de, memset.com,
zap-hosting.com, datto.com, jotelulu.com, yandex.cloud, github.com,
asavie.com (Akamai), and 16 others.

Entries are curated from the live PSL rather than any bundled copy;
brand / as_name attribution was verified against the CLAUDE.md rule
that the IP-WHOIS signal is only trusted when the domain name itself
matches the host's name (name-collisions in MMDB were skipped —
Hypernode AU, goipgroup.com, liquidnet.com, One.com substring noise,
nimbusitsolutions.com, etc.). Types follow
``base_reverse_dns_types.txt``; ``sortlists.py`` re-sorts + dedupes +
validates after the batch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Document PSL-derived override workflow and load_psl_overrides gotcha

Adds three pieces of map-maintenance context learned while building
this PR:

- New subsection "Discovering overrides from the live PSL
  private-domains section" — distinct source from live DMARC data
  (unknown_base_reverse_dns.csv) and MMDB coverage-gap analysis. The
  private section is itself a list of brand-owned suffixes; each is a
  candidate (psl_override + map entry) pair. Emphasizes ruthless
  selectivity — most of the 600+ private-section orgs are dev
  sandboxes or hobby zones that will never appear in DMARC reports.

- Two-path coverage as a single linked step, not two round-trips:
  when adding a PSL override for a hosted-content suffix
  (netlify.app), also add a map row for the brand's corporate
  as_domain (netlify.com) in the same pass. The override fixes the
  PTR path; the ASN-domain alias fixes the ASN-fallback path.

- The load_psl_overrides() fetch-first gotcha. The no-arg form pulls
  the file from master on GitHub, so end-to-end testing of local
  overrides silently uses the old remote version. offline=True is
  required to test local changes against get_base_domain().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:12:32 -04:00
Sean Whalen 2cda5bf59b Surface ASN info and use it for source attribution when a PTR is absent (#715)
* Surface ASN info and fall back to it when a PTR is absent

Adds three new fields to every IP source record — ``asn`` (integer,
e.g. 15169), ``asn_name`` (``"Google LLC"``), ``asn_domain``
(``"google.com"``) — sourced from the bundled IPinfo Lite MMDB. These
flow through to CSV, JSON, Elasticsearch, OpenSearch, and Splunk
outputs as ``source_asn``, ``source_asn_name``, ``source_asn_domain``.

More importantly: when an IP has no reverse DNS (common for many
large senders), source attribution now falls back to the ASN domain
as a lookup key into the same ``reverse_dns_map``. Thanks to #712
and #714, ~85% of routed IPv4 space now has an ``as_domain`` that
hits the map, so rows that were previously unattributable now get a
``source_name``/``source_type`` derived from the ASN. When the ASN
domain misses the map, the raw AS name is used as ``source_name``
with ``source_type`` left null — still better than nothing.

Crucially, ``source_reverse_dns`` and ``source_base_domain`` remain
null on ASN-derived rows, so downstream consumers can still tell a
PTR-resolved attribution apart from an ASN-derived one.

ASN is stored as an integer at the schema level (Elasticsearch /
OpenSearch mappings use ``Integer``) so consumers can do range
queries and numeric sorts; dashboards can prepend ``AS`` at display
time. The MMDB reader normalizes both IPinfo's ``"AS15169"`` string
and MaxMind's ``autonomous_system_number`` int to the same int form.

Also fixes a pre-existing caching bug in ``get_ip_address_info``:
entries without reverse DNS were never written to the IP-info cache,
so every no-PTR IP re-did the MMDB read and DNS attempt on every
call. The cache write is now unconditional.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Bump to 9.9.0 and document the ASN fallback work

Updates the changelog with a 9.9.0 entry covering the ASN-domain
aliases (#712, #714), map-maintenance tooling fixes (#713), and the
ASN-fallback source attribution added in this branch.

Extends AGENTS.md to explain that ``base_reverse_dns_map.csv`` is now
a mixed-namespace map (rDNS bases alongside ASN domains) and adds a
short recipe for finding high-value ASN-domain misses against the
bundled MMDB, so future contributors know where the map's second
lookup path comes from.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Document project conventions previously held only in agent memory

Promotes four conventions out of per-agent memory and into AGENTS.md
so every contributor — human or agent — works from the same baseline:

- Run ruff check + format before committing (Code Style).
- Store natively numeric values as numbers, not pre-formatted strings
  (e.g. ASN as int 15169, not "AS15169"; ES/OS mappings as Integer)
  (Code Style).
- Before rewriting a tracked list/data file from freshly-generated
  content, verify the existing content via git — these files
  accumulate manually-curated entries across sessions (Editing tracked
  data files).
- A release isn't done until hatch-built sdist + wheel are attached to
  the GitHub release page; full 8-step sequence documented (Releases).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 02:13:30 -04:00
Sean Whalen 6effd80604 9.7.0 (#709)
- Auto-download psl_overrides.txt at startup (and whenever the reverse DNS
  map is reloaded) via load_psl_overrides(); add local_psl_overrides_path
  and psl_overrides_url config options
- Add collect_domain_info.py and detect_psl_overrides.py for bulk WHOIS/HTTP
  enrichment and automatic cluster-based PSL override detection
- Block full-IPv4 reverse-DNS entries from ever entering
  base_reverse_dns_map.csv, known_unknown_base_reverse_dns.txt, or
  unknown_base_reverse_dns.csv, and sweep pre-existing IP entries
- Add Religion and Utilities to the allowed service_type values
- Document the full map-maintenance workflow in AGENTS.md
- Substantial expansion of base_reverse_dns_map.csv (net ~+1,000 entries)
- Add 26 tests covering the new loader, IP filter, PSL fold logic, and
  cluster detection

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
2026-04-19 21:20:41 -04:00
Sean Whalen 1542936468 Bump version to 9.5.4, enhance Maildir folder handling, and add config key aliases for environment variable compatibility 2026-03-25 23:22:46 -04:00
Sean Whalen 9551c8b467 Add AGENTS.md for AI agent guidance and link from CLAUDE.md 2026-03-03 21:00:55 -05:00