From 2cda5bf59bd4d40fc7ae1e737be3be1e684bc0de Mon Sep 17 00:00:00 2001 From: Sean Whalen <44679+seanthegeek@users.noreply.github.com> Date: Thu, 23 Apr 2026 02:13:30 -0400 Subject: [PATCH] Surface ASN info and use it for source attribution when a PTR is absent (#715) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Surface ASN info and fall back to it when a PTR is absent Adds three new fields to every IP source record — ``asn`` (integer, e.g. 15169), ``asn_name`` (``"Google LLC"``), ``asn_domain`` (``"google.com"``) — sourced from the bundled IPinfo Lite MMDB. These flow through to CSV, JSON, Elasticsearch, OpenSearch, and Splunk outputs as ``source_asn``, ``source_asn_name``, ``source_asn_domain``. More importantly: when an IP has no reverse DNS (common for many large senders), source attribution now falls back to the ASN domain as a lookup key into the same ``reverse_dns_map``. Thanks to #712 and #714, ~85% of routed IPv4 space now has an ``as_domain`` that hits the map, so rows that were previously unattributable now get a ``source_name``/``source_type`` derived from the ASN. When the ASN domain misses the map, the raw AS name is used as ``source_name`` with ``source_type`` left null — still better than nothing. Crucially, ``source_reverse_dns`` and ``source_base_domain`` remain null on ASN-derived rows, so downstream consumers can still tell a PTR-resolved attribution apart from an ASN-derived one. ASN is stored as an integer at the schema level (Elasticsearch / OpenSearch mappings use ``Integer``) so consumers can do range queries and numeric sorts; dashboards can prepend ``AS`` at display time. The MMDB reader normalizes both IPinfo's ``"AS15169"`` string and MaxMind's ``autonomous_system_number`` int to the same int form. Also fixes a pre-existing caching bug in ``get_ip_address_info``: entries without reverse DNS were never written to the IP-info cache, so every no-PTR IP re-did the MMDB read and DNS attempt on every call. The cache write is now unconditional. Co-Authored-By: Claude Opus 4.7 (1M context) * Bump to 9.9.0 and document the ASN fallback work Updates the changelog with a 9.9.0 entry covering the ASN-domain aliases (#712, #714), map-maintenance tooling fixes (#713), and the ASN-fallback source attribution added in this branch. Extends AGENTS.md to explain that ``base_reverse_dns_map.csv`` is now a mixed-namespace map (rDNS bases alongside ASN domains) and adds a short recipe for finding high-value ASN-domain misses against the bundled MMDB, so future contributors know where the map's second lookup path comes from. Co-Authored-By: Claude Opus 4.7 (1M context) * Document project conventions previously held only in agent memory Promotes four conventions out of per-agent memory and into AGENTS.md so every contributor — human or agent — works from the same baseline: - Run ruff check + format before committing (Code Style). - Store natively numeric values as numbers, not pre-formatted strings (e.g. ASN as int 15169, not "AS15169"; ES/OS mappings as Integer) (Code Style). - Before rewriting a tracked list/data file from freshly-generated content, verify the existing content via git — these files accumulate manually-curated entries across sessions (Editing tracked data files). - A release isn't done until hatch-built sdist + wheel are attached to the GitHub release page; full 8-step sequence documented (Releases). Co-Authored-By: Claude Opus 4.7 (1M context) --------- Co-authored-by: Sean Whalen Co-authored-by: Claude Opus 4.7 (1M context) --- AGENTS.md | 65 ++++++++++++++--- CHANGELOG.md | 24 +++++++ docs/source/output.md | 18 +++-- parsedmarc/__init__.py | 12 ++++ parsedmarc/constants.py | 2 +- parsedmarc/elastic.py | 12 ++++ parsedmarc/opensearch.py | 12 ++++ parsedmarc/splunk.py | 3 + parsedmarc/types.py | 3 + parsedmarc/utils.py | 152 ++++++++++++++++++++++++++++++--------- tests.py | 61 ++++++++++++++++ 11 files changed, 315 insertions(+), 49 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index 12fc094..b6d449c 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -62,22 +62,42 @@ IP address info cached for 4 hours, seen aggregate report IDs cached for 1 hour ## Code Style -- Ruff for formatting and linting (configured in `.vscode/settings.json`) -- TypedDict for structured data, type hints throughout -- Python ≥3.10 required -- Tests are in a single `tests.py` file using unittest; sample reports live in `samples/` -- File path config values must be wrapped with `_expand_path()` in `cli.py` -- Maildir UID checks are intentionally relaxed (warn, don't crash) for Docker compatibility -- Token file writes must create parent directories before opening for write +- Ruff for formatting and linting (configured in `.vscode/settings.json`). Run `ruff check .` and `ruff format --check .` after every code edit, before committing. +- TypedDict for structured data, type hints throughout. +- Python ≥3.10 required. +- Tests are in a single `tests.py` file using unittest; sample reports live in `samples/`. +- File path config values must be wrapped with `_expand_path()` in `cli.py`. +- Maildir UID checks are intentionally relaxed (warn, don't crash) for Docker compatibility. +- Token file writes must create parent directories before opening for write. +- Store natively numeric values as numbers, not pre-formatted strings. Example: ASN is stored as `int 15169`, not `"AS15169"`; Elasticsearch / OpenSearch mappings for such fields use `Integer()` so consumers can do range queries and numeric sorts. Display layers format with a prefix at render time. + +## Editing tracked data files + +Before rewriting a tracked list/data file from freshly-generated content (anything under `parsedmarc/resources/maps/`, CSVs, `.txt` lists), check the existing file first — `git show HEAD: | wc -l`, `git log -1 -- `, `git diff --stat`. Files like `known_unknown_base_reverse_dns.txt` and `base_reverse_dns_map.csv` accumulate manually-curated entries across many sessions, and a "fresh" regeneration that drops the row count is almost certainly destroying prior work. If the new content is meant to *add* rather than *replace*, use a merge/append pattern. Treat any unexpected row-count drop in the pending diff as a red flag. + +## Releases + +A release isn't done until built artifacts are attached to the GitHub release page. Full sequence: + +1. Bump version in `parsedmarc/constants.py`; update `CHANGELOG.md` with a new section under the new version number. +2. Commit on a feature branch, open a PR, merge to master. +3. `git fetch && git checkout master && git pull`. +4. `git tag -a -m "" ` and `git push origin `. +5. `rm -rf dist && hatch build`. Verify `git describe --tags --exact-match` matches the tag. +6. `gh release create --title "" --notes-file `. +7. `gh release upload dist/parsedmarc-.tar.gz dist/parsedmarc--py3-none-any.whl`. +8. Confirm `gh release view --json assets` shows both the sdist and the wheel before considering the release complete. ## Maintaining the reverse DNS maps -`parsedmarc/resources/maps/base_reverse_dns_map.csv` maps reverse DNS base domains to a display name and service type. See `parsedmarc/resources/maps/README.md` for the field format and the service_type precedence rules. +`parsedmarc/resources/maps/base_reverse_dns_map.csv` maps a base domain to a display name and service type. The same map is consulted at two points: first with a PTR-derived base domain, and — if the IP has no PTR — with the ASN domain from the bundled IPinfo Lite MMDB (`parsedmarc/resources/ipinfo/ipinfo_lite.mmdb`). See `parsedmarc/resources/maps/README.md` for the field format and the service_type precedence rules. + +Because both lookup paths read the same CSV, map keys are a mixed namespace — rDNS-base domains (e.g. `comcast.net`, discovered via `base_reverse_dns.csv`) coexist with ASN domains (e.g. `comcast.com`, discovered via coverage-gap analysis against the MMDB). Entries of both kinds should point to the same `(name, type)` when they describe the same operator — grep before inventing a new display name. ### File format - CSV uses **CRLF** line endings and UTF-8 encoding — preserve both when editing programmatically. -- Entries are sorted alphabetically (case-insensitive) by the first column. +- Entries are sorted alphabetically (case-insensitive) by the first column. `parsedmarc/resources/maps/sortlists.py` is authoritative — run it after any batch edit to re-sort, dedupe, and validate `type` values. - Names containing commas must be quoted. - Do not edit in Excel (it mangles Unicode); use LibreOffice Calc or a text editor. @@ -125,7 +145,32 @@ When `unknown_base_reverse_dns.csv` has new entries, follow this order rather th - `detect_psl_overrides.py` — scans the lists for clustered IP-containing patterns, auto-adds brand suffixes to `psl_overrides.txt`, folds affected entries to their base, and removes any remaining full-IP entries. Run before the collector on any new batch. - `collect_domain_info.py` — the bulk enrichment collector described above. Respects `psl_overrides.txt` and skips full-IP entries. - `find_bad_utf8.py` — locates invalid UTF-8 bytes (used after past encoding corruption). -- `sortlists.py` — sorting helper for the list files. +- `sortlists.py` — case-insensitive sort + dedupe + `type`-column validator for the list files; the authoritative sorter run after every batch edit. + +### Checking ASN-domain coverage of the MMDB + +Separately from `base_reverse_dns.csv`, the MMDB itself is a source of keys worth mapping. To find ASN domains with high IP weight that don't yet have a map entry, walk every record in `ipinfo_lite.mmdb`, aggregate IPv4 count per `as_domain`, and subtract what's already a map key: + +```python +import csv, maxminddb +from collections import defaultdict +keys = set() +with open("parsedmarc/resources/maps/base_reverse_dns_map.csv", newline="", encoding="utf-8") as f: + for row in csv.DictReader(f): + keys.add(row["base_reverse_dns"].strip().lower()) +v4 = defaultdict(int); names = {} +for net, rec in maxminddb.open_database("parsedmarc/resources/ipinfo/ipinfo_lite.mmdb"): + if net.version != 4 or not isinstance(rec, dict): continue + d = rec.get("as_domain") + if not d: continue + v4[d.lower()] += net.num_addresses + names[d.lower()] = rec.get("as_name", "") +miss = sorted(((d, v4[d], names[d]) for d in v4 if d not in keys), key=lambda x: -x[1]) +for d, c, n in miss[:50]: + print(f"{c:>12,} {d:<30} {n}") +``` + +Apply the same classification rules above (precedence, naming consistency, skip-if-ambiguous, privacy). Many top misses will be brands already in the map under a different rDNS-base key — the goal there is to alias the ASN domain to the same `(name, type)` so both lookup paths hit. For ASN domains with no obvious brand identity (small resellers, parked ASNs), don't map them — the attribution code falls back to the raw `as_name` from the MMDB, which is better than a guess. ### After a batch merge diff --git a/CHANGELOG.md b/CHANGELOG.md index 149ed7f..df7ce0c 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,29 @@ # Changelog +## 9.9.0 + +### Changes + +- Source attribution now has an ASN fallback. Every IP source record carries three new fields — `asn` (integer, e.g. `15169`), `asn_name` (`"Google LLC"`), and `asn_domain` (`"google.com"`) — sourced from the bundled IPinfo Lite MMDB. When an IP has no reverse DNS, `get_ip_address_info()` uses `asn_domain` as a lookup into the same `reverse_dns_map`, and if that misses, falls back to the raw `asn_name`. `reverse_dns` and `base_domain` stay null on ASN-derived rows so consumers can still distinguish PTR-derived from ASN-derived attribution. +- Added `source_asn`, `source_asn_name`, `source_asn_domain` to CSV output (aggregate + forensic), JSON output, and the Elasticsearch / OpenSearch / Splunk integrations. `source_asn` is mapped as `Integer` at the schema level so consumers can do range queries and numeric sorts; dashboards can prepend `"AS"` at display time. +- Expanded `base_reverse_dns_map.csv` with 500 ASN-domain aliases for the most-routed IPv4 ranges. IPv4-weighted coverage of the bundled `ipinfo_lite.mmdb` went from ~34% of routed space matching a map entry via ASN domain to ~85%. Every alias is a brand that was already in the map under a different rDNS-base key (e.g. adding `comcast.com` alongside the existing `comcast.net`), plus a small number of large operators that previously had no entry. 11 entries were also promoted out of `known_unknown_base_reverse_dns.txt` because ASN context made their identity unambiguous. +- Added `get_ip_address_db_record()` in `parsedmarc.utils`, a single-open MMDB reader that returns country + ASN fields together. `get_ip_address_country()` is now a thin wrapper. Supports both IPinfo Lite's schema (`country_code`, `asn` as `"AS15169"`, `as_name`, `as_domain`) and MaxMind's (`country.iso_code`, `autonomous_system_number` as int, `autonomous_system_organization`) in one pass; ASN is normalized to a plain int from either. MaxMind users who drop in their own ASN MMDB get `asn` + `asn_name` populated; `asn_domain` stays null because MaxMind doesn't carry it. + +### Fixed + +- `get_ip_address_info()` now caches entries for IPs without reverse DNS. Previously the cache write was inside the `if reverse_dns is not None` branch, so every no-PTR IP re-did the MMDB read and DNS attempt on every call. +- Fixed three bugs in `parsedmarc/resources/maps/sortlists.py` that silently disabled the `type`-column validator and sorted the map case-sensitively, contrary to its documented behavior: + - Validator allowed-values map was keyed on `"Type"` (capital T), but the CSV header is `"type"` (lowercase), so every row bypassed validation. + - Types were read with trailing newlines via `f.readlines()`, so comparisons would not have matched even if the column name had been right. + - `sort_csv()` was called without `case_insensitive_sort=True`, which moved the sole mixed-case key (`United-domains.de`) to the top of the file instead of into its alphabetical position. +- Fixed eight pre-existing map rows with invalid or inconsistent `type` values that the now-working validator surfaced: casing corrections for `dhl.com` (`logistics` → `Logistics`), `ghm-grenoble.fr` (`healthcare` → `Healthcare`), and `regusnet.com` (`Real estate` → `Real Estate`); reclassified `lodestonegroup.com` from the nonexistent `Insurance` type to `Finance`; added missing `Religion` and `Utilities` entries to `base_reverse_dns_types.txt` so it matches the README's industry list. +- Fixed the `rt.ru` map entry: was classified as `RT,Government Media`, which conflated Rostelecom (the Russian telco that owns and uses `rt.ru`) with RT / Russia Today (which uses `rt.com`). Corrected to `Rostelecom,ISP`. + +### Upgrade notes + +- Output schema change: CSV, JSON, Elasticsearch, OpenSearch, and Splunk all gain three new fields per row (`source_asn`, `source_asn_name`, `source_asn_domain`). Existing queries and dashboards keep working; dashboards that want to consume the new fields will need to be updated. Elasticsearch / OpenSearch will add the new mappings on next document write. +- Rows for IPs without reverse DNS now populate `source_name` / `source_type` via ASN fallback. If downstream dashboards treated "null `source_name`" as a signal for "no rDNS", switch to checking `source_reverse_dns IS NULL` instead — that remains the unambiguous signal. + ## 9.8.0 ### Changes diff --git a/docs/source/output.md b/docs/source/output.md index a8d19e4..bc73403 100644 --- a/docs/source/output.md +++ b/docs/source/output.md @@ -44,7 +44,10 @@ of the report schema. "reverse_dns": null, "base_domain": null, "name": null, - "type": null + "type": null, + "asn": 7018, + "asn_name": "AT&T Services, Inc.", + "asn_domain": "att.com" }, "count": 2, "alignment": { @@ -90,7 +93,7 @@ of the report schema. ### CSV aggregate report ```text -xml_schema,org_name,org_email,org_extra_contact_info,report_id,begin_date,end_date,normalized_timespan,errors,domain,adkim,aspf,p,sp,pct,fo,source_ip_address,source_country,source_reverse_dns,source_base_domain,source_name,source_type,count,spf_aligned,dkim_aligned,dmarc_aligned,disposition,policy_override_reasons,policy_override_comments,envelope_from,header_from,envelope_to,dkim_domains,dkim_selectors,dkim_results,spf_domains,spf_scopes,spf_results +xml_schema,org_name,org_email,org_extra_contact_info,report_id,begin_date,end_date,normalized_timespan,errors,domain,adkim,aspf,p,sp,pct,fo,source_ip_address,source_country,source_reverse_dns,source_base_domain,source_name,source_type,source_asn,source_asn_name,source_asn_domain,count,spf_aligned,dkim_aligned,dmarc_aligned,disposition,policy_override_reasons,policy_override_comments,envelope_from,header_from,envelope_to,dkim_domains,dkim_selectors,dkim_results,spf_domains,spf_scopes,spf_results draft,acme.com,noreply-dmarc-support@acme.com,http://acme.com/dmarc/support,9391651994964116463,2012-04-28 00:00:00,2012-04-28 23:59:59,False,,example.com,r,r,none,none,100,0,72.150.241.94,US,,,,,2,True,False,True,none,,,example.com,example.com,,example.com,none,fail,example.com,mfrom,pass draft,acme.com,noreply-dmarc-support@acme.com,http://acme.com/dmarc/support,9391651994964116463,2012-04-28 00:00:00,2012-04-28 23:59:59,False,,example.com,r,r,none,none,100,0,72.150.241.94,US,,,,,2,True,False,True,none,,,example.com,example.com,,example.com,none,fail,example.com,mfrom,pass @@ -123,7 +126,12 @@ Thanks to GitHub user [xennn](https://github.com/xennn) for the anonymized "ip_address": "10.10.10.10", "country": null, "reverse_dns": null, - "base_domain": null + "base_domain": null, + "name": null, + "type": null, + "asn": null, + "asn_name": null, + "asn_domain": null }, "authentication_mechanisms": [], "original_envelope_id": null, @@ -193,7 +201,7 @@ Thanks to GitHub user [xennn](https://github.com/xennn) for the anonymized ### CSV forensic report ```text -feedback_type,user_agent,version,original_envelope_id,original_mail_from,original_rcpt_to,arrival_date,arrival_date_utc,subject,message_id,authentication_results,dkim_domain,source_ip_address,source_country,source_reverse_dns,source_base_domain,delivery_result,auth_failure,reported_domain,authentication_mechanisms,sample_headers_only +feedback_type,user_agent,version,original_envelope_id,original_mail_from,original_rcpt_to,arrival_date,arrival_date_utc,subject,message_id,authentication_results,dkim_domain,source_ip_address,source_country,source_reverse_dns,source_base_domain,source_name,source_type,source_asn,source_asn_name,source_asn_domain,delivery_result,auth_failure,reported_domain,authentication_mechanisms,sample_headers_only auth-failure,Lua/1.0,1.0,,sharepoint@domain.de,peter.pan@domain.de,"Mon, 01 Oct 2018 11:20:27 +0200",2018-10-01 09:20:27,Subject,<38.E7.30937.BD6E1BB5@ mailrelay.de>,"dmarc=fail (p=none, dis=none) header.from=domain.de",,10.10.10.10,,,,policy,dmarc,domain.de,,False ``` @@ -238,4 +246,4 @@ auth-failure,Lua/1.0,1.0,,sharepoint@domain.de,peter.pan@domain.de,"Mon, 01 Oct ] } ] -``` \ No newline at end of file +``` diff --git a/parsedmarc/__init__.py b/parsedmarc/__init__.py index f15293d..103520b 100644 --- a/parsedmarc/__init__.py +++ b/parsedmarc/__init__.py @@ -1114,6 +1114,9 @@ def parsed_aggregate_reports_to_csv_rows( row["source_base_domain"] = record["source"]["base_domain"] row["source_name"] = record["source"]["name"] row["source_type"] = record["source"]["type"] + row["source_asn"] = record["source"]["asn"] + row["source_asn_name"] = record["source"]["asn_name"] + row["source_asn_domain"] = record["source"]["asn_domain"] row["count"] = record["count"] row["spf_aligned"] = record["alignment"]["spf"] row["dkim_aligned"] = record["alignment"]["dkim"] @@ -1205,6 +1208,9 @@ def parsed_aggregate_reports_to_csv( "source_base_domain", "source_name", "source_type", + "source_asn", + "source_asn_name", + "source_asn_domain", "count", "spf_aligned", "dkim_aligned", @@ -1406,6 +1412,9 @@ def parsed_forensic_reports_to_csv_rows( row["source_base_domain"] = report["source"]["base_domain"] row["source_name"] = report["source"]["name"] row["source_type"] = report["source"]["type"] + row["source_asn"] = report["source"]["asn"] + row["source_asn_name"] = report["source"]["asn_name"] + row["source_asn_domain"] = report["source"]["asn_domain"] row["source_country"] = report["source"]["country"] del row["source"] row["subject"] = report["parsed_sample"].get("subject") @@ -1451,6 +1460,9 @@ def parsed_forensic_reports_to_csv( "source_base_domain", "source_name", "source_type", + "source_asn", + "source_asn_name", + "source_asn_domain", "delivery_result", "auth_failure", "reported_domain", diff --git a/parsedmarc/constants.py b/parsedmarc/constants.py index 6039f1b..94c0d13 100644 --- a/parsedmarc/constants.py +++ b/parsedmarc/constants.py @@ -1,4 +1,4 @@ -__version__ = "9.8.0" +__version__ = "9.9.0" USER_AGENT = f"parsedmarc/{__version__}" diff --git a/parsedmarc/elastic.py b/parsedmarc/elastic.py index 9103a80..72223fb 100644 --- a/parsedmarc/elastic.py +++ b/parsedmarc/elastic.py @@ -79,6 +79,9 @@ class _AggregateReportDoc(Document): source_base_domain = Text() source_type = Text() source_name = Text() + source_asn = Integer() + source_asn_name = Text() + source_asn_domain = Text() message_count = Integer disposition = Text() dkim_aligned = Boolean() @@ -173,6 +176,9 @@ class _ForensicReportDoc(Document): source_ip_address = Ip() source_country = Text() source_reverse_dns = Text() + source_asn = Integer() + source_asn_name = Text() + source_asn_domain = Text() source_authentication_mechanisms = Text() source_auth_failures = Text() dkim_domain = Text() @@ -489,6 +495,9 @@ def save_aggregate_report_to_elasticsearch( source_base_domain=record["source"]["base_domain"], source_type=record["source"]["type"], source_name=record["source"]["name"], + source_asn=record["source"]["asn"], + source_asn_name=record["source"]["asn_name"], + source_asn_domain=record["source"]["asn_domain"], message_count=record["count"], disposition=record["policy_evaluated"]["disposition"], dkim_aligned=record["policy_evaluated"]["dkim"] is not None @@ -673,6 +682,9 @@ def save_forensic_report_to_elasticsearch( source_country=forensic_report["source"]["country"], source_reverse_dns=forensic_report["source"]["reverse_dns"], source_base_domain=forensic_report["source"]["base_domain"], + source_asn=forensic_report["source"]["asn"], + source_asn_name=forensic_report["source"]["asn_name"], + source_asn_domain=forensic_report["source"]["asn_domain"], authentication_mechanisms=forensic_report["authentication_mechanisms"], auth_failure=forensic_report["auth_failure"], dkim_domain=forensic_report["dkim_domain"], diff --git a/parsedmarc/opensearch.py b/parsedmarc/opensearch.py index c9dcaf2..5260c1f 100644 --- a/parsedmarc/opensearch.py +++ b/parsedmarc/opensearch.py @@ -82,6 +82,9 @@ class _AggregateReportDoc(Document): source_base_domain = Text() source_type = Text() source_name = Text() + source_asn = Integer() + source_asn_name = Text() + source_asn_domain = Text() message_count = Integer disposition = Text() dkim_aligned = Boolean() @@ -176,6 +179,9 @@ class _ForensicReportDoc(Document): source_ip_address = Ip() source_country = Text() source_reverse_dns = Text() + source_asn = Integer() + source_asn_name = Text() + source_asn_domain = Text() source_authentication_mechanisms = Text() source_auth_failures = Text() dkim_domain = Text() @@ -519,6 +525,9 @@ def save_aggregate_report_to_opensearch( source_base_domain=record["source"]["base_domain"], source_type=record["source"]["type"], source_name=record["source"]["name"], + source_asn=record["source"]["asn"], + source_asn_name=record["source"]["asn_name"], + source_asn_domain=record["source"]["asn_domain"], message_count=record["count"], disposition=record["policy_evaluated"]["disposition"], dkim_aligned=record["policy_evaluated"]["dkim"] is not None @@ -703,6 +712,9 @@ def save_forensic_report_to_opensearch( source_country=forensic_report["source"]["country"], source_reverse_dns=forensic_report["source"]["reverse_dns"], source_base_domain=forensic_report["source"]["base_domain"], + source_asn=forensic_report["source"]["asn"], + source_asn_name=forensic_report["source"]["asn_name"], + source_asn_domain=forensic_report["source"]["asn_domain"], authentication_mechanisms=forensic_report["authentication_mechanisms"], auth_failure=forensic_report["auth_failure"], dkim_domain=forensic_report["dkim_domain"], diff --git a/parsedmarc/splunk.py b/parsedmarc/splunk.py index ff660f0..9f83c2a 100644 --- a/parsedmarc/splunk.py +++ b/parsedmarc/splunk.py @@ -104,6 +104,9 @@ class HECClient(object): new_report["source_base_domain"] = record["source"]["base_domain"] new_report["source_type"] = record["source"]["type"] new_report["source_name"] = record["source"]["name"] + new_report["source_asn"] = record["source"]["asn"] + new_report["source_asn_name"] = record["source"]["asn_name"] + new_report["source_asn_domain"] = record["source"]["asn_domain"] new_report["message_count"] = record["count"] new_report["disposition"] = record["policy_evaluated"]["disposition"] new_report["spf_aligned"] = record["alignment"]["spf"] diff --git a/parsedmarc/types.py b/parsedmarc/types.py index f0d367d..91e4b35 100644 --- a/parsedmarc/types.py +++ b/parsedmarc/types.py @@ -40,6 +40,9 @@ class IPSourceInfo(TypedDict): base_domain: Optional[str] name: Optional[str] type: Optional[str] + asn: Optional[int] + asn_name: Optional[str] + asn_domain: Optional[str] class AggregateAlignment(TypedDict): diff --git a/parsedmarc/utils.py b/parsedmarc/utils.py index 9f85728..ea37172 100644 --- a/parsedmarc/utils.py +++ b/parsedmarc/utils.py @@ -151,6 +151,9 @@ class IPAddressInfo(TypedDict): base_domain: Optional[str] name: Optional[str] type: Optional[str] + asn: Optional[int] + asn_name: Optional[str] + asn_domain: Optional[str] def decode_base64(data: str) -> bytes: @@ -457,20 +460,7 @@ def load_ip_db( logger.info("Using bundled IP database") -def get_ip_address_country( - ip_address: str, *, db_path: Optional[str] = None -) -> Optional[str]: - """ - Returns the ISO code for the country associated - with the given IPv4 or IPv6 address - - Args: - ip_address (str): The IP address to query for - db_path (str): Path to a MMDB file from IPinfo, MaxMind, or DBIP - - Returns: - str: And ISO country code associated with the given IP address - """ +def _get_ip_database_path(db_path: Optional[str]) -> str: db_paths = [ "ipinfo_lite.mmdb", "GeoLite2-Country.mmdb", @@ -486,14 +476,13 @@ def get_ip_address_country( "dbip-country.mmdb", ] - if db_path is not None: - if not os.path.isfile(db_path): - logger.warning( - f"No file exists at {db_path}. Falling back to an " - "included copy of the IPinfo IP to Country " - "Lite database." - ) - db_path = None + if db_path is not None and not os.path.isfile(db_path): + logger.warning( + f"No file exists at {db_path}. Falling back to an " + "included copy of the IPinfo IP to Country " + "Lite database." + ) + db_path = None if db_path is None: for system_path in db_paths: @@ -513,14 +502,37 @@ def get_ip_address_country( if db_age > timedelta(days=30): logger.warning("IP database is more than a month old") - db_reader = maxminddb.open_database(db_path) + return db_path + + +class _IPDatabaseRecord(TypedDict): + country: Optional[str] + asn: Optional[int] + asn_name: Optional[str] + asn_domain: Optional[str] + + +def get_ip_address_db_record( + ip_address: str, *, db_path: Optional[str] = None +) -> _IPDatabaseRecord: + """Look up an IP in the configured MMDB and return country + ASN fields. + + IPinfo Lite carries ``country_code``, ``as_name``, and ``as_domain`` on + every record. MaxMind/DBIP country-only databases carry only country, so + ``asn_name`` / ``asn_domain`` come back None for those users. + """ + resolved_path = _get_ip_database_path(db_path) + db_reader = maxminddb.open_database(resolved_path) record = db_reader.get(ip_address) - # Support both the IPinfo schema (flat top-level ``country_code``) and the - # MaxMind/DBIP schema (nested ``country.iso_code``) so users dropping in - # their own MMDB from any of these providers keeps working. country: Optional[str] = None + asn: Optional[int] = None + asn_name: Optional[str] = None + asn_domain: Optional[str] = None if isinstance(record, dict): + # Support both the IPinfo schema (flat top-level ``country_code``) and + # the MaxMind/DBIP schema (nested ``country.iso_code``) so users + # dropping in their own MMDB from any of these providers keeps working. code = record.get("country_code") if code is None: nested = record.get("country") @@ -529,7 +541,52 @@ def get_ip_address_country( if isinstance(code, str): country = code - return country + # Normalize ASN to a plain integer. IPinfo stores it as a string like + # "AS15169"; MaxMind's ASN DB uses ``autonomous_system_number`` as an + # int. Integer form lets consumers do range queries and sort + # numerically; display-time formatting with an "AS" prefix is trivial. + raw_asn = record.get("asn") + if isinstance(raw_asn, int): + asn = raw_asn + elif isinstance(raw_asn, str) and raw_asn: + digits = raw_asn.removeprefix("AS").removeprefix("as") + if digits.isdigit(): + asn = int(digits) + if asn is None: + mm_asn = record.get("autonomous_system_number") + if isinstance(mm_asn, int): + asn = mm_asn + + name = record.get("as_name") or record.get("autonomous_system_organization") + if isinstance(name, str) and name: + asn_name = name + domain = record.get("as_domain") + if isinstance(domain, str) and domain: + asn_domain = domain.lower() + + return { + "country": country, + "asn": asn, + "asn_name": asn_name, + "asn_domain": asn_domain, + } + + +def get_ip_address_country( + ip_address: str, *, db_path: Optional[str] = None +) -> Optional[str]: + """ + Returns the ISO code for the country associated + with the given IPv4 or IPv6 address. + + Args: + ip_address (str): The IP address to query for + db_path (str): Path to a MMDB file from IPinfo, MaxMind, or DBIP + + Returns: + str: And ISO country code associated with the given IP address + """ + return get_ip_address_db_record(ip_address, db_path=db_path)["country"] def load_reverse_dns_map( @@ -723,6 +780,9 @@ def get_ip_address_info( "base_domain": None, "name": None, "type": None, + "asn": None, + "asn_name": None, + "asn_domain": None, } if offline: reverse_dns = None @@ -733,9 +793,13 @@ def get_ip_address_info( timeout=timeout, retries=retries, ) - country = get_ip_address_country(ip_address, db_path=ip_db_path) - info["country"] = country + db_record = get_ip_address_db_record(ip_address, db_path=ip_db_path) + info["country"] = db_record["country"] + info["asn"] = db_record["asn"] + info["asn_name"] = db_record["asn_name"] + info["asn_domain"] = db_record["asn_domain"] info["reverse_dns"] = reverse_dns + if reverse_dns is not None: base_domain = get_base_domain(reverse_dns) if base_domain is not None: @@ -750,12 +814,34 @@ def get_ip_address_info( info["base_domain"] = base_domain info["type"] = service["type"] info["name"] = service["name"] - - if cache is not None: - cache[ip_address] = info - logger.debug(f"IP address {ip_address} added to cache") else: logger.debug(f"IP address {ip_address} reverse_dns not found") + # Fall back to ASN data for source attribution. ``reverse_dns`` and + # ``base_domain`` are left null so consumers can still tell an + # ASN-derived row apart from one resolved via a real PTR. + map_value: ReverseDNSMap = ( + reverse_dns_map if reverse_dns_map is not None else {} + ) + if len(map_value) == 0: + load_reverse_dns_map( + map_value, + always_use_local_file=always_use_local_files, + local_file_path=reverse_dns_map_path, + url=reverse_dns_map_url, + offline=offline, + ) + if info["asn_domain"] and info["asn_domain"] in map_value: + service = map_value[info["asn_domain"]] + info["name"] = service["name"] + info["type"] = service["type"] + elif info["asn_name"]: + # ASN-domain not in the map: surface the raw AS name with no + # classification. Better than leaving the row unattributed. + info["name"] = info["asn_name"] + + if cache is not None: + cache[ip_address] = info + logger.debug(f"IP address {ip_address} added to cache") return info diff --git a/tests.py b/tests.py index 1b126ce..b964c85 100755 --- a/tests.py +++ b/tests.py @@ -223,6 +223,67 @@ class Test(unittest.TestCase): parsedmarc.parsed_smtp_tls_reports_to_csv(result["report"]) print("Passed!") + def testIpAddressInfoSurfacesASNFields(self): + """ASN number, name, and domain from the bundled MMDB appear on every + IP info result, even when no PTR resolves.""" + info = parsedmarc.utils.get_ip_address_info("8.8.8.8", offline=True) + self.assertEqual(info["asn"], 15169) + self.assertIsInstance(info["asn"], int) + self.assertEqual(info["asn_domain"], "google.com") + self.assertTrue(info["asn_name"]) + + def testIpAddressInfoFallsBackToASNMapEntryWhenNoPTR(self): + """When reverse DNS is absent, the ASN domain should be used as a + lookup into the reverse_dns_map so the row still gets attributed, + while reverse_dns and base_domain remain null.""" + info = parsedmarc.utils.get_ip_address_info("8.8.8.8", offline=True) + self.assertIsNone(info["reverse_dns"]) + self.assertIsNone(info["base_domain"]) + self.assertEqual(info["name"], "Google (Including Gmail and Google Workspace)") + self.assertEqual(info["type"], "Email Provider") + + def testIpAddressInfoFallsBackToRawASNameOnMapMiss(self): + """When neither PTR nor an ASN-map entry resolves, the raw AS name + is used as source_name with type left null — better than leaving + the row unattributed.""" + # 204.79.197.100 is in an ASN whose as_domain is not in the map at + # the time of this test (msn.com); this exercises the asn_name + # fallback branch without depending on a specific map state. + from unittest.mock import patch + + with patch( + "parsedmarc.utils.get_ip_address_db_record", + return_value={ + "country": "US", + "asn": 64496, + "asn_name": "Some Unmapped Org, Inc.", + "asn_domain": "unmapped-for-this-test.example", + }, + ): + # Bypass cache to avoid prior-test pollution. + info = parsedmarc.utils.get_ip_address_info( + "192.0.2.1", offline=True, cache=None + ) + self.assertIsNone(info["reverse_dns"]) + self.assertIsNone(info["base_domain"]) + self.assertIsNone(info["type"]) + self.assertEqual(info["name"], "Some Unmapped Org, Inc.") + self.assertEqual(info["asn_domain"], "unmapped-for-this-test.example") + + def testAggregateCsvExposesASNColumns(self): + """The aggregate CSV output should include source_asn, source_asn_name, + and source_asn_domain columns.""" + result = parsedmarc.parse_report_file( + "samples/aggregate/!example.com!1538204542!1538463818.xml", + always_use_local_files=True, + offline=True, + ) + csv_text = parsedmarc.parsed_aggregate_reports_to_csv(result["report"]) + header = csv_text.splitlines()[0].split(",") + self.assertIn("source_asn", header) + self.assertIn("source_asn_name", header) + self.assertIn("source_asn_domain", header) + def testOpenSearchSigV4RequiresRegion(self): with self.assertRaises(opensearch_module.OpenSearchError): opensearch_module.set_hosts(