mirror of
https://github.com/domainaware/parsedmarc.git
synced 2026-04-24 06:19:29 +00:00
Surface ASN info and use it for source attribution when a PTR is absent (#715)
* Surface ASN info and fall back to it when a PTR is absent Adds three new fields to every IP source record — ``asn`` (integer, e.g. 15169), ``asn_name`` (``"Google LLC"``), ``asn_domain`` (``"google.com"``) — sourced from the bundled IPinfo Lite MMDB. These flow through to CSV, JSON, Elasticsearch, OpenSearch, and Splunk outputs as ``source_asn``, ``source_asn_name``, ``source_asn_domain``. More importantly: when an IP has no reverse DNS (common for many large senders), source attribution now falls back to the ASN domain as a lookup key into the same ``reverse_dns_map``. Thanks to #712 and #714, ~85% of routed IPv4 space now has an ``as_domain`` that hits the map, so rows that were previously unattributable now get a ``source_name``/``source_type`` derived from the ASN. When the ASN domain misses the map, the raw AS name is used as ``source_name`` with ``source_type`` left null — still better than nothing. Crucially, ``source_reverse_dns`` and ``source_base_domain`` remain null on ASN-derived rows, so downstream consumers can still tell a PTR-resolved attribution apart from an ASN-derived one. ASN is stored as an integer at the schema level (Elasticsearch / OpenSearch mappings use ``Integer``) so consumers can do range queries and numeric sorts; dashboards can prepend ``AS`` at display time. The MMDB reader normalizes both IPinfo's ``"AS15169"`` string and MaxMind's ``autonomous_system_number`` int to the same int form. Also fixes a pre-existing caching bug in ``get_ip_address_info``: entries without reverse DNS were never written to the IP-info cache, so every no-PTR IP re-did the MMDB read and DNS attempt on every call. The cache write is now unconditional. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Bump to 9.9.0 and document the ASN fallback work Updates the changelog with a 9.9.0 entry covering the ASN-domain aliases (#712, #714), map-maintenance tooling fixes (#713), and the ASN-fallback source attribution added in this branch. Extends AGENTS.md to explain that ``base_reverse_dns_map.csv`` is now a mixed-namespace map (rDNS bases alongside ASN domains) and adds a short recipe for finding high-value ASN-domain misses against the bundled MMDB, so future contributors know where the map's second lookup path comes from. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Document project conventions previously held only in agent memory Promotes four conventions out of per-agent memory and into AGENTS.md so every contributor — human or agent — works from the same baseline: - Run ruff check + format before committing (Code Style). - Store natively numeric values as numbers, not pre-formatted strings (e.g. ASN as int 15169, not "AS15169"; ES/OS mappings as Integer) (Code Style). - Before rewriting a tracked list/data file from freshly-generated content, verify the existing content via git — these files accumulate manually-curated entries across sessions (Editing tracked data files). - A release isn't done until hatch-built sdist + wheel are attached to the GitHub release page; full 8-step sequence documented (Releases). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
65
AGENTS.md
65
AGENTS.md
@@ -62,22 +62,42 @@ IP address info cached for 4 hours, seen aggregate report IDs cached for 1 hour
|
||||
|
||||
## Code Style
|
||||
|
||||
- Ruff for formatting and linting (configured in `.vscode/settings.json`)
|
||||
- TypedDict for structured data, type hints throughout
|
||||
- Python ≥3.10 required
|
||||
- Tests are in a single `tests.py` file using unittest; sample reports live in `samples/`
|
||||
- File path config values must be wrapped with `_expand_path()` in `cli.py`
|
||||
- Maildir UID checks are intentionally relaxed (warn, don't crash) for Docker compatibility
|
||||
- Token file writes must create parent directories before opening for write
|
||||
- Ruff for formatting and linting (configured in `.vscode/settings.json`). Run `ruff check .` and `ruff format --check .` after every code edit, before committing.
|
||||
- TypedDict for structured data, type hints throughout.
|
||||
- Python ≥3.10 required.
|
||||
- Tests are in a single `tests.py` file using unittest; sample reports live in `samples/`.
|
||||
- File path config values must be wrapped with `_expand_path()` in `cli.py`.
|
||||
- Maildir UID checks are intentionally relaxed (warn, don't crash) for Docker compatibility.
|
||||
- Token file writes must create parent directories before opening for write.
|
||||
- Store natively numeric values as numbers, not pre-formatted strings. Example: ASN is stored as `int 15169`, not `"AS15169"`; Elasticsearch / OpenSearch mappings for such fields use `Integer()` so consumers can do range queries and numeric sorts. Display layers format with a prefix at render time.
|
||||
|
||||
## Editing tracked data files
|
||||
|
||||
Before rewriting a tracked list/data file from freshly-generated content (anything under `parsedmarc/resources/maps/`, CSVs, `.txt` lists), check the existing file first — `git show HEAD:<path> | wc -l`, `git log -1 -- <path>`, `git diff --stat`. Files like `known_unknown_base_reverse_dns.txt` and `base_reverse_dns_map.csv` accumulate manually-curated entries across many sessions, and a "fresh" regeneration that drops the row count is almost certainly destroying prior work. If the new content is meant to *add* rather than *replace*, use a merge/append pattern. Treat any unexpected row-count drop in the pending diff as a red flag.
|
||||
|
||||
## Releases
|
||||
|
||||
A release isn't done until built artifacts are attached to the GitHub release page. Full sequence:
|
||||
|
||||
1. Bump version in `parsedmarc/constants.py`; update `CHANGELOG.md` with a new section under the new version number.
|
||||
2. Commit on a feature branch, open a PR, merge to master.
|
||||
3. `git fetch && git checkout master && git pull`.
|
||||
4. `git tag -a <version> -m "<version>" <sha>` and `git push origin <version>`.
|
||||
5. `rm -rf dist && hatch build`. Verify `git describe --tags --exact-match` matches the tag.
|
||||
6. `gh release create <version> --title "<version>" --notes-file <notes>`.
|
||||
7. `gh release upload <version> dist/parsedmarc-<version>.tar.gz dist/parsedmarc-<version>-py3-none-any.whl`.
|
||||
8. Confirm `gh release view <version> --json assets` shows both the sdist and the wheel before considering the release complete.
|
||||
|
||||
## Maintaining the reverse DNS maps
|
||||
|
||||
`parsedmarc/resources/maps/base_reverse_dns_map.csv` maps reverse DNS base domains to a display name and service type. See `parsedmarc/resources/maps/README.md` for the field format and the service_type precedence rules.
|
||||
`parsedmarc/resources/maps/base_reverse_dns_map.csv` maps a base domain to a display name and service type. The same map is consulted at two points: first with a PTR-derived base domain, and — if the IP has no PTR — with the ASN domain from the bundled IPinfo Lite MMDB (`parsedmarc/resources/ipinfo/ipinfo_lite.mmdb`). See `parsedmarc/resources/maps/README.md` for the field format and the service_type precedence rules.
|
||||
|
||||
Because both lookup paths read the same CSV, map keys are a mixed namespace — rDNS-base domains (e.g. `comcast.net`, discovered via `base_reverse_dns.csv`) coexist with ASN domains (e.g. `comcast.com`, discovered via coverage-gap analysis against the MMDB). Entries of both kinds should point to the same `(name, type)` when they describe the same operator — grep before inventing a new display name.
|
||||
|
||||
### File format
|
||||
|
||||
- CSV uses **CRLF** line endings and UTF-8 encoding — preserve both when editing programmatically.
|
||||
- Entries are sorted alphabetically (case-insensitive) by the first column.
|
||||
- Entries are sorted alphabetically (case-insensitive) by the first column. `parsedmarc/resources/maps/sortlists.py` is authoritative — run it after any batch edit to re-sort, dedupe, and validate `type` values.
|
||||
- Names containing commas must be quoted.
|
||||
- Do not edit in Excel (it mangles Unicode); use LibreOffice Calc or a text editor.
|
||||
|
||||
@@ -125,7 +145,32 @@ When `unknown_base_reverse_dns.csv` has new entries, follow this order rather th
|
||||
- `detect_psl_overrides.py` — scans the lists for clustered IP-containing patterns, auto-adds brand suffixes to `psl_overrides.txt`, folds affected entries to their base, and removes any remaining full-IP entries. Run before the collector on any new batch.
|
||||
- `collect_domain_info.py` — the bulk enrichment collector described above. Respects `psl_overrides.txt` and skips full-IP entries.
|
||||
- `find_bad_utf8.py` — locates invalid UTF-8 bytes (used after past encoding corruption).
|
||||
- `sortlists.py` — sorting helper for the list files.
|
||||
- `sortlists.py` — case-insensitive sort + dedupe + `type`-column validator for the list files; the authoritative sorter run after every batch edit.
|
||||
|
||||
### Checking ASN-domain coverage of the MMDB
|
||||
|
||||
Separately from `base_reverse_dns.csv`, the MMDB itself is a source of keys worth mapping. To find ASN domains with high IP weight that don't yet have a map entry, walk every record in `ipinfo_lite.mmdb`, aggregate IPv4 count per `as_domain`, and subtract what's already a map key:
|
||||
|
||||
```python
|
||||
import csv, maxminddb
|
||||
from collections import defaultdict
|
||||
keys = set()
|
||||
with open("parsedmarc/resources/maps/base_reverse_dns_map.csv", newline="", encoding="utf-8") as f:
|
||||
for row in csv.DictReader(f):
|
||||
keys.add(row["base_reverse_dns"].strip().lower())
|
||||
v4 = defaultdict(int); names = {}
|
||||
for net, rec in maxminddb.open_database("parsedmarc/resources/ipinfo/ipinfo_lite.mmdb"):
|
||||
if net.version != 4 or not isinstance(rec, dict): continue
|
||||
d = rec.get("as_domain")
|
||||
if not d: continue
|
||||
v4[d.lower()] += net.num_addresses
|
||||
names[d.lower()] = rec.get("as_name", "")
|
||||
miss = sorted(((d, v4[d], names[d]) for d in v4 if d not in keys), key=lambda x: -x[1])
|
||||
for d, c, n in miss[:50]:
|
||||
print(f"{c:>12,} {d:<30} {n}")
|
||||
```
|
||||
|
||||
Apply the same classification rules above (precedence, naming consistency, skip-if-ambiguous, privacy). Many top misses will be brands already in the map under a different rDNS-base key — the goal there is to alias the ASN domain to the same `(name, type)` so both lookup paths hit. For ASN domains with no obvious brand identity (small resellers, parked ASNs), don't map them — the attribution code falls back to the raw `as_name` from the MMDB, which is better than a guess.
|
||||
|
||||
### After a batch merge
|
||||
|
||||
|
||||
24
CHANGELOG.md
24
CHANGELOG.md
@@ -1,5 +1,29 @@
|
||||
# Changelog
|
||||
|
||||
## 9.9.0
|
||||
|
||||
### Changes
|
||||
|
||||
- Source attribution now has an ASN fallback. Every IP source record carries three new fields — `asn` (integer, e.g. `15169`), `asn_name` (`"Google LLC"`), and `asn_domain` (`"google.com"`) — sourced from the bundled IPinfo Lite MMDB. When an IP has no reverse DNS, `get_ip_address_info()` uses `asn_domain` as a lookup into the same `reverse_dns_map`, and if that misses, falls back to the raw `asn_name`. `reverse_dns` and `base_domain` stay null on ASN-derived rows so consumers can still distinguish PTR-derived from ASN-derived attribution.
|
||||
- Added `source_asn`, `source_asn_name`, `source_asn_domain` to CSV output (aggregate + forensic), JSON output, and the Elasticsearch / OpenSearch / Splunk integrations. `source_asn` is mapped as `Integer` at the schema level so consumers can do range queries and numeric sorts; dashboards can prepend `"AS"` at display time.
|
||||
- Expanded `base_reverse_dns_map.csv` with 500 ASN-domain aliases for the most-routed IPv4 ranges. IPv4-weighted coverage of the bundled `ipinfo_lite.mmdb` went from ~34% of routed space matching a map entry via ASN domain to ~85%. Every alias is a brand that was already in the map under a different rDNS-base key (e.g. adding `comcast.com` alongside the existing `comcast.net`), plus a small number of large operators that previously had no entry. 11 entries were also promoted out of `known_unknown_base_reverse_dns.txt` because ASN context made their identity unambiguous.
|
||||
- Added `get_ip_address_db_record()` in `parsedmarc.utils`, a single-open MMDB reader that returns country + ASN fields together. `get_ip_address_country()` is now a thin wrapper. Supports both IPinfo Lite's schema (`country_code`, `asn` as `"AS15169"`, `as_name`, `as_domain`) and MaxMind's (`country.iso_code`, `autonomous_system_number` as int, `autonomous_system_organization`) in one pass; ASN is normalized to a plain int from either. MaxMind users who drop in their own ASN MMDB get `asn` + `asn_name` populated; `asn_domain` stays null because MaxMind doesn't carry it.
|
||||
|
||||
### Fixed
|
||||
|
||||
- `get_ip_address_info()` now caches entries for IPs without reverse DNS. Previously the cache write was inside the `if reverse_dns is not None` branch, so every no-PTR IP re-did the MMDB read and DNS attempt on every call.
|
||||
- Fixed three bugs in `parsedmarc/resources/maps/sortlists.py` that silently disabled the `type`-column validator and sorted the map case-sensitively, contrary to its documented behavior:
|
||||
- Validator allowed-values map was keyed on `"Type"` (capital T), but the CSV header is `"type"` (lowercase), so every row bypassed validation.
|
||||
- Types were read with trailing newlines via `f.readlines()`, so comparisons would not have matched even if the column name had been right.
|
||||
- `sort_csv()` was called without `case_insensitive_sort=True`, which moved the sole mixed-case key (`United-domains.de`) to the top of the file instead of into its alphabetical position.
|
||||
- Fixed eight pre-existing map rows with invalid or inconsistent `type` values that the now-working validator surfaced: casing corrections for `dhl.com` (`logistics` → `Logistics`), `ghm-grenoble.fr` (`healthcare` → `Healthcare`), and `regusnet.com` (`Real estate` → `Real Estate`); reclassified `lodestonegroup.com` from the nonexistent `Insurance` type to `Finance`; added missing `Religion` and `Utilities` entries to `base_reverse_dns_types.txt` so it matches the README's industry list.
|
||||
- Fixed the `rt.ru` map entry: was classified as `RT,Government Media`, which conflated Rostelecom (the Russian telco that owns and uses `rt.ru`) with RT / Russia Today (which uses `rt.com`). Corrected to `Rostelecom,ISP`.
|
||||
|
||||
### Upgrade notes
|
||||
|
||||
- Output schema change: CSV, JSON, Elasticsearch, OpenSearch, and Splunk all gain three new fields per row (`source_asn`, `source_asn_name`, `source_asn_domain`). Existing queries and dashboards keep working; dashboards that want to consume the new fields will need to be updated. Elasticsearch / OpenSearch will add the new mappings on next document write.
|
||||
- Rows for IPs without reverse DNS now populate `source_name` / `source_type` via ASN fallback. If downstream dashboards treated "null `source_name`" as a signal for "no rDNS", switch to checking `source_reverse_dns IS NULL` instead — that remains the unambiguous signal.
|
||||
|
||||
## 9.8.0
|
||||
|
||||
### Changes
|
||||
|
||||
@@ -44,7 +44,10 @@ of the report schema.
|
||||
"reverse_dns": null,
|
||||
"base_domain": null,
|
||||
"name": null,
|
||||
"type": null
|
||||
"type": null,
|
||||
"asn": 7018,
|
||||
"asn_name": "AT&T Services, Inc.",
|
||||
"asn_domain": "att.com"
|
||||
},
|
||||
"count": 2,
|
||||
"alignment": {
|
||||
@@ -90,7 +93,7 @@ of the report schema.
|
||||
### CSV aggregate report
|
||||
|
||||
```text
|
||||
xml_schema,org_name,org_email,org_extra_contact_info,report_id,begin_date,end_date,normalized_timespan,errors,domain,adkim,aspf,p,sp,pct,fo,source_ip_address,source_country,source_reverse_dns,source_base_domain,source_name,source_type,count,spf_aligned,dkim_aligned,dmarc_aligned,disposition,policy_override_reasons,policy_override_comments,envelope_from,header_from,envelope_to,dkim_domains,dkim_selectors,dkim_results,spf_domains,spf_scopes,spf_results
|
||||
xml_schema,org_name,org_email,org_extra_contact_info,report_id,begin_date,end_date,normalized_timespan,errors,domain,adkim,aspf,p,sp,pct,fo,source_ip_address,source_country,source_reverse_dns,source_base_domain,source_name,source_type,source_asn,source_asn_name,source_asn_domain,count,spf_aligned,dkim_aligned,dmarc_aligned,disposition,policy_override_reasons,policy_override_comments,envelope_from,header_from,envelope_to,dkim_domains,dkim_selectors,dkim_results,spf_domains,spf_scopes,spf_results
|
||||
draft,acme.com,noreply-dmarc-support@acme.com,http://acme.com/dmarc/support,9391651994964116463,2012-04-28 00:00:00,2012-04-28 23:59:59,False,,example.com,r,r,none,none,100,0,72.150.241.94,US,,,,,2,True,False,True,none,,,example.com,example.com,,example.com,none,fail,example.com,mfrom,pass
|
||||
draft,acme.com,noreply-dmarc-support@acme.com,http://acme.com/dmarc/support,9391651994964116463,2012-04-28 00:00:00,2012-04-28 23:59:59,False,,example.com,r,r,none,none,100,0,72.150.241.94,US,,,,,2,True,False,True,none,,,example.com,example.com,,example.com,none,fail,example.com,mfrom,pass
|
||||
|
||||
@@ -123,7 +126,12 @@ Thanks to GitHub user [xennn](https://github.com/xennn) for the anonymized
|
||||
"ip_address": "10.10.10.10",
|
||||
"country": null,
|
||||
"reverse_dns": null,
|
||||
"base_domain": null
|
||||
"base_domain": null,
|
||||
"name": null,
|
||||
"type": null,
|
||||
"asn": null,
|
||||
"asn_name": null,
|
||||
"asn_domain": null
|
||||
},
|
||||
"authentication_mechanisms": [],
|
||||
"original_envelope_id": null,
|
||||
@@ -193,7 +201,7 @@ Thanks to GitHub user [xennn](https://github.com/xennn) for the anonymized
|
||||
### CSV forensic report
|
||||
|
||||
```text
|
||||
feedback_type,user_agent,version,original_envelope_id,original_mail_from,original_rcpt_to,arrival_date,arrival_date_utc,subject,message_id,authentication_results,dkim_domain,source_ip_address,source_country,source_reverse_dns,source_base_domain,delivery_result,auth_failure,reported_domain,authentication_mechanisms,sample_headers_only
|
||||
feedback_type,user_agent,version,original_envelope_id,original_mail_from,original_rcpt_to,arrival_date,arrival_date_utc,subject,message_id,authentication_results,dkim_domain,source_ip_address,source_country,source_reverse_dns,source_base_domain,source_name,source_type,source_asn,source_asn_name,source_asn_domain,delivery_result,auth_failure,reported_domain,authentication_mechanisms,sample_headers_only
|
||||
auth-failure,Lua/1.0,1.0,,sharepoint@domain.de,peter.pan@domain.de,"Mon, 01 Oct 2018 11:20:27 +0200",2018-10-01 09:20:27,Subject,<38.E7.30937.BD6E1BB5@ mailrelay.de>,"dmarc=fail (p=none, dis=none) header.from=domain.de",,10.10.10.10,,,,policy,dmarc,domain.de,,False
|
||||
```
|
||||
|
||||
@@ -238,4 +246,4 @@ auth-failure,Lua/1.0,1.0,,sharepoint@domain.de,peter.pan@domain.de,"Mon, 01 Oct
|
||||
]
|
||||
}
|
||||
]
|
||||
```
|
||||
```
|
||||
|
||||
@@ -1114,6 +1114,9 @@ def parsed_aggregate_reports_to_csv_rows(
|
||||
row["source_base_domain"] = record["source"]["base_domain"]
|
||||
row["source_name"] = record["source"]["name"]
|
||||
row["source_type"] = record["source"]["type"]
|
||||
row["source_asn"] = record["source"]["asn"]
|
||||
row["source_asn_name"] = record["source"]["asn_name"]
|
||||
row["source_asn_domain"] = record["source"]["asn_domain"]
|
||||
row["count"] = record["count"]
|
||||
row["spf_aligned"] = record["alignment"]["spf"]
|
||||
row["dkim_aligned"] = record["alignment"]["dkim"]
|
||||
@@ -1205,6 +1208,9 @@ def parsed_aggregate_reports_to_csv(
|
||||
"source_base_domain",
|
||||
"source_name",
|
||||
"source_type",
|
||||
"source_asn",
|
||||
"source_asn_name",
|
||||
"source_asn_domain",
|
||||
"count",
|
||||
"spf_aligned",
|
||||
"dkim_aligned",
|
||||
@@ -1406,6 +1412,9 @@ def parsed_forensic_reports_to_csv_rows(
|
||||
row["source_base_domain"] = report["source"]["base_domain"]
|
||||
row["source_name"] = report["source"]["name"]
|
||||
row["source_type"] = report["source"]["type"]
|
||||
row["source_asn"] = report["source"]["asn"]
|
||||
row["source_asn_name"] = report["source"]["asn_name"]
|
||||
row["source_asn_domain"] = report["source"]["asn_domain"]
|
||||
row["source_country"] = report["source"]["country"]
|
||||
del row["source"]
|
||||
row["subject"] = report["parsed_sample"].get("subject")
|
||||
@@ -1451,6 +1460,9 @@ def parsed_forensic_reports_to_csv(
|
||||
"source_base_domain",
|
||||
"source_name",
|
||||
"source_type",
|
||||
"source_asn",
|
||||
"source_asn_name",
|
||||
"source_asn_domain",
|
||||
"delivery_result",
|
||||
"auth_failure",
|
||||
"reported_domain",
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
__version__ = "9.8.0"
|
||||
__version__ = "9.9.0"
|
||||
|
||||
USER_AGENT = f"parsedmarc/{__version__}"
|
||||
|
||||
|
||||
@@ -79,6 +79,9 @@ class _AggregateReportDoc(Document):
|
||||
source_base_domain = Text()
|
||||
source_type = Text()
|
||||
source_name = Text()
|
||||
source_asn = Integer()
|
||||
source_asn_name = Text()
|
||||
source_asn_domain = Text()
|
||||
message_count = Integer
|
||||
disposition = Text()
|
||||
dkim_aligned = Boolean()
|
||||
@@ -173,6 +176,9 @@ class _ForensicReportDoc(Document):
|
||||
source_ip_address = Ip()
|
||||
source_country = Text()
|
||||
source_reverse_dns = Text()
|
||||
source_asn = Integer()
|
||||
source_asn_name = Text()
|
||||
source_asn_domain = Text()
|
||||
source_authentication_mechanisms = Text()
|
||||
source_auth_failures = Text()
|
||||
dkim_domain = Text()
|
||||
@@ -489,6 +495,9 @@ def save_aggregate_report_to_elasticsearch(
|
||||
source_base_domain=record["source"]["base_domain"],
|
||||
source_type=record["source"]["type"],
|
||||
source_name=record["source"]["name"],
|
||||
source_asn=record["source"]["asn"],
|
||||
source_asn_name=record["source"]["asn_name"],
|
||||
source_asn_domain=record["source"]["asn_domain"],
|
||||
message_count=record["count"],
|
||||
disposition=record["policy_evaluated"]["disposition"],
|
||||
dkim_aligned=record["policy_evaluated"]["dkim"] is not None
|
||||
@@ -673,6 +682,9 @@ def save_forensic_report_to_elasticsearch(
|
||||
source_country=forensic_report["source"]["country"],
|
||||
source_reverse_dns=forensic_report["source"]["reverse_dns"],
|
||||
source_base_domain=forensic_report["source"]["base_domain"],
|
||||
source_asn=forensic_report["source"]["asn"],
|
||||
source_asn_name=forensic_report["source"]["asn_name"],
|
||||
source_asn_domain=forensic_report["source"]["asn_domain"],
|
||||
authentication_mechanisms=forensic_report["authentication_mechanisms"],
|
||||
auth_failure=forensic_report["auth_failure"],
|
||||
dkim_domain=forensic_report["dkim_domain"],
|
||||
|
||||
@@ -82,6 +82,9 @@ class _AggregateReportDoc(Document):
|
||||
source_base_domain = Text()
|
||||
source_type = Text()
|
||||
source_name = Text()
|
||||
source_asn = Integer()
|
||||
source_asn_name = Text()
|
||||
source_asn_domain = Text()
|
||||
message_count = Integer
|
||||
disposition = Text()
|
||||
dkim_aligned = Boolean()
|
||||
@@ -176,6 +179,9 @@ class _ForensicReportDoc(Document):
|
||||
source_ip_address = Ip()
|
||||
source_country = Text()
|
||||
source_reverse_dns = Text()
|
||||
source_asn = Integer()
|
||||
source_asn_name = Text()
|
||||
source_asn_domain = Text()
|
||||
source_authentication_mechanisms = Text()
|
||||
source_auth_failures = Text()
|
||||
dkim_domain = Text()
|
||||
@@ -519,6 +525,9 @@ def save_aggregate_report_to_opensearch(
|
||||
source_base_domain=record["source"]["base_domain"],
|
||||
source_type=record["source"]["type"],
|
||||
source_name=record["source"]["name"],
|
||||
source_asn=record["source"]["asn"],
|
||||
source_asn_name=record["source"]["asn_name"],
|
||||
source_asn_domain=record["source"]["asn_domain"],
|
||||
message_count=record["count"],
|
||||
disposition=record["policy_evaluated"]["disposition"],
|
||||
dkim_aligned=record["policy_evaluated"]["dkim"] is not None
|
||||
@@ -703,6 +712,9 @@ def save_forensic_report_to_opensearch(
|
||||
source_country=forensic_report["source"]["country"],
|
||||
source_reverse_dns=forensic_report["source"]["reverse_dns"],
|
||||
source_base_domain=forensic_report["source"]["base_domain"],
|
||||
source_asn=forensic_report["source"]["asn"],
|
||||
source_asn_name=forensic_report["source"]["asn_name"],
|
||||
source_asn_domain=forensic_report["source"]["asn_domain"],
|
||||
authentication_mechanisms=forensic_report["authentication_mechanisms"],
|
||||
auth_failure=forensic_report["auth_failure"],
|
||||
dkim_domain=forensic_report["dkim_domain"],
|
||||
|
||||
@@ -104,6 +104,9 @@ class HECClient(object):
|
||||
new_report["source_base_domain"] = record["source"]["base_domain"]
|
||||
new_report["source_type"] = record["source"]["type"]
|
||||
new_report["source_name"] = record["source"]["name"]
|
||||
new_report["source_asn"] = record["source"]["asn"]
|
||||
new_report["source_asn_name"] = record["source"]["asn_name"]
|
||||
new_report["source_asn_domain"] = record["source"]["asn_domain"]
|
||||
new_report["message_count"] = record["count"]
|
||||
new_report["disposition"] = record["policy_evaluated"]["disposition"]
|
||||
new_report["spf_aligned"] = record["alignment"]["spf"]
|
||||
|
||||
@@ -40,6 +40,9 @@ class IPSourceInfo(TypedDict):
|
||||
base_domain: Optional[str]
|
||||
name: Optional[str]
|
||||
type: Optional[str]
|
||||
asn: Optional[int]
|
||||
asn_name: Optional[str]
|
||||
asn_domain: Optional[str]
|
||||
|
||||
|
||||
class AggregateAlignment(TypedDict):
|
||||
|
||||
@@ -151,6 +151,9 @@ class IPAddressInfo(TypedDict):
|
||||
base_domain: Optional[str]
|
||||
name: Optional[str]
|
||||
type: Optional[str]
|
||||
asn: Optional[int]
|
||||
asn_name: Optional[str]
|
||||
asn_domain: Optional[str]
|
||||
|
||||
|
||||
def decode_base64(data: str) -> bytes:
|
||||
@@ -457,20 +460,7 @@ def load_ip_db(
|
||||
logger.info("Using bundled IP database")
|
||||
|
||||
|
||||
def get_ip_address_country(
|
||||
ip_address: str, *, db_path: Optional[str] = None
|
||||
) -> Optional[str]:
|
||||
"""
|
||||
Returns the ISO code for the country associated
|
||||
with the given IPv4 or IPv6 address
|
||||
|
||||
Args:
|
||||
ip_address (str): The IP address to query for
|
||||
db_path (str): Path to a MMDB file from IPinfo, MaxMind, or DBIP
|
||||
|
||||
Returns:
|
||||
str: And ISO country code associated with the given IP address
|
||||
"""
|
||||
def _get_ip_database_path(db_path: Optional[str]) -> str:
|
||||
db_paths = [
|
||||
"ipinfo_lite.mmdb",
|
||||
"GeoLite2-Country.mmdb",
|
||||
@@ -486,14 +476,13 @@ def get_ip_address_country(
|
||||
"dbip-country.mmdb",
|
||||
]
|
||||
|
||||
if db_path is not None:
|
||||
if not os.path.isfile(db_path):
|
||||
logger.warning(
|
||||
f"No file exists at {db_path}. Falling back to an "
|
||||
"included copy of the IPinfo IP to Country "
|
||||
"Lite database."
|
||||
)
|
||||
db_path = None
|
||||
if db_path is not None and not os.path.isfile(db_path):
|
||||
logger.warning(
|
||||
f"No file exists at {db_path}. Falling back to an "
|
||||
"included copy of the IPinfo IP to Country "
|
||||
"Lite database."
|
||||
)
|
||||
db_path = None
|
||||
|
||||
if db_path is None:
|
||||
for system_path in db_paths:
|
||||
@@ -513,14 +502,37 @@ def get_ip_address_country(
|
||||
if db_age > timedelta(days=30):
|
||||
logger.warning("IP database is more than a month old")
|
||||
|
||||
db_reader = maxminddb.open_database(db_path)
|
||||
return db_path
|
||||
|
||||
|
||||
class _IPDatabaseRecord(TypedDict):
|
||||
country: Optional[str]
|
||||
asn: Optional[int]
|
||||
asn_name: Optional[str]
|
||||
asn_domain: Optional[str]
|
||||
|
||||
|
||||
def get_ip_address_db_record(
|
||||
ip_address: str, *, db_path: Optional[str] = None
|
||||
) -> _IPDatabaseRecord:
|
||||
"""Look up an IP in the configured MMDB and return country + ASN fields.
|
||||
|
||||
IPinfo Lite carries ``country_code``, ``as_name``, and ``as_domain`` on
|
||||
every record. MaxMind/DBIP country-only databases carry only country, so
|
||||
``asn_name`` / ``asn_domain`` come back None for those users.
|
||||
"""
|
||||
resolved_path = _get_ip_database_path(db_path)
|
||||
db_reader = maxminddb.open_database(resolved_path)
|
||||
record = db_reader.get(ip_address)
|
||||
|
||||
# Support both the IPinfo schema (flat top-level ``country_code``) and the
|
||||
# MaxMind/DBIP schema (nested ``country.iso_code``) so users dropping in
|
||||
# their own MMDB from any of these providers keeps working.
|
||||
country: Optional[str] = None
|
||||
asn: Optional[int] = None
|
||||
asn_name: Optional[str] = None
|
||||
asn_domain: Optional[str] = None
|
||||
if isinstance(record, dict):
|
||||
# Support both the IPinfo schema (flat top-level ``country_code``) and
|
||||
# the MaxMind/DBIP schema (nested ``country.iso_code``) so users
|
||||
# dropping in their own MMDB from any of these providers keeps working.
|
||||
code = record.get("country_code")
|
||||
if code is None:
|
||||
nested = record.get("country")
|
||||
@@ -529,7 +541,52 @@ def get_ip_address_country(
|
||||
if isinstance(code, str):
|
||||
country = code
|
||||
|
||||
return country
|
||||
# Normalize ASN to a plain integer. IPinfo stores it as a string like
|
||||
# "AS15169"; MaxMind's ASN DB uses ``autonomous_system_number`` as an
|
||||
# int. Integer form lets consumers do range queries and sort
|
||||
# numerically; display-time formatting with an "AS" prefix is trivial.
|
||||
raw_asn = record.get("asn")
|
||||
if isinstance(raw_asn, int):
|
||||
asn = raw_asn
|
||||
elif isinstance(raw_asn, str) and raw_asn:
|
||||
digits = raw_asn.removeprefix("AS").removeprefix("as")
|
||||
if digits.isdigit():
|
||||
asn = int(digits)
|
||||
if asn is None:
|
||||
mm_asn = record.get("autonomous_system_number")
|
||||
if isinstance(mm_asn, int):
|
||||
asn = mm_asn
|
||||
|
||||
name = record.get("as_name") or record.get("autonomous_system_organization")
|
||||
if isinstance(name, str) and name:
|
||||
asn_name = name
|
||||
domain = record.get("as_domain")
|
||||
if isinstance(domain, str) and domain:
|
||||
asn_domain = domain.lower()
|
||||
|
||||
return {
|
||||
"country": country,
|
||||
"asn": asn,
|
||||
"asn_name": asn_name,
|
||||
"asn_domain": asn_domain,
|
||||
}
|
||||
|
||||
|
||||
def get_ip_address_country(
|
||||
ip_address: str, *, db_path: Optional[str] = None
|
||||
) -> Optional[str]:
|
||||
"""
|
||||
Returns the ISO code for the country associated
|
||||
with the given IPv4 or IPv6 address.
|
||||
|
||||
Args:
|
||||
ip_address (str): The IP address to query for
|
||||
db_path (str): Path to a MMDB file from IPinfo, MaxMind, or DBIP
|
||||
|
||||
Returns:
|
||||
str: And ISO country code associated with the given IP address
|
||||
"""
|
||||
return get_ip_address_db_record(ip_address, db_path=db_path)["country"]
|
||||
|
||||
|
||||
def load_reverse_dns_map(
|
||||
@@ -723,6 +780,9 @@ def get_ip_address_info(
|
||||
"base_domain": None,
|
||||
"name": None,
|
||||
"type": None,
|
||||
"asn": None,
|
||||
"asn_name": None,
|
||||
"asn_domain": None,
|
||||
}
|
||||
if offline:
|
||||
reverse_dns = None
|
||||
@@ -733,9 +793,13 @@ def get_ip_address_info(
|
||||
timeout=timeout,
|
||||
retries=retries,
|
||||
)
|
||||
country = get_ip_address_country(ip_address, db_path=ip_db_path)
|
||||
info["country"] = country
|
||||
db_record = get_ip_address_db_record(ip_address, db_path=ip_db_path)
|
||||
info["country"] = db_record["country"]
|
||||
info["asn"] = db_record["asn"]
|
||||
info["asn_name"] = db_record["asn_name"]
|
||||
info["asn_domain"] = db_record["asn_domain"]
|
||||
info["reverse_dns"] = reverse_dns
|
||||
|
||||
if reverse_dns is not None:
|
||||
base_domain = get_base_domain(reverse_dns)
|
||||
if base_domain is not None:
|
||||
@@ -750,12 +814,34 @@ def get_ip_address_info(
|
||||
info["base_domain"] = base_domain
|
||||
info["type"] = service["type"]
|
||||
info["name"] = service["name"]
|
||||
|
||||
if cache is not None:
|
||||
cache[ip_address] = info
|
||||
logger.debug(f"IP address {ip_address} added to cache")
|
||||
else:
|
||||
logger.debug(f"IP address {ip_address} reverse_dns not found")
|
||||
# Fall back to ASN data for source attribution. ``reverse_dns`` and
|
||||
# ``base_domain`` are left null so consumers can still tell an
|
||||
# ASN-derived row apart from one resolved via a real PTR.
|
||||
map_value: ReverseDNSMap = (
|
||||
reverse_dns_map if reverse_dns_map is not None else {}
|
||||
)
|
||||
if len(map_value) == 0:
|
||||
load_reverse_dns_map(
|
||||
map_value,
|
||||
always_use_local_file=always_use_local_files,
|
||||
local_file_path=reverse_dns_map_path,
|
||||
url=reverse_dns_map_url,
|
||||
offline=offline,
|
||||
)
|
||||
if info["asn_domain"] and info["asn_domain"] in map_value:
|
||||
service = map_value[info["asn_domain"]]
|
||||
info["name"] = service["name"]
|
||||
info["type"] = service["type"]
|
||||
elif info["asn_name"]:
|
||||
# ASN-domain not in the map: surface the raw AS name with no
|
||||
# classification. Better than leaving the row unattributed.
|
||||
info["name"] = info["asn_name"]
|
||||
|
||||
if cache is not None:
|
||||
cache[ip_address] = info
|
||||
logger.debug(f"IP address {ip_address} added to cache")
|
||||
|
||||
return info
|
||||
|
||||
|
||||
61
tests.py
61
tests.py
@@ -223,6 +223,67 @@ class Test(unittest.TestCase):
|
||||
parsedmarc.parsed_smtp_tls_reports_to_csv(result["report"])
|
||||
print("Passed!")
|
||||
|
||||
def testIpAddressInfoSurfacesASNFields(self):
|
||||
"""ASN number, name, and domain from the bundled MMDB appear on every
|
||||
IP info result, even when no PTR resolves."""
|
||||
info = parsedmarc.utils.get_ip_address_info("8.8.8.8", offline=True)
|
||||
self.assertEqual(info["asn"], 15169)
|
||||
self.assertIsInstance(info["asn"], int)
|
||||
self.assertEqual(info["asn_domain"], "google.com")
|
||||
self.assertTrue(info["asn_name"])
|
||||
|
||||
def testIpAddressInfoFallsBackToASNMapEntryWhenNoPTR(self):
|
||||
"""When reverse DNS is absent, the ASN domain should be used as a
|
||||
lookup into the reverse_dns_map so the row still gets attributed,
|
||||
while reverse_dns and base_domain remain null."""
|
||||
info = parsedmarc.utils.get_ip_address_info("8.8.8.8", offline=True)
|
||||
self.assertIsNone(info["reverse_dns"])
|
||||
self.assertIsNone(info["base_domain"])
|
||||
self.assertEqual(info["name"], "Google (Including Gmail and Google Workspace)")
|
||||
self.assertEqual(info["type"], "Email Provider")
|
||||
|
||||
def testIpAddressInfoFallsBackToRawASNameOnMapMiss(self):
|
||||
"""When neither PTR nor an ASN-map entry resolves, the raw AS name
|
||||
is used as source_name with type left null — better than leaving
|
||||
the row unattributed."""
|
||||
# 204.79.197.100 is in an ASN whose as_domain is not in the map at
|
||||
# the time of this test (msn.com); this exercises the asn_name
|
||||
# fallback branch without depending on a specific map state.
|
||||
from unittest.mock import patch
|
||||
|
||||
with patch(
|
||||
"parsedmarc.utils.get_ip_address_db_record",
|
||||
return_value={
|
||||
"country": "US",
|
||||
"asn": 64496,
|
||||
"asn_name": "Some Unmapped Org, Inc.",
|
||||
"asn_domain": "unmapped-for-this-test.example",
|
||||
},
|
||||
):
|
||||
# Bypass cache to avoid prior-test pollution.
|
||||
info = parsedmarc.utils.get_ip_address_info(
|
||||
"192.0.2.1", offline=True, cache=None
|
||||
)
|
||||
self.assertIsNone(info["reverse_dns"])
|
||||
self.assertIsNone(info["base_domain"])
|
||||
self.assertIsNone(info["type"])
|
||||
self.assertEqual(info["name"], "Some Unmapped Org, Inc.")
|
||||
self.assertEqual(info["asn_domain"], "unmapped-for-this-test.example")
|
||||
|
||||
def testAggregateCsvExposesASNColumns(self):
|
||||
"""The aggregate CSV output should include source_asn, source_asn_name,
|
||||
and source_asn_domain columns."""
|
||||
result = parsedmarc.parse_report_file(
|
||||
"samples/aggregate/!example.com!1538204542!1538463818.xml",
|
||||
always_use_local_files=True,
|
||||
offline=True,
|
||||
)
|
||||
csv_text = parsedmarc.parsed_aggregate_reports_to_csv(result["report"])
|
||||
header = csv_text.splitlines()[0].split(",")
|
||||
self.assertIn("source_asn", header)
|
||||
self.assertIn("source_asn_name", header)
|
||||
self.assertIn("source_asn_domain", header)
|
||||
|
||||
def testOpenSearchSigV4RequiresRegion(self):
|
||||
with self.assertRaises(opensearch_module.OpenSearchError):
|
||||
opensearch_module.set_hosts(
|
||||
|
||||
Reference in New Issue
Block a user