Surface ASN info and use it for source attribution when a PTR is absent (#715)

* Surface ASN info and fall back to it when a PTR is absent

Adds three new fields to every IP source record — ``asn`` (integer,
e.g. 15169), ``asn_name`` (``"Google LLC"``), ``asn_domain``
(``"google.com"``) — sourced from the bundled IPinfo Lite MMDB. These
flow through to CSV, JSON, Elasticsearch, OpenSearch, and Splunk
outputs as ``source_asn``, ``source_asn_name``, ``source_asn_domain``.

More importantly: when an IP has no reverse DNS (common for many
large senders), source attribution now falls back to the ASN domain
as a lookup key into the same ``reverse_dns_map``. Thanks to #712
and #714, ~85% of routed IPv4 space now has an ``as_domain`` that
hits the map, so rows that were previously unattributable now get a
``source_name``/``source_type`` derived from the ASN. When the ASN
domain misses the map, the raw AS name is used as ``source_name``
with ``source_type`` left null — still better than nothing.

Crucially, ``source_reverse_dns`` and ``source_base_domain`` remain
null on ASN-derived rows, so downstream consumers can still tell a
PTR-resolved attribution apart from an ASN-derived one.

ASN is stored as an integer at the schema level (Elasticsearch /
OpenSearch mappings use ``Integer``) so consumers can do range
queries and numeric sorts; dashboards can prepend ``AS`` at display
time. The MMDB reader normalizes both IPinfo's ``"AS15169"`` string
and MaxMind's ``autonomous_system_number`` int to the same int form.

Also fixes a pre-existing caching bug in ``get_ip_address_info``:
entries without reverse DNS were never written to the IP-info cache,
so every no-PTR IP re-did the MMDB read and DNS attempt on every
call. The cache write is now unconditional.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Bump to 9.9.0 and document the ASN fallback work

Updates the changelog with a 9.9.0 entry covering the ASN-domain
aliases (#712, #714), map-maintenance tooling fixes (#713), and the
ASN-fallback source attribution added in this branch.

Extends AGENTS.md to explain that ``base_reverse_dns_map.csv`` is now
a mixed-namespace map (rDNS bases alongside ASN domains) and adds a
short recipe for finding high-value ASN-domain misses against the
bundled MMDB, so future contributors know where the map's second
lookup path comes from.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Document project conventions previously held only in agent memory

Promotes four conventions out of per-agent memory and into AGENTS.md
so every contributor — human or agent — works from the same baseline:

- Run ruff check + format before committing (Code Style).
- Store natively numeric values as numbers, not pre-formatted strings
  (e.g. ASN as int 15169, not "AS15169"; ES/OS mappings as Integer)
  (Code Style).
- Before rewriting a tracked list/data file from freshly-generated
  content, verify the existing content via git — these files
  accumulate manually-curated entries across sessions (Editing tracked
  data files).
- A release isn't done until hatch-built sdist + wheel are attached to
  the GitHub release page; full 8-step sequence documented (Releases).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Sean Whalen
2026-04-23 02:13:30 -04:00
committed by GitHub
parent c2678f8e21
commit 2cda5bf59b
11 changed files with 315 additions and 49 deletions

View File

@@ -62,22 +62,42 @@ IP address info cached for 4 hours, seen aggregate report IDs cached for 1 hour
## Code Style
- Ruff for formatting and linting (configured in `.vscode/settings.json`)
- TypedDict for structured data, type hints throughout
- Python ≥3.10 required
- Tests are in a single `tests.py` file using unittest; sample reports live in `samples/`
- File path config values must be wrapped with `_expand_path()` in `cli.py`
- Maildir UID checks are intentionally relaxed (warn, don't crash) for Docker compatibility
- Token file writes must create parent directories before opening for write
- Ruff for formatting and linting (configured in `.vscode/settings.json`). Run `ruff check .` and `ruff format --check .` after every code edit, before committing.
- TypedDict for structured data, type hints throughout.
- Python ≥3.10 required.
- Tests are in a single `tests.py` file using unittest; sample reports live in `samples/`.
- File path config values must be wrapped with `_expand_path()` in `cli.py`.
- Maildir UID checks are intentionally relaxed (warn, don't crash) for Docker compatibility.
- Token file writes must create parent directories before opening for write.
- Store natively numeric values as numbers, not pre-formatted strings. Example: ASN is stored as `int 15169`, not `"AS15169"`; Elasticsearch / OpenSearch mappings for such fields use `Integer()` so consumers can do range queries and numeric sorts. Display layers format with a prefix at render time.
## Editing tracked data files
Before rewriting a tracked list/data file from freshly-generated content (anything under `parsedmarc/resources/maps/`, CSVs, `.txt` lists), check the existing file first — `git show HEAD:<path> | wc -l`, `git log -1 -- <path>`, `git diff --stat`. Files like `known_unknown_base_reverse_dns.txt` and `base_reverse_dns_map.csv` accumulate manually-curated entries across many sessions, and a "fresh" regeneration that drops the row count is almost certainly destroying prior work. If the new content is meant to *add* rather than *replace*, use a merge/append pattern. Treat any unexpected row-count drop in the pending diff as a red flag.
## Releases
A release isn't done until built artifacts are attached to the GitHub release page. Full sequence:
1. Bump version in `parsedmarc/constants.py`; update `CHANGELOG.md` with a new section under the new version number.
2. Commit on a feature branch, open a PR, merge to master.
3. `git fetch && git checkout master && git pull`.
4. `git tag -a <version> -m "<version>" <sha>` and `git push origin <version>`.
5. `rm -rf dist && hatch build`. Verify `git describe --tags --exact-match` matches the tag.
6. `gh release create <version> --title "<version>" --notes-file <notes>`.
7. `gh release upload <version> dist/parsedmarc-<version>.tar.gz dist/parsedmarc-<version>-py3-none-any.whl`.
8. Confirm `gh release view <version> --json assets` shows both the sdist and the wheel before considering the release complete.
## Maintaining the reverse DNS maps
`parsedmarc/resources/maps/base_reverse_dns_map.csv` maps reverse DNS base domains to a display name and service type. See `parsedmarc/resources/maps/README.md` for the field format and the service_type precedence rules.
`parsedmarc/resources/maps/base_reverse_dns_map.csv` maps a base domain to a display name and service type. The same map is consulted at two points: first with a PTR-derived base domain, and — if the IP has no PTR — with the ASN domain from the bundled IPinfo Lite MMDB (`parsedmarc/resources/ipinfo/ipinfo_lite.mmdb`). See `parsedmarc/resources/maps/README.md` for the field format and the service_type precedence rules.
Because both lookup paths read the same CSV, map keys are a mixed namespace — rDNS-base domains (e.g. `comcast.net`, discovered via `base_reverse_dns.csv`) coexist with ASN domains (e.g. `comcast.com`, discovered via coverage-gap analysis against the MMDB). Entries of both kinds should point to the same `(name, type)` when they describe the same operator — grep before inventing a new display name.
### File format
- CSV uses **CRLF** line endings and UTF-8 encoding — preserve both when editing programmatically.
- Entries are sorted alphabetically (case-insensitive) by the first column.
- Entries are sorted alphabetically (case-insensitive) by the first column. `parsedmarc/resources/maps/sortlists.py` is authoritative — run it after any batch edit to re-sort, dedupe, and validate `type` values.
- Names containing commas must be quoted.
- Do not edit in Excel (it mangles Unicode); use LibreOffice Calc or a text editor.
@@ -125,7 +145,32 @@ When `unknown_base_reverse_dns.csv` has new entries, follow this order rather th
- `detect_psl_overrides.py` — scans the lists for clustered IP-containing patterns, auto-adds brand suffixes to `psl_overrides.txt`, folds affected entries to their base, and removes any remaining full-IP entries. Run before the collector on any new batch.
- `collect_domain_info.py` — the bulk enrichment collector described above. Respects `psl_overrides.txt` and skips full-IP entries.
- `find_bad_utf8.py` — locates invalid UTF-8 bytes (used after past encoding corruption).
- `sortlists.py` — sorting helper for the list files.
- `sortlists.py` — case-insensitive sort + dedupe + `type`-column validator for the list files; the authoritative sorter run after every batch edit.
### Checking ASN-domain coverage of the MMDB
Separately from `base_reverse_dns.csv`, the MMDB itself is a source of keys worth mapping. To find ASN domains with high IP weight that don't yet have a map entry, walk every record in `ipinfo_lite.mmdb`, aggregate IPv4 count per `as_domain`, and subtract what's already a map key:
```python
import csv, maxminddb
from collections import defaultdict
keys = set()
with open("parsedmarc/resources/maps/base_reverse_dns_map.csv", newline="", encoding="utf-8") as f:
for row in csv.DictReader(f):
keys.add(row["base_reverse_dns"].strip().lower())
v4 = defaultdict(int); names = {}
for net, rec in maxminddb.open_database("parsedmarc/resources/ipinfo/ipinfo_lite.mmdb"):
if net.version != 4 or not isinstance(rec, dict): continue
d = rec.get("as_domain")
if not d: continue
v4[d.lower()] += net.num_addresses
names[d.lower()] = rec.get("as_name", "")
miss = sorted(((d, v4[d], names[d]) for d in v4 if d not in keys), key=lambda x: -x[1])
for d, c, n in miss[:50]:
print(f"{c:>12,} {d:<30} {n}")
```
Apply the same classification rules above (precedence, naming consistency, skip-if-ambiguous, privacy). Many top misses will be brands already in the map under a different rDNS-base key — the goal there is to alias the ASN domain to the same `(name, type)` so both lookup paths hit. For ASN domains with no obvious brand identity (small resellers, parked ASNs), don't map them — the attribution code falls back to the raw `as_name` from the MMDB, which is better than a guess.
### After a batch merge

View File

@@ -1,5 +1,29 @@
# Changelog
## 9.9.0
### Changes
- Source attribution now has an ASN fallback. Every IP source record carries three new fields — `asn` (integer, e.g. `15169`), `asn_name` (`"Google LLC"`), and `asn_domain` (`"google.com"`) — sourced from the bundled IPinfo Lite MMDB. When an IP has no reverse DNS, `get_ip_address_info()` uses `asn_domain` as a lookup into the same `reverse_dns_map`, and if that misses, falls back to the raw `asn_name`. `reverse_dns` and `base_domain` stay null on ASN-derived rows so consumers can still distinguish PTR-derived from ASN-derived attribution.
- Added `source_asn`, `source_asn_name`, `source_asn_domain` to CSV output (aggregate + forensic), JSON output, and the Elasticsearch / OpenSearch / Splunk integrations. `source_asn` is mapped as `Integer` at the schema level so consumers can do range queries and numeric sorts; dashboards can prepend `"AS"` at display time.
- Expanded `base_reverse_dns_map.csv` with 500 ASN-domain aliases for the most-routed IPv4 ranges. IPv4-weighted coverage of the bundled `ipinfo_lite.mmdb` went from ~34% of routed space matching a map entry via ASN domain to ~85%. Every alias is a brand that was already in the map under a different rDNS-base key (e.g. adding `comcast.com` alongside the existing `comcast.net`), plus a small number of large operators that previously had no entry. 11 entries were also promoted out of `known_unknown_base_reverse_dns.txt` because ASN context made their identity unambiguous.
- Added `get_ip_address_db_record()` in `parsedmarc.utils`, a single-open MMDB reader that returns country + ASN fields together. `get_ip_address_country()` is now a thin wrapper. Supports both IPinfo Lite's schema (`country_code`, `asn` as `"AS15169"`, `as_name`, `as_domain`) and MaxMind's (`country.iso_code`, `autonomous_system_number` as int, `autonomous_system_organization`) in one pass; ASN is normalized to a plain int from either. MaxMind users who drop in their own ASN MMDB get `asn` + `asn_name` populated; `asn_domain` stays null because MaxMind doesn't carry it.
### Fixed
- `get_ip_address_info()` now caches entries for IPs without reverse DNS. Previously the cache write was inside the `if reverse_dns is not None` branch, so every no-PTR IP re-did the MMDB read and DNS attempt on every call.
- Fixed three bugs in `parsedmarc/resources/maps/sortlists.py` that silently disabled the `type`-column validator and sorted the map case-sensitively, contrary to its documented behavior:
- Validator allowed-values map was keyed on `"Type"` (capital T), but the CSV header is `"type"` (lowercase), so every row bypassed validation.
- Types were read with trailing newlines via `f.readlines()`, so comparisons would not have matched even if the column name had been right.
- `sort_csv()` was called without `case_insensitive_sort=True`, which moved the sole mixed-case key (`United-domains.de`) to the top of the file instead of into its alphabetical position.
- Fixed eight pre-existing map rows with invalid or inconsistent `type` values that the now-working validator surfaced: casing corrections for `dhl.com` (`logistics``Logistics`), `ghm-grenoble.fr` (`healthcare``Healthcare`), and `regusnet.com` (`Real estate``Real Estate`); reclassified `lodestonegroup.com` from the nonexistent `Insurance` type to `Finance`; added missing `Religion` and `Utilities` entries to `base_reverse_dns_types.txt` so it matches the README's industry list.
- Fixed the `rt.ru` map entry: was classified as `RT,Government Media`, which conflated Rostelecom (the Russian telco that owns and uses `rt.ru`) with RT / Russia Today (which uses `rt.com`). Corrected to `Rostelecom,ISP`.
### Upgrade notes
- Output schema change: CSV, JSON, Elasticsearch, OpenSearch, and Splunk all gain three new fields per row (`source_asn`, `source_asn_name`, `source_asn_domain`). Existing queries and dashboards keep working; dashboards that want to consume the new fields will need to be updated. Elasticsearch / OpenSearch will add the new mappings on next document write.
- Rows for IPs without reverse DNS now populate `source_name` / `source_type` via ASN fallback. If downstream dashboards treated "null `source_name`" as a signal for "no rDNS", switch to checking `source_reverse_dns IS NULL` instead — that remains the unambiguous signal.
## 9.8.0
### Changes

View File

@@ -44,7 +44,10 @@ of the report schema.
"reverse_dns": null,
"base_domain": null,
"name": null,
"type": null
"type": null,
"asn": 7018,
"asn_name": "AT&T Services, Inc.",
"asn_domain": "att.com"
},
"count": 2,
"alignment": {
@@ -90,7 +93,7 @@ of the report schema.
### CSV aggregate report
```text
xml_schema,org_name,org_email,org_extra_contact_info,report_id,begin_date,end_date,normalized_timespan,errors,domain,adkim,aspf,p,sp,pct,fo,source_ip_address,source_country,source_reverse_dns,source_base_domain,source_name,source_type,count,spf_aligned,dkim_aligned,dmarc_aligned,disposition,policy_override_reasons,policy_override_comments,envelope_from,header_from,envelope_to,dkim_domains,dkim_selectors,dkim_results,spf_domains,spf_scopes,spf_results
xml_schema,org_name,org_email,org_extra_contact_info,report_id,begin_date,end_date,normalized_timespan,errors,domain,adkim,aspf,p,sp,pct,fo,source_ip_address,source_country,source_reverse_dns,source_base_domain,source_name,source_type,source_asn,source_asn_name,source_asn_domain,count,spf_aligned,dkim_aligned,dmarc_aligned,disposition,policy_override_reasons,policy_override_comments,envelope_from,header_from,envelope_to,dkim_domains,dkim_selectors,dkim_results,spf_domains,spf_scopes,spf_results
draft,acme.com,noreply-dmarc-support@acme.com,http://acme.com/dmarc/support,9391651994964116463,2012-04-28 00:00:00,2012-04-28 23:59:59,False,,example.com,r,r,none,none,100,0,72.150.241.94,US,,,,,2,True,False,True,none,,,example.com,example.com,,example.com,none,fail,example.com,mfrom,pass
draft,acme.com,noreply-dmarc-support@acme.com,http://acme.com/dmarc/support,9391651994964116463,2012-04-28 00:00:00,2012-04-28 23:59:59,False,,example.com,r,r,none,none,100,0,72.150.241.94,US,,,,,2,True,False,True,none,,,example.com,example.com,,example.com,none,fail,example.com,mfrom,pass
@@ -123,7 +126,12 @@ Thanks to GitHub user [xennn](https://github.com/xennn) for the anonymized
"ip_address": "10.10.10.10",
"country": null,
"reverse_dns": null,
"base_domain": null
"base_domain": null,
"name": null,
"type": null,
"asn": null,
"asn_name": null,
"asn_domain": null
},
"authentication_mechanisms": [],
"original_envelope_id": null,
@@ -193,7 +201,7 @@ Thanks to GitHub user [xennn](https://github.com/xennn) for the anonymized
### CSV forensic report
```text
feedback_type,user_agent,version,original_envelope_id,original_mail_from,original_rcpt_to,arrival_date,arrival_date_utc,subject,message_id,authentication_results,dkim_domain,source_ip_address,source_country,source_reverse_dns,source_base_domain,delivery_result,auth_failure,reported_domain,authentication_mechanisms,sample_headers_only
feedback_type,user_agent,version,original_envelope_id,original_mail_from,original_rcpt_to,arrival_date,arrival_date_utc,subject,message_id,authentication_results,dkim_domain,source_ip_address,source_country,source_reverse_dns,source_base_domain,source_name,source_type,source_asn,source_asn_name,source_asn_domain,delivery_result,auth_failure,reported_domain,authentication_mechanisms,sample_headers_only
auth-failure,Lua/1.0,1.0,,sharepoint@domain.de,peter.pan@domain.de,"Mon, 01 Oct 2018 11:20:27 +0200",2018-10-01 09:20:27,Subject,<38.E7.30937.BD6E1BB5@ mailrelay.de>,"dmarc=fail (p=none, dis=none) header.from=domain.de",,10.10.10.10,,,,policy,dmarc,domain.de,,False
```
@@ -238,4 +246,4 @@ auth-failure,Lua/1.0,1.0,,sharepoint@domain.de,peter.pan@domain.de,"Mon, 01 Oct
]
}
]
```
```

View File

@@ -1114,6 +1114,9 @@ def parsed_aggregate_reports_to_csv_rows(
row["source_base_domain"] = record["source"]["base_domain"]
row["source_name"] = record["source"]["name"]
row["source_type"] = record["source"]["type"]
row["source_asn"] = record["source"]["asn"]
row["source_asn_name"] = record["source"]["asn_name"]
row["source_asn_domain"] = record["source"]["asn_domain"]
row["count"] = record["count"]
row["spf_aligned"] = record["alignment"]["spf"]
row["dkim_aligned"] = record["alignment"]["dkim"]
@@ -1205,6 +1208,9 @@ def parsed_aggregate_reports_to_csv(
"source_base_domain",
"source_name",
"source_type",
"source_asn",
"source_asn_name",
"source_asn_domain",
"count",
"spf_aligned",
"dkim_aligned",
@@ -1406,6 +1412,9 @@ def parsed_forensic_reports_to_csv_rows(
row["source_base_domain"] = report["source"]["base_domain"]
row["source_name"] = report["source"]["name"]
row["source_type"] = report["source"]["type"]
row["source_asn"] = report["source"]["asn"]
row["source_asn_name"] = report["source"]["asn_name"]
row["source_asn_domain"] = report["source"]["asn_domain"]
row["source_country"] = report["source"]["country"]
del row["source"]
row["subject"] = report["parsed_sample"].get("subject")
@@ -1451,6 +1460,9 @@ def parsed_forensic_reports_to_csv(
"source_base_domain",
"source_name",
"source_type",
"source_asn",
"source_asn_name",
"source_asn_domain",
"delivery_result",
"auth_failure",
"reported_domain",

View File

@@ -1,4 +1,4 @@
__version__ = "9.8.0"
__version__ = "9.9.0"
USER_AGENT = f"parsedmarc/{__version__}"

View File

@@ -79,6 +79,9 @@ class _AggregateReportDoc(Document):
source_base_domain = Text()
source_type = Text()
source_name = Text()
source_asn = Integer()
source_asn_name = Text()
source_asn_domain = Text()
message_count = Integer
disposition = Text()
dkim_aligned = Boolean()
@@ -173,6 +176,9 @@ class _ForensicReportDoc(Document):
source_ip_address = Ip()
source_country = Text()
source_reverse_dns = Text()
source_asn = Integer()
source_asn_name = Text()
source_asn_domain = Text()
source_authentication_mechanisms = Text()
source_auth_failures = Text()
dkim_domain = Text()
@@ -489,6 +495,9 @@ def save_aggregate_report_to_elasticsearch(
source_base_domain=record["source"]["base_domain"],
source_type=record["source"]["type"],
source_name=record["source"]["name"],
source_asn=record["source"]["asn"],
source_asn_name=record["source"]["asn_name"],
source_asn_domain=record["source"]["asn_domain"],
message_count=record["count"],
disposition=record["policy_evaluated"]["disposition"],
dkim_aligned=record["policy_evaluated"]["dkim"] is not None
@@ -673,6 +682,9 @@ def save_forensic_report_to_elasticsearch(
source_country=forensic_report["source"]["country"],
source_reverse_dns=forensic_report["source"]["reverse_dns"],
source_base_domain=forensic_report["source"]["base_domain"],
source_asn=forensic_report["source"]["asn"],
source_asn_name=forensic_report["source"]["asn_name"],
source_asn_domain=forensic_report["source"]["asn_domain"],
authentication_mechanisms=forensic_report["authentication_mechanisms"],
auth_failure=forensic_report["auth_failure"],
dkim_domain=forensic_report["dkim_domain"],

View File

@@ -82,6 +82,9 @@ class _AggregateReportDoc(Document):
source_base_domain = Text()
source_type = Text()
source_name = Text()
source_asn = Integer()
source_asn_name = Text()
source_asn_domain = Text()
message_count = Integer
disposition = Text()
dkim_aligned = Boolean()
@@ -176,6 +179,9 @@ class _ForensicReportDoc(Document):
source_ip_address = Ip()
source_country = Text()
source_reverse_dns = Text()
source_asn = Integer()
source_asn_name = Text()
source_asn_domain = Text()
source_authentication_mechanisms = Text()
source_auth_failures = Text()
dkim_domain = Text()
@@ -519,6 +525,9 @@ def save_aggregate_report_to_opensearch(
source_base_domain=record["source"]["base_domain"],
source_type=record["source"]["type"],
source_name=record["source"]["name"],
source_asn=record["source"]["asn"],
source_asn_name=record["source"]["asn_name"],
source_asn_domain=record["source"]["asn_domain"],
message_count=record["count"],
disposition=record["policy_evaluated"]["disposition"],
dkim_aligned=record["policy_evaluated"]["dkim"] is not None
@@ -703,6 +712,9 @@ def save_forensic_report_to_opensearch(
source_country=forensic_report["source"]["country"],
source_reverse_dns=forensic_report["source"]["reverse_dns"],
source_base_domain=forensic_report["source"]["base_domain"],
source_asn=forensic_report["source"]["asn"],
source_asn_name=forensic_report["source"]["asn_name"],
source_asn_domain=forensic_report["source"]["asn_domain"],
authentication_mechanisms=forensic_report["authentication_mechanisms"],
auth_failure=forensic_report["auth_failure"],
dkim_domain=forensic_report["dkim_domain"],

View File

@@ -104,6 +104,9 @@ class HECClient(object):
new_report["source_base_domain"] = record["source"]["base_domain"]
new_report["source_type"] = record["source"]["type"]
new_report["source_name"] = record["source"]["name"]
new_report["source_asn"] = record["source"]["asn"]
new_report["source_asn_name"] = record["source"]["asn_name"]
new_report["source_asn_domain"] = record["source"]["asn_domain"]
new_report["message_count"] = record["count"]
new_report["disposition"] = record["policy_evaluated"]["disposition"]
new_report["spf_aligned"] = record["alignment"]["spf"]

View File

@@ -40,6 +40,9 @@ class IPSourceInfo(TypedDict):
base_domain: Optional[str]
name: Optional[str]
type: Optional[str]
asn: Optional[int]
asn_name: Optional[str]
asn_domain: Optional[str]
class AggregateAlignment(TypedDict):

View File

@@ -151,6 +151,9 @@ class IPAddressInfo(TypedDict):
base_domain: Optional[str]
name: Optional[str]
type: Optional[str]
asn: Optional[int]
asn_name: Optional[str]
asn_domain: Optional[str]
def decode_base64(data: str) -> bytes:
@@ -457,20 +460,7 @@ def load_ip_db(
logger.info("Using bundled IP database")
def get_ip_address_country(
ip_address: str, *, db_path: Optional[str] = None
) -> Optional[str]:
"""
Returns the ISO code for the country associated
with the given IPv4 or IPv6 address
Args:
ip_address (str): The IP address to query for
db_path (str): Path to a MMDB file from IPinfo, MaxMind, or DBIP
Returns:
str: And ISO country code associated with the given IP address
"""
def _get_ip_database_path(db_path: Optional[str]) -> str:
db_paths = [
"ipinfo_lite.mmdb",
"GeoLite2-Country.mmdb",
@@ -486,14 +476,13 @@ def get_ip_address_country(
"dbip-country.mmdb",
]
if db_path is not None:
if not os.path.isfile(db_path):
logger.warning(
f"No file exists at {db_path}. Falling back to an "
"included copy of the IPinfo IP to Country "
"Lite database."
)
db_path = None
if db_path is not None and not os.path.isfile(db_path):
logger.warning(
f"No file exists at {db_path}. Falling back to an "
"included copy of the IPinfo IP to Country "
"Lite database."
)
db_path = None
if db_path is None:
for system_path in db_paths:
@@ -513,14 +502,37 @@ def get_ip_address_country(
if db_age > timedelta(days=30):
logger.warning("IP database is more than a month old")
db_reader = maxminddb.open_database(db_path)
return db_path
class _IPDatabaseRecord(TypedDict):
country: Optional[str]
asn: Optional[int]
asn_name: Optional[str]
asn_domain: Optional[str]
def get_ip_address_db_record(
ip_address: str, *, db_path: Optional[str] = None
) -> _IPDatabaseRecord:
"""Look up an IP in the configured MMDB and return country + ASN fields.
IPinfo Lite carries ``country_code``, ``as_name``, and ``as_domain`` on
every record. MaxMind/DBIP country-only databases carry only country, so
``asn_name`` / ``asn_domain`` come back None for those users.
"""
resolved_path = _get_ip_database_path(db_path)
db_reader = maxminddb.open_database(resolved_path)
record = db_reader.get(ip_address)
# Support both the IPinfo schema (flat top-level ``country_code``) and the
# MaxMind/DBIP schema (nested ``country.iso_code``) so users dropping in
# their own MMDB from any of these providers keeps working.
country: Optional[str] = None
asn: Optional[int] = None
asn_name: Optional[str] = None
asn_domain: Optional[str] = None
if isinstance(record, dict):
# Support both the IPinfo schema (flat top-level ``country_code``) and
# the MaxMind/DBIP schema (nested ``country.iso_code``) so users
# dropping in their own MMDB from any of these providers keeps working.
code = record.get("country_code")
if code is None:
nested = record.get("country")
@@ -529,7 +541,52 @@ def get_ip_address_country(
if isinstance(code, str):
country = code
return country
# Normalize ASN to a plain integer. IPinfo stores it as a string like
# "AS15169"; MaxMind's ASN DB uses ``autonomous_system_number`` as an
# int. Integer form lets consumers do range queries and sort
# numerically; display-time formatting with an "AS" prefix is trivial.
raw_asn = record.get("asn")
if isinstance(raw_asn, int):
asn = raw_asn
elif isinstance(raw_asn, str) and raw_asn:
digits = raw_asn.removeprefix("AS").removeprefix("as")
if digits.isdigit():
asn = int(digits)
if asn is None:
mm_asn = record.get("autonomous_system_number")
if isinstance(mm_asn, int):
asn = mm_asn
name = record.get("as_name") or record.get("autonomous_system_organization")
if isinstance(name, str) and name:
asn_name = name
domain = record.get("as_domain")
if isinstance(domain, str) and domain:
asn_domain = domain.lower()
return {
"country": country,
"asn": asn,
"asn_name": asn_name,
"asn_domain": asn_domain,
}
def get_ip_address_country(
ip_address: str, *, db_path: Optional[str] = None
) -> Optional[str]:
"""
Returns the ISO code for the country associated
with the given IPv4 or IPv6 address.
Args:
ip_address (str): The IP address to query for
db_path (str): Path to a MMDB file from IPinfo, MaxMind, or DBIP
Returns:
str: And ISO country code associated with the given IP address
"""
return get_ip_address_db_record(ip_address, db_path=db_path)["country"]
def load_reverse_dns_map(
@@ -723,6 +780,9 @@ def get_ip_address_info(
"base_domain": None,
"name": None,
"type": None,
"asn": None,
"asn_name": None,
"asn_domain": None,
}
if offline:
reverse_dns = None
@@ -733,9 +793,13 @@ def get_ip_address_info(
timeout=timeout,
retries=retries,
)
country = get_ip_address_country(ip_address, db_path=ip_db_path)
info["country"] = country
db_record = get_ip_address_db_record(ip_address, db_path=ip_db_path)
info["country"] = db_record["country"]
info["asn"] = db_record["asn"]
info["asn_name"] = db_record["asn_name"]
info["asn_domain"] = db_record["asn_domain"]
info["reverse_dns"] = reverse_dns
if reverse_dns is not None:
base_domain = get_base_domain(reverse_dns)
if base_domain is not None:
@@ -750,12 +814,34 @@ def get_ip_address_info(
info["base_domain"] = base_domain
info["type"] = service["type"]
info["name"] = service["name"]
if cache is not None:
cache[ip_address] = info
logger.debug(f"IP address {ip_address} added to cache")
else:
logger.debug(f"IP address {ip_address} reverse_dns not found")
# Fall back to ASN data for source attribution. ``reverse_dns`` and
# ``base_domain`` are left null so consumers can still tell an
# ASN-derived row apart from one resolved via a real PTR.
map_value: ReverseDNSMap = (
reverse_dns_map if reverse_dns_map is not None else {}
)
if len(map_value) == 0:
load_reverse_dns_map(
map_value,
always_use_local_file=always_use_local_files,
local_file_path=reverse_dns_map_path,
url=reverse_dns_map_url,
offline=offline,
)
if info["asn_domain"] and info["asn_domain"] in map_value:
service = map_value[info["asn_domain"]]
info["name"] = service["name"]
info["type"] = service["type"]
elif info["asn_name"]:
# ASN-domain not in the map: surface the raw AS name with no
# classification. Better than leaving the row unattributed.
info["name"] = info["asn_name"]
if cache is not None:
cache[ip_address] = info
logger.debug(f"IP address {ip_address} added to cache")
return info

View File

@@ -223,6 +223,67 @@ class Test(unittest.TestCase):
parsedmarc.parsed_smtp_tls_reports_to_csv(result["report"])
print("Passed!")
def testIpAddressInfoSurfacesASNFields(self):
"""ASN number, name, and domain from the bundled MMDB appear on every
IP info result, even when no PTR resolves."""
info = parsedmarc.utils.get_ip_address_info("8.8.8.8", offline=True)
self.assertEqual(info["asn"], 15169)
self.assertIsInstance(info["asn"], int)
self.assertEqual(info["asn_domain"], "google.com")
self.assertTrue(info["asn_name"])
def testIpAddressInfoFallsBackToASNMapEntryWhenNoPTR(self):
"""When reverse DNS is absent, the ASN domain should be used as a
lookup into the reverse_dns_map so the row still gets attributed,
while reverse_dns and base_domain remain null."""
info = parsedmarc.utils.get_ip_address_info("8.8.8.8", offline=True)
self.assertIsNone(info["reverse_dns"])
self.assertIsNone(info["base_domain"])
self.assertEqual(info["name"], "Google (Including Gmail and Google Workspace)")
self.assertEqual(info["type"], "Email Provider")
def testIpAddressInfoFallsBackToRawASNameOnMapMiss(self):
"""When neither PTR nor an ASN-map entry resolves, the raw AS name
is used as source_name with type left null — better than leaving
the row unattributed."""
# 204.79.197.100 is in an ASN whose as_domain is not in the map at
# the time of this test (msn.com); this exercises the asn_name
# fallback branch without depending on a specific map state.
from unittest.mock import patch
with patch(
"parsedmarc.utils.get_ip_address_db_record",
return_value={
"country": "US",
"asn": 64496,
"asn_name": "Some Unmapped Org, Inc.",
"asn_domain": "unmapped-for-this-test.example",
},
):
# Bypass cache to avoid prior-test pollution.
info = parsedmarc.utils.get_ip_address_info(
"192.0.2.1", offline=True, cache=None
)
self.assertIsNone(info["reverse_dns"])
self.assertIsNone(info["base_domain"])
self.assertIsNone(info["type"])
self.assertEqual(info["name"], "Some Unmapped Org, Inc.")
self.assertEqual(info["asn_domain"], "unmapped-for-this-test.example")
def testAggregateCsvExposesASNColumns(self):
"""The aggregate CSV output should include source_asn, source_asn_name,
and source_asn_domain columns."""
result = parsedmarc.parse_report_file(
"samples/aggregate/!example.com!1538204542!1538463818.xml",
always_use_local_files=True,
offline=True,
)
csv_text = parsedmarc.parsed_aggregate_reports_to_csv(result["report"])
header = csv_text.splitlines()[0].split(",")
self.assertIn("source_asn", header)
self.assertIn("source_asn_name", header)
self.assertIn("source_asn_domain", header)
def testOpenSearchSigV4RequiresRegion(self):
with self.assertRaises(opensearch_module.OpenSearchError):
opensearch_module.set_hosts(