10.2.0 - Explain why a report is invalid instead of "Not a valid report" (#802)

* Explain why a report is invalid instead of "Not a valid report"

The parser catches broadly so one malformed report can't crash a batch,
but every failure surfaced as the generic ParserError("Not a valid
report"), telling operators nothing about the cause.

parse_report_file() now keeps each format parser's specific error as it
tries aggregate XML -> SMTP TLS JSON -> report email, and when all three
reject the input it content-sniffs the leading byte to surface the single
relevant reason (e.g. "Invalid aggregate report: Missing field:
'org_name'", or "Not a recognized report format (...)"). The CLI already
logs str(error), so this reaches the user with no cli.py change.

Every parser catch site also re-raises with `raise ... from <original>`,
preserving the underlying ExpatError / JSONDecodeError / KeyError /
archive errors on __cause__ for library callers and tracebacks. The same
exception *types* are still raised.

Finally, the catch-all "unexpected error" branches append
`(raised at <file>:<line>)` from the deepest traceback frame, but only
when the parsedmarc logger is at DEBUG level (e.g. the CLI's --debug);
normal-level output is unchanged.

Bumps the in-progress version to 10.2.0 and documents all three in the
CHANGELOG.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Cover the failure-report path in parse_report_file error tests

The reason-surfacing tests covered the aggregate and SMTP TLS branches but
not failure reports, which reach parse_report_file only via the email
path. Add a malformed multipart/report failure email (missing the required
Source-IP) and assert the message names the failure format and the missing
field rather than collapsing to "Not a valid report".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Cover the new error-reporting lines; drop one unreachable catch

Bring the lines added by this PR to full test coverage:

- Test _exc_origin() with a never-raised exception (no __traceback__) so the
  "no frames" guard is exercised.
- Test _parse_smtp_tls_failure_details() with a non-dict, which raises
  TypeError (not KeyError) and exercises the generic catch-all.
- Test parse_report_email() with an unparseable Date header, which trips the
  initial mail-parse catch-all and becomes a ParserError.

Two dead lines are removed rather than hidden, per the project's "delete
unreachable branches, no # pragma: no cover" rule:

- _looks_like_email() looped with a `continue` for blank lines, but every
  caller passes lstrip()-ed text, so the first line is never blank. Simplified
  to inspect the first line directly.
- parse_report_email()'s `except Exception` after `except InvalidFailureReport`
  was unreachable: parse_failure_report wraps its entire body and provably
  raises only InvalidFailureReport, which the preceding handler already catches.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Fix IndexError when backfilling envelope_from from SPF results

_parse_report_record() backfills a missing/empty envelope_from from the
last SPF auth result's domain. The "envelope_from is None" branch gated on
the raw auth_results["spf"] list but indexed the filtered
new_record["auth_results"]["spf"] list, which only holds results that have
a domain. A reporter sending an SPF result with no domain made the filtered
list empty while the raw list was non-empty, so [-1] raised IndexError and
the whole record failed to parse.

The two near-identical envelope_from backfill branches (missing identifier
vs. empty identifier) drifted apart -- only one was updated when the
filtered new_record list was introduced -- which is what let them disagree
on which list to read. Merge them into a single path, keyed on
dict.get("envelope_from") is None, that gates and indexes the same raw list
with the "domain" membership guard the missing-identifier branch already
used.

Regression test: envelope_from=None with an SPF result carrying no domain
now parses to envelope_from=None instead of raising. This is the bug that
motivated the surrounding error-reporting work in this PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Trim comment

* Cover the touched error branches; drop a dead UnicodeDecodeError catch

Codecov flagged the pre-existing error branches this PR touched (adding
`from e` / `_exc_origin`) as changed-and-uncovered. Most are real
malformed-report paths, so add honest tests that drive them with realistic
inputs:

- parse_smtp_tls_report_json: nested missing key (date-range without
  start-datetime) -> InvalidSMTPTLSReport chaining a KeyError.
- parse_aggregate_report_xml: non-structured report_metadata -> the
  AttributeError branch ("Report missing required section").
- parse_report_email: valid legacy text/plain failure report (success path),
  a text report missing its fields, a base64 attachment of malformed
  aggregate XML, and one of invalid SMTP TLS JSON.
- parse_report_file: gzipped junk -> the str branch of the content sniff.

The `except UnicodeDecodeError` in extract_report is removed as dead code
(no `# pragma: no cover`, per the repo rule): str-mode streams are already
rejected by explicit isinstance checks, and every decode() uses
errors="ignore", so it can never fire. str-mode still raises ParserError.

Also rename the two new failure-report tests from "Forensic" to "Failure"
and add an AGENTS.md rule: RUF reports are "failure reports"; "forensic" is
reserved for the literal backward-compat alias identifiers only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Fix pyright errors in the new error-branch tests

CI runs pyright over the whole repo (tests included); these slipped through
because the local check only covered parsedmarc/__init__.py:

- _parse_smtp_tls_failure_details("not a dict") is a deliberate wrong-type
  test -> targeted `# pyright: ignore[reportArgumentType]`.
- result["report"]["source"]["ip_address"] on a ParsedReport TypedDict ->
  cast(FailureReport, result["report"]) first, matching existing tests.

This is what failed lint-docs-build (and, since `test` needs it, skipped the
Codecov upload) on the prior commits.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Sean Whalen
2026-06-25 14:53:56 -04:00
committed by GitHub
parent bc17e6afb2
commit a67e8d3ebc
5 changed files with 423 additions and 45 deletions
+2
View File
@@ -60,6 +60,8 @@ To skip DNS lookups during testing, set `GITHUB_ACTIONS=true`.
`ReportType = Literal["aggregate", "failure", "smtp_tls"]`. Exception hierarchy: `ParserError``InvalidDMARCReport``InvalidAggregateReport`/`InvalidFailureReport`, and `InvalidSMTPTLSReport`. Legacy alias `InvalidForensicReport = InvalidFailureReport` preserved.
**Terminology: say "failure report", never "forensic report".** RUF reports are **failure reports** (RFC 9991 terminology) everywhere in new code, test names, docstrings, comments, CHANGELOG entries, and prose. "Forensic" is the legacy term, kept *only* as backward-compatible API aliases (`parse_forensic_report`, `InvalidForensicReport`, `parsed_forensic_reports_to_csv`, `ForensicReport`, the `Forensic` archive-folder name). Use "forensic" only when naming one of those literal pre-existing identifiers — never as a generic description of the report type.
### RFC 9989 / RFC 9990 / RFC 9991 support
Aggregate reports parse under both RFC 7489 and RFC 9990 in one code path. RFC 9990 adds these fields, all surfaced through `AggregatePolicyPublished` / `AggregateReportMetadata` / `AggregateAuthResult*`:
+12
View File
@@ -1,5 +1,17 @@
# Changelog
## 10.2.0
### Changes
- **Invalid reports now explain *why* they are invalid.** The parser still catches broadly so one malformed report can never crash a batch, but the diagnostics are no longer a bare `Not a valid report`. `parse_report_file()` keeps each format parser's specific error as it tries aggregate XML → SMTP TLS JSON → report email, and when all three reject the input it sniffs the content shape to surface the single relevant reason — e.g. `Invalid aggregate report: Missing field: 'org_name'`, `Invalid SMTP TLS report: Missing required field: date-range`, or `Not a recognized report format (not a DMARC aggregate XML report, an SMTP TLS JSON report, or a DMARC report email)`. This is the message the CLI already logs (`Failed to parse <file> - <reason>`), so operators see the cause without a code change on their end.
- **Original exceptions are now chained.** Every parser catch site re-raises with `raise … from <original>`, so the underlying `ExpatError`, `json.JSONDecodeError`, `KeyError`, and archive/decompression errors are preserved on `__cause__` for library callers and tracebacks instead of being flattened into a string. Behavior is otherwise unchanged — the same exception *types* are still raised.
- **Unexpected errors cite their origin in debug mode.** When a catch-all branch turns an unforeseen exception into a generic `Unexpected error: …` / archive / mail-parse failure, the message now appends `(raised at <file>:<line>)` pinpointing the deepest traceback frame — but only when the `parsedmarc` logger is at `DEBUG` level (e.g. the CLI's `--debug`). Normal-level output is unchanged.
### Bug fixes
- **Fixed an `IndexError` when backfilling `envelope_from` from SPF results.** In `_parse_report_record()`, when an aggregate report's `<identifiers>` carried an empty `envelope_from`, the fallback checked the raw `auth_results["spf"]` list for length but then indexed the *filtered* `new_record["auth_results"]["spf"]` list (which only contains results that have a `domain`). A reporter that sent an SPF auth result with no `domain` made the filtered list empty while the raw list was non-empty, so the `[-1]` index raised `IndexError` and the whole record failed to parse. The two near-identical `envelope_from` backfill branches (one for a missing identifier, one for an empty one) are now a single code path that gates and indexes the same list, matching the already-correct missing-identifier branch.
## 10.1.1
### Changes
+127 -43
View File
@@ -8,11 +8,13 @@ import binascii
import email
import email.utils
import json
import logging
import mailbox
import os
import re
import shutil
import tempfile
import traceback
import xml.parsers.expat as expat
import zipfile
import zlib
@@ -144,6 +146,26 @@ class InvalidFailureReport(InvalidDMARCReport):
InvalidForensicReport = InvalidFailureReport
def _exc_origin(error: BaseException) -> str:
"""Returns a ``" (raised at <file>:<line>)"`` suffix pointing at where an
unexpected exception actually originated, but only when the parsedmarc
logger is at ``DEBUG`` level. Returns ``""`` otherwise, so normal output is
unchanged.
The deepest traceback frame is the line that raised, which is the useful
breadcrumb when a catch-all ``except Exception`` turns an unforeseen error
into a generic ``Invalid*Report`` -- without it, the message says *what*
failed but not *where*.
"""
if not logger.isEnabledFor(logging.DEBUG):
return ""
frames = traceback.extract_tb(error.__traceback__)
if not frames:
return ""
last = frames[-1]
return " (raised at {0}:{1})".format(last.filename, last.lineno)
def _text(value: Any) -> Optional[str]:
"""Unwrap a possibly-langAttrString value parsed by xmltodict.
@@ -510,7 +532,9 @@ def _parse_report_record(
new_result["human_result"] = _text(result.get("human_result"))
new_record["auth_results"]["spf"].append(new_result)
if "envelope_from" not in new_record["identifiers"]:
# Backfill envelope_from from the last SPF result's domain when the
# reporter omitted the identifier or sent it empty.
if new_record["identifiers"].get("envelope_from") is None:
envelope_from = None
if len(auth_results["spf"]) > 0:
spf_result = auth_results["spf"][-1]
@@ -520,13 +544,6 @@ def _parse_report_record(
envelope_from = str(envelope_from).lower()
new_record["identifiers"]["envelope_from"] = envelope_from
elif new_record["identifiers"]["envelope_from"] is None:
if len(auth_results["spf"]) > 0:
envelope_from = new_record["auth_results"]["spf"][-1]["domain"]
if envelope_from is not None:
envelope_from = str(envelope_from).lower()
new_record["identifiers"]["envelope_from"] = envelope_from
envelope_to = None
if "envelope_to" in new_record["identifiers"]:
envelope_to = new_record["identifiers"]["envelope_to"]
@@ -568,9 +585,11 @@ def _parse_smtp_tls_failure_details(failure_details: dict[str, Any]):
return new_failure_details
except KeyError as e:
raise InvalidSMTPTLSReport(f"Missing required failure details field: {e}")
raise InvalidSMTPTLSReport(
f"Missing required failure details field: {e}"
) from e
except Exception as e:
raise InvalidSMTPTLSReport(str(e))
raise InvalidSMTPTLSReport(str(e) + _exc_origin(e)) from e
def _parse_smtp_tls_report_policy(policy: dict[str, Any]):
@@ -608,9 +627,9 @@ def _parse_smtp_tls_report_policy(policy: dict[str, Any]):
return new_policy
except KeyError as e:
raise InvalidSMTPTLSReport(f"Missing required policy field: {e}")
raise InvalidSMTPTLSReport(f"Missing required policy field: {e}") from e
except Exception as e:
raise InvalidSMTPTLSReport(str(e))
raise InvalidSMTPTLSReport(str(e) + _exc_origin(e)) from e
def parse_smtp_tls_report_json(report: Union[str, bytes]) -> SMTPTLSReport:
@@ -650,9 +669,9 @@ def parse_smtp_tls_report_json(report: Union[str, bytes]) -> SMTPTLSReport:
return new_report
except KeyError as e:
raise InvalidSMTPTLSReport(f"Missing required field: {e}")
raise InvalidSMTPTLSReport(f"Missing required field: {e}") from e
except Exception as e:
raise InvalidSMTPTLSReport(str(e))
raise InvalidSMTPTLSReport(str(e) + _exc_origin(e)) from e
def parsed_smtp_tls_reports_to_csv_rows(
@@ -1026,15 +1045,21 @@ def parse_aggregate_report_xml(
return cast(AggregateReport, new_report)
except expat.ExpatError as error:
raise InvalidAggregateReport("Invalid XML: {0}".format(error.__str__()))
raise InvalidAggregateReport(
"Invalid XML: {0}".format(error.__str__())
) from error
except KeyError as error:
raise InvalidAggregateReport("Missing field: {0}".format(error.__str__()))
except AttributeError:
raise InvalidAggregateReport("Report missing required section")
raise InvalidAggregateReport(
"Missing field: {0}".format(error.__str__())
) from error
except AttributeError as error:
raise InvalidAggregateReport("Report missing required section") from error
except Exception as error:
raise InvalidAggregateReport("Unexpected error: {0}".format(error.__str__()))
raise InvalidAggregateReport(
"Unexpected error: {0}{1}".format(error.__str__(), _exc_origin(error))
) from error
def extract_report(content: Union[bytes, str, BinaryIO]) -> str:
@@ -1112,10 +1137,10 @@ def extract_report(content: Union[bytes, str, BinaryIO]) -> str:
else:
raise ParserError("Not a valid zip, gzip, json, or xml file")
except UnicodeDecodeError:
raise ParserError("File objects must be opened in binary (rb) mode")
except Exception as error:
raise ParserError("Invalid archive file: {0}".format(error.__str__()))
raise ParserError(
"Invalid archive file: {0}{1}".format(error.__str__(), _exc_origin(error))
) from error
finally:
if file_object:
try:
@@ -1176,7 +1201,7 @@ def parse_aggregate_report_file(
try:
xml = extract_report(_input)
except Exception as e:
raise InvalidAggregateReport(e)
raise InvalidAggregateReport(str(e) + _exc_origin(e)) from e
return parse_aggregate_report_xml(
xml,
@@ -1557,10 +1582,14 @@ def parse_failure_report(
return cast(FailureReport, parsed_report)
except KeyError as error:
raise InvalidFailureReport("Missing value: {0}".format(error.__str__()))
raise InvalidFailureReport(
"Missing value: {0}".format(error.__str__())
) from error
except Exception as error:
raise InvalidFailureReport("Unexpected error: {0}".format(error.__str__()))
raise InvalidFailureReport(
"Unexpected error: {0}{1}".format(error.__str__(), _exc_origin(error))
) from error
def parsed_failure_reports_to_csv_rows(
@@ -1728,7 +1757,7 @@ def parse_report_email(
msg = email.message_from_string(input_str)
except Exception as e:
raise ParserError(e.__str__())
raise ParserError(e.__str__() + _exc_origin(e)) from e
subject = None
feedback_report = None
smtp_tls_report = None
@@ -1785,10 +1814,10 @@ def parse_report_email(
fields["received-date"], fields["sender-ip-address"]
)
except Exception as e:
error = 'Unable to parse message with subject "{0}": {1}'.format(
subject, e
error = 'Unable to parse message with subject "{0}": {1}{2}'.format(
subject, e, _exc_origin(e)
)
raise InvalidDMARCReport(error)
raise InvalidDMARCReport(error) from e
sample = parts[1].lstrip()
logger.debug(sample)
@@ -1827,17 +1856,18 @@ def parse_report_email(
except (TypeError, ValueError, binascii.Error):
pass
except InvalidDMARCReport:
error = 'Message with subject "{0}" is not a valid DMARC report'.format(
subject
except InvalidDMARCReport as e:
error = (
'Message with subject "{0}" is not a valid '
"DMARC report: {1}".format(subject, e)
)
raise ParserError(error)
raise ParserError(error) from e
except Exception as e:
error = 'Unable to parse message with subject "{0}": {1}'.format(
subject, e
error = 'Unable to parse message with subject "{0}": {1}{2}'.format(
subject, e, _exc_origin(e)
)
raise ParserError(error)
raise ParserError(error) from e
if feedback_report and sample:
try:
@@ -1861,9 +1891,7 @@ def parse_report_email(
"is not a valid "
"failure DMARC report: {1}".format(subject, e)
)
raise InvalidFailureReport(error)
except Exception as e:
raise InvalidFailureReport(e.__str__())
raise InvalidFailureReport(error) from e
result = {"report_type": "failure", "report": failure_report}
return result
@@ -1875,6 +1903,54 @@ def parse_report_email(
return result
# An RFC 5322 header field name (printable ASCII excluding the colon) followed
# by a colon at the start of a line, or an mbox "From " separator.
_email_header_regex = re.compile(r"^(From |[\x21-\x39\x3b-\x7e]+:)")
def _looks_like_email(text: str) -> bool:
"""Returns True if the first line looks like an email header.
Callers pass already-``lstrip()``-ed text, so the first line is the first
meaningful line.
"""
first_line = text.split("\n", 1)[0]
return _email_header_regex.match(first_line) is not None
def _describe_parse_failure(
content: Union[str, bytes],
aggregate_error: InvalidAggregateReport,
smtp_tls_error: InvalidSMTPTLSReport,
email_error: InvalidDMARCReport,
) -> str:
"""Builds a human-readable reason for a parse_report_file failure.
parse_report_file tries the aggregate XML, SMTP TLS JSON, and report-email
parsers in turn; when all three reject the input, only the parser for the
format the content actually resembles produced a meaningful error. The
other two are noise -- a malformed aggregate report is also "not JSON" and
"not an email". Sniff the leading non-whitespace byte to surface the single
relevant reason.
"""
if isinstance(content, (bytes, bytearray, memoryview)):
sniff = bytes(content)[:512].decode("utf-8", errors="replace")
else:
sniff = content[:512]
sniff = sniff.lstrip()
if sniff.startswith("<"):
return "Invalid aggregate report: {0}".format(aggregate_error)
if sniff.startswith("{"):
return "Invalid SMTP TLS report: {0}".format(smtp_tls_error)
if _looks_like_email(sniff):
return "Invalid report email: {0}".format(email_error)
return (
"Not a recognized report format (not a DMARC aggregate XML report, "
"an SMTP TLS JSON report, or a DMARC report email)"
)
def parse_report_file(
input_: Union[bytes, str, os.PathLike[str], os.PathLike[bytes], BinaryIO],
*,
@@ -1930,6 +2006,10 @@ def parse_report_file(
results: Optional[ParsedReport] = None
# parse_report_file tries the three report formats in turn. When all three
# reject the input, keep each format's specific error so the final message
# can explain *why* the file is invalid instead of a bare "Not a valid
# report" (see _describe_parse_failure).
try:
report = parse_aggregate_report_file(
content,
@@ -1945,11 +2025,11 @@ def parse_report_file(
normalize_timespan_threshold_hours=normalize_timespan_threshold_hours,
)
results = {"report_type": "aggregate", "report": report}
except InvalidAggregateReport:
except InvalidAggregateReport as aggregate_error:
try:
report = parse_smtp_tls_report_json(content)
results = {"report_type": "smtp_tls", "report": report}
except InvalidSMTPTLSReport:
except InvalidSMTPTLSReport as smtp_tls_error:
try:
results = parse_report_email(
content,
@@ -1965,8 +2045,12 @@ def parse_report_file(
keep_alive=keep_alive,
normalize_timespan_threshold_hours=normalize_timespan_threshold_hours,
)
except InvalidDMARCReport:
raise ParserError("Not a valid report")
except InvalidDMARCReport as email_error:
raise ParserError(
_describe_parse_failure(
content, aggregate_error, smtp_tls_error, email_error
)
) from email_error
if results is None:
raise ParserError("Not a valid report")
+1 -1
View File
@@ -1,4 +1,4 @@
__version__ = "10.1.1"
__version__ = "10.2.0"
USER_AGENT = f"parsedmarc/{__version__}"
+281 -1
View File
@@ -5,7 +5,10 @@ parse_aggregate_report_xml, parse_failure_report, parse_smtp_tls_report_json,
extract_report, get_dmarc_reports_from_mbox, and the CSV / JSON renderers.
"""
import base64
import gzip
import json
import logging
import mailbox
import os
import unittest
@@ -935,6 +938,33 @@ class Test(unittest.TestCase):
result = parsedmarc._parse_report_record(record, offline=True)
self.assertEqual(result["identifiers"]["envelope_from"], "spf.example.com")
def testParseReportRecordEnvelopeFromNullNoSpfDomain(self):
"""envelope_from=None with SPF results that carry no domain must not
raise IndexError (regression: the branch gated on the raw SPF list but
indexed the filtered list, which is empty when no result has a domain)"""
record = {
"row": {
"source_ip": "192.0.2.1",
"count": "1",
"policy_evaluated": {
"disposition": "none",
"dkim": "pass",
"spf": "pass",
},
},
"identifiers": {
"header_from": "example.com",
"envelope_from": None,
},
# A raw SPF result with no "domain" -> filtered list is empty.
"auth_results": {
"dkim": [],
"spf": [{"scope": "mfrom", "result": "pass"}],
},
}
result = parsedmarc._parse_report_record(record, offline=True)
self.assertIsNone(result["identifiers"]["envelope_from"])
def testParseReportRecordEnvelopeTo(self):
"""envelope_to is preserved and moved correctly"""
record = {
@@ -1013,6 +1043,14 @@ class Test(unittest.TestCase):
with self.assertRaises(parsedmarc.InvalidSMTPTLSReport):
parsedmarc._parse_smtp_tls_failure_details({"result-type": "err"})
def testParseSmtpTlsFailureDetailsNonDict(self):
"""A non-dict failure-details value hits the catch-all (TypeError,
not KeyError) and is wrapped as InvalidSMTPTLSReport"""
with self.assertRaises(parsedmarc.InvalidSMTPTLSReport) as ctx:
# Deliberate wrong type to exercise the non-KeyError catch-all.
parsedmarc._parse_smtp_tls_failure_details("not a dict") # pyright: ignore[reportArgumentType]
self.assertIsInstance(ctx.exception.__cause__, TypeError)
def testParseSmtpTlsReportPolicyValid(self):
"""Valid STS policy parses correctly"""
policy = {
@@ -1822,8 +1860,25 @@ class TestMalformedXmlRecovery(unittest.TestCase):
<auth_results><spf><domain>example.com</domain><result>pass</result></spf></auth_results>
</record>
</feedback>"""
with self.assertRaises(parsedmarc.InvalidAggregateReport):
with self.assertRaises(parsedmarc.InvalidAggregateReport) as ctx:
parsedmarc.parse_aggregate_report_xml(xml, offline=True)
# The missing-field error chains the underlying KeyError so a library
# caller can inspect which field was absent.
self.assertIsInstance(ctx.exception.__cause__, KeyError)
def testReportMetadataNotAStructureRaises(self):
"""A non-structured report_metadata trips the AttributeError branch"""
# report_metadata is plain text rather than nested elements, so the
# parser's attribute access on it fails with AttributeError.
xml = (
"<feedback><report_metadata>x</report_metadata>"
"<policy_published><domain>x.com</domain></policy_published>"
"<record></record></feedback>"
)
with self.assertRaises(parsedmarc.InvalidAggregateReport) as ctx:
parsedmarc.parse_aggregate_report_xml(xml, offline=True)
self.assertIn("missing required section", str(ctx.exception))
self.assertIsInstance(ctx.exception.__cause__, AttributeError)
class TestPolicyPublishedEdgeCases(unittest.TestCase):
@@ -1995,6 +2050,134 @@ class TestParseReportFile(unittest.TestCase):
with self.assertRaises(parsedmarc.ParserError):
parsedmarc.parse_report_file(b"this is not a report", offline=True)
def testParseReportFileInvalidAggregateReason(self):
"""Malformed aggregate XML explains the aggregate-specific reason"""
xml = (
b'<?xml version="1.0"?>\n<feedback>\n'
b"<report_metadata><email>dmarc@example.com</email>"
b"<report_id>no-org</report_id></report_metadata>\n"
b"<policy_published><domain>example.com</domain><p>none</p>"
b"</policy_published>\n</feedback>"
)
with self.assertRaises(parsedmarc.ParserError) as ctx:
parsedmarc.parse_report_file(xml, offline=True)
message = str(ctx.exception)
# The reason must name the aggregate format and the missing field,
# not collapse to a bare "Not a valid report".
self.assertIn("aggregate", message.lower())
self.assertIn("org_name", message)
self.assertNotIn("Not a valid report", message)
def testParseReportFileInvalidSmtpTlsReason(self):
"""Malformed SMTP TLS JSON explains the SMTP-TLS-specific reason"""
with self.assertRaises(parsedmarc.ParserError) as ctx:
parsedmarc.parse_report_file(b'{"organization-name": "x"}', offline=True)
message = str(ctx.exception)
self.assertIn("SMTP TLS", message)
self.assertIn("date-range", message)
def testParseReportFileInvalidFailureReason(self):
"""A malformed failure report (email path) explains the reason"""
# A real DMARC failure report arrives as a multipart/report email;
# parse_report_file reaches it only via the email branch. Omit the
# required Source-IP so parse_failure_report rejects it.
eml = (
b"From: dmarc-noreply@example.com\n"
b"Subject: DMARC Failure Report\n"
b"MIME-Version: 1.0\n"
b"Content-Type: multipart/report; "
b'report-type=feedback-report; boundary="b"\n\n'
b"--b\n"
b"Content-Type: text/plain\n\n"
b"This is a DMARC failure report.\n"
b"--b\n"
b"Content-Type: message/feedback-report\n\n"
b"Feedback-Type: auth-failure\n"
b"Version: 1\n"
b"--b\n"
b"Content-Type: message/rfc822\n\n"
b"From: spoof@victim.example\n"
b"Subject: hi\n"
b"--b--\n"
)
with self.assertRaises(parsedmarc.ParserError) as ctx:
parsedmarc.parse_report_file(eml, offline=True)
message = str(ctx.exception)
# The reason must name the failure format and the missing field,
# not collapse to a bare "Not a valid report".
self.assertIn("failure", message.lower())
self.assertIn("source_ip", message)
self.assertNotIn("Not a valid report", message)
def testParseReportFileUnrecognizedGzip(self):
"""Gzipped junk decompresses to a str that matches no format.
Exercises the str branch of the content sniff (the decompressed
payload is a str, not bytes).
"""
blob = gzip.compress(b"plain junk not a report")
with self.assertRaises(parsedmarc.ParserError) as ctx:
parsedmarc.parse_report_file(blob, offline=True)
self.assertIn("recognized report format", str(ctx.exception))
def testParseReportFileUnrecognizedFormat(self):
"""Content matching no known format says so explicitly"""
with self.assertRaises(parsedmarc.ParserError) as ctx:
parsedmarc.parse_report_file(b"this is not a report", offline=True)
self.assertIn("recognized report format", str(ctx.exception))
def testParseReportFilePreservesCause(self):
"""The raised ParserError chains the underlying parse failure"""
with self.assertRaises(parsedmarc.ParserError) as ctx:
parsedmarc.parse_report_file(b"<feedback></feedback>", offline=True)
# raise ... from email_error must populate __cause__ for callers and
# tracebacks, even though the cross-format message is content-sniffed.
self.assertIsNotNone(ctx.exception.__cause__)
def _parse_unexpected_error(self):
# <feedback></feedback> parses as XML but trips a NoneType subscript
# deep in the aggregate parser, hitting the catch-all "Unexpected
# error" branch (not a narrow KeyError/ExpatError).
with self.assertRaises(parsedmarc.ParserError) as ctx:
parsedmarc.parse_report_file(b"<feedback></feedback>", offline=True)
return str(ctx.exception)
def testUnexpectedErrorOmitsOriginWhenNotDebug(self):
"""Catch-all errors stay clean when the logger is above DEBUG"""
logger = logging.getLogger("parsedmarc.log")
previous = logger.level
logger.setLevel(logging.WARNING)
try:
message = self._parse_unexpected_error()
finally:
logger.setLevel(previous)
self.assertIn("Unexpected error", message)
self.assertNotIn("raised at", message)
def testUnexpectedErrorCitesOriginInDebug(self):
"""Catch-all errors cite the source file:line when the logger is DEBUG"""
logger = logging.getLogger("parsedmarc.log")
previous = logger.level
logger.setLevel(logging.DEBUG)
try:
message = self._parse_unexpected_error()
finally:
logger.setLevel(previous)
self.assertIn("raised at", message)
self.assertIn("__init__.py:", message)
def testExcOriginEmptyWhenNoTraceback(self):
"""_exc_origin returns '' for an exception with no traceback"""
logger = logging.getLogger("parsedmarc.log")
previous = logger.level
logger.setLevel(logging.DEBUG)
try:
# A never-raised exception has __traceback__ is None, so there is
# no origin frame to cite even though debug logging is on.
self.assertEqual(parsedmarc._exc_origin(ValueError("x")), "")
finally:
logger.setLevel(previous)
class TestParseReportEmail(unittest.TestCase):
"""Tests for parse_report_email edge cases"""
@@ -2017,6 +2200,79 @@ This is not a DMARC report."""
with self.assertRaises(parsedmarc.InvalidDMARCReport):
parsedmarc.parse_report_email(email_str, offline=True)
def testUnparseableDateRaisesParserError(self):
"""An unparseable Date header trips the initial mail-parse catch-all"""
# human_timestamp_to_datetime() raises on a junk Date, which the
# catch-all around the initial parse turns into a ParserError.
email_str = "From: a@b.c\nDate: not-a-real-date\nSubject: x\n\nbody"
with self.assertRaises(parsedmarc.ParserError) as ctx:
parsedmarc.parse_report_email(email_str, offline=True)
self.assertIn("not-a-real-date", str(ctx.exception))
def testFailureTextReportParses(self):
"""A valid legacy text/plain failure report parses to a failure
report (the success path that builds the synthetic feedback report
and extracts the message sample)"""
# The field-name regex matches letters and spaces only, so the legacy
# fields are space-separated ("Received Date", "Sender IP Address").
eml = (
"From: report@example.com\nSubject: Failure Report\n"
"Content-Type: text/plain\n\n"
"A message claiming to be from you has failed authentication.\n"
"Received Date: Mon, 01 Jan 2024 00:00:00 +0000\n"
"Sender IP Address: 192.0.2.1\n"
"detected.\n"
"From: spoof@example.com\nTo: victim@example.com\nSubject: spam\n"
)
result = parsedmarc.parse_report_email(eml, offline=True)
self.assertEqual(result["report_type"], "failure")
report = cast(FailureReport, result["report"])
self.assertEqual(report["source"]["ip_address"], "192.0.2.1")
def testFailureTextMissingFieldsRaises(self):
"""A text/plain failure report missing its fields is rejected with
the subject named (not silently dropped)"""
# Has the trigger phrase and "detected." but none of the
# Received-Date / Sender-IP-Address fields, so building the synthetic
# feedback report raises KeyError, surfaced as InvalidDMARCReport.
eml = (
"From: a@b.c\nSubject: Failure\nContent-Type: text/plain\n\n"
"A message claiming to be from you has failed. "
"No fields here detected. nothing\n"
)
with self.assertRaises(parsedmarc.InvalidDMARCReport) as ctx:
parsedmarc.parse_report_email(eml, offline=True)
self.assertIn("Failure", str(ctx.exception))
def testAttachmentMalformedXmlRaises(self):
"""A base64 attachment of malformed aggregate XML is rejected"""
att = base64.b64encode(b"<feedback></feedback>").decode()
eml = (
"From: a@b.c\nSubject: Agg\nMIME-Version: 1.0\n"
"Content-Type: application/octet-stream\n"
"Content-Transfer-Encoding: base64\n\n" + att + "\n"
)
with self.assertRaises(parsedmarc.ParserError) as ctx:
parsedmarc.parse_report_email(eml, offline=True)
self.assertIn("not a valid DMARC report", str(ctx.exception))
def testAttachmentInvalidJsonRaises(self):
"""A base64 attachment of invalid SMTP TLS JSON is rejected.
parse_smtp_tls_report_json raises InvalidSMTPTLSReport, a sibling of
InvalidDMARCReport, so it falls through to the generic catch-all and
becomes a ParserError naming the subject.
"""
att = base64.b64encode(b"{not valid json").decode()
eml = (
"From: a@b.c\nSubject: Tls\nMIME-Version: 1.0\n"
"Content-Type: application/octet-stream\n"
"Content-Transfer-Encoding: base64\n\n" + att + "\n"
)
with self.assertRaises(parsedmarc.ParserError) as ctx:
parsedmarc.parse_report_email(eml, offline=True)
self.assertIn("Tls", str(ctx.exception))
class TestFailureReportParsing(unittest.TestCase):
"""Tests for failure report field defaults and edge cases"""
@@ -2232,6 +2488,30 @@ class TestSmtpTlsReportErrors(unittest.TestCase):
with self.assertRaises(parsedmarc.InvalidSMTPTLSReport):
parsedmarc.parse_smtp_tls_report_json("not json {{{")
def testInvalidJsonPreservesCause(self):
"""Invalid JSON chains the underlying JSONDecodeError"""
with self.assertRaises(parsedmarc.InvalidSMTPTLSReport) as ctx:
parsedmarc.parse_smtp_tls_report_json("not json {{{")
self.assertIsInstance(ctx.exception.__cause__, json.JSONDecodeError)
def testNestedMissingKeyNamesTheField(self):
"""A KeyError on a nested field reports which field was missing"""
# All five required top-level fields are present, so this gets past the
# top-level check and raises KeyError on date-range["start-datetime"].
report = json.dumps(
{
"organization-name": "x",
"date-range": {},
"contact-info": "x",
"report-id": "x",
"policies": [],
}
)
with self.assertRaises(parsedmarc.InvalidSMTPTLSReport) as ctx:
parsedmarc.parse_smtp_tls_report_json(report)
self.assertIn("start-datetime", str(ctx.exception))
self.assertIsInstance(ctx.exception.__cause__, KeyError)
class TestBucketIntervalEdgeCases(unittest.TestCase):
"""Tests for _bucket_interval_by_day edge cases"""