Commit Graph

8 Commits

Author SHA1 Message Date
Sean Whalen aabcfb4298 Store numbers as number_value; fix conditional guards to != ""
Two corrections confirmed against Google's official content-hub parsers
(content/parsers/third_party/community/*/cbn):

1. Numbers as numbers. count, source_asn, successful_session_count and
   failed_session_count were being stored in additional.fields as string_value.
   Store them as number_value instead (build string -> convert to uinteger ->
   rename to number_value, the content-hub idiom), so SecOps can range-query and
   sort them, per parsedmarc's "store numbers as numbers" rule. Booleans stay
   string_value (content-hub never uses bool_value) and are still converted in
   step 1b for the == "true"/"false" comparisons.

2. Conditional guards. Replaced bare `if [field] {` with `if [field] != "" {`
   (76 guards + the detection cascade + policy_override). After 1a initializes
   every tested field to "", a bare `if` is true for an empty field (Logstash/CBN
   semantics), which would misfire detection and emit empty labels. content-hub
   uses `!= ""` ~111x vs 2 bare (both flags); parser flags (no_json_payload,
   not_json, *_nan) correctly stay bare.

Verified: braces balance, no stray bare field-guards, all if-tested fields
initialized, all four numeric fields emit number_value.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 12:08:10 -04:00
Sean Whalen 05c9177d73 Cite the official Chronicle content-hub parser repo
Add github.com/chronicle/content-hub (Google's official third-party SecOps
parser repo) to the README references and re-anchor the in-code citations to
it. Its current CBN parsers (e.g. CLOUDFLARE_PAGESHIELD, Copyright 2025 Google
SecOps) confirm both fixes this parser makes: initialize every field before the
json{} filter, and convert JSON booleans/numbers to strings before comparison.
Replaces the dated "How to parse JSON data" citation with the authoritative,
actively-maintained source.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 11:27:38 -04:00
Sean Whalen 88034c7192 Define CBN up front for new SecOps users
Add a short, skippable callout explaining what a parser / configuration-based
normalizer (CBN) is and how it fits the SecOps ingest flow (log type → parser →
UDM event), so the README serves newcomers without slowing experienced users.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 11:13:03 -04:00
Sean Whalen 1c234de9ff Expand README references with the sources used
Add the remaining official Google docs the parser is built on (parser tips
& troubleshooting, manage parsers, UDM search, Bindplane install) and a
clearly-separated "Additional sources and tooling" section for the community
resources that drove the JSON type-handling and field-init fixes
(thatsiemguy's Parsing 101, the Corelight production parser, chronicle/cbn-tool).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 11:09:58 -04:00
Sean Whalen 46e694502d Detect aggregate reports by "xml_schema" instead of "domain"
xml_schema is aggregate-only (failure/SMTP TLS rows don't carry it) and a
distinctive, non-generic field name, addressing the concern that "domain"
could be confused with other logs. parsedmarc defaults xml_schema to "draft"
when the report omits <version> (parsedmarc/__init__.py:832), so it survives a
missing version element -- unlike a field with no default.

It is also a native JSON string straight out of the json{} filter, so unlike
dmarc_aligned it needs no convert step to be testable, keeping detection
independent of the type-conversion in step 1b. xml_schema is added to the
pre-json init block (required for any if-tested field); domain stays
initialized since it is still mapped to target.hostname.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 10:36:49 -04:00
Sean Whalen 2d9a2a2a8f Fix JSON type handling and pre-json field init in SecOps parser
Two CBN behaviors, confirmed against Google's own "How to parse JSON data"
guide (statedump shows JSON true/199 retaining boolean/integer type) and the
published Corelight production parser:

1. The json{} filter preserves the original JSON type, so parsedmarc's boolean
   *_aligned / testing / normalized_timespan and numeric count / *_session_count
   / source_asn would never match string comparisons. Add a mutate{convert} step
   turning them into strings before any == "true"/"false" test or %{...} use.

2. CBN raises _failed_parsing_ when an `if [field]` references a field absent
   from the log, and most detection/mapping fields are absent in 2 of the 3
   report shapes (or null within one). Initialize every conditionally-checked
   field to "" before the json{} filter.

Without these, DMARC-fail records would not be categorized AUTH_VIOLATION and
aggregate/TLS reports could fail parsing outright. README caveat and PR
validation steps updated accordingly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 10:22:02 -04:00
Sean Whalen 784e3050bd Detect aggregate reports by "domain" instead of "adkim"
adkim is the published policy's DKIM alignment mode (defaulted to "r" by
parsedmarc), an obscure thing to key detection on. Switch the aggregate
detector to "domain" -- the reported From-domain, a required element present
and non-empty in every aggregate record (2388/2388 sample rows) and unique to
aggregate (failure uses reported_domain, SMTP TLS uses policy_domain).
header_from is unsuitable: it can be empty when a record carries no
identifiers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 09:42:28 -04:00
Sean Whalen ca27428713 Add Google SecOps (Chronicle) UDM parser for syslog output
A SecOps-side custom parser (CBN) that maps parsedmarc's [syslog] JSON
events to the Unified Data Model. No library changes: parsedmarc already
emits structured JSON, so the DMARC->UDM mapping lives in the parser and a
downstream UDM schema change is a parser edit, not a parsedmarc release.

Covers all three report types:
- aggregate -> EMAIL_TRANSACTION
- failure   -> EMAIL_TRANSACTION
- smtp_tls  -> GENERIC_EVENT (noun from policy_domain, present on every row)

Built strictly against the official UDM and parser-syntax docs (cited
inline). Sets metadata.event_timestamp from the report window via date{},
maps disposition / auth-failure to security_result with valid action and
category enums (AUTH_VIOLATION on DMARC fail), uses real network.email
field names, and strips syslog framing before JSON parsing. Ships real
sample events generated from the project's sample reports for validation.

Not yet validated against a live SecOps tenant; caveats are documented in
the README.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 09:24:20 -04:00