Fix duplicate detection for normalized aggregate reports in Elasticsearch/OpenSearch

Change date_begin/date_end queries from exact match to range queries (gte/lte) so that previously saved normalized time buckets are correctly detected as duplicates within the original report's date range. Co-authored-by: seanthegeek <44679+seanthegeek@users.noreply.github.com>
Initial plan
2026-04-21 21:09:27 +00:00 · 2026-03-06 17:59:10 +00:00 · 2026-03-06 17:55:59 +00:00 · 2026-03-04 12:36:15 -05:00 · 2026-03-03 21:00:55 -05:00 · 2026-03-03 11:46:13 -05:00
15 changed files with 126 additions and 27 deletions
--- a/.github/workflows/python-tests.yml
+++ b/.github/workflows/python-tests.yml
@@ -30,7 +30,7 @@ jobs:
    strategy:
      fail-fast: false
      matrix:
-        python-version: ["3.9", "3.10", "3.11", "3.12", "3.13", "3.14"]
+        python-version: ["3.10", "3.11", "3.12", "3.13", "3.14"]

    steps:
    - uses: actions/checkout@v5
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -0,0 +1,64 @@
+# AGENTS.md
+
+This file provides guidance to AI agents when working with code in this repository.
+
+## Project Overview
+
+parsedmarc is a Python module and CLI utility for parsing DMARC aggregate (RUA), forensic (RUF), and SMTP TLS reports. It reads reports from IMAP, Microsoft Graph, Gmail API, Maildir, mbox files, or direct file paths, and outputs to JSON/CSV, Elasticsearch, OpenSearch, Splunk, Kafka, S3, Azure Log Analytics, syslog, or webhooks.
+
+## Common Commands
+
+```bash
+# Install with dev/build dependencies
+pip install .[build]
+
+# Run all tests with coverage
+pytest --cov --cov-report=xml tests.py
+
+# Run a single test
+pytest tests.py::Test::testAggregateSamples
+
+# Lint and format
+ruff check .
+ruff format .
+
+# Test CLI with sample reports
+parsedmarc --debug -c ci.ini samples/aggregate/*
+parsedmarc --debug -c ci.ini samples/forensic/*
+
+# Build docs
+cd docs && make html
+
+# Build distribution
+hatch build
+```
+
+To skip DNS lookups during testing, set `GITHUB_ACTIONS=true`.
+
+## Architecture
+
+**Data flow:** Input sources → CLI (`cli.py:_main`) → Parse (`__init__.py`) → Enrich (DNS/GeoIP via `utils.py`) → Output integrations
+
+### Key modules
+
+- `parsedmarc/__init__.py` — Core parsing logic. Main functions: `parse_report_file()`, `parse_report_email()`, `parse_aggregate_report_xml()`, `parse_forensic_report()`, `parse_smtp_tls_report_json()`, `get_dmarc_reports_from_mailbox()`, `watch_inbox()`
+- `parsedmarc/cli.py` — CLI entry point (`_main`), config file parsing, output orchestration
+- `parsedmarc/types.py` — TypedDict definitions for all report types (`AggregateReport`, `ForensicReport`, `SMTPTLSReport`, `ParsingResults`)
+- `parsedmarc/utils.py` — IP/DNS/GeoIP enrichment, base64 decoding, compression handling
+- `parsedmarc/mail/` — Polymorphic mail connections: `IMAPConnection`, `GmailConnection`, `MSGraphConnection`, `MaildirConnection`
+- `parsedmarc/{elastic,opensearch,splunk,kafkaclient,loganalytics,syslog,s3,webhook,gelf}.py` — Output integrations
+
+### Report type system
+
+`ReportType = Literal["aggregate", "forensic", "smtp_tls"]`. Exception hierarchy: `ParserError` → `InvalidDMARCReport` → `InvalidAggregateReport`/`InvalidForensicReport`, and `InvalidSMTPTLSReport`.
+
+### Caching
+
+IP address info cached for 4 hours, seen aggregate report IDs cached for 1 hour (via `ExpiringDict`).
+
+## Code Style
+
+- Ruff for formatting and linting (configured in `.vscode/settings.json`)
+- TypedDict for structured data, type hints throughout
+- Python ≥3.10 required
+- Tests are in a single `tests.py` file using unittest; sample reports live in `samples/`
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,23 @@
 # Changelog

+## 9.1.1
+
+### Fixes
+
+- Fix the use of Elasticsearch and OpenSearch API keys (PR #660 fixes issue #653)
+
+### Changes
+
+- Drop support for Python 3.9 (PR #661)
+
+## 9.1.0
+
+## Enhancements
+
+- Add TCP and TLS support for syslog output. (#656)
+- Skip DNS lookups in GitHub Actions to prevent DNS timeouts during tests timeouts. (#657)
+- Remove microseconds from DMARC aggregate report time ranges before parsing them.
+
 ## 9.0.10

 - Support Python 3.14+
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -0,0 +1,3 @@
+# CLAUD.md
+
+@AGENTS.md
--- a/README.md
+++ b/README.md
@@ -56,9 +56,9 @@ for RHEL or Debian.
 | 3.6     | ❌         | Used in RHEL 8, but not supported by project dependencies |
 | 3.7     | ❌         | End of Life (EOL)                                          |
 | 3.8     | ❌         | End of Life (EOL)                                          |
-| 3.9     | ✅         | Supported until August 2026 (Debian 11); May 2032 (RHEL 9) |
+| 3.9     | ❌         | Used in Debian 11 and RHEL 9, but not supported by project dependencies |
 | 3.10    | ✅         | Actively maintained                                        |
 | 3.11    | ✅         | Actively maintained; supported until June 2028 (Debian 12) |
 | 3.12    | ✅         | Actively maintained; supported until May 2035 (RHEL 10)    |
 | 3.13    | ✅         | Actively maintained; supported until June 2030 (Debian 13) |
-| 3.14    | ✅         | Actively maintained                                        |
+| 3.14    | ✅         | Supported (requires `imapclient>=3.1.0`)                  |
--- a/ci.ini
+++ b/ci.ini
@@ -3,6 +3,7 @@ save_aggregate = True
 save_forensic = True
 save_smtp_tls = True
 debug = True
+offline = True

 [elasticsearch]
 hosts = http://localhost:9200
--- a/docs/source/index.md
+++ b/docs/source/index.md
@@ -56,12 +56,12 @@ for RHEL or Debian.
 | 3.6     | ❌         | Used in RHEL 8, but not supported by project dependencies |
 | 3.7     | ❌         | End of Life (EOL)                                          |
 | 3.8     | ❌         | End of Life (EOL)                                          |
-| 3.9     | ✅         | Supported until August 2026 (Debian 11); May 2032 (RHEL 9) |
+| 3.9     | ❌         | Used in Debian 11 and RHEL 9, but not supported by project dependencies |
 | 3.10    | ✅         | Actively maintained                                        |
 | 3.11    | ✅         | Actively maintained; supported until June 2028 (Debian 12) |
 | 3.12    | ✅         | Actively maintained; supported until May 2035 (RHEL 10)    |
 | 3.13    | ✅         | Actively maintained; supported until June 2030 (Debian 13) |
-| 3.14    | ✅         | Actively maintained                                        |
+| 3.14    | ✅         | Supported (requires `imapclient>=3.1.0`)                  |

 ```{toctree}
 :caption: 'Contents'
--- a/docs/source/installation.md
+++ b/docs/source/installation.md
@@ -162,10 +162,10 @@ sudo -u parsedmarc virtualenv /opt/parsedmarc/venv
 ```

 CentOS/RHEL 8 systems use Python 3.6 by default, so on those systems
-explicitly tell `virtualenv` to use `python3.9` instead
+explicitly tell `virtualenv` to use `python3.10` instead

 ```bash
-sudo -u parsedmarc virtualenv -p python3.9  /opt/parsedmarc/venv
+sudo -u parsedmarc virtualenv -p python3.10  /opt/parsedmarc/venv
 ```

 Activate the virtualenv
--- a/parsedmarc/cli.py
+++ b/parsedmarc/cli.py
@@ -1058,10 +1058,10 @@ def _main():
                opts.elasticsearch_password = elasticsearch_config["password"]
            # Until 8.20
            if "apiKey" in elasticsearch_config:
-                opts.elasticsearch_apiKey = elasticsearch_config["apiKey"]
+                opts.elasticsearch_api_key = elasticsearch_config["apiKey"]
            # Since 8.20
            if "api_key" in elasticsearch_config:
-                opts.elasticsearch_apiKey = elasticsearch_config["api_key"]
+                opts.elasticsearch_api_key = elasticsearch_config["api_key"]

        if "opensearch" in config:
            opensearch_config = config["opensearch"]
@@ -1098,10 +1098,10 @@ def _main():
                opts.opensearch_password = opensearch_config["password"]
            # Until 8.20
            if "apiKey" in opensearch_config:
-                opts.opensearch_apiKey = opensearch_config["apiKey"]
+                opts.opensearch_api_key = opensearch_config["apiKey"]
            # Since 8.20
            if "api_key" in opensearch_config:
-                opts.opensearch_apiKey = opensearch_config["api_key"]
+                opts.opensearch_api_key = opensearch_config["api_key"]

        if "splunk_hec" in config.sections():
            hec_config = config["splunk_hec"]
@@ -1470,8 +1470,12 @@ def _main():
                certfile_path=opts.syslog_certfile_path,
                keyfile_path=opts.syslog_keyfile_path,
                timeout=opts.syslog_timeout if opts.syslog_timeout is not None else 5.0,
-                retry_attempts=opts.syslog_retry_attempts if opts.syslog_retry_attempts is not None else 3,
-                retry_delay=opts.syslog_retry_delay if opts.syslog_retry_delay is not None else 5,
+                retry_attempts=opts.syslog_retry_attempts
+                if opts.syslog_retry_attempts is not None
+                else 3,
+                retry_delay=opts.syslog_retry_delay
+                if opts.syslog_retry_delay is not None
+                else 5,
            )
        except Exception as error_:
            logger.error("Syslog Error: {0}".format(error_.__str__()))
--- a/parsedmarc/constants.py
+++ b/parsedmarc/constants.py
@@ -1,3 +1,3 @@
-__version__ = "9.0.10"
+__version__ = "9.1.1"

 USER_AGENT = f"parsedmarc/{__version__}"
--- a/parsedmarc/elastic.py
+++ b/parsedmarc/elastic.py
@@ -413,8 +413,8 @@ def save_aggregate_report_to_elasticsearch(
    org_name_query = Q(dict(match_phrase=dict(org_name=org_name)))  # type: ignore
    report_id_query = Q(dict(match_phrase=dict(report_id=report_id)))  # pyright: ignore[reportArgumentType]
    domain_query = Q(dict(match_phrase={"published_policy.domain": domain}))  # pyright: ignore[reportArgumentType]
-    begin_date_query = Q(dict(match=dict(date_begin=begin_date)))  # pyright: ignore[reportArgumentType]
-    end_date_query = Q(dict(match=dict(date_end=end_date)))  # pyright: ignore[reportArgumentType]
+    begin_date_query = Q(dict(range=dict(date_begin=dict(gte=begin_date))))  # pyright: ignore[reportArgumentType]
+    end_date_query = Q(dict(range=dict(date_end=dict(lte=end_date))))  # pyright: ignore[reportArgumentType]

    if index_suffix is not None:
        search_index = "dmarc_aggregate_{0}*".format(index_suffix)
--- a/parsedmarc/opensearch.py
+++ b/parsedmarc/opensearch.py
@@ -413,8 +413,8 @@ def save_aggregate_report_to_opensearch(
    org_name_query = Q(dict(match_phrase=dict(org_name=org_name)))
    report_id_query = Q(dict(match_phrase=dict(report_id=report_id)))
    domain_query = Q(dict(match_phrase={"published_policy.domain": domain}))
-    begin_date_query = Q(dict(match=dict(date_begin=begin_date)))
-    end_date_query = Q(dict(match=dict(date_end=end_date)))
+    begin_date_query = Q(dict(range=dict(date_begin=dict(gte=begin_date))))
+    end_date_query = Q(dict(range=dict(date_end=dict(lte=end_date))))

    if index_suffix is not None:
        search_index = "dmarc_aggregate_{0}*".format(index_suffix)
--- a/parsedmarc/types.py
+++ b/parsedmarc/types.py
@@ -2,7 +2,7 @@ from __future__ import annotations

 from typing import Any, Dict, List, Literal, Optional, TypedDict, Union

-# NOTE: This module is intentionally Python 3.9 compatible.
+# NOTE: This module is intentionally Python 3.10 compatible.
 # - No PEP 604 unions (A | B)
 # - No typing.NotRequired / Required (3.11+) to avoid an extra dependency.
 #   For optional keys, use total=False TypedDicts.
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -2,7 +2,7 @@
 requires = [
    "hatchling>=1.27.0",
 ]
-requires_python = ">=3.9,<3.14"
+requires_python = ">=3.10,<3.15"
 build-backend = "hatchling.build"

 [project]
@@ -29,7 +29,7 @@ classifiers = [
    "Operating System :: OS Independent",
    "Programming Language :: Python :: 3"
 ]
-requires-python = ">=3.9"
+requires-python = ">=3.10"
 dependencies = [
    "azure-identity>=1.8.0",
    "azure-monitor-ingestion>=1.0.0",
@@ -45,7 +45,7 @@ dependencies = [
    "google-auth-httplib2>=0.1.0",
    "google-auth-oauthlib>=0.4.6",
    "google-auth>=2.3.3",
-    "imapclient>=2.1.0",
+    "imapclient>=3.1.0",
    "kafka-python-ng>=2.2.2",
    "lxml>=4.4.0",
    "mailsuite>=1.11.2",
--- a/tests.py
+++ b/tests.py
@@ -12,6 +12,9 @@ from lxml import etree
 import parsedmarc
 import parsedmarc.utils

+# Detect if running in GitHub Actions to skip DNS lookups
+OFFLINE_MODE = os.environ.get("GITHUB_ACTIONS", "false").lower() == "true"
+

 def minify_xml(xml_string):
    parser = etree.XMLParser(remove_blank_text=True)
@@ -121,7 +124,7 @@ class Test(unittest.TestCase):
                continue
            print("Testing {0}: ".format(sample_path), end="")
            parsed_report = parsedmarc.parse_report_file(
-                sample_path, always_use_local_files=True
+                sample_path, always_use_local_files=True, offline=OFFLINE_MODE
            )["report"]
            parsedmarc.parsed_aggregate_reports_to_csv(parsed_report)
            print("Passed!")
@@ -129,7 +132,7 @@ class Test(unittest.TestCase):
    def testEmptySample(self):
        """Test empty/unparasable report"""
        with self.assertRaises(parsedmarc.ParserError):
-            parsedmarc.parse_report_file("samples/empty.xml")
+            parsedmarc.parse_report_file("samples/empty.xml", offline=OFFLINE_MODE)

    def testForensicSamples(self):
        """Test sample forensic/ruf/failure DMARC reports"""
@@ -139,8 +142,12 @@ class Test(unittest.TestCase):
            print("Testing {0}: ".format(sample_path), end="")
            with open(sample_path) as sample_file:
                sample_content = sample_file.read()
-                parsed_report = parsedmarc.parse_report_email(sample_content)["report"]
-            parsed_report = parsedmarc.parse_report_file(sample_path)["report"]
+                parsed_report = parsedmarc.parse_report_email(
+                    sample_content, offline=OFFLINE_MODE
+                )["report"]
+            parsed_report = parsedmarc.parse_report_file(
+                sample_path, offline=OFFLINE_MODE
+            )["report"]
            parsedmarc.parsed_forensic_reports_to_csv(parsed_report)
            print("Passed!")

@@ -152,7 +159,9 @@ class Test(unittest.TestCase):
            if os.path.isdir(sample_path):
                continue
            print("Testing {0}: ".format(sample_path), end="")
-            parsed_report = parsedmarc.parse_report_file(sample_path)["report"]
+            parsed_report = parsedmarc.parse_report_file(
+                sample_path, offline=OFFLINE_MODE
+            )["report"]
            parsedmarc.parsed_smtp_tls_reports_to_csv(parsed_report)
            print("Passed!")
Author	SHA1	Message	Date
copilot-swe-agent[bot]	a918d7582c	Fix duplicate detection for normalized aggregate reports in Elasticsearch/OpenSearch Change date_begin/date_end queries from exact match to range queries (gte/lte) so that previously saved normalized time buckets are correctly detected as duplicates within the original report's date range. Co-authored-by: seanthegeek <44679+seanthegeek@users.noreply.github.com>	2026-03-06 17:59:10 +00:00
copilot-swe-agent[bot]	e9b4031288	Initial plan	2026-03-06 17:55:59 +00:00
Kili	e98fdfa96b	Fix Python 3.14 support metadata and require imapclient 3.1.0 (#662 )	2026-03-04 12:36:15 -05:00
Sean Whalen	9551c8b467	Add AGENTS.md for AI agent guidance and link from CLAUDE.md	2026-03-03 21:00:55 -05:00
Sean Whalen	d987943c22	Update changelog formatting for version 9.1.1	2026-03-03 11:46:13 -05:00
Sean Whalen	3d8a99b5d3	9.1.1 - Fix the use of Elasticsearch and OpenSearch API keys (PR #660 fixes issue #653) - Drop support for Python 3.9 (PR #661)	2026-03-03 11:43:53 -05:00
Majid Burney	5aaaedf463	Use correct key names for elasticsearch/opensearch api keys (#660 )	2026-03-03 11:35:05 -05:00
Copilot	2e3ee25ec9	Drop Python 3.9 support (#661 ) * Initial plan * Drop Python 3.9 support: update CI matrix, pyproject.toml, docs, and README Co-authored-by: seanthegeek <44679+seanthegeek@users.noreply.github.com> * Update Python 3.9 version table entry to note Debian 11/RHEL 9 usage Co-authored-by: seanthegeek <44679+seanthegeek@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: seanthegeek <44679+seanthegeek@users.noreply.github.com>	2026-03-03 11:34:35 -05:00
Sean Whalen	33eb2aaf62	9.1.0 ## Enhancements - Add TCP and TLS support for syslog output. (#656) - Skip DNS lookups in GitHub Actions to prevent DNS timeouts during tests timeouts. (#657) - Remove microseconds from DMARC aggregate report time ranges before parsing them.	2026-02-20 14:36:37 -05:00
Sean Whalen	1387fb4899	9.0.11 - Remove microseconds from DMARC aggregate report time ranges before parsing them.	2026-02-20 14:27:51 -05:00
Copilot	4d97bd25aa	Skip DNS lookups in GitHub Actions to prevent test timeouts (#657 ) * Add offline mode for tests in GitHub Actions to skip DNS lookups Co-authored-by: seanthegeek <44679+seanthegeek@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: seanthegeek <44679+seanthegeek@users.noreply.github.com>	2026-02-18 18:19:28 -05:00
Copilot	17a612df0c	Add TCP and TLS transport support to syslog module (#656 ) - Updated parsedmarc/syslog.py to support UDP, TCP, and TLS protocols - Added protocol parameter with UDP as default for backward compatibility - Implemented TLS support with CA verification and client certificate auth - Added retry logic for TCP/TLS connections with configurable attempts and delays - Updated parsedmarc/cli.py with new config file parsing - Updated documentation with examples for TCP and TLS configurations Co-authored-by: seanthegeek <44679+seanthegeek@users.noreply.github.com> * Remove CLI arguments for syslog options, keep config-file only Per user request, removed command-line argument options for syslog parameters. All new syslog options (protocol, TLS settings, timeout, retry) are now only available via configuration file, consistent with other similar options. Co-authored-by: seanthegeek <44679+seanthegeek@users.noreply.github.com> * Fix code review issues: remove trailing whitespace and add cert validation - Removed trailing whitespace from syslog.py and usage.md - Added warning when only one of certfile_path/keyfile_path is provided - Improved error handling for incomplete TLS client certificate configuration Co-authored-by: seanthegeek <44679+seanthegeek@users.noreply.github.com> * Set minimum TLS version to 1.2 for enhanced security Explicitly configured ssl_context.minimum_version = TLSVersion.TLSv1_2 to ensure only secure TLS versions are used for syslog connections. Co-authored-by: seanthegeek <44679+seanthegeek@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: seanthegeek <44679+seanthegeek@users.noreply.github.com>	2026-02-18 18:12:59 -05:00