Better wording

8.18.6
- Fix since option to correctly work with weeks (PR #604) - Add 183 entries to `base_reverse_dns_map.csv` - Add 57 entries to `known_unknown_base_reverse_dns.txt` - Check for invalid UTF-8 bytes in `base_reverse_dns_map.csv` at build - Remove unneeded items from the `parsedmarc.resources` module at build
2026-02-18 15:36:24 +00:00 · 2025-08-17 17:01:22 -04:00 · 2025-08-17 17:00:11 -04:00 · 2025-08-17 16:01:45 -04:00 · 2025-08-17 15:55:47 -04:00 · 2025-08-17 15:17:24 -04:00
20 changed files with 1322 additions and 50 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -142,3 +142,4 @@ scratch.py

 parsedmarc/resources/maps/base_reverse_dns.csv
 parsedmarc/resources/maps/unknown_base_reverse_dns.csv
+*.bak
--- a/.vscode/settings.json
+++ b/.vscode/settings.json
@@ -13,6 +13,7 @@
        "automodule",
        "backported",
        "bellsouth",
+        "boto",
        "brakhane",
        "Brightmail",
        "CEST",
@@ -36,6 +37,7 @@
        "expiringdict",
        "fieldlist",
        "genindex",
+        "geoip",
        "geoipupdate",
        "Geolite",
        "geolocation",
@@ -44,7 +46,10 @@
        "hostnames",
        "htpasswd",
        "httpasswd",
+        "httplib",
        "IMAP",
+        "imapclient",
+        "infile",
        "Interaktive",
        "IPDB",
        "journalctl",
@@ -80,14 +85,18 @@
        "nosecureimap",
        "nosniff",
        "nwettbewerb",
+        "opensearch",
        "parsedmarc",
        "passsword",
        "Postorius",
        "premade",
        "procs",
        "publicsuffix",
+        "publicsuffixlist",
        "publixsuffix",
+        "pygelf",
        "pypy",
+        "pytest",
        "quickstart",
        "Reindex",
        "replyto",
@@ -95,10 +104,12 @@
        "Rollup",
        "Rpdm",
        "SAMEORIGIN",
+        "sdist",
        "Servernameone",
        "setuptools",
        "smartquotes",
        "SMTPTLS",
+        "sortmaps",
        "sourcetype",
        "STARTTLS",
        "tasklist",
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,6 +1,32 @@
 Changelog
 =========

+8.18.6
+------
+
+- Fix since option to correctly work with weeks (PR #604)
+- Add 183 entries to `base_reverse_dns_map.csv`
+- Add 57 entries to `known_unknown_base_reverse_dns.txt`
+- Check for invalid UTF-8 bytes in `base_reverse_dns_map.csv` at build
+- Exclude unneeded items from the `parsedmarc.resources` module at build
+
+8.18.5
+------
+
+- Fix CSV download
+
+8.18.4
+------
+
+- Fix webhooks
+
+8.18.3
+------
+
+- Move `__version__` to `parsedmarc.constants`
+- Create a constant `USER_AGENT`
+- Use the HTTP `User-Agent` header value `parsedmarc/version` for all HTTP requests
+
 8.18.2
 ------

@@ -676,7 +702,7 @@ in the ``elasticsearch`` configuration file section (closes issue #78)
 -----

 - Add filename and line number to logging output
- Improved IMAP error handling  
+- Improved IMAP error handling
 - Add CLI options

  ```text
--- a/build.sh
+++ b/build.sh
@@ -18,8 +18,11 @@ if [  -d "./../parsedmarc-docs" ]; then
  cp -rf build/html/* ../../parsedmarc-docs/
 fi
 cd ..
-sort -o "parsedmarc/resources/maps/known_unknown_base_reverse_dns.txt" "parsedmarc/resources/maps/known_unknown_base_reverse_dns.txt"
-./sortmaps.py
+cd parsedmarc/resources/maps
+python3 sortmaps.py
+echo "Checking for invalid UTF-8 bytes in base_reverse_dns_map.csv"
+python3 find_bad_utf8.py base_reverse_dns_map.csv
+cd ../../..
 python3 tests.py
 rm -rf dist/ build/
-hatch build
+hatch build
--- a/docs/source/usage.md
+++ b/docs/source/usage.md
@@ -369,7 +369,7 @@ The full set of configuration options are:
  - `mode` - str: The GELF transport type to use. Valid modes: `tcp`, `udp`, `tls`

 - `maildir`
-  - `reports_folder` - str: Full path for mailbox maidir location (Default: `INBOX`)
+  - `maildir_path` - str: Full path for mailbox maidir location (Default: `INBOX`)
  - `maildir_create` - bool: Create maildir if not present (Default: False)

 - `webhook` - Post the individual reports to a webhook url with the report as the JSON body
--- a/parsedmarc/init.py
+++ b/parsedmarc/init.py
@@ -34,12 +34,13 @@ from parsedmarc.mail import (
    MSGraphConnection,
    GmailConnection,
 )
+
+from parsedmarc.constants import __version__
 from parsedmarc.utils import get_base_domain, get_ip_address_info
 from parsedmarc.utils import is_outlook_msg, convert_outlook_msg
 from parsedmarc.utils import parse_email
 from parsedmarc.utils import timestamp_to_human, human_timestamp_to_datetime

-__version__ = "8.18.2"

 logger.debug("parsedmarc v{0}".format(__version__))

@@ -1579,7 +1580,7 @@ def get_dmarc_reports_from_mailbox(

    if since:
        _since = 1440  # default one day
-        if re.match(r"\d+[mhd]$", since):
+        if re.match(r"\d+[mhdw]$", since):
            s = re.split(r"(\d+)", since)
            if s[2] == "m":
                _since = int(s[1])
--- a/parsedmarc/constants.py
+++ b/parsedmarc/constants.py
@@ -0,0 +1,2 @@
+__version__ = "8.18.6"
+USER_AGENT = f"parsedmarc/{__version__}"
--- a/parsedmarc/resources/maps/README.md
+++ b/parsedmarc/resources/maps/README.md
@@ -3,6 +3,8 @@
 A mapping is meant to make it easier to identify who or what a sending source is. Please consider contributing
 additional mappings in a GitHub Pull Request.

+Do not open these CSV files in Excel. It will replace Unicode characters with question marks. Use LibreOffice Calc instead.
+
 ## base_reverse_dns_map.csv

 A CSV file with three fields: `base_reverse_dns`, `name`, and `type`.
@@ -25,6 +27,7 @@ The `service_type` is based on the following rule precedence:
 - Agriculture
 - Automotive
 - Beauty
+- Conglomerate
 - Construction
 - Consulting
 - Defense
@@ -41,6 +44,7 @@ The `service_type` is based on the following rule precedence:
 - IaaS
 - Industrial
 - ISP
+- Legal
 - Logistics
 - Manufacturing
 - Marketing
@@ -50,6 +54,7 @@ The `service_type` is based on the following rule precedence:
 - Nonprofit
 - PaaS
 - Photography
+- Physical Security
 - Print
 - Publishing
 - Real Estate
@@ -72,12 +77,16 @@ A list of reverse DNS base domains that could not be identified as belonging to

 ## base_reverse_dns.csv

-A CSV with the fields `source_name` and optionally `message_countcount`. This CSV can be generated byy exporting the base DNS data from the Kibana on Splunk dashboards provided by parsedmarc. This file is not tracked by Git.
+A CSV with the fields `source_name` and optionally `message_count`. This CSV can be generated by exporting the base DNS data from the Kibana or Splunk dashboards provided by parsedmarc. This file is not tracked by Git.

 ## unknown_base_reverse_dns.csv

 A CSV file with the fields `source_name` and `message_count`. This file is not tracked by Git.

+## find_bad_utf8.py
+
+Locates invalid UTF-8 bytes in files and optionally tries to current them. Generated by GPT5. Helped me find where I had introduced invalid bytes in `base_reverse_dns_map.csv`.
+
 ## find_unknown_base_reverse_dns.py

 This is a python script that reads the domains in `base_reverse_dns.csv` and writes the domains that are not in `base_reverse_dns_map.csv` or `known_unknown_base_reverse_dns.txt` to `unknown_base_reverse_dns.csv`. This is useful for identifying potential additional domains to contribute to `base_reverse_dns_map.csv` and `known_unknown_base_reverse_dns.txt`.
--- a/parsedmarc/resources/maps/base_reverse_dns_map.csv
+++ b/parsedmarc/resources/maps/base_reverse_dns_map.csv
--- a/parsedmarc/resources/maps/find_bad_utf8.py
+++ b/parsedmarc/resources/maps/find_bad_utf8.py
@@ -0,0 +1,488 @@
+#!/usr/bin/env python3
+
+
+import argparse
+import codecs
+import os
+import sys
+import shutil
+from typing import List, Tuple
+
+"""
+Locates and optionally corrects bad UTF-8 bytes in a file.
+Generated by GPT-5 Use at your own risk.
+"""
+
+# -------------------------
+# UTF-8 scanning
+# -------------------------
+
+
+def scan_line_for_utf8_errors(
+    line_bytes: bytes, line_no: int, base_offset: int, context: int
+):
+    """
+    Scan one line of raw bytes for UTF-8 decoding errors.
+    Returns a list of dicts describing each error.
+    """
+    pos = 0
+    results = []
+    while pos < len(line_bytes):
+        dec = codecs.getincrementaldecoder("utf-8")("strict")
+        try:
+            dec.decode(line_bytes[pos:], final=True)
+            break
+        except UnicodeDecodeError as e:
+            rel_index = e.start
+            abs_index_in_line = pos + rel_index
+            abs_offset = base_offset + abs_index_in_line
+
+            start_ctx = max(0, abs_index_in_line - context)
+            end_ctx = min(len(line_bytes), abs_index_in_line + 1 + context)
+            ctx_bytes = line_bytes[start_ctx:end_ctx]
+            bad_byte = line_bytes[abs_index_in_line : abs_index_in_line + 1]
+            col = abs_index_in_line + 1  # 1-based byte column
+
+            results.append(
+                {
+                    "line": line_no,
+                    "column": col,
+                    "abs_offset": abs_offset,
+                    "bad_byte_hex": bad_byte.hex(),
+                    "context_hex": ctx_bytes.hex(),
+                    "context_preview": ctx_bytes.decode("utf-8", errors="replace"),
+                }
+            )
+            # Move past the offending byte and continue
+            pos = abs_index_in_line + 1
+    return results
+
+
+def scan_file_for_utf8_errors(path: str, context: int, limit: int):
+    errors_found = 0
+    limit_val = limit if limit != 0 else float("inf")
+
+    with open(path, "rb") as f:
+        total_offset = 0
+        line_no = 0
+        while True:
+            line = f.readline()
+            if not line:
+                break
+            line_no += 1
+            results = scan_line_for_utf8_errors(line, line_no, total_offset, context)
+            for r in results:
+                errors_found += 1
+                print(
+                    f"[ERROR {errors_found}] Line {r['line']}, Column {r['column']}, "
+                    f"Absolute byte offset {r['abs_offset']}"
+                )
+                print(f"  Bad byte: 0x{r['bad_byte_hex']}")
+                print(f"  Context (hex): {r['context_hex']}")
+                print(f"  Context (preview): {r['context_preview']}")
+                print()
+                if errors_found >= limit_val:
+                    print(f"Reached limit of {limit} errors. Stopping.")
+                    return errors_found
+            total_offset += len(line)
+
+    if errors_found == 0:
+        print("No invalid UTF-8 bytes found. 🎉")
+    else:
+        print(f"Found {errors_found} invalid UTF-8 byte(s).")
+    return errors_found
+
+
+# -------------------------
+# Whole-file conversion
+# -------------------------
+
+
+def detect_encoding_text(path: str) -> Tuple[str, str]:
+    """
+    Use charset-normalizer to detect file encoding.
+    Return (encoding_name, decoded_text). Falls back to cp1252 if needed.
+    """
+    try:
+        from charset_normalizer import from_path
+    except ImportError:
+        print(
+            "Please install charset-normalizer: pip install charset-normalizer",
+            file=sys.stderr,
+        )
+        sys.exit(4)
+
+    matches = from_path(path)
+    match = matches.best()
+    if match is None or match.encoding is None:
+        # Fallback heuristic for Western single-byte text
+        with open(path, "rb") as fb:
+            data = fb.read()
+        try:
+            return "cp1252", data.decode("cp1252", errors="strict")
+        except UnicodeDecodeError:
+            print("Unable to detect encoding reliably.", file=sys.stderr)
+            sys.exit(5)
+
+    return match.encoding, str(match)
+
+
+def convert_to_utf8(src_path: str, out_path: str, src_encoding: str = None) -> str:
+    """
+    Convert an entire file to UTF-8 (re-decoding everything).
+    If src_encoding is provided, use it; else auto-detect.
+    Returns the encoding actually used.
+    """
+    if src_encoding:
+        with open(src_path, "rb") as fb:
+            data = fb.read()
+        try:
+            text = data.decode(src_encoding, errors="strict")
+        except LookupError:
+            print(f"Unknown encoding: {src_encoding}", file=sys.stderr)
+            sys.exit(6)
+        except UnicodeDecodeError as e:
+            print(f"Decoding failed with {src_encoding}: {e}", file=sys.stderr)
+            sys.exit(7)
+        used = src_encoding
+    else:
+        used, text = detect_encoding_text(src_path)
+
+    with open(out_path, "w", encoding="utf-8", newline="") as fw:
+        fw.write(text)
+    return used
+
+
+def verify_utf8_file(path: str) -> Tuple[bool, str]:
+    try:
+        with open(path, "rb") as fb:
+            fb.read().decode("utf-8", errors="strict")
+        return True, ""
+    except UnicodeDecodeError as e:
+        return False, str(e)
+
+
+# -------------------------
+# Targeted single-byte fixer
+# -------------------------
+
+
+def iter_lines_with_offsets(b: bytes):
+    """
+    Yield (line_bytes, line_start_abs_offset). Preserves LF/CRLF/CR in bytes.
+    """
+    start = 0
+    for i, byte in enumerate(b):
+        if byte == 0x0A:  # LF
+            yield b[start : i + 1], start
+            start = i + 1
+    if start < len(b):
+        yield b[start:], start
+
+
+def detect_probable_fallbacks() -> List[str]:
+    # Good defaults for Western/Portuguese text
+    return ["cp1252", "iso-8859-1", "iso-8859-15"]
+
+
+def repair_mixed_utf8_line(line: bytes, base_offset: int, fallback_chain: List[str]):
+    """
+    Strictly validate UTF-8 and fix *only* the exact offending byte when an error occurs.
+    This avoids touching adjacent valid UTF-8 (prevents mojibake like 'Ã©').
+    """
+    out_fragments: List[str] = []
+    fixes = []
+    pos = 0
+    n = len(line)
+
+    while pos < n:
+        dec = codecs.getincrementaldecoder("utf-8")("strict")
+        try:
+            s = dec.decode(line[pos:], final=True)
+            out_fragments.append(s)
+            break
+        except UnicodeDecodeError as e:
+            # Append the valid prefix before the error
+            if e.start > 0:
+                out_fragments.append(
+                    line[pos : pos + e.start].decode("utf-8", errors="strict")
+                )
+
+            bad_index = pos + e.start  # absolute index in 'line'
+            bad_slice = line[bad_index : bad_index + 1]  # FIX EXACTLY ONE BYTE
+
+            # Decode that single byte using the first working fallback
+            decoded = None
+            used_enc = None
+            for enc in fallback_chain:
+                try:
+                    decoded = bad_slice.decode(enc, errors="strict")
+                    used_enc = enc
+                    break
+                except Exception:
+                    continue
+            if decoded is None:
+                # latin-1 always succeeds (byte->same code point)
+                decoded = bad_slice.decode("latin-1")
+                used_enc = "latin-1 (fallback)"
+
+            out_fragments.append(decoded)
+
+            # Log the fix
+            col_1based = bad_index + 1  # byte-based column
+            fixes.append(
+                {
+                    "line_base_offset": base_offset,
+                    "line": None,  # caller fills line number
+                    "column": col_1based,
+                    "abs_offset": base_offset + bad_index,
+                    "bad_bytes_hex": bad_slice.hex(),
+                    "used_encoding": used_enc,
+                    "replacement_preview": decoded,
+                }
+            )
+
+            # Advance exactly one byte past the offending byte and continue
+            pos = bad_index + 1
+
+    return "".join(out_fragments), fixes
+
+
+def targeted_fix_to_utf8(
+    src_path: str,
+    out_path: str,
+    fallback_chain: List[str],
+    dry_run: bool,
+    max_fixes: int,
+):
+    with open(src_path, "rb") as fb:
+        data = fb.read()
+
+    total_fixes = 0
+    repaired_lines: List[str] = []
+    line_no = 0
+    max_val = max_fixes if max_fixes != 0 else float("inf")
+
+    for line_bytes, base_offset in iter_lines_with_offsets(data):
+        line_no += 1
+        # Fast path: keep lines that are already valid UTF-8
+        try:
+            repaired_lines.append(line_bytes.decode("utf-8", errors="strict"))
+            continue
+        except UnicodeDecodeError:
+            pass
+
+        fixed_text, fixes = repair_mixed_utf8_line(
+            line_bytes, base_offset, fallback_chain=fallback_chain
+        )
+        for f in fixes:
+            f["line"] = line_no
+
+        repaired_lines.append(fixed_text)
+
+        # Log fixes
+        for f in fixes:
+            total_fixes += 1
+            print(
+                f"[FIX {total_fixes}] Line {f['line']}, Column {f['column']}, Abs offset {f['abs_offset']}"
+            )
+            print(f"  Bad bytes: 0x{f['bad_bytes_hex']}")
+            print(f"  Used encoding: {f['used_encoding']}")
+            preview = f["replacement_preview"].replace("\r", "\\r").replace("\n", "\\n")
+            if len(preview) > 40:
+                preview = preview[:40] + "…"
+            print(f"  Replacement preview: {preview}")
+            print()
+            if total_fixes >= max_val:
+                print(f"Reached max fixes limit ({max_fixes}). Stopping scan.")
+                break
+        if total_fixes >= max_val:
+            break
+
+    if dry_run:
+        print(f"Dry run complete. Detected {total_fixes} fix(es). No file written.")
+        return total_fixes
+
+    # Join and verify result can be encoded to UTF-8
+    repaired_text = "".join(repaired_lines)
+    try:
+        repaired_text.encode("utf-8", errors="strict")
+    except UnicodeEncodeError as e:
+        print(f"Internal error: repaired text not valid UTF-8: {e}", file=sys.stderr)
+        sys.exit(3)
+
+    with open(out_path, "w", encoding="utf-8", newline="") as fw:
+        fw.write(repaired_text)
+
+    print(f"Fixed file written to: {out_path}")
+    print(f"Total fixes applied: {total_fixes}")
+    return total_fixes
+
+
+# -------------------------
+# CLI
+# -------------------------
+
+
+def main():
+    ap = argparse.ArgumentParser(
+        description=(
+            "Scan for invalid UTF-8; optionally convert whole file or fix only invalid bytes.\n\n"
+            "By default, --convert and --fix **edit the input file in place** and create a backup "
+            "named '<input>.bak' before writing. If you pass --output, the original file is left "
+            "unchanged and no backup is created. Use --dry-run to preview fixes without writing."
+        ),
+        formatter_class=argparse.RawTextHelpFormatter,
+    )
+    ap.add_argument("path", help="Path to the CSV/text file")
+    ap.add_argument(
+        "--context",
+        type=int,
+        default=20,
+        help="Bytes of context to show around errors (default: 20)",
+    )
+    ap.add_argument(
+        "--limit",
+        type=int,
+        default=100,
+        help="Max errors to report during scan (0 = unlimited)",
+    )
+    ap.add_argument(
+        "--skip-scan", action="store_true", help="Skip initial scan for speed"
+    )
+
+    # Whole-file convert
+    ap.add_argument(
+        "--convert",
+        action="store_true",
+        help="Convert entire file to UTF-8 using auto/forced encoding "
+        "(in-place by default; creates '<input>.bak').",
+    )
+    ap.add_argument(
+        "--encoding",
+        help="Force source encoding for --convert or first fallback for --fix",
+    )
+    ap.add_argument(
+        "--output",
+        help="Write to this path instead of in-place (no .bak is created in that case)",
+    )
+
+    # Targeted fix
+    ap.add_argument(
+        "--fix",
+        action="store_true",
+        help="Fix only invalid byte(s) via fallback encodings "
+        "(in-place by default; creates '<input>.bak').",
+    )
+    ap.add_argument(
+        "--fallbacks",
+        help="Comma-separated fallback encodings (default: cp1252,iso-8859-1,iso-8859-15)",
+    )
+    ap.add_argument(
+        "--dry-run",
+        action="store_true",
+        help="(fix) Print fixes but do not write or create a .bak",
+    )
+    ap.add_argument(
+        "--max-fixes",
+        type=int,
+        default=0,
+        help="(fix) Stop after N fixes (0 = unlimited)",
+    )
+
+    args = ap.parse_args()
+    path = args.path
+
+    if not os.path.isfile(path):
+        print(f"File not found: {path}", file=sys.stderr)
+        sys.exit(2)
+
+    # Optional scan first
+    if not args.skip_scan:
+        scan_file_for_utf8_errors(path, context=args.context, limit=args.limit)
+
+    # Mode selection guards
+    if args.convert and args.fix:
+        print("Choose either --convert or --fix (not both).", file=sys.stderr)
+        sys.exit(9)
+    if not args.convert and not args.fix and args.skip_scan:
+        print("No action selected (use --convert or --fix).")
+        return
+    if not args.convert and not args.fix:
+        # User only wanted a scan
+        return
+
+    # Determine output path and backup behavior
+    # In-place by default: create '<input>.bak' before overwriting.
+    if args.output:
+        out_path = args.output
+        in_place = False
+    else:
+        out_path = path
+        in_place = True
+
+    # CONVERT mode
+    if args.convert:
+        print("\n[CONVERT MODE] Converting file to UTF-8...")
+        if in_place:
+            # Create backup before overwriting original
+            backup_path = path + ".bak"
+            shutil.copy2(path, backup_path)
+            print(f"Backup created: {backup_path}")
+        used = convert_to_utf8(path, out_path, src_encoding=args.encoding)
+        print(f"Source encoding used: {used}")
+        print(f"Saved UTF-8 file as: {out_path}")
+        ok, err = verify_utf8_file(out_path)
+        if ok:
+            print("Verification: output is valid UTF-8 ✅")
+        else:
+            print(f"Verification failed: {err}")
+            sys.exit(8)
+        return
+
+    # FIX mode (targeted, single-byte)
+    if args.fix:
+        print("\n[FIX MODE] Fixing only invalid bytes to UTF-8...")
+        if args.dry_run:
+            # Dry-run: never write or create backup
+            out_path_effective = os.devnull
+            in_place_effective = False
+        else:
+            out_path_effective = out_path
+            in_place_effective = in_place
+
+        # Build fallback chain (if --encoding provided, try it first)
+        if args.fallbacks:
+            fallback_chain = [e.strip() for e in args.fallbacks.split(",") if e.strip()]
+        else:
+            fallback_chain = detect_probable_fallbacks()
+        if args.encoding and args.encoding not in fallback_chain:
+            fallback_chain = [args.encoding] + fallback_chain
+
+        if in_place_effective:
+            # Create backup before overwriting original (only when actually writing)
+            backup_path = path + ".bak"
+            shutil.copy2(path, backup_path)
+            print(f"Backup created: {backup_path}")
+
+        fix_count = targeted_fix_to_utf8(
+            path,
+            out_path_effective,
+            fallback_chain=fallback_chain,
+            dry_run=args.dry_run,
+            max_fixes=args.max_fixes,
+        )
+
+        if not args.dry_run:
+            ok, err = verify_utf8_file(out_path_effective)
+            if ok:
+                print("Verification: output is valid UTF-8 ✅")
+                print(f"Fix mode completed — {fix_count} byte(s) corrected.")
+            else:
+                print(f"Verification failed: {err}")
+                sys.exit(8)
+        return
+
+
+if __name__ == "__main__":
+    main()
--- a/parsedmarc/resources/maps/find_unknown_base_reverse_dns.py
+++ b/parsedmarc/resources/maps/find_unknown_base_reverse_dns.py
@@ -9,6 +9,7 @@ def _main():
    input_csv_file_path = "base_reverse_dns.csv"
    base_reverse_dns_map_file_path = "base_reverse_dns_map.csv"
    known_unknown_list_file_path = "known_unknown_base_reverse_dns.txt"
+    psl_overrides_file_path = "psl_overrides.txt"
    output_csv_file_path = "unknown_base_reverse_dns.csv"

    csv_headers = ["source_name", "message_count"]
@@ -23,6 +24,7 @@ def _main():
        input_csv_file_path,
        base_reverse_dns_map_file_path,
        known_unknown_list_file_path,
+        psl_overrides_file_path,
    ]:
        if not os.path.exists(p):
            logger.error(f"{p} does not exist")
@@ -38,6 +40,17 @@ def _main():
                )
            else:
                known_unknown_domains.append(domain)
+    logger.info(f"Loading {psl_overrides_file_path}")
+    psl_overrides = []
+    with open(psl_overrides_file_path) as f:
+        for line in f.readlines():
+            domain = line.lower().strip()
+            if domain in psl_overrides:
+                logger.warning(
+                    f"{domain} is in {psl_overrides_file_path} multiple times"
+                )
+            else:
+                psl_overrides.append(domain)
    logger.info(f"Loading {base_reverse_dns_map_file_path}")
    known_domains = []
    with open(base_reverse_dns_map_file_path) as f:
@@ -52,13 +65,20 @@ def _main():
            if domain in known_unknown_domains and known_domains:
                pass
                logger.warning(
-                    f"{domain} is in {known_unknown_list_file_path} and {base_reverse_dns_map_file_path}"
+                    f"{domain} is in {known_unknown_list_file_path} and \
+                        {base_reverse_dns_map_file_path}"
                )

    logger.info(f"Checking domains against {base_reverse_dns_map_file_path}")
    with open(input_csv_file_path) as f:
        for row in csv.DictReader(f):
            domain = row["source_name"].lower().strip()
+            if domain == "":
+                continue
+            for psl_domain in psl_overrides:
+                if domain.endswith(psl_domain):
+                    domain = psl_domain
+                    break
            if domain not in known_domains and domain not in known_unknown_domains:
                logger.info(f"New unknown domain found: {domain}")
                output_rows.append(row)
--- a/parsedmarc/resources/maps/known_unknown_base_reverse_dns.txt
+++ b/parsedmarc/resources/maps/known_unknown_base_reverse_dns.txt
@@ -1,125 +1,309 @@
+185.in-addr.arpa
+190.in-addr.arpa
 200.in-addr.arpa
+9services.com
+a7e.ru
+a94434500-blog.com
+abv-10.top
 adlucrumnewsletter.com
 admin.corpivensa.gob.ve
+advantageiq.com
+advrider.ro
 aerospacevitro.us.com
+agenturserver.de
+aghories.com
+ai270.net
 albagroup-eg.com
+alchemy.net
+anchorfundhub.com
+anglishment.com
 anteldata.net.uy
+antis.edu
 antonaoll.com
+anviklass.org
+anwrgrp.lat
 aosau.net
 arandomserver.com
+aransk.ru
+ardcs.cn
+as29550.net
 asmecam.it
+aurelienvos.com
+automatech.lat
+avistaadvantage.com
 b8sales.com
+bearandbullmarketnews.com
 bestinvestingtime.com
 biocorp.com
-bisno1.co.jp
+bitter-echo.com
+blguss.com
+bluenet.ch
 bluhosting.com
 bodiax.pp.ua
 bost-law.com
+brainity.com
 brnonet.cz
 brushinglegal.de
+brw.net
+budgeteasehub.com
+buoytoys.com
+cashflowmasterypro.com
+cavabeen.com
+cbti.net
+checkpox.fun
+chegouseuvlache.org
 christus.mx
+ckaik.cn
 cloud-edm.com
+cloudaccess.net
+cloudflare-email.org
+cloudhosting.rs
 cloudlogin.co
 cnode.io
+code-it.net
+colombiaceropapel.org
 commerceinsurance.com
 coolblaze.com
+coowo.com
+corpemail.net
+cp2-myorderbox.com
 cps.com.ar
+ctla.co.kr
+dastans.ru
+datahost36.de
+descarca-counter-strike.net
 detrot.xyz
 digi.net.my
+dinofelis.cn
+diwkyncbi.top
 dkginternet.com
+dns-oid.com
+domconfig.com
 doorsrv.com
+dreampox.fun
 dreamtechmedia.com
 ds.network
+dvj.theworkpc.com
+dwlcka.com
+dyntcorp.com
+economiceagles.com
+egosimail.com
+emailgids.net
 emailperegrine.com
+entretothom.net
 epsilon-group.com
+erestaff.com
+example.com
+exposervers.com-new
 eyecandyhosting.xyz
 fetscorp.shop
+fin-crime.com
+financeaimpoint.com
+financeupward.com
+flex-video.bnr.la
 formicidaehunt.net
 fosterheap.com
+frontiernet.net
 gendns.com
+getthatroi.com
+gigidea.net
+giize.com
 ginous.eu.com
 gist-th.com
 gophermedia.com
 gqlists.us.com
 gratzl.de
+greatestworldnews.com
+greennutritioncare.com
+h-serv.co.uk
 hgnbroken.us.com
 hosting1337.com
+hostinglotus.cloud
 hostingmichigan.com
+hostiran.name
+hostmnl.com
 hostname.localhost
 hostnetwork.com
+hosts.net.nz
 hostwhitelabel.com
+hpms1.jp
+hypericine.com
+i-mecca.net
+iaasdns.com
+iam.net.ma
+iconmarketingguy.com
 idcfcloud.net
+idealconcept.live
+igppevents.org.uk
+imjtmn.cn
 immenzaces.com
+inshaaegypt.com
+ip-147-135-108.us
+ip-178-33-109.eu
+ip-ptr.tech
+itsidc.com
+itwebs.com
 ivol.co
 jalanet.co.id
+jimishare.com
+jumanra.org
 kahlaa.com
 kbronet.com.tw
 kdnursing.org
+kihy.theworkpc.com
 kitchenaildbd.com
+layerdns.cloud
 legenditds.com
 lighthouse-media.com
+listertermoformadoa.com
+llsend.com
 lohkal.com
 lonestarmm.net
+longwoodmgmt.com
+lwl-puehringer.at
+lynx.net.lb
 magnetmail.net
+mail-fire.com
 manhattanbulletpoint.com
+manpowerservices.com
+marketmysterycode.com
+marketwizardspro.com
 masterclassjournal.com
+matroguel.cam
+maximpactipo.com
+mechanicalwalk.store
+mediavobis.com
+misorpresa.com
 moderntradingnews.com
 moonjaws.com
+morningnewscatcher.com
 motion4ever.net
 mschosting.com
+msdp1.com
 mspnet.pro
 mts-nn.ru
+mxserver.ro
 mxthunder.net
 myrewards.net
 mysagestore.com
+mysecurewebserver.com
+myvps.jp
+name.tools
+nanshenqfurniture.com
+nask.pl
 ncport.ru
+ncsdi.ws
 nebdig.com
 neovet-base.ru
+netbri.com
+netkl.org
+newinvestingguide.com
+newwallstreetcode.com
+ngvcv.cn
 nic.name
 nidix.net
+nlscanme.com
+nmeuh.cn
+noisndametal.com
+nwo.giize.com
+offerslatedeals.com
+office365.us
 ogicom.net
 omegabrasil.inf.br
 onnet21.com
+oppt-ac.fit
+orbitel.net.co
 ovaltinalization.co
 overta.ru
+panaltyspot.space
 passionatesmiles.com
+perimetercenter.net
+permanentscreen.com
+phdns3.es
 planethoster.net
+plesk.page
 pmnhost.net
+pokupki5.ru
 popiup.com
+ports.net
 prima.com.ar
 prima.net.ar
+prohealthmotion.com
 proudserver.com
+psnm.ru
+pvcwindowsprices.live
 qontenciplc.autos
+quatthonggiotico.com
+rapidns.com
 raxa.host
+reliablepanel.com
+rgb365.eu
+riddlecamera.net
+rwdhosting.ca
+s500host.com
 sahacker-2020.com
 samsales.site
+saransk.ru
 satirogluet.com
-securednshost.com
+seaspraymta3.net
+secorp.mx
 securen.net
 securerelay.in
 securev.net
 servershost.biz
+silvestrejaguar.sbs
+silvestreonca.sbs
+siriuscloud.jp
+sisglobalresearch.com
 smallvillages.com
+smartape-vps.com
 solusoftware.com
 spiritualtechnologies.io
 sprout.org
 stableserver.net
+stockepictigers.com
 stockexchangejournal.com
 suksangroup.com
+sysop4.com
 system.eu.com
+szhongbing.com
 t-jon.com
+tecnoxia.net
+tel-xyz.fit
 tenkids.net
+terminavalley.com
 thaicloudsolutions.com
 thaimonster.com
+thepushcase.com
+totaal.net
+traderlearningcenter.com
 tullostrucking.com
+ultragate.com
 unite.services
 urawasl.com
 us.servername.us
+varvia.de
 vendimetry.com
 vibrantwellnesscorp.com
+viviotech.us
+vlflgl.com
+volganet.ru
 wallstreetsgossip.com
+wealthexpertisepro.com
+web-login.eu
 weblinkinternational.com
+webnox.io
+welllivinghive.com
+wisdomhard.com
+wisewealthcircle.com
+wsiph2.com
+xnt.mx
+xpnuf.cn
 xsfati.us.com
 xspmail.jp
+yourciviccompass.com
+yourinvestworkbook.com
+yoursitesecure.net
 zerowebhosting.net
 znlc.jp
+ztomy.com
--- a/parsedmarc/resources/maps/psl_overrides.txt
+++ b/parsedmarc/resources/maps/psl_overrides.txt
@@ -0,0 +1,6 @@
+akura.ne.jp
+amazonaws.com
+cloudaccess.net
+h-serv.co.uk
+linode.com
+plesk.page
--- a/parsedmarc/resources/maps/sortmaps.py
+++ b/parsedmarc/resources/maps/sortmaps.py
@@ -0,0 +1,59 @@
+#!/usr/bin/env python3
+
+import os
+import csv
+
+maps_dir = os.path.join(".")
+map_files = ["base_reverse_dns_map.csv"]
+list_files = ["known_unknown_base_reverse_dns.txt", "psl_overrides.txt"]
+
+
+def sort_csv(filepath, column=0):
+    with open(filepath, mode="r", newline="") as infile:
+        reader = csv.reader(infile)
+        header = next(reader)
+        sorted_rows = sorted(reader, key=lambda row: row[column])
+        existing_values = []
+        for row in sorted_rows:
+            if row[column] in existing_values:
+                print(f"Warning: {row[column]} is in {filepath} multiple times")
+
+    with open(filepath, mode="w", newline="\n") as outfile:
+        writer = csv.writer(outfile)
+        writer.writerow(header)
+        writer.writerows(sorted_rows)
+
+
+def sort_list_file(
+    filepath,
+    lowercase=True,
+    strip=True,
+    deduplicate=True,
+    remove_blank_lines=True,
+    ending_newline=True,
+    newline="\n",
+):
+    with open(filepath, mode="r", newline=newline) as infile:
+        lines = infile.readlines()
+        for i in range(len(lines)):
+            if lowercase:
+                lines[i] = lines[i].lower()
+            if strip:
+                lines[i] = lines[i].strip()
+        if deduplicate:
+            lines = list(set(lines))
+        if remove_blank_lines:
+            while "" in lines:
+                lines.remove("")
+        lines = sorted(lines)
+        if ending_newline:
+            if lines[-1] != "":
+                lines.append("")
+    with open(filepath, mode="w", newline=newline) as outfile:
+        outfile.write("\n".join(lines))
+
+
+for csv_file in map_files:
+    sort_csv(os.path.join(maps_dir, csv_file))
+for list_file in list_files:
+    sort_list_file(os.path.join(maps_dir, list_file))
--- a/parsedmarc/splunk.py
+++ b/parsedmarc/splunk.py
@@ -5,7 +5,7 @@ import json
 import urllib3
 import requests

-from parsedmarc import __version__
+from parsedmarc.constants import USER_AGENT
 from parsedmarc.log import logger
 from parsedmarc.utils import human_timestamp_to_unix_timestamp

@@ -51,7 +51,7 @@ class HECClient(object):
        self._common_data = dict(host=self.host, source=self.source, index=self.index)

        self.session.headers = {
-            "User-Agent": "parsedmarc/{0}".format(__version__),
+            "User-Agent": USER_AGENT,
            "Authorization": "Splunk {0}".format(self.access_token),
        }

--- a/parsedmarc/utils.py
+++ b/parsedmarc/utils.py
@@ -37,7 +37,7 @@ import requests
 from parsedmarc.log import logger
 import parsedmarc.resources.dbip
 import parsedmarc.resources.maps
-
+from parsedmarc.constants import USER_AGENT

 parenthesis_regex = re.compile(r"\s*\(.*\)\s*")

@@ -345,7 +345,8 @@ def get_service_from_reverse_dns_base_domain(
    if not (offline or always_use_local_file) and len(reverse_dns_map) == 0:
        try:
            logger.debug(f"Trying to fetch reverse DNS map from {url}...")
-            response = requests.get(url)
+            headers = {"User-Agent": USER_AGENT}
+            response = requests.get(url, headers=headers)
            response.raise_for_status()
            csv_file.write(response.text)
            csv_file.seek(0)
@@ -355,6 +356,7 @@ def get_service_from_reverse_dns_base_domain(
        except Exception:
            logger.warning("Not a valid CSV file")
            csv_file.seek(0)
+            logging.debug("Response body:")
            logger.debug(csv_file.read())

    if len(reverse_dns_map) == 0:
--- a/parsedmarc/webhook.py
+++ b/parsedmarc/webhook.py
@@ -1,6 +1,7 @@
 import requests

 from parsedmarc import logger
+from parsedmarc.constants import USER_AGENT


 class WebhookClient(object):
@@ -21,7 +22,7 @@ class WebhookClient(object):
        self.timeout = timeout
        self.session = requests.Session()
        self.session.headers = {
-            "User-Agent": "parsedmarc",
+            "User-Agent": USER_AGENT,
            "Content-Type": "application/json",
        }

--- a/pyproject.toml
+++ b/pyproject.toml
@@ -76,9 +76,20 @@ parsedmarc = "parsedmarc.cli:_main"
 Homepage = "https://domainaware.github.io/parsedmarc"

 [tool.hatch.version]
-path = "parsedmarc/__init__.py"
+path = "parsedmarc/constants.py"

 [tool.hatch.build.targets.sdist]
 include = [
    "/parsedmarc",
 ]
+
+[tool.hatch.build]
+exclude = [
+"base_reverse_dns.csv",
+"find_bad_utf8.py",
+"find_unknown_base_reverse_dns.py",
+"unknown_base_reverse_dns.csv",
+"sortmaps.py",
+"README.md",
+"*.bak"
+]
--- a/sortmaps.py
+++ b/sortmaps.py
@@ -1,25 +0,0 @@
-#!/usr/bin/env python3
-
-import os
-import glob
-import csv
-
-
-maps_dir = os.path.join("parsedmarc", "resources", "maps")
-csv_files = glob.glob(os.path.join(maps_dir, "*.csv"))
-
-
-def sort_csv(filepath, column=0):
-    with open(filepath, mode="r", newline="") as infile:
-        reader = csv.reader(infile)
-        header = next(reader)
-        sorted_rows = sorted(reader, key=lambda row: row[column])
-
-    with open(filepath, mode="w", newline="\n") as outfile:
-        writer = csv.writer(outfile)
-        writer.writerow(header)
-        writer.writerows(sorted_rows)
-
-
-for csv_file in csv_files:
-    sort_csv(csv_file)
--- a/splunk/smtp_tls_dashboard.xml
+++ b/splunk/smtp_tls_dashboard.xml
@@ -0,0 +1,107 @@
+<form version="1.1" theme="dark">
+  <label>SMTP TLS Reporting</label>
+  <fieldset submitButton="false" autoRun="true">
+    <input type="time" token="time">
+      <label></label>
+      <default>
+        <earliest>-7d@h</earliest>
+        <latest>now</latest>
+      </default>
+    </input>
+    <input type="text" token="organization_name" searchWhenChanged="true">
+      <label>Organization name</label>
+      <default>*</default>
+      <initialValue>*</initialValue>
+    </input>
+    <input type="text" token="policy_domain">
+      <label>Policy domain</label>
+      <default>*</default>
+      <initialValue>*</initialValue>
+    </input>
+    <input type="dropdown" token="policy_type" searchWhenChanged="true">
+      <label>Policy type</label>
+      <choice value="*">Any</choice>
+      <choice value="tlsa">tlsa</choice>
+      <choice value="sts">sts</choice>
+      <choice value="no-policy-found">no-policy-found</choice>
+      <default>*</default>
+      <initialValue>*</initialValue>
+    </input>
+  </fieldset>
+  <row>
+    <panel>
+      <title>Reporting organizations</title>
+      <table>
+        <search>
+          <query>index=email sourcetype=smtp:tls organization_name=$organization_name$ policies{}.policy_domain=$policy_domain$
+| rename policies{}.policy_domain as policy_domain
+| rename policies{}.policy_type as policy_type
+| rename policies{}.failed_session_count as failed_sessions
+| rename policies{}.failure_details{}.failed_session_count as failed_sessions
+| rename policies{}.successful_session_count as successful_sessions
+| rename policies{}.failure_details{}.sending_mta_ip as sending_mta_ip
+| rename policies{}.failure_details{}.receiving_ip as receiving_ip
+| rename policies{}.failure_details{}.receiving_mx_hostname as receiving_mx_hostname
+| rename policies{}.failure_details{}.result_type as failure_type
+| fillnull value=0 failed_sessions
+| stats sum(failed_sessions) as failed_sessions sum(successful_sessions) as successful_sessions by organization_name
+| sort -successful_sessions 0</query>
+          <earliest>$time.earliest$</earliest>
+          <latest>$time.latest$</latest>
+        </search>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </table>
+    </panel>
+    <panel>
+      <title>Domains</title>
+      <table>
+        <search>
+          <query>index=email sourcetype=smtp:tls organization_name=$organization_name$ policies{}.policy_domain=$policy_domain$
+| rename policies{}.policy_domain as policy_domain
+| rename policies{}.policy_type as policy_type
+| rename policies{}.failed_session_count as failed_sessions
+| rename policies{}.failure_details{}.failed_session_count as failed_sessions
+| rename policies{}.successful_session_count as successful_sessions
+| rename policies{}.failure_details{}.sending_mta_ip as sending_mta_ip
+| rename policies{}.failure_details{}.receiving_ip as receiving_ip
+| rename policies{}.failure_details{}.receiving_mx_hostname as receiving_mx_hostname
+| rename policies{}.failure_details{}.result_type as failure_type
+| fillnull value=0 failed_sessions
+| stats sum(failed_sessions) as failed_sessions sum(successful_sessions) as successful_sessions  by policy_domain
+| sort -successful_sessions 0</query>
+          <earliest>$time.earliest$</earliest>
+          <latest>$time.latest$</latest>
+        </search>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </table>
+    </panel>
+  </row>
+  <row>
+    <panel>
+      <title>Failure details</title>
+      <table>
+        <search>
+          <query>index=email sourcetype=smtp:tls organization_name=$organization_name$ policies{}.policy_domain=$policy_domain$ policies{}.failure_details{}.result_type=*
+| rename policies{}.policy_domain as policy_domain
+| rename policies{}.policy_type as policy_type
+| rename policies{}.failed_session_count as failed_sessions
+| rename policies{}.failure_details{}.failed_session_count as failed_sessions
+| rename policies{}.successful_session_count as successful_sessions
+| rename policies{}.failure_details{}.sending_mta_ip as sending_mta_ip
+| rename policies{}.failure_details{}.receiving_ip as receiving_ip
+| rename policies{}.failure_details{}.receiving_mx_hostname as receiving_mx_hostname
+| fillnull value=0 failed_sessions
+| rename policies{}.failure_details{}.result_type as failure_type
+| table _time organization_name policy_domain policy_type failed_sessions successful_sessions sending_mta_ip receiving_ip receiving_mx_hostname failure_type
+| sort by -_time 0</query>
+          <earliest>$time.earliest$</earliest>
+          <latest>$time.latest$</latest>
+        </search>
+        <option name="drilldown">none</option>
+        <option name="refresh.display">progressbar</option>
+      </table>
+    </panel>
+  </row>
+</form>
Author	SHA1	Message	Date
Sean Whalen	2b7ae50a27	Better wording	2025-08-17 17:01:22 -04:00
Sean Whalen	3feb478793	8.18.6 - Fix since option to correctly work with weeks (PR #604) - Add 183 entries to `base_reverse_dns_map.csv` - Add 57 entries to `known_unknown_base_reverse_dns.txt` - Check for invalid UTF-8 bytes in `base_reverse_dns_map.csv` at build - Remove unneeded items from the `parsedmarc.resources` module at build	2025-08-17 17:00:11 -04:00
Sean Whalen	01630bb61c	Update code formatting	2025-08-17 16:01:45 -04:00
Sean Whalen	39347cb244	Sdd find_bad_utf8.py	2025-08-17 15:55:47 -04:00
Sean Whalen	ed25526d59	Update maps	2025-08-17 15:17:24 -04:00
alagendijk-minddistrict	880d7110fe	Fix since option to correctly work with weeks (#604 )	2025-08-14 18:39:04 -04:00
Martin Kjær Jørgensen	d62001f5a4	fix wrong configuration option for maildir (#606 ) Signed-off-by: Martin Kjær Jørgensen <me@lagy.org>	2025-08-14 18:36:58 -04:00
Sean Whalen	0720bffcb6	Remove extra spaces	2025-06-10 19:05:06 -04:00
Sean Whalen	fecd55a97d	Add SMTP TLS Reporting dashboard for Splunk Closes #600	2025-06-10 18:54:43 -04:00
Sean Whalen	a121306eed	Fix typo in the map	2025-06-10 10:53:55 -04:00
Sean Whalen	980c9c7904	Add Hostinger to the map	2025-06-10 10:50:06 -04:00
Sean Whalen	963f5d796f	Fix build script	2025-06-10 09:51:12 -04:00
Sean Whalen	6532f3571b	Update lists	2025-06-09 20:05:56 -04:00
Sean Whalen	ea878443a8	Update lists	2025-06-09 17:04:16 -04:00
Sean Whalen	9f6de41958	Update lists	2025-06-09 13:41:49 -04:00
Sean Whalen	119192701c	Update lists	2025-06-09 12:02:50 -04:00
Sean Whalen	1d650be48a	Fix typo	2025-06-08 21:41:07 -04:00
Sean Whalen	a85553fb18	Update lists	2025-06-08 21:40:10 -04:00
Sean Whalen	5975d8eb21	Fix sorting	2025-06-08 20:17:21 -04:00
Sean Whalen	87ae6175f2	Update lists	2025-06-08 19:51:13 -04:00
Sean Whalen	68b93ed580	Update map	2025-06-03 14:54:58 -04:00
Sean Whalen	55508b513b	Remove debugging code	2025-06-03 14:38:15 -04:00
Sean Whalen	71511c0cfc	8.18.5 - Fix CSV download	2025-06-03 11:44:42 -04:00
Sean Whalen	7c45812284	8.18.4 - Fix webhooks	2025-06-02 16:52:48 -04:00
Sean Whalen	607a091a5f	8.18.3 - Move `__version__` to `parsedmarc.constants` - Create a constant `USER_AGENT` - Use the HTTP `User-Agent` header value `parsedmarc/version` for all HTTP requests	2025-06-02 16:43:26 -04:00
Sean Whalen	c308bf938c	Update the README	2025-06-02 15:43:51 -04:00