Map-data build hygiene: README single source of truth, drop maintainer scripts from wheel (9.11.2) (#768)

* Drop base_reverse_dns_types.txt; sortlists.py now reads types from README.md The .txt file duplicated the README's industry list and introduced drift risk — twice in the project's history we had to add types to the .txt only because the README had been updated independently. Make the README the single source of truth. - Add `` / `` HTML comment markers around the bullet list in parsedmarc/resources/maps/README.md. Markers don't render in GitHub's preview. - New `load_types_from_readme()` in sortlists.py parses the bullet items between the markers and returns them. Errors clearly if the README is missing or the markers are absent. - Delete base_reverse_dns_types.txt. - Fix a pre-existing typo in README precedence rule 4: `Web Hosting` → `Web Host` (matches the canonical type used in 4,176 map rows). Smoke test: feeding a row with a bogus type still triggers the validator (`'NotARealType' is not an allowed value for 'type'`), confirming the README-derived list flows through identically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * sortlists.py: normalize README types-list block in place Before validating the map, the validator now sorts the  /  block in README.md alphabetically (case-insensitively), trims leading and trailing whitespace from each item, and deduplicates case- insensitively, rewriting the README in place if any of those need fixing. Errors clearly when two entries differ only by casing (which would otherwise silently lose one). Adding a new category is now just inserting a `- New Type` line anywhere inside the markers — `sortlists.py` will tidy it on the next run. Same shape as how the validator already normalizes known_unknown_base_reverse_dns.txt and psl_overrides.txt. The pure read path is preserved as `load_types_from_readme()` for callers that don't want a side-effecting rewrite (tests, downstream tooling). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Stop shipping maintainer scripts; bump to 9.11.2 The exclude list in [tool.hatch.build] was originally meant to keep maintainer-only batch tooling under parsedmarc/resources/maps/ out of the wheel and sdist (it lists `find_bad_utf8.py`, `find_unknown_base_reverse_dns.py`, the renamed-and-removed `sortmaps.py`). The list never grew when new tools were added, so `collect_domain_info.py`, `classify_unknown_domains.py`, `detect_psl_overrides.py`, `detect_rebrands.py`, and `sortlists.py` all started shipping in distributions despite contributing nothing to runtime functionality. Replace the per-file basename list with a single glob pattern: parsedmarc/resources/maps/[!_]*.py The leading-`_` exception keeps `__init__.py` shipping (required so that `importlib.resources.files(parsedmarc.resources.maps)` can locate the bundled CSV/TXT data files), while excluding any other .py file under that directory — including future maintainer scripts that haven't been written yet. Drop the now-redundant per-file entries from the exclude list: `find_bad_utf8.py`, `find_unknown_base_reverse_dns.py`, and the already-removed `sortmaps.py`. The non-.py exclusions stay (`base_reverse_dns.csv`, `unknown_base_reverse_dns.csv`, `README.md`, `*.bak`). Verified with `hatch build`: - Wheel under parsedmarc/resources/maps/: __init__.py + 3 data files (CSV/TXTs), no maintainer .py - sdist matches - Clean-venv install of the built wheel loads 298 PSL overrides and `get_base_domain('host01.netlify.app')` returns `netlify.app` Bump to 9.11.2 since this changes shipped artifacts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 19:05:24 +00:00 · 2026-05-08 12:36:48 -04:00
parent 053195581b
commit ff6f75d740
6 changed files with 111 additions and 64 deletions
@@ -1,5 +1,12 @@
 # Changelog

+## 9.11.2
+
+### Changes
+
+- **`base_reverse_dns_types.txt` removed; `sortlists.py` now reads the authoritative `type` list directly from `parsedmarc/resources/maps/README.md`.** The README's industry list (between new `<!-- types-list:start -->` / `<!-- types-list:end -->` HTML-comment markers) is now the single source of truth, eliminating the drift risk between the data file and the documented list. Before validating the map, `sortlists.py` also normalizes the README block in place: trims whitespace, deduplicates case-insensitively (errors on case-conflicting entries), and sorts entries alphabetically — so adding a new type is just inserting a `- New Type` line anywhere inside the markers. Also fixes a pre-existing typo in the precedence rules where rule 4 said `Web Hosting` but the canonical type used in 4,176 map rows is `Web Host`.
+- **Maintenance tooling no longer ships in the wheel/sdist.** The Python scripts under `parsedmarc/resources/maps/` (`collect_domain_info.py`, `classify_unknown_domains.py`, `detect_psl_overrides.py`, `detect_rebrands.py`, `sortlists.py`, plus the previously-already-excluded `find_bad_utf8.py` and `find_unknown_base_reverse_dns.py`) are maintainer-only batch tooling, not parsedmarc runtime code. They have always been in the repository for convenience but were unnecessarily included in distributions, pulling reviewer attention and contributing nothing to end-user functionality. The build now excludes any `.py` file under `parsedmarc/resources/maps/` whose name doesn't start with an underscore via a single glob pattern (`parsedmarc/resources/maps/[!_]*.py`), so future maintainer scripts added to that directory are excluded automatically while `__init__.py` continues to ship. The directory's `__init__.py` and the runtime data files (`base_reverse_dns_map.csv`, `known_unknown_base_reverse_dns.txt`, `psl_overrides.txt`) continue to ship — they're loaded at runtime via `importlib.resources.files(parsedmarc.resources.maps)`.
+
 ## 9.11.1

 ### Fixed
@@ -1,4 +1,4 @@
-__version__ = "9.11.1"
+__version__ = "9.11.2"

 USER_AGENT = f"parsedmarc/{__version__}"

@@ -19,11 +19,12 @@ The `service_type` is based on the following rule precedence:
 1. All email security services are identified as `Email Security`, no matter how or where they are hosted.
 2. All marketing services are identified as `Marketing`, no matter how or where they are hosted.
 3. All telecommunications providers that offer internet access are identified as `ISP`, even if they also offer other services, such as web hosting or email hosting.
-4. All web hosting providers are identified as `Web Hosting`, even if the service also offers email hosting.
+4. All web hosting providers are identified as `Web Host`, even if the service also offers email hosting.
 5. All email account providers are identified as `Email Provider`, no matter how or where they are hosted
 6. All legitimate platforms offering their Software as a Service (SaaS) are identified as `SaaS`, regardless of industry. This helps simplify metrics.
 7. All other senders that use their own domain as a Reverse DNS base domain should be identified based on their industry

+<!-- types-list:start -->
 - Agriculture
 - Automotive
 - Beauty
@@ -70,6 +71,9 @@ The `service_type` is based on the following rule precedence:
 - Travel
 - Utilities
 - Web Host
+<!-- types-list:end -->
+
+The list above is the authoritative set of allowed `type` values; `sortlists.py` parses the bullet items between the `<!-- types-list:start -->` and `<!-- types-list:end -->` HTML comment markers and uses them to validate every row's `type` column. Before validating the map, it also normalizes the block in place: trims whitespace, deduplicates case-insensitively, and sorts the entries alphabetically — so adding a new type is just a matter of inserting a `- New Type` line anywhere inside the markers, and `sortlists.py` will tidy it on the next run. Keep the markers themselves intact when editing.

 The file currently contains over 5,000 mappings from a wide variety of email sending sources.

@@ -97,10 +101,6 @@ A CSV with the fields `source_name` and optionally `message_count`. This CSV can

 A CSV file with the fields `source_name` and `message_count`. This file is not tracked by Git.

-## base_reverse_dns_types.txt
-
-A plaintext list (one per line) of the allowed `type` values. Should match the industry list in this README; used by `sortlists.py` as the authoritative set for validation.
-
 ## psl_overrides.txt

 A plaintext list of reverse-DNS suffixes used to fold noisy subdomain patterns down to a single base. Each line is a suffix with an optional leading separator:
@@ -181,4 +181,4 @@ The output of `detect_rebrands.py`. Tab-separated, one row per flagged map key.

 ## sortlists.py

-Validation and sorting helper invoked as a module. Alphabetically sorts `base_reverse_dns_map.csv` (case-insensitive by first column, preserving CRLF line endings), deduplicates entries, validates that every `type` appears in `base_reverse_dns_types.txt`, and warns on names that contain unescaped commas or stray whitespace. Run it after any batch merge before committing.
+Validation and sorting helper invoked as a module. Alphabetically sorts `base_reverse_dns_map.csv` (case-insensitive by first column, preserving CRLF line endings), deduplicates entries, validates that every `type` appears in this README's authoritative type list (parsed from the `<!-- types-list:start -->` / `<!-- types-list:end -->` block above), and warns on names that contain unescaped commas or stray whitespace. Run it after any batch merge before committing.
@@ -1,46 +0,0 @@
-Agriculture
-Automotive
-Beauty
-Conglomerate
-Construction
-Consulting
-Defense
-Education
-Email Provider
-Email Security
-Entertainment
-Event Planning
-Finance
-Food
-Government
-Government Media
-Healthcare
-ISP
-IaaS
-Industrial
-Legal
-Logistics
-MSP
-MSSP
-Manufacturing
-Marketing
-News
-Nonprofit
-PaaS
-Photography
-Physical Security
-Print
-Publishing
-Real Estate
-Religion
-Retail
-SaaS
-Science
-Search Engine
-Social Media
-Sports
-Staffing
-Technology
-Travel
-Utilities
-Web Host
@@ -4,10 +4,93 @@ from __future__ import annotations

 import os
 import csv
+import re
 from pathlib import Path
 from typing import Mapping, Iterable, Optional, Collection, Union, List, Dict


+_TYPES_LIST_RE = re.compile(
+    r"<!--\s*types-list:start\s*-->(.*?)<!--\s*types-list:end\s*-->",
+    re.DOTALL,
+)
+
+
+def _parse_types_block(block: str, source: str) -> List[str]:
+    """Extract type names from the raw text between the marker comments."""
+    types: List[str] = []
+    for line in block.splitlines():
+        stripped = line.strip()
+        if not stripped:
+            continue
+        if not stripped.startswith("- "):
+            raise ValueError(
+                f"{source}: unexpected line inside types-list block: {line!r}"
+            )
+        types.append(stripped[2:].strip())
+    return types
+
+
+def normalize_types_in_readme(readme_path: Union[str, Path]) -> List[str]:
+    """Validate, normalize, and load the authoritative `type` list from README.md.
+
+    Trims leading/trailing whitespace from each item, deduplicates
+    case-insensitively (preserving first-seen casing), and sorts the list
+    case-insensitively. If the on-disk list differs from the normalized
+    form, the README is rewritten in place. Returns the normalized list.
+
+    Raises ValueError if the markers are missing, the block is empty, a
+    line doesn't start with `- `, or two entries differ only by casing.
+    """
+    path = Path(readme_path)
+    text = path.read_text(encoding="utf-8")
+    m = _TYPES_LIST_RE.search(text)
+    if not m:
+        raise ValueError(
+            f"{path}: missing <!-- types-list:start --> / <!-- types-list:end --> markers"
+        )
+    raw_types = _parse_types_block(m.group(1), str(path))
+    if not raw_types:
+        raise ValueError(f"{path}: types-list block is empty")
+
+    seen: Dict[str, str] = {}
+    for t in raw_types:
+        key = t.lower()
+        if key in seen and seen[key] != t:
+            raise ValueError(
+                f"{path}: types-list contains case-conflicting entries: "
+                f"{seen[key]!r} and {t!r}"
+            )
+        seen.setdefault(key, t)
+    normalized = sorted(seen.values(), key=str.lower)
+
+    if normalized != raw_types:
+        new_block = "\n".join(f"- {t}" for t in normalized)
+        replacement = f"<!-- types-list:start -->\n{new_block}\n<!-- types-list:end -->"
+        new_text = text[: m.start()] + replacement + text[m.end() :]
+        path.write_text(new_text, encoding="utf-8")
+    return normalized
+
+
+def load_types_from_readme(readme_path: Union[str, Path]) -> List[str]:
+    """Read the authoritative `type` list out of README.md without rewriting.
+
+    Use `normalize_types_in_readme` to additionally sort, dedupe, and
+    rewrite the block in place. This thin wrapper is kept for callers
+    that only want to read the list (e.g. tests, downstream tools).
+    """
+    path = Path(readme_path)
+    text = path.read_text(encoding="utf-8")
+    m = _TYPES_LIST_RE.search(text)
+    if not m:
+        raise ValueError(
+            f"{path}: missing <!-- types-list:start --> / <!-- types-list:end --> markers"
+        )
+    types = _parse_types_block(m.group(1), str(path))
+    if not types:
+        raise ValueError(f"{path}: types-list block is empty")
+    return types
+
+
 class CSVValidationError(Exception):
    def __init__(self, errors: list[str]):
        super().__init__("\n".join(errors))
@@ -153,10 +236,16 @@ def _main():
    map_file = "base_reverse_dns_map.csv"
    map_key = "base_reverse_dns"
    list_files = ["known_unknown_base_reverse_dns.txt", "psl_overrides.txt"]
-    types_file = "base_reverse_dns_types.txt"
+    readme_file = "README.md"

-    with open(types_file) as f:
-        types = [line.strip() for line in f if line.strip()]
+    if not os.path.exists(readme_file):
+        print(f"Error: {readme_file} does not exist")
+        exit(1)
+    try:
+        types = normalize_types_in_readme(readme_file)
+    except ValueError as e:
+        print(f"Error: {e}")
+        exit(1)

    map_allowed_values = {"type": types}

@@ -165,10 +254,6 @@ def _main():
            print(f"Error: {list_file} does not exist")
            exit(1)
        sort_list_file(list_file)
-    if not os.path.exists(types_file):
-        print(f"Error: {types_file} does not exist")
-        exit(1)
-    sort_list_file(types_file, lowercase=False)
    if not os.path.exists(map_file):
        print(f"Error: {map_file} does not exist")
        exit(1)
@@ -88,10 +88,11 @@ include = [
 [tool.hatch.build]
 exclude = [
    "base_reverse_dns.csv",
-    "find_bad_utf8.py",
-    "find_unknown_base_reverse_dns.py",
    "unknown_base_reverse_dns.csv",
-    "sortmaps.py",
    "README.md",
-    "*.bak"
+    "*.bak",
+    # Maintenance tooling: any Python file under parsedmarc/resources/maps/
+    # whose name doesn't start with `_` (i.e. everything except __init__.py,
+    # which must keep shipping for `importlib.resources.files()` lookups).
+    "parsedmarc/resources/maps/[!_]*.py",
 ]