mirror of
https://github.com/domainaware/parsedmarc.git
synced 2026-05-18 09:55:24 +00:00
Clarify handling of TLS errors and user network issues in classification guidelines
This commit is contained in:
@@ -164,6 +164,8 @@ When `unknown_base_reverse_dns.csv` has new entries, follow this order rather th
|
||||
|
||||
4. **Classify from the TSV, not by re-fetching.** Feed the TSV to an LLM classifier (or skim it by hand). One pass over a ~200-byte-per-domain summary is roughly an order of magnitude cheaper than spawning research sub-agents that each run their own `whois`/WebFetch loop — observed: ~227k tokens per 186-domain sub-agent vs. a few tens of k total for the TSV pass.
|
||||
|
||||
**A self-signed-certificate or TLS-handshake error in the homepage column is not necessarily a property of the domain.** It can equally be the user's firewall or a TLS-intercepting proxy reissuing certs for outbound traffic, in which case *every* domain in the TSV will look broken in the same way. Same for a sweep of DNS-resolution failures. Before treating those rows as unclassifiable, **ask the user** whether their network is filtering DNS / HTTPS — if it is, the fetch failures carry no signal about the domains and you should not flag them as unreachable.
|
||||
|
||||
5. **IP-WHOIS identifies the hosting network, not the domain's operator.** Do not classify a domain as company X just because its A/AAAA record points into X's IP space. The hosting netname tells you who operates the machines; it tells you nothing about who operates the domain. **Only trust the IP-WHOIS signal when the domain name itself matches the host's name** — e.g. a domain `foohost.com` sitting on a netname like `FOOHOST-NET` corroborates its own identity; `random.com` sitting on `CLOUDFLARENET` tells you nothing. When the homepage and domain-WHOIS are both empty, don't reach for the IP signal to fill the gap — skip the domain and record it as known-unknown instead.
|
||||
|
||||
**Known exception — OVH's numeric reverse-DNS pattern.** OVH publishes reverse-DNS names like `ip-A-B-C.us` / `ip-A-B-C.eu` (three dash-separated octets, not four), and the domain WHOIS is OVH SAS. These are safe to map as `OVH,Web Host` despite the domain name not resembling "ovh"; the WHOIS is what corroborates it, not the IP netname. If you encounter other reverse-DNS-only brands with a similar recurring pattern, confirm via domain-WHOIS before mapping and document the pattern here.
|
||||
@@ -188,7 +190,7 @@ When someone points at a specific domain — from a DMARC report they inspected,
|
||||
|
||||
1. **MMDB check first.** Confirm the domain appears in `ipinfo_lite.mmdb` as an `as_domain`, and note the `as_name`, ASN(s), and network / IPv4 counts for scale context. If the domain doesn't appear as an `as_domain`, it's a PTR-side-only addition — fine, but call that out so the reviewer knows only the PTR path will hit it. See "Checking ASN-domain coverage of the MMDB" for the walk-the-MMDB pattern.
|
||||
2. **Grep existing map and known-unknown keys for the brand.** `grep -in "<brand>" base_reverse_dns_map.csv known_unknown_base_reverse_dns.txt`. If any variant of the brand is already classified, reuse that `(name, type)` rather than inventing a new display name (same rule as bulk workflows — one canonical display name per operator). If it's in `known_unknown_base_reverse_dns.txt`, understand *why* before promoting it out.
|
||||
3. **Corroborate identity from two sources.** Fetch the homepage with `WebFetch` and run `whois` on the domain. Confirm the service category (ISP, Web Host, MSP, SaaS, etc.) from what the homepage actually describes, cross-checked against the domain WHOIS's registrant organization. Privacy-redacted WHOIS plus an unreachable or self-signed homepage means you cannot confidently classify — do not reach for the IP-WHOIS as a substitute (rule 5 of the unknown-domain workflow applies here too: only trust IP-WHOIS when the domain name matches the host's name).
|
||||
3. **Corroborate identity from two sources.** Fetch the homepage with `WebFetch` and run `whois` on the domain. Confirm the service category (ISP, Web Host, MSP, SaaS, etc.) from what the homepage actually describes, cross-checked against the domain WHOIS's registrant organization. Privacy-redacted WHOIS plus an unreachable or self-signed homepage means you cannot confidently classify — do not reach for the IP-WHOIS as a substitute (rule 5 of the unknown-domain workflow applies here too: only trust IP-WHOIS when the domain name matches the host's name). **Caveat:** a self-signed cert or TLS-handshake error can also be the user's firewall / a TLS-intercepting proxy rather than a property of the domain — see step 4 of the bulk workflow above. Ask the user before chalking it up to the domain.
|
||||
4. **Apply the same precedence and naming rules as the bulk workflows.** README.md type precedence. Canonical display name per brand family (every Vodafone entity is "Vodafone", every Evolus alias points at the same `(name, type)` as the rest of the family, etc.).
|
||||
5. **Be honest about inference in the commit body.** If a domain has no verifiable homepage or WHOIS and you are classifying from MMDB `as_name` + routed-network scale alone, say so explicitly — e.g. *"Classification is inferred from the MMDB as_name 'GLOBAL CONNECTIVITY SOLUTIONS LLP' and the routed-network scale; homepage unreachable, WHOIS privacy-redacted."* A silent guess is indistinguishable from a verified fact in a diff, and the reviewer has no way to know to double-check it.
|
||||
6. **Privacy rule still applies.** No domains containing a full IPv4 address, regardless of how the domain was sourced.
|
||||
|
||||
Reference in New Issue
Block a user