mirror of
https://github.com/domainaware/parsedmarc.git
synced 2026-06-11 04:59:43 +00:00
Run --use-search-fallback against 10,544 bot-blocked KU rows; +473 promotions
Also expands the search-fallback trigger regex to recognize self-signed
TLS interception (firewall block via cert) and a wider class of
local-firewall block-page strings.
Mechanics
1. Identified 10,544 KU rows from the 34,647-row prior TSV that looked
bot-blocked (via the new `_looks_bot_blocked` detector).
2. Ran `collect_domain_info.py --use-search-fallback` against just
those rows. Throughput was ~3.4 rows/sec at 32 workers / 3s HTTP
timeout / 5s WHOIS timeout. ~50 min wall time.
3. Audited the resulting TSV and discovered 2,078 rows whose homepage
fetch had silently returned a corporate firewall's block page
(Fortinet "Web Filter Violation" being the most common, 1,419 of
them). The original `_SEARCH_FALLBACK_TRIGGER_RE` didn't recognize
those strings, so search-fallback wasn't firing — the firewall's
block-page text was being fed to the classifier as if it were the
operator's homepage. Almost no false promotions resulted (block-page
text doesn't match industry detectors), but the rows weren't
recovering either.
4. Expanded the trigger regex to catch web-filter block pages, then
re-fetched just the 2,078 affected rows.
5. Final classifier pass: 474 unambiguous map adds, 41 ambiguous, 1
silently dropped (adult content), 10,066 still in KU.
Self-signed-cert detection
A separate fix lands in this commit: when the primary fetch fails with
an SSL cert verification error matching "self-signed certificate", the
collector skips the verify=False browser fallback. Rationale: TLS-
intercepting firewalls (corporate or personal-network) present their
own self-signed cert specifically when blocking. The verify=False
fallback would happily retrieve the firewall's block page, which then
poisons the row's title/description. Skipping that path leaves the
row's metadata empty so search-fallback can recover real content.
Other cert errors (hostname mismatch, weak DH, legacy renegotiation)
keep the existing fallback path because they're typically real
operators with misconfigured TLS rather than firewall interception.
Numbers
Map: 37,640 → 38,114 (+474)
KU: 32,324 → 31,886 (−438)
Disjoint check: 0 shared keys
Unknown CSV: regenerated, just the header
Type distribution of the 474 promotions
162 ISP 17 MSP 4 MSSP / Marketing
72 Web Host 16 Technology 4 Beauty / Agriculture
41 Finance 14 Healthcare 3 IaaS / Science / Legal
19 Government 11 Travel 2 Search / Religion / SaaS
10 Logistics 8 Manufacturing 2 Email Sec / Email Provider
9 Education / Retail 8 News 2 Entertainment
7 Utilities / Phys Sec 6 Real Estate 1 Auto / Staff / PaaS
6 Food / Consulting / Industrial / Conglomerate / Nonprofit
Most of the gains are network operators (162 ISPs, 72 Web Hosts) —
the population that's most likely to be Cloudflare-walled or DDoS-
Guard-walled at the homepage layer but show up clearly in DDG
abstracts.
Smoke audit on a 30-row random sample of map adds: 28 plausible, 2
borderline (`es.graphicpkg.com → Food` could also be Industrial since
Graphic Packaging makes packaging *for* the food industry, but the
vertically-specialized rule applies; `annuairesante.ameli.fr` →
Finance via French health-insurance vocabulary, defensible). The 41
ambiguous rows stay in KU per the established workflow — they need
the same one-row-at-a-time human triage as PR #766 used.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
File diff suppressed because it is too large
Load Diff
@@ -749,6 +749,24 @@ def _fetch_homepage(domain: str, timeout: float) -> dict:
|
||||
out["error"] = ""
|
||||
return out
|
||||
|
||||
# Self-signed-cert detection: TLS-intercepting firewalls present
|
||||
# their own self-signed cert specifically when *blocking* a
|
||||
# request. The verify=False browser fallback would succeed but
|
||||
# return the firewall's block page, not the real operator's
|
||||
# content — that block page would then poison the row's title /
|
||||
# description and mislead the classifier. Skip the fallback for
|
||||
# this row so `_looks_bot_blocked` returns True (empty meta) and
|
||||
# the search-fallback path can recover real content.
|
||||
# Cert errors NOT covered here (hostname mismatch, weak DH,
|
||||
# legacy renegotiation) keep the existing fallback path because
|
||||
# they're typically real operators with misconfigured TLS rather
|
||||
# than firewall interception.
|
||||
if primary_err and (
|
||||
"self-signed" in primary_err.lower() or "self signed" in primary_err.lower()
|
||||
):
|
||||
last_err = (primary_err + " | firewall-blocked, skipped fallback")[:200]
|
||||
continue
|
||||
|
||||
# Curl fallback: trigger on errors or non-2xx. A 2xx with empty head
|
||||
# is left alone (likely a parked page; retrying rarely helps).
|
||||
non_success = primary_status and not primary_status.startswith("2")
|
||||
@@ -800,6 +818,15 @@ _SEARCH_FALLBACK_TRIGGER_RE = re.compile(
|
||||
r"just a moment|are you a robot|checking your browser|"
|
||||
r"please enable javascript|"
|
||||
r"ddos[- ]guard|px-captcha|vercel security checkpoint|"
|
||||
r"\bcaptcha\b|"
|
||||
# Local firewall / DNS-filter block pages — corporate firewalls (Fortinet,
|
||||
# Palo Alto, Cisco Umbrella, Sophos, etc.) typically present a generic
|
||||
# block page with one of these phrases. The page is the *firewall's*,
|
||||
# not the operator's, so search-fallback is the only way to recover.
|
||||
r"web filter violation|web filter block|fortinet secure dns service|"
|
||||
r"this site has been blocked|access blocked by|"
|
||||
r"blocked by your network|blocked by administrator|"
|
||||
r"this content is blocked|"
|
||||
# Generic blocked / unavailable
|
||||
r"access denied|access to this page has been denied|"
|
||||
r"site is not available|page is not available|"
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user