Reclassify KU pool: 2,248 promotions + new ambiguous-output worklist (#766)

* Reclassify KU pool: 2,248 promotions; surface 78 ambiguous rows for review

Re-fetched homepage / WHOIS / DNS for all 34,647 domains in
known_unknown_base_reverse_dns.txt via collect_domain_info.py and re-ran the
classifier. The classifier itself was extended in several directions while
auditing the unclassified pool — the changes are listed below.

Numbers
- 2,248 KU rows promoted to base_reverse_dns_map.csv (unambiguous matches).
- 78 rows surfaced as ambiguous (two or more distinct detector categories
  fired) — these are NOT auto-promoted; they need human adjudication.
- 32,399 rows remain in KU (genuinely no signal — most have privacy-only
  WHOIS, parked / blocked / Cloudflare-walled homepages, or empty MMDB
  enrichment).
- Disjoint invariant verified: comm -12 of map keys and KU prints nothing.
- Unknown-list regenerated via find_unknown_base_reverse_dns.py.

Classifier changes (classify_unknown_domains.py)

1. Three output buckets via new --ambiguous-out flag. Per-row outcome is now
   one of: map (auto-promote), ambiguous (worklist for human review), or
   ku (no signal). When ≥2 distinct detector categories fire on a row, the
   classifier picks a primary in precedence order but does NOT auto-promote
   — instead it writes the row to the ambiguous TSV with the alternatives
   listed. Rationale: the operator-typology question ("is this a SaaS
   company or an Energy company?") is a judgment call the classifier
   shouldn't make on its own.

2. Plural-matching fix: outer `\b` boundary changed to `s?\b` across all 46
   detectors so `dedicated server` matches `dedicated servers`,
   `law firm` matches `law firms`, etc. This was silently dropping the
   majority of English-text matches.

3. TLD-only signal classification: bare-TLD rows (gov.kh / ac.id / mil.bd /
   .jus.br etc.) now classify even when title/desc/as_name are all empty.
   Previously short-circuited at "need some signal".

4. TLD lists massively expanded:
   - Education: ~85 TLDs (every gov-restricted edu / ac suffix worldwide)
   - Government: ~110 TLDs incl. judicial branch (.jus.br) and legislative
     (.leg.br); covers Eastern Europe, MENA, SE Asia, Africa, Caribbean,
     Pacific
   - Military: ~45 .mil.* suffixes
   - Plus US K-12 regex (.k12.<state>.us)

5. New concrete-vocabulary patterns added based on KU-pool audit:
   - cybersecurity / cyber security for business → MSSP
   - autonomous system / asn owner / network operator / peering exchange
     / IXP → ISP
   - ICANN registrar / domain registrar / domain name platform / CDN /
     WAF / anti-DDoS → Web Host
   - BPM platform / CXM / CCaaS / CPaaS / contact center platform /
     compliance software → SaaS
   - katılım bankası / pensioen en verzekeringen / empréstimo consignado
     / credit (scores|reports|cards|comparison|bureau) /
     stock and commodity market → Finance
   - aeroportos de / passagem de ônibus / bilişim şirketi / havacılık →
     Travel & Tech variants
   - acciaio inossidabile / laminati piani → Industrial
   - Russian football-club declension forms (футбольного клуба, etc.)
   - tv channel / movie streaming / video streaming platform →
     Entertainment
   - genetic sequencing / next-generation sequencing /
     clinical diagnostic → Healthcare
   - punto vendita → Italian Retail
   - electrolyser / electrolyzer / green hydrogen → Energy

6. Mojibake table extended for Western European compounds: ã/â/ê/î/ô
   (Portuguese ã, French/PT â/ê/ô) plus uppercase variants.

Bug fixes from cross-language collisions

The audit pass exposed three short tokens that meant one thing in the
language they were added for and something completely different in another
language the classifier also targets:

- `por` (added as Luxembourgish for "parish" → Religion). Also the Spanish
  and Portuguese preposition "for / by", which appears on roughly every
  Spanish-language page. Was producing ~34 Religion false positives on
  Mexican ISPs, Brazilian utilities, etc.
- `pura` (added as Indonesian/Sundanese/Balinese for "Hindu temple" →
  Religion). Also the feminine of "pure" in Portuguese / Spanish / Italian,
  and a frequent brand-name fragment ("Pura Energia", "Angkasa Pura").
  Was misclassifying Brazilian electric utilities and Indonesian aviation
  services.
- bare `broker` (added as Luxembourgish for Finance). Matched any English
  text containing "broker" / "brokers" — including Cushman & Wakefield's
  "real estate brokers" line, which forced the row into Finance instead
  of Real Estate.

All three removed; AGENTS.md now codifies the rule.

AGENTS.md additions

- "Three output buckets" subsection: documents map / ambiguous / ku output
  and how PRs should call out ambiguous review counts.
- "No taglines / slogans" rule: marketing copy ("we make it easy",
  "smarter decisions") doesn't belong in any detector.
- "No ambiguous signals" rule: cross-category bare words (gazette / academy
  / society / club / studio) are forbidden as classifier keywords; use the
  pinning compound instead. Same rule applies in every language.
- "Cross-language grammar / lexical overlap" rule: short tokens that mean X
  in language A often mean a function word / adjective / brand fragment in
  language B. Cites the por / pura / broker incidents.
- "Classify by what the operator literally provides" rule: clusters by
  acronym suffix (UCaaS / CCaaS / CPaaS) tempt mis-grouping; CCaaS is SaaS
  not ISP, etc. Includes the root-cause analysis of the
  contact-center-as-ISP mistake.
- "Genuinely-ambiguous-between-two-types" rule: phrases like
  "energy management software" that fit equally on a SaaS startup, an
  Industrial conglomerate, and a consultancy belong in NO detector — leave
  the row unmapped and rely on more-specific compounds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Triage 78 ambiguous rows + new classifier filters and rules

Interactive triage of the 78 rows the v1 classifier surfaced as ambiguous
(two or more distinct categories fired). Net result of this commit, on
top of the v1 promotions already in the branch:

- 74 ambiguous rows promoted to map with a human-adjudicated category
  (and 10 of those with a corrected human-cleaned brand vs. the noisy
  as_name / title-bleed the v1 classifier captured).
- 1 row dropped silently per the AGENTS.md adult-content rule.
- 3 rows kept in KU (personal projects, parked pages caught by the
  classifier mid-triage that we then surface'd-then-confirmed).

Map: 37,566 → 37,640 (+74). KU: 32,399 → 32,324 (−75). Disjoint clean.

Three new classifier filters added during triage as recurring patterns
surfaced — these run before category detectors and short-circuit to KU
or DROP rather than letting the operator-typology detectors fire on
parking-page / personal-page / adult-page text:

1. PARKED_PAGE_RE — Media Temple "automatically generated default server
   page", Hostinger Horizons, Apache default, parked-by-registrar pages,
   "site has shut down", "has completed its journey". Cloudflare /
   DDoS-Guard / "Are you a robot?" interstitials are explicitly NOT
   filtered (they leave the TLD-signal path open for gov / edu / mil
   sites that are bot-blocked).
2. PERSONAL_PROJECT_RE — "personal BGP project", "personal website and
   CV", "homelab", "hobby project", "side project". Hobbyists running
   their own ASN aren't commercial operators.
3. ADULT_CONTENT_RE — adult web design / adult-entertainment hosting /
   xxx / escort directory etc. Returns a sentinel ("DROP", None) so the
   caller drops the domain from both map and KU per the AGENTS.md
   content rule.

The classifier API now also writes a fourth output file (--dropped-out)
listing domains the adult-content filter caught, so the caller can
remove them from any tracked list files they currently sit in.

Title-noise list extended to catch: "attention required" / "are you a
robot" / "checking your browser" / "please enable javascript" /
"ddos-guard" / "px-captcha" / "site is not available" / "page is not
available" / "access to this page has been denied". This stops these
strings from bleeding into the brand column when TLD-only classification
fires (the `health.gov.il → "Attention Required!"` shape of bug).

Several cross-language false positives caught during the triage — same
shape as the por / pura / broker incidents the previous commit fixed:

- bare French `e?mailing` matched "Mailing Solutions" (mail-server
  infrastructure on a Cisco VAR's product list, not marketing). Required
  to start with `e` to keep the email-marketing meaning while losing the
  bare-mailing collision.
- Norwegian / Danish bare `avis` (newspaper) matched "Avis Romania" car
  rental and any French text saying "avis" (notice/opinion). Replaced
  with compound forms (`dagsavis`, `lokalavis`, `morgenavis`, etc.).
- Vietnamese bare `bộ` (ministry) matched "bộ phim" (movie set), "bộ
  sưu tập" (collection), and the founding-text references on Vietnam
  Eximbank's about page. Replaced with compound forms (`bộ trưởng`, `bộ
  tài chính`, `bộ ngoại giao`, etc.).
- Russian bare `провайдер` (provider) matched "хостинг провайдер"
  (hosting provider, Web Host) on a Tajikistan domain registrar. Removed
  the bare form; only the internet-specific compounds remain.
- Luxembourgish bare `broker` (Finance) matched "real estate brokers"
  on Cushman & Wakefield's homepage and any English page mentioning
  brokers. Removed the bare form entirely.
- Turkish bare `vakıf` (foundation) matched "Vakıf Katılım Bankası" —
  for-profit Islamic-finance bank whose brand uses the word. Replaced
  with nonprofit-specific compounds (`yardım vakfı`, `hayır vakfı`,
  `kamu yararına vakıf`).

New positive-classification keywords added based on triage gaps:

- MSP rescue path now matches the SMB-IT-shop idiom in Polish
  (`usługi IT dla biznesu`, `obsługa informatyczna firm`,
  `outsourcing IT`), Spanish (`servicios informáticos para empresas`),
  German (`IT-Dienstleister für`, `managed-IT-services`), French
  (`infogérance`, `prestataire de services informatiques`), Italian
  (`servizi informatici gestiti`, `outsourcing informatico`),
  Portuguese (`serviços de TI gerenciados`, `terceirização de TI`),
  Dutch (`beheerd-IT`, `IT-beheer`), and Indonesian
  (`penyedia solusi IT`, `solusi IT terpadu/berbasis`).
- Finance now matches `accounting firm` / `cpa firm` /
  `certified public accountants` / `chartered accountants` /
  `tax preparation` / `tax advisory` / `audit firm` plus equivalents in
  Spanish, Portuguese, French, German, Italian, and Polish.
- SaaS now matches CCaaS / CPaaS / `contact-center-as-a-service` /
  `communications-platform-as-a-service` / `compliance software` /
  `regulatory management software` and CCaaS no longer lives in ISP
  (carryover from the user-flagged "contact centers are not ISPs"
  correction).

AGENTS.md additions:

- "Triage heuristics learned from the 78-row interactive review of
  PR #766's ambiguous bucket" subsection codifying every adjudication
  rule the user applied during the review:
  * pick the main-focus category (first / most-mentioned)
  * clients are not operator typology
  * vertically-specialized firms take the vertical
  * stream-hosting infrastructure is Web Host
  * multi-service SMB IT shops are MSP
  * VARs are Technology
  * CCaaS / CPaaS / UCaaS are SaaS
  * gov/edu/mil/jus TLD signal trumps Cloudflare interstitials
  * esports tournament organizers are Entertainment
  * personal projects / parked pages / adult content go to KU or DROP
  * brand quality is its own dimension — capture corrected brand
    during triage rather than shipping the noisy as_name

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Sean Whalen
2026-05-08 00:03:45 -04:00
committed by GitHub
parent 06d277686d
commit b31a9e022f
4 changed files with 3094 additions and 2551 deletions
+48 -1
View File
@@ -225,7 +225,14 @@ When `unknown_base_reverse_dns.csv` has new entries, follow this order rather th
- `find_unknown_base_reverse_dns.py` — regenerates `unknown_base_reverse_dns.csv` from `base_reverse_dns.csv` by subtracting what is already mapped or known-unknown. Enforces the no-full-IP privacy rule at ingest. Translates non-domain-shaped `source_name` rows (raw MMDB `as_name` strings surfaced by the ASN-fallback path in `utils.py:get_ip_address_info` when the IP had no PTR and the `as_domain` was uncategorized) to their corresponding `as_domain` via the bundled MMDB, so the row enters the pipeline as a researchable domain (and drops out automatically if that `as_domain` is already mapped). Run after merging a batch.
- `detect_psl_overrides.py` — scans the lists for clustered IP-containing patterns, auto-adds brand suffixes to `psl_overrides.txt`, folds affected entries to their base, and removes any remaining full-IP entries. Run before the collector on any new batch.
- `collect_domain_info.py` — the bulk enrichment collector described above. Respects `psl_overrides.txt` and skips full-IP entries. Two derived columns surface drift signals that are also useful during initial classification: `rebrand_signal` combines a body-text regex (matches "now X", "formerly known as X", "is now part of X", etc.) with a path/alt-text regex (matches "rebrand", "brand-launch", "brand-announcement", "name-change", "our-new-name") so that image-only acquisition banners — `<a href="…/brand-launch-…"><img alt="Brand announcement"></a>` — also fire. `external_links` lists the homepage's non-self, non-social outbound link hosts; useful as review context but not a flag trigger by default in the drift sweep (most external links are to partners / customers / vendors and don't indicate a rebrand).
- `classify_unknown_domains.py` — regex-based multilingual classifier that consumes a `collect_domain_info.py` TSV and emits map / known-unknown additions. Useful for both lookup paths into `base_reverse_dns_map.csv`: the original PTR-side flow (classifying reverse-DNS base domains discovered from DMARC report source IPs) and the MMDB-coverage flow (classifying ASN domains lifted from the bundled IPinfo Lite MMDB). Detectors cover all 44 industry types in the README, and every detector aims for **concept parity across the same broad language pool** — see the concept-parity rule below. The classifier is the regex baseline of step 4 of the unknown-domain workflow (see "Workflow for classifying unknown domains" above) — it catches the obvious cases at scale and leaves the genuinely ambiguous to manual / LLM review. Append the script's `--map-out` to `base_reverse_dns_map.csv` and `--ku-out` to `known_unknown_base_reverse_dns.txt` (after the per-batch brand cleanup pass), then run `sortlists.py`. The HAND dict at the top of the script is an extension point for batch-specific overrides (e.g. acquisition aliases, brand-name corrections that don't fit any detector).
- `classify_unknown_domains.py` — regex-based multilingual classifier that consumes a `collect_domain_info.py` TSV and emits map / ambiguous / known-unknown additions. Useful for both lookup paths into `base_reverse_dns_map.csv`: the original PTR-side flow (classifying reverse-DNS base domains discovered from DMARC report source IPs) and the MMDB-coverage flow (classifying ASN domains lifted from the bundled IPinfo Lite MMDB). Detectors cover all 44 industry types in the README, and every detector aims for **concept parity across the same broad language pool** — see the concept-parity rule below. The classifier is the regex baseline of step 4 of the unknown-domain workflow (see "Workflow for classifying unknown domains" above) — it catches the obvious cases at scale and leaves the genuinely ambiguous to manual / LLM review.
**Three output buckets**. Per-row, the classifier returns one of three states:
1. `--map-out` (CSV `domain,name,type`) — exactly one detector category fired. Auto-promote: append to `base_reverse_dns_map.csv`.
2. `--ambiguous-out` (TSV `domain, name, primary_type, alternatives, title`) — **two or more distinct categories fired**. The classifier picks a primary in precedence order but does **not** auto-promote; a human must adjudicate. Use this file as a worklist: for each row, pick one of the candidates (or assign a different category, or send the row to KU). The PR description should call out the ambiguous count and how many were resolved manually vs. left in KU. This bucket is the relief valve for the operator-typology problem — when a regex hit could legitimately mean "this is a SaaS company" or "this is an Energy company" (or any other inter-category boundary case), the classifier surfaces the row instead of guessing.
3. `--ku-out` (text, one domain per line) — no detector fired. Append to `known_unknown_base_reverse_dns.txt`.
Append `--map-out` to `base_reverse_dns_map.csv` and `--ku-out` to `known_unknown_base_reverse_dns.txt` (after the per-batch brand cleanup pass), then run `sortlists.py`. The HAND dict at the top of the script is an extension point for batch-specific overrides (e.g. acquisition aliases, brand-name corrections that don't fit any detector).
**Concept parity rule for multilingual detectors.** When editing or extending any detector regex in `classify_unknown_domains.py`, every language section must cover the **same set of distinct concepts** that the English section covers — not just one or two transliterated keywords. The English section is the spec; each non-English section is an attempt to express that same concept set in idiomatic terms.
@@ -236,6 +243,46 @@ When `unknown_base_reverse_dns.csv` has new entries, follow this order rather th
- **British vs American spellings.** Where US/UK English diverge (`tire`/`tyre`, `defense`/`defence`, `center`/`centre`, `color`/`colour`), include both in the English section so the detector matches both spellings.
This rule applies equally to the smaller detectors (MSSP, IaaS/PaaS/SaaS, Defense, Conglomerate, Energy, etc.) — but for those, "skip rather than invent" does most of the work, since many languages have no native compound for "managed security services" or "infrastructure as a service" and the English term is itself loanword-shaped in most contexts.
**No taglines / slogans as classifier keywords.** Marketing taglines ("we make it easy", "smarter decisions", "your trusted partner", "innovation at scale", "where ideas come to life") are domain-agnostic — every consulting firm, every SaaS pitch, every law firm's homepage uses them. They carry no industry signal and produce false positives across every detector they touch. Keep classifier keywords to **concrete operator-typology vocabulary** — what the operator literally is (`law firm`, `data center`, `record label`, `automotive supplier`) or what it literally provides (`fiber internet`, `mortgage lending`, `pharmaceutical manufacturing`). If a phrase could plausibly appear on a hardware vendor, an MSP, an ad agency, and a government press release, it does not belong in any detector.
**No ambiguous signals.** A keyword belongs in a detector only if it identifies *that one* category. Cross-category words ("gazette" / "Gazette" — a newspaper, a school newsletter, a corporate bulletin, a neighborhood paper, all use it; "academy" — could be K-12, military, beauty, sports, or a SaaS product called "Academy"; "society" — a charity, a learned body, a university residence, a medical association; "club" — a sports team, a nightclub, a children's organization, a casino loyalty program; "studio" — film, photo, fitness, recording, dance) are forbidden as bare keywords. Use the concrete compound that pins the meaning ("rugby club", "photo studio", "research society", "K-12 school district"). The same rule applies in every language — bare Russian "клуб", Spanish "estudio", German "Verein" carry the same multi-meaning hazard as their English equivalents and need the same compounding before they go in. When in doubt, leave the row to manual review rather than feeding the detector a phrase that fires on multiple unrelated industries.
**Cross-language grammar / lexical overlap.** A short token that is a meaningful keyword in language A is often a function word, adjective, or brand-name fragment in language B — and the classifier runs every detector against every language's text without knowing which language the input is in. The result is silent false positives across whole regions of the input. Before adding any short keyword (≤4 letters, plus longer ones that overlap common loanwords), explicitly check whether it collides with a common word in any of the other languages the classifier targets. Two real cases that landed in the file and had to be removed:
- `por` was added as Luxembourgish for "parish" (Religion). It is the Spanish and Portuguese preposition "for / by", which appears on roughly every Spanish-language webpage. Re-classifying ~17k KU rows surfaced ~34 Religion false positives — Mexican ISPs, Brazilian utilities, anything whose homepage said *"para"* or *"por"* — before the bare token was removed.
- `pura` was added as Indonesian/Balinese for "Hindu temple" (Religion). It is also the feminine form of "pure" in Portuguese / Spanish / Italian and a frequent brand-name fragment ("Pura Energia", "Angkasa Pura"). It produced misclassifications on a Brazilian electric utility and an Indonesian aviation services company before being removed.
The defense is mechanical: when proposing a short keyword in any non-English language, run it past the same prepositions / common-adjectives / brand-name-fragments check in *every other language the classifier touches*, and reject the keyword if any of those collide. Compound terms ("পবিত্র মন্দির", "Mosquée Centrale", "religious order") carry their own pinning context and don't collide; bare 3- or 4-letter tokens almost always do. If the language genuinely has no longer compound for the concept, "skip rather than invent" applies — leave that language out of that detector and rely on as_name / WHOIS / TLD signals to pick up the operator instead.
**Classify by what the operator literally provides commercially, not by what its product touches.** Acronym-similar but commercially-distinct categories regularly tempt mis-grouping:
- `UCaaS` (Microsoft Teams / RingCentral / Zoom Phone) is voice-telephony-flavored SaaS. Borderline-ISP but the customer pays for the application, not for connectivity.
- `CCaaS` (Five9, Talkdesk, Genesys Cloud, NICE inContact) is **SaaS** — the product is call-center software (agent desktops, queues, IVR builders, ticket routing). Sold to enterprise IT teams running a customer-service operation. Not an ISP.
- `CPaaS` (Twilio, Sinch, MessageBird) is **PaaS / SaaS** — a developer API for programmable SMS / voice. Sold to developers, not to network buyers.
- Bare BPO contact centers (Concentrix, Teleperformance) are **Staffing / services** operations, not ISPs.
All four show up in pages that mention "voice", "telephony", "communications", "real-time" — but voice runs over the internet, and that's a transport medium, not an industry. The operator-typology test: *what does the customer pay this company for?* An ISP customer pays for **connectivity** (fiber, cable, wireless transit). A CCaaS customer pays for **call-routing software**. Different products, different categories. Don't cluster acronyms by their `-aaS` / `-cloud` / `-platform` suffix; cluster by the actual line item on the invoice.
The same rule applies broadly: a "managed services" company that resells AWS is **MSP**, not IaaS; a "fintech platform" that runs lending is **Finance**, not SaaS; a "media company" running a streaming app is **Entertainment**, not Tech. When a phrase has multiple plausible homes, pick the home that matches the operator's commercial role, and route the row to the category whose customers would recognize the company as theirs.
**Triage heuristics learned from the 78-row interactive review of PR #766's ambiguous bucket** — these are the rules a reviewer should apply when adjudicating each row in the `--ambiguous-out` worklist:
- **Pick the main-focus category** — what comes first / appears most in the title, not what's listed in passing. A Turin IT firm whose description starts "software development, web design, …, video-surveillance, hosting" is **Technology**, not Physical Security.
- **Clients are not operator typology.** Aramark serves "hospitals, universities, school districts, stadiums" — Aramark is **Food**, not Healthcare/Education. Draffin Tucker accounting "serves businesses, individuals, governments, non-profits, and healthcare providers" — Draffin Tucker is **Finance**, not Healthcare/Nonprofit. Loomis Armored serves "retailers, banks and the public sector" — Loomis is **Physical Security**, not Government/Finance/Retail. The rule is identical to the parking-page rule (the operator's identity is what they are, not what their clients are).
- **Vertically-specialized firms take the vertical, not the operator typology.** PRC is "Leading Healthcare Survey & Advisory Company" exclusively in healthcare → **Healthcare**, not Consulting. Vhi is Ireland's largest health insurer (only health insurance) → **Healthcare**, not Finance. Western Carriers is alcoholic-beverage-only logistics → **Food**, not Logistics. SportLevel is sports-data-only → **Sports**, not SaaS. The diagnostic: *does this firm do anything outside the listed vertical?* If no, use the vertical. If yes (e.g. Aramark serves multiple verticals), use the operator typology.
- **Stream-hosting infrastructure (audio/video) is Web Host, not Entertainment.** ScaleEngine's Canadian video CDN, Kinescope's video hosting platform, iCastCenter's SHOUTcast hosting, Teleport's P2P CDN for OTT — the operator sells *bandwidth/transcoding/storage*; the customer (broadcaster) sells the content. Same "what does the customer pay for" diagnostic as elsewhere.
- **Multi-service SMB IT shops are MSP.** Pattern: title leads with "IT services" or the local equivalent (`prestataire de services informatiques` / `usługi IT dla biznesu` / `penyedia solusi IT` / `IT-Dienstleister` / `serviços de TI gerenciados` / `infogérance`), with hosting, networking, voice, and physical-security install bundled. Datech (Poland), Gigantara (Indonesia), Hilltop (USA), iVenture (USA Florida), Marmites (France), Subset (UK), Treten (Nigeria), TheBits (USA Bellingham), Ukrinfosystems (Ukraine), Techexpert (international) all classified MSP. **Use MSP, not MSSP, when title leads with "IT Services" even if cybersecurity is one of the offerings — reserve MSSP for operators whose primary product is security.**
- **VARs (value-added resellers) are Technology.** A "Cisco Premier Partner" / "Microsoft Gold Partner" / hardware-and-services reseller with no managed-services book of business is Technology. The MSP/MSSP labels are reserved for operators selling ongoing managed services (subscription IT operations).
- **CCaaS / CPaaS / UCaaS are SaaS, not ISP.** Established earlier in this section but worth restating because four rows in the ambiguous bucket were variants of this (Evolve IP, mGage, Star2Star/Sangoma, Voximplant). The customer pays for software (call-routing, voice APIs, call-center desks), not connectivity.
- **`.gov.<cc>` / `.edu.<cc>` / `.mil.<cc>` / `.jus.<cc>` / `.k12.<state>.us` TLD signal trumps homepage noise.** A row whose homepage is Cloudflare-walled or DDoS-Guard-walled but whose TLD is restricted to government / education / military / judicial / K-12 should still classify on the TLD signal. The bot-block interstitial is *not* a parked page.
- **Esports tournament organizers are Entertainment, not Sports.** Sports is reserved for traditional athletic competitions, federations, and clubs.
- **Personal projects, homelabs, and CV pages go to KU.** A hobbyist's personal ASN ("personal BGP networking project, homelab insights"), a developer's portfolio site, an "About me" / CV page — these aren't commercial operators. The classifier filters them via `PERSONAL_PROJECT_RE`; reviewers reach the same conclusion.
- **Parked / default / placeholder / shutdown pages go to KU.** The Media Temple "automatically generated default server page", Hostinger Horizons placeholder, Apache default, parked-by-registrar pages, "site has shut down / has completed its journey" wind-down pages — none reveal the actual operator. The classifier filters these via `PARKED_PAGE_RE`. Cloudflare / DDoS-Guard / "Are you a robot?" interstitials, on the other hand, are *not* parked pages — see the TLD-signal rule above.
- **Adult / sexually-explicit content domains are dropped silently from both files.** Same as the existing content rule earlier in this file. The classifier filters these via `ADULT_CONTENT_RE` and emits them to `--dropped-out` for the caller to remove from KU.
- **Brand quality is its own dimension — capture it during triage.** Many ambiguous rows had a poor brand pulled from a tagline (`#1 Custom Software Development Company` instead of `3 Edge Software`, `H.S. Oberoi Buildtech|Best Builder in Gurgaon` instead of `H.S. Oberoi Buildtech`, `Original WEMPI` instead of `West Edmonton Mall`, the parent's `Bronco Wine Co` as_name when the operator is `Classic Wines + Spirits of California`). Note the correct brand in the decision log so it can be applied during the map append; don't ship the tagline-derived brand into the CSV.
**When a phrase is genuinely ambiguous between two distinct operator types, leave it out of both detectors.** "Energy management software / platform" is the canonical example: it appears equally on (a) a pure-play SaaS startup selling to utilities, (b) a Schneider Electric / Honeywell / Siemens product brochure where the operator is an Industrial conglomerate, and (c) a consultancy's white-paper page. The same regex hit means three different category answers, and a regex has no way to tell them apart. Don't classify those phrases at all — leave the row known-unknown for manual review, and rely on more-specific compounds (`renewable energy company`, `gas distribution`, `electrolyser` for Energy; `crm platform`, `bpm system`, `low-code platform` for SaaS) that pin operator typology directly. The defense isn't "pick the most likely category" — it's "skip the ambiguous phrase". A row left unmapped is recoverable; a row misattributed across operator categories is not.
- `detect_rebrands.py` — drift sweep that re-fetches every key in `base_reverse_dns_map.csv` with the same machinery as `collect_domain_info.py` and emits a TSV of rows where `rebrand_signal` or `redirect_changed` (final URL host doesn't sit under the input domain) fired. **Run once a year, not more often** — operator rebrands accumulate slowly and a yearly cadence is enough to keep the map current without spending review effort on near-empty diffs. Not part of the standard per-batch workflow. Output is for periodic review — a single signal is one corroborating source; promoting a flagged row still needs a second source per the two-corroborating-sources rule. Resume-safe via `-o`. Use `--limit N` to spot-check a slice; `--include-clean` to also emit non-flagged rows; `--flag-external-links` to additionally flag rows whose only signal is an outbound non-self host (off by default to keep partner/vendor noise out of the review queue).
- `find_bad_utf8.py` — locates invalid UTF-8 bytes (used after past encoding corruption).
- `sortlists.py` — case-insensitive sort + dedupe + `type`-column validator for the list files; the authoritative sorter run after every batch edit.
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff