From 769b16bb03de27e16b7269195e50d98f28f0ea56 Mon Sep 17 00:00:00 2001 From: Sean Whalen <44679+seanthegeek@users.noreply.github.com> Date: Thu, 7 May 2026 11:31:58 -0400 Subject: [PATCH] Drift-detect rebrands: tighten regex; promote 11 verified rebrand-aliased map keys (#753) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Tighten rebrand regex to drop CTA, third-party-mention, and CSS-asset FPs The first run of detect_rebrands.py against the live map surfaced systemic false-positive categories that drowned the real signals. Tightening over two rounds of FP triage: REBRAND_RE — drop bare "now " and "joined the X" branches: - "Buy Now PROMO", "Apply Now Who", "Order Now Free Shipping" — modern marketing pages saturate body text with CTA fragments and ~95% of bare "now " matches were these. Replaced with the linguistically meaningful pattern "(is|are|was|were|am) now (?:(?:a )?part of)?" which still catches "BankOnIT is now Navanta", "We are now Cencora", "is now part of Lumen", etc. - "joined the Festo Certified System Integrator Program", "joined the ClimateCAP Initiative", "joined the Fredonia Women's Rugby team" — the "joined the X" pattern was too generic; real "joined the X family" rebrand banners are rare enough that dropping the branch is the right trade. REBRAND_RE — add `\b` word boundary at the start so triggers don't match mid-word: "Stre*am* now Mystery" was matching `am now ` because the last two letters of "Stream" satisfied the verb alternation. REBRAND_PATH_RE — drop bare `rebrand`, `name change`, `new name for`, and `brand-update` / `brand-refresh` patterns. They appeared too often as CSS class names (`class="rebrand-page"`), CSS variables (`--rebrand-underline-color`), image filenames (`bms-rebrand-logo.svg`, `brand-update.css`), and JSON/JS strings (`"name change"` user-account labels). Adding `\b` boundaries doesn't help because dashes are non-word characters. The remaining narrow patterns (`brand-launch`, `brand-announcement`, `brand-reveal`, `our-new-name`, `our-new-brand`, `acquisition-announcement`, `merger-announcement`) still catch the canonical bankonitusa.com case via its `brand-launch-frequently-asked- questions` URL slug and `Brand announcement` alt text. _REBRAND_NOISE — make the comparison case-insensitive and add "included", "iso", "secure", "part" to suppress "is now ON" / "is now LIVE" / "is now ISO 27001 certified" / "is now Secure Managed Wi-Fi" / "is now Part of" patterns. Twitter/Facebook/Square (the social-platform rebrand mentions in footers like "X (formerly Twitter)") moved to lowercase since the comparison is now case-insensitive. Net effect on a full sweep over the ~13,100-key map: rebrand-signal flagged-row count dropped from ~270 (initial run) to 108 (round-3), clearing the dominant FP categories while every real signal — verified against the bankonitusa.com canonical case plus 11 other actual rebrands — still fires. Co-Authored-By: Claude Opus 4.7 (1M context) * Promote 11 verified rebrands found by drift sweep; alias 4 acquirer domains Renames produced by `detect_rebrands.py` running against the full ~13,100-key map and verified by re-reading each operator's homepage. Type column unchanged for every row — only the canonical `name` shifts to the new operator. Where the new operator's primary domain wasn't already in the map, a case-1 alias row is added pointing to the same `(name, type)`. Renames: - amerisourcebergen.com: AMERISOURCEBERGEN → Cencora - aurorahealthcare.org: Aurora Health Care → Advocate Health - consolidated.com: Consolidated Communications → Fidium Fiber - databridgesites.com: Meridian Parkway Data Center Owner → TierPoint - emarsys.com: SAP Emarsys → SAP Engagement Cloud - rig.net: RigNet → Viasat - rxlightning.com: RxLightning → CoverMyMeds - telepoint.bg: Telepoint → Digital Realty - thehostgroup.com: The Host Group → HostGo - ultisat.com: Globecomm Services Maryland → UltiSat - unifiedpostgroup.com: Unifiedpost Group → Banqup New aliases (operator's primary domain not previously mapped): - cencora.com → Cencora, Healthcare - advocatehealth.com → Advocate Health, Healthcare - covermymeds.com → CoverMyMeds, Healthcare - banqup.com → Banqup, SaaS Five sweep hits intentionally deferred for lack of a clear second source: megatel.co.nz → Nova (`nova.co.nz` is for sale via a domain broker; unclear which Nova entity), pogozone.com → NeuBeam (NeuBeam's homepage doesn't acknowledge the PogoZone acquisition), prempub.com → Ingenious Media (ingeniousmedia.com fetch failed), voltagepark.com → ? (merger with Lightning AI rather than a clean rebrand), and a handful of more ambiguous Synopsys/Ansys/OmniAccess/Rakuten/Indigital/Synthite signals that need manual research. Co-Authored-By: Claude Opus 4.7 (1M context) * Document detect_rebrands.py cadence as run-once-a-year The drift sweep is for catching operator rebrands and acquisitions that accumulated since the previous run; M&A activity over the mapped operator set is slow enough that yearly is sufficient. Annotate the script's own docstring, the maps README, and the AGENTS.md "Related utility scripts" entry so a future contributor doesn't mistake it for a per-batch step. Co-Authored-By: Claude Opus 4.7 (1M context) --------- Co-authored-by: Sean Whalen Co-authored-by: Claude Opus 4.7 (1M context) --- AGENTS.md | 2 +- parsedmarc/resources/maps/README.md | 2 + .../resources/maps/base_reverse_dns_map.csv | 26 ++-- .../resources/maps/collect_domain_info.py | 121 ++++++++++++------ parsedmarc/resources/maps/detect_rebrands.py | 7 + 5 files changed, 104 insertions(+), 54 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index 9ed8900..06d43bb 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -225,7 +225,7 @@ When `unknown_base_reverse_dns.csv` has new entries, follow this order rather th - `find_unknown_base_reverse_dns.py` — regenerates `unknown_base_reverse_dns.csv` from `base_reverse_dns.csv` by subtracting what is already mapped or known-unknown. Enforces the no-full-IP privacy rule at ingest. Translates non-domain-shaped `source_name` rows (raw MMDB `as_name` strings surfaced by the ASN-fallback path in `utils.py:get_ip_address_info` when the IP had no PTR and the `as_domain` was uncategorized) to their corresponding `as_domain` via the bundled MMDB, so the row enters the pipeline as a researchable domain (and drops out automatically if that `as_domain` is already mapped). Run after merging a batch. - `detect_psl_overrides.py` — scans the lists for clustered IP-containing patterns, auto-adds brand suffixes to `psl_overrides.txt`, folds affected entries to their base, and removes any remaining full-IP entries. Run before the collector on any new batch. - `collect_domain_info.py` — the bulk enrichment collector described above. Respects `psl_overrides.txt` and skips full-IP entries. Two derived columns surface drift signals that are also useful during initial classification: `rebrand_signal` combines a body-text regex (matches "now X", "formerly known as X", "is now part of X", etc.) with a path/alt-text regex (matches "rebrand", "brand-launch", "brand-announcement", "name-change", "our-new-name") so that image-only acquisition banners — `Brand announcement` — also fire. `external_links` lists the homepage's non-self, non-social outbound link hosts; useful as review context but not a flag trigger by default in the drift sweep (most external links are to partners / customers / vendors and don't indicate a rebrand). -- `detect_rebrands.py` — drift sweep that re-fetches every key in `base_reverse_dns_map.csv` with the same machinery as `collect_domain_info.py` and emits a TSV of rows where `rebrand_signal` or `redirect_changed` (final URL host doesn't sit under the input domain) fired. Output is for periodic review — a single signal is one corroborating source; promoting a flagged row still needs a second source per the two-corroborating-sources rule. Resume-safe via `-o`. Use `--limit N` to spot-check a slice; `--include-clean` to also emit non-flagged rows; `--flag-external-links` to additionally flag rows whose only signal is an outbound non-self host (off by default to keep partner/vendor noise out of the review queue). +- `detect_rebrands.py` — drift sweep that re-fetches every key in `base_reverse_dns_map.csv` with the same machinery as `collect_domain_info.py` and emits a TSV of rows where `rebrand_signal` or `redirect_changed` (final URL host doesn't sit under the input domain) fired. **Run once a year, not more often** — operator rebrands accumulate slowly and a yearly cadence is enough to keep the map current without spending review effort on near-empty diffs. Not part of the standard per-batch workflow. Output is for periodic review — a single signal is one corroborating source; promoting a flagged row still needs a second source per the two-corroborating-sources rule. Resume-safe via `-o`. Use `--limit N` to spot-check a slice; `--include-clean` to also emit non-flagged rows; `--flag-external-links` to additionally flag rows whose only signal is an outbound non-self host (off by default to keep partner/vendor noise out of the review queue). - `find_bad_utf8.py` — locates invalid UTF-8 bytes (used after past encoding corruption). - `sortlists.py` — case-insensitive sort + dedupe + `type`-column validator for the list files; the authoritative sorter run after every batch edit. diff --git a/parsedmarc/resources/maps/README.md b/parsedmarc/resources/maps/README.md index 3fbe31f..8b8c043 100644 --- a/parsedmarc/resources/maps/README.md +++ b/parsedmarc/resources/maps/README.md @@ -140,6 +140,8 @@ The output of `collect_domain_info.py`. Tab-separated, one row per researched do ## detect_rebrands.py +**Cadence: run roughly once a year.** Not part of the standard mapping workflow — operator rebrands and acquisitions accumulate slowly, and a yearly sweep is sufficient to keep `base_reverse_dns_map.csv` from drifting out of date. There is no benefit to running it more often. + Drift sweep that re-fetches every key in `base_reverse_dns_map.csv` with the same machinery as `collect_domain_info.py` and writes a TSV (`rebrand_drift.tsv` by default) of rows where a drift signal fired. Two signals are flagged by default: - `rebrand_signal` — the collector's body-text and path/alt-text regexes (see above) matched. diff --git a/parsedmarc/resources/maps/base_reverse_dns_map.csv b/parsedmarc/resources/maps/base_reverse_dns_map.csv index 4d6305d..ff59362 100644 --- a/parsedmarc/resources/maps/base_reverse_dns_map.csv +++ b/parsedmarc/resources/maps/base_reverse_dns_map.csv @@ -290,6 +290,7 @@ advania.no,Advania Norway,MSP adventconstructions.co.tz,Advent Construction Limited,Industrial adventhealth.com,AdventHealth,Healthcare adventisthealth.org,Adventist Health,Healthcare +advocatehealth.com,Advocate Health,Healthcare adyen.com,Adyen,Finance adyl.com.br,Adyl Telecom,ISP ae.com.br,Agência Estado,News @@ -575,7 +576,7 @@ americantower.com,American Tower,Technology americantower.com.br,American Tower,Technology americatelnet.com.pe,Americatel Peru,ISP amerinoc.com,AmeriNOC,Web Host -amerisourcebergen.com,AMERISOURCEBERGEN,Healthcare +amerisourcebergen.com,Cencora,Healthcare ameslab.gov,Ames Laboratory,Government amethyst.co.jp,Amethyst,Healthcare amfam.com,American Family Insurance,Finance @@ -918,7 +919,7 @@ auriga.com,Auriga,Technology auriganet.in,Auriganet Digital Technologies,ISP auris.com,Auris,SaaS aurologic.com,aurologic,Web Host -aurorahealthcare.org,Aurora Health Care,Healthcare +aurorahealthcare.org,Advocate Health,Healthcare ausgrid.com.au,Ausgrid,Utilities auspost.com.au,Australia Post,Logistics aussiebb.com.au,Aussie Broadband,ISP @@ -1051,6 +1052,7 @@ bank-banque-canada.ca,Bank of Canada,Government bank-verlag.de,Bank-Verlag,Finance bankofamerica.com,Bank of America,Finance bankonitusa.com,Navanta,MSP +banqup.com,Banqup,SaaS banxico.org.mx,Banco de Mexico,Government barak-online.net,Netvision,ISP barcconnects.net,BARC Connects,ISP @@ -1760,6 +1762,7 @@ cello.co.nz,Cello Group,ISP celsiainternet.com,Celsia Internet,ISP celya.fr,Celya (Carrefour),ISP cencominc.com,Cencom,ISP +cencora.com,Cencora,Healthcare cenet.catholic.edu.au,CEnet Catholic Education Network,Education cengagebrain.com,Cengage Learning,Education cenic.org,CENIC,Nonprofit @@ -2292,7 +2295,7 @@ conrad.nyc,ConradIT,MSP consideredcreative.com,Considered Creative,Marketing consilio.com,Consilio,Legal consol.com,Consolidated Edison (ConEd),Utilities -consolidated.com,Consolidated Communications,ISP +consolidated.com,Fidium Fiber,ISP consolidated.coop,Consolidated,Utilities consolidatedlabel.com,Consolidated Label,Print consolidatednd.com,Consolidated Telcom,ISP @@ -2399,6 +2402,7 @@ countyofriverside.us,County of Riverside,Government courierplus.net,Courier Plus,Logistics covage.com,Covage,ISP covenantuniversity.edu.ng,Covenant University,Education +covermymeds.com,CoverMyMeds,Healthcare cox.com,Cox Communications,ISP cox.net,Cox Communications,ISP coxenterprises.com,Cox Enterprises,Conglomerate @@ -2642,7 +2646,7 @@ data.cr,American Data Networks Costa Rica,ISP data102.com,Hivelocity (Data102),Web Host data443.com,Data443,Email Security databank.com,DataBank,Web Host -databridgesites.com,Meridian Parkway Data Center Owner,Web Host +databridgesites.com,TierPoint,Web Host datacamp.co.uk,CDN77,Web Host datacanopy.com,Data Canopy Colocation,IaaS datacate.com,Datacate,Web Host @@ -3355,7 +3359,7 @@ emailsecurity.app,Mesh Security,Email Security emailservice.io,Mailprotector,Email Security emailsrv.net,Mailprotector,Email Security emailsrvr.com,Rackspace Email,Email Security -emarsys.com,SAP Emarsys,Marketing +emarsys.com,SAP Engagement Cloud,Marketing emberpoint.com,Ember Point,SaaS embou.com,Embou,ISP embrapa.br,Embrapa Brazilian Agricultural Research Corporation,Government @@ -9410,7 +9414,7 @@ richmond.edu,University of Richmond,Education richmondfed.org,Federal Reserve Bank of Richmond,Government ricoh-usa.com,Ricoh USA,Manufacturing ridsa.com.ar,Red Intercable Digital,ISP -rig.net,RigNet,ISP +rig.net,Viasat,ISP rightel.ir,Rightel,ISP rightnowtech.com,Oracle Service Cloud,SaaS rightside.ru,Telecoma,ISP @@ -9552,7 +9556,7 @@ rvurology.com,Rogue Valley Urology,Healthcare rwas.co.uk,Royal Welsh Agricultural Society,Agriculture rwth-aachen.de,RWTH Aachen University,Education rwts.com.au,Real World Technology Solutions,MSP -rxlightning.com,RxLightning,Healthcare +rxlightning.com,CoverMyMeds,Healthcare rybnet.pl,Rybnet,ISP ryoka.co.jp,Ryoka Denka Kasei,Manufacturing rzd.ru,Russian Railways,Logistics @@ -11041,7 +11045,7 @@ telepacific.net,TPx Communications,ISP telepark-passau.de,Telepark Passau,ISP teleperformance.com,Teleperformance,SaaS telepermit.co.nz,Spark NZ,ISP -telepoint.bg,Telepoint,Web Host +telepoint.bg,Digital Realty,Web Host telered.com.ar,Telered,ISP telering.at,Magenta,ISP telesat.com,Telesat,ISP @@ -11179,7 +11183,7 @@ thegirlandthefig.com,the girl & the fig,Food theglobalresearchnetwork.com,Global Healthcare Research,Healthcare thehartford.com,HARTFORD FIRE INSURANCE,Finance thehost.ua,TheHost Ukraine,Web Host -thehostgroup.com,The Host Group,Web Host +thehostgroup.com,HostGo,Web Host theice.com,Intercontinental Exchange (ICE),Finance theinternetsubway.us,The Internet Subway,ISP themercury.com,The Mercury,News @@ -11759,7 +11763,7 @@ ultahost.com,UltaHost,Web Host ultel.net,Ultel,ISP ultimate-guitar.com,Ultimate Guitar,Entertainment ultimatedomain.hosting,Ultimate Domain Hosting,Web Host -ultisat.com,Globecomm Services Maryland,MSP +ultisat.com,UltiSat,MSP ultra.one,UltraOne,ISP ultralinkce.com.br,Ultralink,ISP ultralinkweb.com.br,Ultralink Telecom,ISP @@ -11845,7 +11849,7 @@ unidadeditorial.es,Unidad Editorial,News unidata.it,Unidata,ISP unifesp.br,Universidade Federal de Sao Paulo,Education unifiedlayer.com,UnifiedLayers,Web Host -unifiedpostgroup.com,Unifiedpost Group,SaaS +unifiedpostgroup.com,Banqup,SaaS unifique.com.br,Unifique,ISP unifique.net,Unifique,ISP unijos.edu.ng,University of Jos,Education diff --git a/parsedmarc/resources/maps/collect_domain_info.py b/parsedmarc/resources/maps/collect_domain_info.py index 6b69c52..8bd5057 100644 --- a/parsedmarc/resources/maps/collect_domain_info.py +++ b/parsedmarc/resources/maps/collect_domain_info.py @@ -158,17 +158,30 @@ _FULL_IP_RE = re.compile( # Rebrand-signal scan. Triggered phrases are followed by a captured brand name # (capitalized, non-noise word). The reviewer ultimately judges whether a hit # is a real rebrand banner — the regex's job is to not miss the obvious ones. -# Real cases: "now Navanta", "is now part of Lumen", "formerly known as -# Symantec Email Security", "we became Newfold Digital". +# Real cases: "BankOnIT is now Navanta", "is now part of Lumen", "we are now +# Cencora", "formerly known as Symantec Email Security", "we became Newfold +# Digital". +# +# A bare leading "now " was tried and dropped — modern marketing +# pages saturate the body text with CTA fragments like "Buy Now PROMO", +# "Order Now Free Shipping", "Apply Now Who We Are", which all match a bare +# `now ` and are 95%+ false positives. Requiring a copular verb +# (`is/are/was/were/am now`) keeps the linguistic shape of an actual +# announcement and rules out CTA buttons. The same is true in reverse for +# bare "formerly " — kept because "formerly" virtually never +# appears in a CTA context, but the same noise list catches the residual +# "Formerly Available" / "Formerly Open" cases. REBRAND_RE = re.compile( - r"(?:" - r"(?:now|formerly(?: known as)?) " + r"\b(?:" + r"formerly(?: known as)? " + r"|" + r"(?:is|are|was|were|am) now (?:(?:a )?part of )?" r"|" r"(?:we became|rebranded(?: as| to)?|merged with|" r"acquired by|previously known as|previously operated as|" - r"is now (?:a )?part of|new name for|joined the) " + r"new name for) " r")" - r"([A-Za-z][A-Za-z0-9&]+)", + r"([A-Za-z][A-Za-z0-9&]+)\b", re.IGNORECASE, ) @@ -182,14 +195,11 @@ REBRAND_RE = re.compile( # change / etc. by a space, dash, or underscore, which virtually never # occurs outside a rebrand context. REBRAND_PATH_RE = re.compile( - r"(?:" - r"rebrand" - r"|brand[ _-](?:launch|announcement|reveal|refresh|change|update)" - r"|name[ _-]change" + r"\b(?:" + r"brand[ _-](?:launch|announcement|reveal)" r"|our[ _-]new[ _-](?:name|brand)" - r"|new[ _-]name[ _-]for" r"|(?:acquisition|merger)[ _-]announcement" - r")", + r")\b", re.IGNORECASE, ) @@ -199,35 +209,62 @@ REBRAND_PATH_RE = re.compile( # narrow so real one-word brand names (Navanta, Lumen, Sykt, etc.) survive. _REBRAND_NOISE = frozenset( { - "Available", - "Accepting", - "Active", - "Booking", - "Closed", - "Complete", - "Enrolling", - "Expanding", - "Free", - "Hiring", - "Live", - "Loading", - "Offering", - "Online", - "Open", - "Operating", - "Pending", - "Playing", - "Powered", - "Selling", - "Serving", - "Shipping", - "Showing", - "Streaming", - "Supporting", - "Trending", - "Underway", - "You", - "Your", + # Past-participles / present-participles that the "are now " + # / "is now " pattern picks up from ordinary marketing prose. + # Compared case-insensitively against the captured brand, so a + # single entry covers any casing the page emits ("LIVE", "Live", + # "live"). Add lowercase forms here. + "available", + "accepting", + "active", + "booking", + "closed", + "complete", + "enrolling", + "expanding", + "free", + "hiring", + "installed", + "live", + "loading", + "offering", + "online", + "open", + "operating", + "part", # "is now Part of [our family]" already filtered by structure; + # this catches inverted phrasing where "Part" is the captured token. + "pending", + "playing", + "powered", + "secure", # "is now Secure Managed Wi-Fi" / "is now Secure Login" + "selling", + "serving", + "shipping", + "showing", + "streaming", + "supporting", + "trending", + "underway", + # Short prepositions / pronouns that grammatically follow the verb + # but are not brand names: "are now In Control", "is now On the air". + "down", + "in", + "off", + "on", + "out", + "up", + "you", + "your", + # Standards / certifications that follow "is now certified" + # in marketing copy (compliance announcements). + "iso", + # Social-media platform rebrands that ubiquitously appear in + # footers as "X (formerly Twitter)", "Meta (formerly Facebook)", + # "Block (formerly Square)". The mention is real but it's almost + # never about the page operator's own rebrand. + "twitter", + "facebook", + "square", } ) @@ -349,7 +386,7 @@ def _rebrand_signal(*texts: str) -> str: # post-trigger noise like "now hiring" / "formerly available". if not brand or not brand[0].isupper(): continue - if brand in _REBRAND_NOISE: + if brand.lower() in _REBRAND_NOISE: continue start = max(0, m.start() - 30) end = min(len(text), m.end() + 80) diff --git a/parsedmarc/resources/maps/detect_rebrands.py b/parsedmarc/resources/maps/detect_rebrands.py index 35215ee..5461cbe 100644 --- a/parsedmarc/resources/maps/detect_rebrands.py +++ b/parsedmarc/resources/maps/detect_rebrands.py @@ -1,6 +1,13 @@ #!/usr/bin/env python """Re-fetch mapped reverse-DNS base domains and surface possible rebrand signals. +Cadence: run roughly once a year. Operator rebrands and acquisitions +accumulate slowly, and a yearly sweep is sufficient to keep the map current +without spending review effort on near-empty diffs. This is not part of the +standard per-batch mapping workflow — that workflow uses the related +`collect_domain_info.py` for unmapped domains. Use this script when you want +to revisit the *already-mapped* set for drift. + Walks `base_reverse_dns_map.csv`, fetches each domain's homepage with the same machinery used by `collect_domain_info.py`, and writes a TSV listing rows where one of two default drift signals fired: