Drift-detect rebrands: tighten regex; promote 11 verified rebrand-aliased map keys (#753)

* Tighten rebrand regex to drop CTA, third-party-mention, and CSS-asset FPs

The first run of detect_rebrands.py against the live map surfaced systemic
false-positive categories that drowned the real signals. Tightening over two
rounds of FP triage:

REBRAND_RE — drop bare "now <Cap>" and "joined the X" branches:

- "Buy Now PROMO", "Apply Now Who", "Order Now Free Shipping" — modern
  marketing pages saturate body text with CTA fragments and ~95% of bare
  "now <Capital>" matches were these. Replaced with the linguistically
  meaningful pattern "(is|are|was|were|am) now (?:(?:a )?part of)?" which
  still catches "BankOnIT is now Navanta", "We are now Cencora",
  "is now part of Lumen", etc.
- "joined the Festo Certified System Integrator Program", "joined the
  ClimateCAP Initiative", "joined the Fredonia Women's Rugby team" — the
  "joined the X" pattern was too generic; real "joined the X family"
  rebrand banners are rare enough that dropping the branch is the right
  trade.

REBRAND_RE — add `\b` word boundary at the start so triggers don't match
mid-word: "Stre*am* now Mystery" was matching `am now <Cap>` because the
last two letters of "Stream" satisfied the verb alternation.

REBRAND_PATH_RE — drop bare `rebrand`, `name change`, `new name for`, and
`brand-update` / `brand-refresh` patterns. They appeared too often as CSS
class names (`class="rebrand-page"`), CSS variables
(`--rebrand-underline-color`), image filenames (`bms-rebrand-logo.svg`,
`brand-update.css`), and JSON/JS strings (`"name change"` user-account
labels). Adding `\b` boundaries doesn't help because dashes are non-word
characters. The remaining narrow patterns (`brand-launch`,
`brand-announcement`, `brand-reveal`, `our-new-name`, `our-new-brand`,
`acquisition-announcement`, `merger-announcement`) still catch the
canonical bankonitusa.com case via its `brand-launch-frequently-asked-
questions` URL slug and `Brand announcement` alt text.

_REBRAND_NOISE — make the comparison case-insensitive and add
"included", "iso", "secure", "part" to suppress "is now ON" / "is now
LIVE" / "is now ISO 27001 certified" / "is now Secure Managed Wi-Fi" /
"is now Part of" patterns. Twitter/Facebook/Square (the social-platform
rebrand mentions in footers like "X (formerly Twitter)") moved to
lowercase since the comparison is now case-insensitive.

Net effect on a full sweep over the ~13,100-key map: rebrand-signal
flagged-row count dropped from ~270 (initial run) to 108 (round-3),
clearing the dominant FP categories while every real signal — verified
against the bankonitusa.com canonical case plus 11 other actual
rebrands — still fires.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Promote 11 verified rebrands found by drift sweep; alias 4 acquirer domains

Renames produced by `detect_rebrands.py` running against the full ~13,100-key
map and verified by re-reading each operator's homepage. Type column
unchanged for every row — only the canonical `name` shifts to the new
operator. Where the new operator's primary domain wasn't already in the map,
a case-1 alias row is added pointing to the same `(name, type)`.

Renames:

- amerisourcebergen.com: AMERISOURCEBERGEN → Cencora
- aurorahealthcare.org: Aurora Health Care → Advocate Health
- consolidated.com: Consolidated Communications → Fidium Fiber
- databridgesites.com: Meridian Parkway Data Center Owner → TierPoint
- emarsys.com: SAP Emarsys → SAP Engagement Cloud
- rig.net: RigNet → Viasat
- rxlightning.com: RxLightning → CoverMyMeds
- telepoint.bg: Telepoint → Digital Realty
- thehostgroup.com: The Host Group → HostGo
- ultisat.com: Globecomm Services Maryland → UltiSat
- unifiedpostgroup.com: Unifiedpost Group → Banqup

New aliases (operator's primary domain not previously mapped):

- cencora.com → Cencora, Healthcare
- advocatehealth.com → Advocate Health, Healthcare
- covermymeds.com → CoverMyMeds, Healthcare
- banqup.com → Banqup, SaaS

Five sweep hits intentionally deferred for lack of a clear second source:
megatel.co.nz → Nova (`nova.co.nz` is for sale via a domain broker;
unclear which Nova entity), pogozone.com → NeuBeam (NeuBeam's homepage
doesn't acknowledge the PogoZone acquisition), prempub.com → Ingenious
Media (ingeniousmedia.com fetch failed), voltagepark.com → ? (merger
with Lightning AI rather than a clean rebrand), and a handful of more
ambiguous Synopsys/Ansys/OmniAccess/Rakuten/Indigital/Synthite signals
that need manual research.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Document detect_rebrands.py cadence as run-once-a-year

The drift sweep is for catching operator rebrands and acquisitions that
accumulated since the previous run; M&A activity over the mapped operator
set is slow enough that yearly is sufficient. Annotate the script's own
docstring, the maps README, and the AGENTS.md "Related utility scripts"
entry so a future contributor doesn't mistake it for a per-batch step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Sean Whalen
2026-05-07 11:31:58 -04:00
committed by GitHub
parent c752e776de
commit 769b16bb03
5 changed files with 104 additions and 54 deletions
+1 -1
View File
@@ -225,7 +225,7 @@ When `unknown_base_reverse_dns.csv` has new entries, follow this order rather th
- `find_unknown_base_reverse_dns.py` — regenerates `unknown_base_reverse_dns.csv` from `base_reverse_dns.csv` by subtracting what is already mapped or known-unknown. Enforces the no-full-IP privacy rule at ingest. Translates non-domain-shaped `source_name` rows (raw MMDB `as_name` strings surfaced by the ASN-fallback path in `utils.py:get_ip_address_info` when the IP had no PTR and the `as_domain` was uncategorized) to their corresponding `as_domain` via the bundled MMDB, so the row enters the pipeline as a researchable domain (and drops out automatically if that `as_domain` is already mapped). Run after merging a batch.
- `detect_psl_overrides.py` — scans the lists for clustered IP-containing patterns, auto-adds brand suffixes to `psl_overrides.txt`, folds affected entries to their base, and removes any remaining full-IP entries. Run before the collector on any new batch.
- `collect_domain_info.py` — the bulk enrichment collector described above. Respects `psl_overrides.txt` and skips full-IP entries. Two derived columns surface drift signals that are also useful during initial classification: `rebrand_signal` combines a body-text regex (matches "now X", "formerly known as X", "is now part of X", etc.) with a path/alt-text regex (matches "rebrand", "brand-launch", "brand-announcement", "name-change", "our-new-name") so that image-only acquisition banners — `<a href="…/brand-launch-…"><img alt="Brand announcement"></a>` — also fire. `external_links` lists the homepage's non-self, non-social outbound link hosts; useful as review context but not a flag trigger by default in the drift sweep (most external links are to partners / customers / vendors and don't indicate a rebrand).
- `detect_rebrands.py` — drift sweep that re-fetches every key in `base_reverse_dns_map.csv` with the same machinery as `collect_domain_info.py` and emits a TSV of rows where `rebrand_signal` or `redirect_changed` (final URL host doesn't sit under the input domain) fired. Output is for periodic review — a single signal is one corroborating source; promoting a flagged row still needs a second source per the two-corroborating-sources rule. Resume-safe via `-o`. Use `--limit N` to spot-check a slice; `--include-clean` to also emit non-flagged rows; `--flag-external-links` to additionally flag rows whose only signal is an outbound non-self host (off by default to keep partner/vendor noise out of the review queue).
- `detect_rebrands.py` — drift sweep that re-fetches every key in `base_reverse_dns_map.csv` with the same machinery as `collect_domain_info.py` and emits a TSV of rows where `rebrand_signal` or `redirect_changed` (final URL host doesn't sit under the input domain) fired. **Run once a year, not more often** — operator rebrands accumulate slowly and a yearly cadence is enough to keep the map current without spending review effort on near-empty diffs. Not part of the standard per-batch workflow. Output is for periodic review — a single signal is one corroborating source; promoting a flagged row still needs a second source per the two-corroborating-sources rule. Resume-safe via `-o`. Use `--limit N` to spot-check a slice; `--include-clean` to also emit non-flagged rows; `--flag-external-links` to additionally flag rows whose only signal is an outbound non-self host (off by default to keep partner/vendor noise out of the review queue).
- `find_bad_utf8.py` — locates invalid UTF-8 bytes (used after past encoding corruption).
- `sortlists.py` — case-insensitive sort + dedupe + `type`-column validator for the list files; the authoritative sorter run after every batch edit.
+2
View File
@@ -140,6 +140,8 @@ The output of `collect_domain_info.py`. Tab-separated, one row per researched do
## detect_rebrands.py
**Cadence: run roughly once a year.** Not part of the standard mapping workflow — operator rebrands and acquisitions accumulate slowly, and a yearly sweep is sufficient to keep `base_reverse_dns_map.csv` from drifting out of date. There is no benefit to running it more often.
Drift sweep that re-fetches every key in `base_reverse_dns_map.csv` with the same machinery as `collect_domain_info.py` and writes a TSV (`rebrand_drift.tsv` by default) of rows where a drift signal fired. Two signals are flagged by default:
- `rebrand_signal` — the collector's body-text and path/alt-text regexes (see above) matched.
@@ -290,6 +290,7 @@ advania.no,Advania Norway,MSP
adventconstructions.co.tz,Advent Construction Limited,Industrial
adventhealth.com,AdventHealth,Healthcare
adventisthealth.org,Adventist Health,Healthcare
advocatehealth.com,Advocate Health,Healthcare
adyen.com,Adyen,Finance
adyl.com.br,Adyl Telecom,ISP
ae.com.br,Agência Estado,News
@@ -575,7 +576,7 @@ americantower.com,American Tower,Technology
americantower.com.br,American Tower,Technology
americatelnet.com.pe,Americatel Peru,ISP
amerinoc.com,AmeriNOC,Web Host
amerisourcebergen.com,AMERISOURCEBERGEN,Healthcare
amerisourcebergen.com,Cencora,Healthcare
ameslab.gov,Ames Laboratory,Government
amethyst.co.jp,Amethyst,Healthcare
amfam.com,American Family Insurance,Finance
@@ -918,7 +919,7 @@ auriga.com,Auriga,Technology
auriganet.in,Auriganet Digital Technologies,ISP
auris.com,Auris,SaaS
aurologic.com,aurologic,Web Host
aurorahealthcare.org,Aurora Health Care,Healthcare
aurorahealthcare.org,Advocate Health,Healthcare
ausgrid.com.au,Ausgrid,Utilities
auspost.com.au,Australia Post,Logistics
aussiebb.com.au,Aussie Broadband,ISP
@@ -1051,6 +1052,7 @@ bank-banque-canada.ca,Bank of Canada,Government
bank-verlag.de,Bank-Verlag,Finance
bankofamerica.com,Bank of America,Finance
bankonitusa.com,Navanta,MSP
banqup.com,Banqup,SaaS
banxico.org.mx,Banco de Mexico,Government
barak-online.net,Netvision,ISP
barcconnects.net,BARC Connects,ISP
@@ -1760,6 +1762,7 @@ cello.co.nz,Cello Group,ISP
celsiainternet.com,Celsia Internet,ISP
celya.fr,Celya (Carrefour),ISP
cencominc.com,Cencom,ISP
cencora.com,Cencora,Healthcare
cenet.catholic.edu.au,CEnet Catholic Education Network,Education
cengagebrain.com,Cengage Learning,Education
cenic.org,CENIC,Nonprofit
@@ -2292,7 +2295,7 @@ conrad.nyc,ConradIT,MSP
consideredcreative.com,Considered Creative,Marketing
consilio.com,Consilio,Legal
consol.com,Consolidated Edison (ConEd),Utilities
consolidated.com,Consolidated Communications,ISP
consolidated.com,Fidium Fiber,ISP
consolidated.coop,Consolidated,Utilities
consolidatedlabel.com,Consolidated Label,Print
consolidatednd.com,Consolidated Telcom,ISP
@@ -2399,6 +2402,7 @@ countyofriverside.us,County of Riverside,Government
courierplus.net,Courier Plus,Logistics
covage.com,Covage,ISP
covenantuniversity.edu.ng,Covenant University,Education
covermymeds.com,CoverMyMeds,Healthcare
cox.com,Cox Communications,ISP
cox.net,Cox Communications,ISP
coxenterprises.com,Cox Enterprises,Conglomerate
@@ -2642,7 +2646,7 @@ data.cr,American Data Networks Costa Rica,ISP
data102.com,Hivelocity (Data102),Web Host
data443.com,Data443,Email Security
databank.com,DataBank,Web Host
databridgesites.com,Meridian Parkway Data Center Owner,Web Host
databridgesites.com,TierPoint,Web Host
datacamp.co.uk,CDN77,Web Host
datacanopy.com,Data Canopy Colocation,IaaS
datacate.com,Datacate,Web Host
@@ -3355,7 +3359,7 @@ emailsecurity.app,Mesh Security,Email Security
emailservice.io,Mailprotector,Email Security
emailsrv.net,Mailprotector,Email Security
emailsrvr.com,Rackspace Email,Email Security
emarsys.com,SAP Emarsys,Marketing
emarsys.com,SAP Engagement Cloud,Marketing
emberpoint.com,Ember Point,SaaS
embou.com,Embou,ISP
embrapa.br,Embrapa Brazilian Agricultural Research Corporation,Government
@@ -9410,7 +9414,7 @@ richmond.edu,University of Richmond,Education
richmondfed.org,Federal Reserve Bank of Richmond,Government
ricoh-usa.com,Ricoh USA,Manufacturing
ridsa.com.ar,Red Intercable Digital,ISP
rig.net,RigNet,ISP
rig.net,Viasat,ISP
rightel.ir,Rightel,ISP
rightnowtech.com,Oracle Service Cloud,SaaS
rightside.ru,Telecoma,ISP
@@ -9552,7 +9556,7 @@ rvurology.com,Rogue Valley Urology,Healthcare
rwas.co.uk,Royal Welsh Agricultural Society,Agriculture
rwth-aachen.de,RWTH Aachen University,Education
rwts.com.au,Real World Technology Solutions,MSP
rxlightning.com,RxLightning,Healthcare
rxlightning.com,CoverMyMeds,Healthcare
rybnet.pl,Rybnet,ISP
ryoka.co.jp,Ryoka Denka Kasei,Manufacturing
rzd.ru,Russian Railways,Logistics
@@ -11041,7 +11045,7 @@ telepacific.net,TPx Communications,ISP
telepark-passau.de,Telepark Passau,ISP
teleperformance.com,Teleperformance,SaaS
telepermit.co.nz,Spark NZ,ISP
telepoint.bg,Telepoint,Web Host
telepoint.bg,Digital Realty,Web Host
telered.com.ar,Telered,ISP
telering.at,Magenta,ISP
telesat.com,Telesat,ISP
@@ -11179,7 +11183,7 @@ thegirlandthefig.com,the girl & the fig,Food
theglobalresearchnetwork.com,Global Healthcare Research,Healthcare
thehartford.com,HARTFORD FIRE INSURANCE,Finance
thehost.ua,TheHost Ukraine,Web Host
thehostgroup.com,The Host Group,Web Host
thehostgroup.com,HostGo,Web Host
theice.com,Intercontinental Exchange (ICE),Finance
theinternetsubway.us,The Internet Subway,ISP
themercury.com,The Mercury,News
@@ -11759,7 +11763,7 @@ ultahost.com,UltaHost,Web Host
ultel.net,Ultel,ISP
ultimate-guitar.com,Ultimate Guitar,Entertainment
ultimatedomain.hosting,Ultimate Domain Hosting,Web Host
ultisat.com,Globecomm Services Maryland,MSP
ultisat.com,UltiSat,MSP
ultra.one,UltraOne,ISP
ultralinkce.com.br,Ultralink,ISP
ultralinkweb.com.br,Ultralink Telecom,ISP
@@ -11845,7 +11849,7 @@ unidadeditorial.es,Unidad Editorial,News
unidata.it,Unidata,ISP
unifesp.br,Universidade Federal de Sao Paulo,Education
unifiedlayer.com,UnifiedLayers,Web Host
unifiedpostgroup.com,Unifiedpost Group,SaaS
unifiedpostgroup.com,Banqup,SaaS
unifique.com.br,Unifique,ISP
unifique.net,Unifique,ISP
unijos.edu.ng,University of Jos,Education
1 base_reverse_dns name type
290 adventconstructions.co.tz Advent Construction Limited Industrial
291 adventhealth.com AdventHealth Healthcare
292 adventisthealth.org Adventist Health Healthcare
293 advocatehealth.com Advocate Health Healthcare
294 adyen.com Adyen Finance
295 adyl.com.br Adyl Telecom ISP
296 ae.com.br Agência Estado News
576 americantower.com.br American Tower Technology
577 americatelnet.com.pe Americatel Peru ISP
578 amerinoc.com AmeriNOC Web Host
579 amerisourcebergen.com AMERISOURCEBERGEN Cencora Healthcare
580 ameslab.gov Ames Laboratory Government
581 amethyst.co.jp Amethyst Healthcare
582 amfam.com American Family Insurance Finance
919 auriganet.in Auriganet Digital Technologies ISP
920 auris.com Auris SaaS
921 aurologic.com aurologic Web Host
922 aurorahealthcare.org Aurora Health Care Advocate Health Healthcare
923 ausgrid.com.au Ausgrid Utilities
924 auspost.com.au Australia Post Logistics
925 aussiebb.com.au Aussie Broadband ISP
1052 bank-verlag.de Bank-Verlag Finance
1053 bankofamerica.com Bank of America Finance
1054 bankonitusa.com Navanta MSP
1055 banqup.com Banqup SaaS
1056 banxico.org.mx Banco de Mexico Government
1057 barak-online.net Netvision ISP
1058 barcconnects.net BARC Connects ISP
1762 celsiainternet.com Celsia Internet ISP
1763 celya.fr Celya (Carrefour) ISP
1764 cencominc.com Cencom ISP
1765 cencora.com Cencora Healthcare
1766 cenet.catholic.edu.au CEnet Catholic Education Network Education
1767 cengagebrain.com Cengage Learning Education
1768 cenic.org CENIC Nonprofit
2295 consideredcreative.com Considered Creative Marketing
2296 consilio.com Consilio Legal
2297 consol.com Consolidated Edison (ConEd) Utilities
2298 consolidated.com Consolidated Communications Fidium Fiber ISP
2299 consolidated.coop Consolidated Utilities
2300 consolidatedlabel.com Consolidated Label Print
2301 consolidatednd.com Consolidated Telcom ISP
2402 courierplus.net Courier Plus Logistics
2403 covage.com Covage ISP
2404 covenantuniversity.edu.ng Covenant University Education
2405 covermymeds.com CoverMyMeds Healthcare
2406 cox.com Cox Communications ISP
2407 cox.net Cox Communications ISP
2408 coxenterprises.com Cox Enterprises Conglomerate
2646 data102.com Hivelocity (Data102) Web Host
2647 data443.com Data443 Email Security
2648 databank.com DataBank Web Host
2649 databridgesites.com Meridian Parkway Data Center Owner TierPoint Web Host
2650 datacamp.co.uk CDN77 Web Host
2651 datacanopy.com Data Canopy Colocation IaaS
2652 datacate.com Datacate Web Host
3359 emailservice.io Mailprotector Email Security
3360 emailsrv.net Mailprotector Email Security
3361 emailsrvr.com Rackspace Email Email Security
3362 emarsys.com SAP Emarsys SAP Engagement Cloud Marketing
3363 emberpoint.com Ember Point SaaS
3364 embou.com Embou ISP
3365 embrapa.br Embrapa Brazilian Agricultural Research Corporation Government
9414 richmondfed.org Federal Reserve Bank of Richmond Government
9415 ricoh-usa.com Ricoh USA Manufacturing
9416 ridsa.com.ar Red Intercable Digital ISP
9417 rig.net RigNet Viasat ISP
9418 rightel.ir Rightel ISP
9419 rightnowtech.com Oracle Service Cloud SaaS
9420 rightside.ru Telecoma ISP
9556 rwas.co.uk Royal Welsh Agricultural Society Agriculture
9557 rwth-aachen.de RWTH Aachen University Education
9558 rwts.com.au Real World Technology Solutions MSP
9559 rxlightning.com RxLightning CoverMyMeds Healthcare
9560 rybnet.pl Rybnet ISP
9561 ryoka.co.jp Ryoka Denka Kasei Manufacturing
9562 rzd.ru Russian Railways Logistics
11045 telepark-passau.de Telepark Passau ISP
11046 teleperformance.com Teleperformance SaaS
11047 telepermit.co.nz Spark NZ ISP
11048 telepoint.bg Telepoint Digital Realty Web Host
11049 telered.com.ar Telered ISP
11050 telering.at Magenta ISP
11051 telesat.com Telesat ISP
11183 theglobalresearchnetwork.com Global Healthcare Research Healthcare
11184 thehartford.com HARTFORD FIRE INSURANCE Finance
11185 thehost.ua TheHost Ukraine Web Host
11186 thehostgroup.com The Host Group HostGo Web Host
11187 theice.com Intercontinental Exchange (ICE) Finance
11188 theinternetsubway.us The Internet Subway ISP
11189 themercury.com The Mercury News
11763 ultel.net Ultel ISP
11764 ultimate-guitar.com Ultimate Guitar Entertainment
11765 ultimatedomain.hosting Ultimate Domain Hosting Web Host
11766 ultisat.com Globecomm Services Maryland UltiSat MSP
11767 ultra.one UltraOne ISP
11768 ultralinkce.com.br Ultralink ISP
11769 ultralinkweb.com.br Ultralink Telecom ISP
11849 unidata.it Unidata ISP
11850 unifesp.br Universidade Federal de Sao Paulo Education
11851 unifiedlayer.com UnifiedLayers Web Host
11852 unifiedpostgroup.com Unifiedpost Group Banqup SaaS
11853 unifique.com.br Unifique ISP
11854 unifique.net Unifique ISP
11855 unijos.edu.ng University of Jos Education
@@ -158,17 +158,30 @@ _FULL_IP_RE = re.compile(
# Rebrand-signal scan. Triggered phrases are followed by a captured brand name
# (capitalized, non-noise word). The reviewer ultimately judges whether a hit
# is a real rebrand banner — the regex's job is to not miss the obvious ones.
# Real cases: "now Navanta", "is now part of Lumen", "formerly known as
# Symantec Email Security", "we became Newfold Digital".
# Real cases: "BankOnIT is now Navanta", "is now part of Lumen", "we are now
# Cencora", "formerly known as Symantec Email Security", "we became Newfold
# Digital".
#
# A bare leading "now <Capital>" was tried and dropped — modern marketing
# pages saturate the body text with CTA fragments like "Buy Now PROMO",
# "Order Now Free Shipping", "Apply Now Who We Are", which all match a bare
# `now <Capital>` and are 95%+ false positives. Requiring a copular verb
# (`is/are/was/were/am now`) keeps the linguistic shape of an actual
# announcement and rules out CTA buttons. The same is true in reverse for
# bare "formerly <Capital>" — kept because "formerly" virtually never
# appears in a CTA context, but the same noise list catches the residual
# "Formerly Available" / "Formerly Open" cases.
REBRAND_RE = re.compile(
r"(?:"
r"(?:now|formerly(?: known as)?) "
r"\b(?:"
r"formerly(?: known as)? "
r"|"
r"(?:is|are|was|were|am) now (?:(?:a )?part of )?"
r"|"
r"(?:we became|rebranded(?: as| to)?|merged with|"
r"acquired by|previously known as|previously operated as|"
r"is now (?:a )?part of|new name for|joined the) "
r"new name for) "
r")"
r"([A-Za-z][A-Za-z0-9&]+)",
r"([A-Za-z][A-Za-z0-9&]+)\b",
re.IGNORECASE,
)
@@ -182,14 +195,11 @@ REBRAND_RE = re.compile(
# change / etc. by a space, dash, or underscore, which virtually never
# occurs outside a rebrand context.
REBRAND_PATH_RE = re.compile(
r"(?:"
r"rebrand"
r"|brand[ _-](?:launch|announcement|reveal|refresh|change|update)"
r"|name[ _-]change"
r"\b(?:"
r"brand[ _-](?:launch|announcement|reveal)"
r"|our[ _-]new[ _-](?:name|brand)"
r"|new[ _-]name[ _-]for"
r"|(?:acquisition|merger)[ _-]announcement"
r")",
r")\b",
re.IGNORECASE,
)
@@ -199,35 +209,62 @@ REBRAND_PATH_RE = re.compile(
# narrow so real one-word brand names (Navanta, Lumen, Sykt, etc.) survive.
_REBRAND_NOISE = frozenset(
{
"Available",
"Accepting",
"Active",
"Booking",
"Closed",
"Complete",
"Enrolling",
"Expanding",
"Free",
"Hiring",
"Live",
"Loading",
"Offering",
"Online",
"Open",
"Operating",
"Pending",
"Playing",
"Powered",
"Selling",
"Serving",
"Shipping",
"Showing",
"Streaming",
"Supporting",
"Trending",
"Underway",
"You",
"Your",
# Past-participles / present-participles that the "are now <Cap>"
# / "is now <Cap>" pattern picks up from ordinary marketing prose.
# Compared case-insensitively against the captured brand, so a
# single entry covers any casing the page emits ("LIVE", "Live",
# "live"). Add lowercase forms here.
"available",
"accepting",
"active",
"booking",
"closed",
"complete",
"enrolling",
"expanding",
"free",
"hiring",
"installed",
"live",
"loading",
"offering",
"online",
"open",
"operating",
"part", # "is now Part of [our family]" already filtered by structure;
# this catches inverted phrasing where "Part" is the captured token.
"pending",
"playing",
"powered",
"secure", # "is now Secure Managed Wi-Fi" / "is now Secure Login"
"selling",
"serving",
"shipping",
"showing",
"streaming",
"supporting",
"trending",
"underway",
# Short prepositions / pronouns that grammatically follow the verb
# but are not brand names: "are now In Control", "is now On the air".
"down",
"in",
"off",
"on",
"out",
"up",
"you",
"your",
# Standards / certifications that follow "is now <CERT> certified"
# in marketing copy (compliance announcements).
"iso",
# Social-media platform rebrands that ubiquitously appear in
# footers as "X (formerly Twitter)", "Meta (formerly Facebook)",
# "Block (formerly Square)". The mention is real but it's almost
# never about the page operator's own rebrand.
"twitter",
"facebook",
"square",
}
)
@@ -349,7 +386,7 @@ def _rebrand_signal(*texts: str) -> str:
# post-trigger noise like "now hiring" / "formerly available".
if not brand or not brand[0].isupper():
continue
if brand in _REBRAND_NOISE:
if brand.lower() in _REBRAND_NOISE:
continue
start = max(0, m.start() - 30)
end = min(len(text), m.end() + 80)
@@ -1,6 +1,13 @@
#!/usr/bin/env python
"""Re-fetch mapped reverse-DNS base domains and surface possible rebrand signals.
Cadence: run roughly once a year. Operator rebrands and acquisitions
accumulate slowly, and a yearly sweep is sufficient to keep the map current
without spending review effort on near-empty diffs. This is not part of the
standard per-batch mapping workflow — that workflow uses the related
`collect_domain_info.py` for unmapped domains. Use this script when you want
to revisit the *already-mapped* set for drift.
Walks `base_reverse_dns_map.csv`, fetches each domain's homepage with the same
machinery used by `collect_domain_info.py`, and writes a TSV listing rows where
one of two default drift signals fired: