diff --git a/AGENTS.md b/AGENTS.md
index 9ed8900..06d43bb 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -225,7 +225,7 @@ When `unknown_base_reverse_dns.csv` has new entries, follow this order rather th
- `find_unknown_base_reverse_dns.py` — regenerates `unknown_base_reverse_dns.csv` from `base_reverse_dns.csv` by subtracting what is already mapped or known-unknown. Enforces the no-full-IP privacy rule at ingest. Translates non-domain-shaped `source_name` rows (raw MMDB `as_name` strings surfaced by the ASN-fallback path in `utils.py:get_ip_address_info` when the IP had no PTR and the `as_domain` was uncategorized) to their corresponding `as_domain` via the bundled MMDB, so the row enters the pipeline as a researchable domain (and drops out automatically if that `as_domain` is already mapped). Run after merging a batch.
- `detect_psl_overrides.py` — scans the lists for clustered IP-containing patterns, auto-adds brand suffixes to `psl_overrides.txt`, folds affected entries to their base, and removes any remaining full-IP entries. Run before the collector on any new batch.
- `collect_domain_info.py` — the bulk enrichment collector described above. Respects `psl_overrides.txt` and skips full-IP entries. Two derived columns surface drift signals that are also useful during initial classification: `rebrand_signal` combines a body-text regex (matches "now X", "formerly known as X", "is now part of X", etc.) with a path/alt-text regex (matches "rebrand", "brand-launch", "brand-announcement", "name-change", "our-new-name") so that image-only acquisition banners — `
` — also fire. `external_links` lists the homepage's non-self, non-social outbound link hosts; useful as review context but not a flag trigger by default in the drift sweep (most external links are to partners / customers / vendors and don't indicate a rebrand).
-- `detect_rebrands.py` — drift sweep that re-fetches every key in `base_reverse_dns_map.csv` with the same machinery as `collect_domain_info.py` and emits a TSV of rows where `rebrand_signal` or `redirect_changed` (final URL host doesn't sit under the input domain) fired. Output is for periodic review — a single signal is one corroborating source; promoting a flagged row still needs a second source per the two-corroborating-sources rule. Resume-safe via `-o`. Use `--limit N` to spot-check a slice; `--include-clean` to also emit non-flagged rows; `--flag-external-links` to additionally flag rows whose only signal is an outbound non-self host (off by default to keep partner/vendor noise out of the review queue).
+- `detect_rebrands.py` — drift sweep that re-fetches every key in `base_reverse_dns_map.csv` with the same machinery as `collect_domain_info.py` and emits a TSV of rows where `rebrand_signal` or `redirect_changed` (final URL host doesn't sit under the input domain) fired. **Run once a year, not more often** — operator rebrands accumulate slowly and a yearly cadence is enough to keep the map current without spending review effort on near-empty diffs. Not part of the standard per-batch workflow. Output is for periodic review — a single signal is one corroborating source; promoting a flagged row still needs a second source per the two-corroborating-sources rule. Resume-safe via `-o`. Use `--limit N` to spot-check a slice; `--include-clean` to also emit non-flagged rows; `--flag-external-links` to additionally flag rows whose only signal is an outbound non-self host (off by default to keep partner/vendor noise out of the review queue).
- `find_bad_utf8.py` — locates invalid UTF-8 bytes (used after past encoding corruption).
- `sortlists.py` — case-insensitive sort + dedupe + `type`-column validator for the list files; the authoritative sorter run after every batch edit.
diff --git a/parsedmarc/resources/maps/README.md b/parsedmarc/resources/maps/README.md
index 3fbe31f..8b8c043 100644
--- a/parsedmarc/resources/maps/README.md
+++ b/parsedmarc/resources/maps/README.md
@@ -140,6 +140,8 @@ The output of `collect_domain_info.py`. Tab-separated, one row per researched do
## detect_rebrands.py
+**Cadence: run roughly once a year.** Not part of the standard mapping workflow — operator rebrands and acquisitions accumulate slowly, and a yearly sweep is sufficient to keep `base_reverse_dns_map.csv` from drifting out of date. There is no benefit to running it more often.
+
Drift sweep that re-fetches every key in `base_reverse_dns_map.csv` with the same machinery as `collect_domain_info.py` and writes a TSV (`rebrand_drift.tsv` by default) of rows where a drift signal fired. Two signals are flagged by default:
- `rebrand_signal` — the collector's body-text and path/alt-text regexes (see above) matched.
diff --git a/parsedmarc/resources/maps/base_reverse_dns_map.csv b/parsedmarc/resources/maps/base_reverse_dns_map.csv
index 4d6305d..ff59362 100644
--- a/parsedmarc/resources/maps/base_reverse_dns_map.csv
+++ b/parsedmarc/resources/maps/base_reverse_dns_map.csv
@@ -290,6 +290,7 @@ advania.no,Advania Norway,MSP
adventconstructions.co.tz,Advent Construction Limited,Industrial
adventhealth.com,AdventHealth,Healthcare
adventisthealth.org,Adventist Health,Healthcare
+advocatehealth.com,Advocate Health,Healthcare
adyen.com,Adyen,Finance
adyl.com.br,Adyl Telecom,ISP
ae.com.br,Agência Estado,News
@@ -575,7 +576,7 @@ americantower.com,American Tower,Technology
americantower.com.br,American Tower,Technology
americatelnet.com.pe,Americatel Peru,ISP
amerinoc.com,AmeriNOC,Web Host
-amerisourcebergen.com,AMERISOURCEBERGEN,Healthcare
+amerisourcebergen.com,Cencora,Healthcare
ameslab.gov,Ames Laboratory,Government
amethyst.co.jp,Amethyst,Healthcare
amfam.com,American Family Insurance,Finance
@@ -918,7 +919,7 @@ auriga.com,Auriga,Technology
auriganet.in,Auriganet Digital Technologies,ISP
auris.com,Auris,SaaS
aurologic.com,aurologic,Web Host
-aurorahealthcare.org,Aurora Health Care,Healthcare
+aurorahealthcare.org,Advocate Health,Healthcare
ausgrid.com.au,Ausgrid,Utilities
auspost.com.au,Australia Post,Logistics
aussiebb.com.au,Aussie Broadband,ISP
@@ -1051,6 +1052,7 @@ bank-banque-canada.ca,Bank of Canada,Government
bank-verlag.de,Bank-Verlag,Finance
bankofamerica.com,Bank of America,Finance
bankonitusa.com,Navanta,MSP
+banqup.com,Banqup,SaaS
banxico.org.mx,Banco de Mexico,Government
barak-online.net,Netvision,ISP
barcconnects.net,BARC Connects,ISP
@@ -1760,6 +1762,7 @@ cello.co.nz,Cello Group,ISP
celsiainternet.com,Celsia Internet,ISP
celya.fr,Celya (Carrefour),ISP
cencominc.com,Cencom,ISP
+cencora.com,Cencora,Healthcare
cenet.catholic.edu.au,CEnet Catholic Education Network,Education
cengagebrain.com,Cengage Learning,Education
cenic.org,CENIC,Nonprofit
@@ -2292,7 +2295,7 @@ conrad.nyc,ConradIT,MSP
consideredcreative.com,Considered Creative,Marketing
consilio.com,Consilio,Legal
consol.com,Consolidated Edison (ConEd),Utilities
-consolidated.com,Consolidated Communications,ISP
+consolidated.com,Fidium Fiber,ISP
consolidated.coop,Consolidated,Utilities
consolidatedlabel.com,Consolidated Label,Print
consolidatednd.com,Consolidated Telcom,ISP
@@ -2399,6 +2402,7 @@ countyofriverside.us,County of Riverside,Government
courierplus.net,Courier Plus,Logistics
covage.com,Covage,ISP
covenantuniversity.edu.ng,Covenant University,Education
+covermymeds.com,CoverMyMeds,Healthcare
cox.com,Cox Communications,ISP
cox.net,Cox Communications,ISP
coxenterprises.com,Cox Enterprises,Conglomerate
@@ -2642,7 +2646,7 @@ data.cr,American Data Networks Costa Rica,ISP
data102.com,Hivelocity (Data102),Web Host
data443.com,Data443,Email Security
databank.com,DataBank,Web Host
-databridgesites.com,Meridian Parkway Data Center Owner,Web Host
+databridgesites.com,TierPoint,Web Host
datacamp.co.uk,CDN77,Web Host
datacanopy.com,Data Canopy Colocation,IaaS
datacate.com,Datacate,Web Host
@@ -3355,7 +3359,7 @@ emailsecurity.app,Mesh Security,Email Security
emailservice.io,Mailprotector,Email Security
emailsrv.net,Mailprotector,Email Security
emailsrvr.com,Rackspace Email,Email Security
-emarsys.com,SAP Emarsys,Marketing
+emarsys.com,SAP Engagement Cloud,Marketing
emberpoint.com,Ember Point,SaaS
embou.com,Embou,ISP
embrapa.br,Embrapa Brazilian Agricultural Research Corporation,Government
@@ -9410,7 +9414,7 @@ richmond.edu,University of Richmond,Education
richmondfed.org,Federal Reserve Bank of Richmond,Government
ricoh-usa.com,Ricoh USA,Manufacturing
ridsa.com.ar,Red Intercable Digital,ISP
-rig.net,RigNet,ISP
+rig.net,Viasat,ISP
rightel.ir,Rightel,ISP
rightnowtech.com,Oracle Service Cloud,SaaS
rightside.ru,Telecoma,ISP
@@ -9552,7 +9556,7 @@ rvurology.com,Rogue Valley Urology,Healthcare
rwas.co.uk,Royal Welsh Agricultural Society,Agriculture
rwth-aachen.de,RWTH Aachen University,Education
rwts.com.au,Real World Technology Solutions,MSP
-rxlightning.com,RxLightning,Healthcare
+rxlightning.com,CoverMyMeds,Healthcare
rybnet.pl,Rybnet,ISP
ryoka.co.jp,Ryoka Denka Kasei,Manufacturing
rzd.ru,Russian Railways,Logistics
@@ -11041,7 +11045,7 @@ telepacific.net,TPx Communications,ISP
telepark-passau.de,Telepark Passau,ISP
teleperformance.com,Teleperformance,SaaS
telepermit.co.nz,Spark NZ,ISP
-telepoint.bg,Telepoint,Web Host
+telepoint.bg,Digital Realty,Web Host
telered.com.ar,Telered,ISP
telering.at,Magenta,ISP
telesat.com,Telesat,ISP
@@ -11179,7 +11183,7 @@ thegirlandthefig.com,the girl & the fig,Food
theglobalresearchnetwork.com,Global Healthcare Research,Healthcare
thehartford.com,HARTFORD FIRE INSURANCE,Finance
thehost.ua,TheHost Ukraine,Web Host
-thehostgroup.com,The Host Group,Web Host
+thehostgroup.com,HostGo,Web Host
theice.com,Intercontinental Exchange (ICE),Finance
theinternetsubway.us,The Internet Subway,ISP
themercury.com,The Mercury,News
@@ -11759,7 +11763,7 @@ ultahost.com,UltaHost,Web Host
ultel.net,Ultel,ISP
ultimate-guitar.com,Ultimate Guitar,Entertainment
ultimatedomain.hosting,Ultimate Domain Hosting,Web Host
-ultisat.com,Globecomm Services Maryland,MSP
+ultisat.com,UltiSat,MSP
ultra.one,UltraOne,ISP
ultralinkce.com.br,Ultralink,ISP
ultralinkweb.com.br,Ultralink Telecom,ISP
@@ -11845,7 +11849,7 @@ unidadeditorial.es,Unidad Editorial,News
unidata.it,Unidata,ISP
unifesp.br,Universidade Federal de Sao Paulo,Education
unifiedlayer.com,UnifiedLayers,Web Host
-unifiedpostgroup.com,Unifiedpost Group,SaaS
+unifiedpostgroup.com,Banqup,SaaS
unifique.com.br,Unifique,ISP
unifique.net,Unifique,ISP
unijos.edu.ng,University of Jos,Education
diff --git a/parsedmarc/resources/maps/collect_domain_info.py b/parsedmarc/resources/maps/collect_domain_info.py
index 6b69c52..8bd5057 100644
--- a/parsedmarc/resources/maps/collect_domain_info.py
+++ b/parsedmarc/resources/maps/collect_domain_info.py
@@ -158,17 +158,30 @@ _FULL_IP_RE = re.compile(
# Rebrand-signal scan. Triggered phrases are followed by a captured brand name
# (capitalized, non-noise word). The reviewer ultimately judges whether a hit
# is a real rebrand banner — the regex's job is to not miss the obvious ones.
-# Real cases: "now Navanta", "is now part of Lumen", "formerly known as
-# Symantec Email Security", "we became Newfold Digital".
+# Real cases: "BankOnIT is now Navanta", "is now part of Lumen", "we are now
+# Cencora", "formerly known as Symantec Email Security", "we became Newfold
+# Digital".
+#
+# A bare leading "now " was tried and dropped — modern marketing
+# pages saturate the body text with CTA fragments like "Buy Now PROMO",
+# "Order Now Free Shipping", "Apply Now Who We Are", which all match a bare
+# `now ` and are 95%+ false positives. Requiring a copular verb
+# (`is/are/was/were/am now`) keeps the linguistic shape of an actual
+# announcement and rules out CTA buttons. The same is true in reverse for
+# bare "formerly " — kept because "formerly" virtually never
+# appears in a CTA context, but the same noise list catches the residual
+# "Formerly Available" / "Formerly Open" cases.
REBRAND_RE = re.compile(
- r"(?:"
- r"(?:now|formerly(?: known as)?) "
+ r"\b(?:"
+ r"formerly(?: known as)? "
+ r"|"
+ r"(?:is|are|was|were|am) now (?:(?:a )?part of )?"
r"|"
r"(?:we became|rebranded(?: as| to)?|merged with|"
r"acquired by|previously known as|previously operated as|"
- r"is now (?:a )?part of|new name for|joined the) "
+ r"new name for) "
r")"
- r"([A-Za-z][A-Za-z0-9&]+)",
+ r"([A-Za-z][A-Za-z0-9&]+)\b",
re.IGNORECASE,
)
@@ -182,14 +195,11 @@ REBRAND_RE = re.compile(
# change / etc. by a space, dash, or underscore, which virtually never
# occurs outside a rebrand context.
REBRAND_PATH_RE = re.compile(
- r"(?:"
- r"rebrand"
- r"|brand[ _-](?:launch|announcement|reveal|refresh|change|update)"
- r"|name[ _-]change"
+ r"\b(?:"
+ r"brand[ _-](?:launch|announcement|reveal)"
r"|our[ _-]new[ _-](?:name|brand)"
- r"|new[ _-]name[ _-]for"
r"|(?:acquisition|merger)[ _-]announcement"
- r")",
+ r")\b",
re.IGNORECASE,
)
@@ -199,35 +209,62 @@ REBRAND_PATH_RE = re.compile(
# narrow so real one-word brand names (Navanta, Lumen, Sykt, etc.) survive.
_REBRAND_NOISE = frozenset(
{
- "Available",
- "Accepting",
- "Active",
- "Booking",
- "Closed",
- "Complete",
- "Enrolling",
- "Expanding",
- "Free",
- "Hiring",
- "Live",
- "Loading",
- "Offering",
- "Online",
- "Open",
- "Operating",
- "Pending",
- "Playing",
- "Powered",
- "Selling",
- "Serving",
- "Shipping",
- "Showing",
- "Streaming",
- "Supporting",
- "Trending",
- "Underway",
- "You",
- "Your",
+ # Past-participles / present-participles that the "are now "
+ # / "is now " pattern picks up from ordinary marketing prose.
+ # Compared case-insensitively against the captured brand, so a
+ # single entry covers any casing the page emits ("LIVE", "Live",
+ # "live"). Add lowercase forms here.
+ "available",
+ "accepting",
+ "active",
+ "booking",
+ "closed",
+ "complete",
+ "enrolling",
+ "expanding",
+ "free",
+ "hiring",
+ "installed",
+ "live",
+ "loading",
+ "offering",
+ "online",
+ "open",
+ "operating",
+ "part", # "is now Part of [our family]" already filtered by structure;
+ # this catches inverted phrasing where "Part" is the captured token.
+ "pending",
+ "playing",
+ "powered",
+ "secure", # "is now Secure Managed Wi-Fi" / "is now Secure Login"
+ "selling",
+ "serving",
+ "shipping",
+ "showing",
+ "streaming",
+ "supporting",
+ "trending",
+ "underway",
+ # Short prepositions / pronouns that grammatically follow the verb
+ # but are not brand names: "are now In Control", "is now On the air".
+ "down",
+ "in",
+ "off",
+ "on",
+ "out",
+ "up",
+ "you",
+ "your",
+ # Standards / certifications that follow "is now certified"
+ # in marketing copy (compliance announcements).
+ "iso",
+ # Social-media platform rebrands that ubiquitously appear in
+ # footers as "X (formerly Twitter)", "Meta (formerly Facebook)",
+ # "Block (formerly Square)". The mention is real but it's almost
+ # never about the page operator's own rebrand.
+ "twitter",
+ "facebook",
+ "square",
}
)
@@ -349,7 +386,7 @@ def _rebrand_signal(*texts: str) -> str:
# post-trigger noise like "now hiring" / "formerly available".
if not brand or not brand[0].isupper():
continue
- if brand in _REBRAND_NOISE:
+ if brand.lower() in _REBRAND_NOISE:
continue
start = max(0, m.start() - 30)
end = min(len(text), m.end() + 80)
diff --git a/parsedmarc/resources/maps/detect_rebrands.py b/parsedmarc/resources/maps/detect_rebrands.py
index 35215ee..5461cbe 100644
--- a/parsedmarc/resources/maps/detect_rebrands.py
+++ b/parsedmarc/resources/maps/detect_rebrands.py
@@ -1,6 +1,13 @@
#!/usr/bin/env python
"""Re-fetch mapped reverse-DNS base domains and surface possible rebrand signals.
+Cadence: run roughly once a year. Operator rebrands and acquisitions
+accumulate slowly, and a yearly sweep is sufficient to keep the map current
+without spending review effort on near-empty diffs. This is not part of the
+standard per-batch mapping workflow — that workflow uses the related
+`collect_domain_info.py` for unmapped domains. Use this script when you want
+to revisit the *already-mapped* set for drift.
+
Walks `base_reverse_dns_map.csv`, fetches each domain's homepage with the same
machinery used by `collect_domain_info.py`, and writes a TSV listing rows where
one of two default drift signals fired: