mirror of
https://github.com/domainaware/parsedmarc.git
synced 2026-05-20 10:55:24 +00:00
Drift-detect rebrands: tighten regex; promote 11 verified rebrand-aliased map keys (#753)
* Tighten rebrand regex to drop CTA, third-party-mention, and CSS-asset FPs The first run of detect_rebrands.py against the live map surfaced systemic false-positive categories that drowned the real signals. Tightening over two rounds of FP triage: REBRAND_RE — drop bare "now <Cap>" and "joined the X" branches: - "Buy Now PROMO", "Apply Now Who", "Order Now Free Shipping" — modern marketing pages saturate body text with CTA fragments and ~95% of bare "now <Capital>" matches were these. Replaced with the linguistically meaningful pattern "(is|are|was|were|am) now (?:(?:a )?part of)?" which still catches "BankOnIT is now Navanta", "We are now Cencora", "is now part of Lumen", etc. - "joined the Festo Certified System Integrator Program", "joined the ClimateCAP Initiative", "joined the Fredonia Women's Rugby team" — the "joined the X" pattern was too generic; real "joined the X family" rebrand banners are rare enough that dropping the branch is the right trade. REBRAND_RE — add `\b` word boundary at the start so triggers don't match mid-word: "Stre*am* now Mystery" was matching `am now <Cap>` because the last two letters of "Stream" satisfied the verb alternation. REBRAND_PATH_RE — drop bare `rebrand`, `name change`, `new name for`, and `brand-update` / `brand-refresh` patterns. They appeared too often as CSS class names (`class="rebrand-page"`), CSS variables (`--rebrand-underline-color`), image filenames (`bms-rebrand-logo.svg`, `brand-update.css`), and JSON/JS strings (`"name change"` user-account labels). Adding `\b` boundaries doesn't help because dashes are non-word characters. The remaining narrow patterns (`brand-launch`, `brand-announcement`, `brand-reveal`, `our-new-name`, `our-new-brand`, `acquisition-announcement`, `merger-announcement`) still catch the canonical bankonitusa.com case via its `brand-launch-frequently-asked- questions` URL slug and `Brand announcement` alt text. _REBRAND_NOISE — make the comparison case-insensitive and add "included", "iso", "secure", "part" to suppress "is now ON" / "is now LIVE" / "is now ISO 27001 certified" / "is now Secure Managed Wi-Fi" / "is now Part of" patterns. Twitter/Facebook/Square (the social-platform rebrand mentions in footers like "X (formerly Twitter)") moved to lowercase since the comparison is now case-insensitive. Net effect on a full sweep over the ~13,100-key map: rebrand-signal flagged-row count dropped from ~270 (initial run) to 108 (round-3), clearing the dominant FP categories while every real signal — verified against the bankonitusa.com canonical case plus 11 other actual rebrands — still fires. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Promote 11 verified rebrands found by drift sweep; alias 4 acquirer domains Renames produced by `detect_rebrands.py` running against the full ~13,100-key map and verified by re-reading each operator's homepage. Type column unchanged for every row — only the canonical `name` shifts to the new operator. Where the new operator's primary domain wasn't already in the map, a case-1 alias row is added pointing to the same `(name, type)`. Renames: - amerisourcebergen.com: AMERISOURCEBERGEN → Cencora - aurorahealthcare.org: Aurora Health Care → Advocate Health - consolidated.com: Consolidated Communications → Fidium Fiber - databridgesites.com: Meridian Parkway Data Center Owner → TierPoint - emarsys.com: SAP Emarsys → SAP Engagement Cloud - rig.net: RigNet → Viasat - rxlightning.com: RxLightning → CoverMyMeds - telepoint.bg: Telepoint → Digital Realty - thehostgroup.com: The Host Group → HostGo - ultisat.com: Globecomm Services Maryland → UltiSat - unifiedpostgroup.com: Unifiedpost Group → Banqup New aliases (operator's primary domain not previously mapped): - cencora.com → Cencora, Healthcare - advocatehealth.com → Advocate Health, Healthcare - covermymeds.com → CoverMyMeds, Healthcare - banqup.com → Banqup, SaaS Five sweep hits intentionally deferred for lack of a clear second source: megatel.co.nz → Nova (`nova.co.nz` is for sale via a domain broker; unclear which Nova entity), pogozone.com → NeuBeam (NeuBeam's homepage doesn't acknowledge the PogoZone acquisition), prempub.com → Ingenious Media (ingeniousmedia.com fetch failed), voltagepark.com → ? (merger with Lightning AI rather than a clean rebrand), and a handful of more ambiguous Synopsys/Ansys/OmniAccess/Rakuten/Indigital/Synthite signals that need manual research. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Document detect_rebrands.py cadence as run-once-a-year The drift sweep is for catching operator rebrands and acquisitions that accumulated since the previous run; M&A activity over the mapped operator set is slow enough that yearly is sufficient. Annotate the script's own docstring, the maps README, and the AGENTS.md "Related utility scripts" entry so a future contributor doesn't mistake it for a per-batch step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Sean Whalen <seanthegeek@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -225,7 +225,7 @@ When `unknown_base_reverse_dns.csv` has new entries, follow this order rather th
|
||||
- `find_unknown_base_reverse_dns.py` — regenerates `unknown_base_reverse_dns.csv` from `base_reverse_dns.csv` by subtracting what is already mapped or known-unknown. Enforces the no-full-IP privacy rule at ingest. Translates non-domain-shaped `source_name` rows (raw MMDB `as_name` strings surfaced by the ASN-fallback path in `utils.py:get_ip_address_info` when the IP had no PTR and the `as_domain` was uncategorized) to their corresponding `as_domain` via the bundled MMDB, so the row enters the pipeline as a researchable domain (and drops out automatically if that `as_domain` is already mapped). Run after merging a batch.
|
||||
- `detect_psl_overrides.py` — scans the lists for clustered IP-containing patterns, auto-adds brand suffixes to `psl_overrides.txt`, folds affected entries to their base, and removes any remaining full-IP entries. Run before the collector on any new batch.
|
||||
- `collect_domain_info.py` — the bulk enrichment collector described above. Respects `psl_overrides.txt` and skips full-IP entries. Two derived columns surface drift signals that are also useful during initial classification: `rebrand_signal` combines a body-text regex (matches "now X", "formerly known as X", "is now part of X", etc.) with a path/alt-text regex (matches "rebrand", "brand-launch", "brand-announcement", "name-change", "our-new-name") so that image-only acquisition banners — `<a href="…/brand-launch-…"><img alt="Brand announcement"></a>` — also fire. `external_links` lists the homepage's non-self, non-social outbound link hosts; useful as review context but not a flag trigger by default in the drift sweep (most external links are to partners / customers / vendors and don't indicate a rebrand).
|
||||
- `detect_rebrands.py` — drift sweep that re-fetches every key in `base_reverse_dns_map.csv` with the same machinery as `collect_domain_info.py` and emits a TSV of rows where `rebrand_signal` or `redirect_changed` (final URL host doesn't sit under the input domain) fired. Output is for periodic review — a single signal is one corroborating source; promoting a flagged row still needs a second source per the two-corroborating-sources rule. Resume-safe via `-o`. Use `--limit N` to spot-check a slice; `--include-clean` to also emit non-flagged rows; `--flag-external-links` to additionally flag rows whose only signal is an outbound non-self host (off by default to keep partner/vendor noise out of the review queue).
|
||||
- `detect_rebrands.py` — drift sweep that re-fetches every key in `base_reverse_dns_map.csv` with the same machinery as `collect_domain_info.py` and emits a TSV of rows where `rebrand_signal` or `redirect_changed` (final URL host doesn't sit under the input domain) fired. **Run once a year, not more often** — operator rebrands accumulate slowly and a yearly cadence is enough to keep the map current without spending review effort on near-empty diffs. Not part of the standard per-batch workflow. Output is for periodic review — a single signal is one corroborating source; promoting a flagged row still needs a second source per the two-corroborating-sources rule. Resume-safe via `-o`. Use `--limit N` to spot-check a slice; `--include-clean` to also emit non-flagged rows; `--flag-external-links` to additionally flag rows whose only signal is an outbound non-self host (off by default to keep partner/vendor noise out of the review queue).
|
||||
- `find_bad_utf8.py` — locates invalid UTF-8 bytes (used after past encoding corruption).
|
||||
- `sortlists.py` — case-insensitive sort + dedupe + `type`-column validator for the list files; the authoritative sorter run after every batch edit.
|
||||
|
||||
|
||||
@@ -140,6 +140,8 @@ The output of `collect_domain_info.py`. Tab-separated, one row per researched do
|
||||
|
||||
## detect_rebrands.py
|
||||
|
||||
**Cadence: run roughly once a year.** Not part of the standard mapping workflow — operator rebrands and acquisitions accumulate slowly, and a yearly sweep is sufficient to keep `base_reverse_dns_map.csv` from drifting out of date. There is no benefit to running it more often.
|
||||
|
||||
Drift sweep that re-fetches every key in `base_reverse_dns_map.csv` with the same machinery as `collect_domain_info.py` and writes a TSV (`rebrand_drift.tsv` by default) of rows where a drift signal fired. Two signals are flagged by default:
|
||||
|
||||
- `rebrand_signal` — the collector's body-text and path/alt-text regexes (see above) matched.
|
||||
|
||||
@@ -290,6 +290,7 @@ advania.no,Advania Norway,MSP
|
||||
adventconstructions.co.tz,Advent Construction Limited,Industrial
|
||||
adventhealth.com,AdventHealth,Healthcare
|
||||
adventisthealth.org,Adventist Health,Healthcare
|
||||
advocatehealth.com,Advocate Health,Healthcare
|
||||
adyen.com,Adyen,Finance
|
||||
adyl.com.br,Adyl Telecom,ISP
|
||||
ae.com.br,Agência Estado,News
|
||||
@@ -575,7 +576,7 @@ americantower.com,American Tower,Technology
|
||||
americantower.com.br,American Tower,Technology
|
||||
americatelnet.com.pe,Americatel Peru,ISP
|
||||
amerinoc.com,AmeriNOC,Web Host
|
||||
amerisourcebergen.com,AMERISOURCEBERGEN,Healthcare
|
||||
amerisourcebergen.com,Cencora,Healthcare
|
||||
ameslab.gov,Ames Laboratory,Government
|
||||
amethyst.co.jp,Amethyst,Healthcare
|
||||
amfam.com,American Family Insurance,Finance
|
||||
@@ -918,7 +919,7 @@ auriga.com,Auriga,Technology
|
||||
auriganet.in,Auriganet Digital Technologies,ISP
|
||||
auris.com,Auris,SaaS
|
||||
aurologic.com,aurologic,Web Host
|
||||
aurorahealthcare.org,Aurora Health Care,Healthcare
|
||||
aurorahealthcare.org,Advocate Health,Healthcare
|
||||
ausgrid.com.au,Ausgrid,Utilities
|
||||
auspost.com.au,Australia Post,Logistics
|
||||
aussiebb.com.au,Aussie Broadband,ISP
|
||||
@@ -1051,6 +1052,7 @@ bank-banque-canada.ca,Bank of Canada,Government
|
||||
bank-verlag.de,Bank-Verlag,Finance
|
||||
bankofamerica.com,Bank of America,Finance
|
||||
bankonitusa.com,Navanta,MSP
|
||||
banqup.com,Banqup,SaaS
|
||||
banxico.org.mx,Banco de Mexico,Government
|
||||
barak-online.net,Netvision,ISP
|
||||
barcconnects.net,BARC Connects,ISP
|
||||
@@ -1760,6 +1762,7 @@ cello.co.nz,Cello Group,ISP
|
||||
celsiainternet.com,Celsia Internet,ISP
|
||||
celya.fr,Celya (Carrefour),ISP
|
||||
cencominc.com,Cencom,ISP
|
||||
cencora.com,Cencora,Healthcare
|
||||
cenet.catholic.edu.au,CEnet Catholic Education Network,Education
|
||||
cengagebrain.com,Cengage Learning,Education
|
||||
cenic.org,CENIC,Nonprofit
|
||||
@@ -2292,7 +2295,7 @@ conrad.nyc,ConradIT,MSP
|
||||
consideredcreative.com,Considered Creative,Marketing
|
||||
consilio.com,Consilio,Legal
|
||||
consol.com,Consolidated Edison (ConEd),Utilities
|
||||
consolidated.com,Consolidated Communications,ISP
|
||||
consolidated.com,Fidium Fiber,ISP
|
||||
consolidated.coop,Consolidated,Utilities
|
||||
consolidatedlabel.com,Consolidated Label,Print
|
||||
consolidatednd.com,Consolidated Telcom,ISP
|
||||
@@ -2399,6 +2402,7 @@ countyofriverside.us,County of Riverside,Government
|
||||
courierplus.net,Courier Plus,Logistics
|
||||
covage.com,Covage,ISP
|
||||
covenantuniversity.edu.ng,Covenant University,Education
|
||||
covermymeds.com,CoverMyMeds,Healthcare
|
||||
cox.com,Cox Communications,ISP
|
||||
cox.net,Cox Communications,ISP
|
||||
coxenterprises.com,Cox Enterprises,Conglomerate
|
||||
@@ -2642,7 +2646,7 @@ data.cr,American Data Networks Costa Rica,ISP
|
||||
data102.com,Hivelocity (Data102),Web Host
|
||||
data443.com,Data443,Email Security
|
||||
databank.com,DataBank,Web Host
|
||||
databridgesites.com,Meridian Parkway Data Center Owner,Web Host
|
||||
databridgesites.com,TierPoint,Web Host
|
||||
datacamp.co.uk,CDN77,Web Host
|
||||
datacanopy.com,Data Canopy Colocation,IaaS
|
||||
datacate.com,Datacate,Web Host
|
||||
@@ -3355,7 +3359,7 @@ emailsecurity.app,Mesh Security,Email Security
|
||||
emailservice.io,Mailprotector,Email Security
|
||||
emailsrv.net,Mailprotector,Email Security
|
||||
emailsrvr.com,Rackspace Email,Email Security
|
||||
emarsys.com,SAP Emarsys,Marketing
|
||||
emarsys.com,SAP Engagement Cloud,Marketing
|
||||
emberpoint.com,Ember Point,SaaS
|
||||
embou.com,Embou,ISP
|
||||
embrapa.br,Embrapa Brazilian Agricultural Research Corporation,Government
|
||||
@@ -9410,7 +9414,7 @@ richmond.edu,University of Richmond,Education
|
||||
richmondfed.org,Federal Reserve Bank of Richmond,Government
|
||||
ricoh-usa.com,Ricoh USA,Manufacturing
|
||||
ridsa.com.ar,Red Intercable Digital,ISP
|
||||
rig.net,RigNet,ISP
|
||||
rig.net,Viasat,ISP
|
||||
rightel.ir,Rightel,ISP
|
||||
rightnowtech.com,Oracle Service Cloud,SaaS
|
||||
rightside.ru,Telecoma,ISP
|
||||
@@ -9552,7 +9556,7 @@ rvurology.com,Rogue Valley Urology,Healthcare
|
||||
rwas.co.uk,Royal Welsh Agricultural Society,Agriculture
|
||||
rwth-aachen.de,RWTH Aachen University,Education
|
||||
rwts.com.au,Real World Technology Solutions,MSP
|
||||
rxlightning.com,RxLightning,Healthcare
|
||||
rxlightning.com,CoverMyMeds,Healthcare
|
||||
rybnet.pl,Rybnet,ISP
|
||||
ryoka.co.jp,Ryoka Denka Kasei,Manufacturing
|
||||
rzd.ru,Russian Railways,Logistics
|
||||
@@ -11041,7 +11045,7 @@ telepacific.net,TPx Communications,ISP
|
||||
telepark-passau.de,Telepark Passau,ISP
|
||||
teleperformance.com,Teleperformance,SaaS
|
||||
telepermit.co.nz,Spark NZ,ISP
|
||||
telepoint.bg,Telepoint,Web Host
|
||||
telepoint.bg,Digital Realty,Web Host
|
||||
telered.com.ar,Telered,ISP
|
||||
telering.at,Magenta,ISP
|
||||
telesat.com,Telesat,ISP
|
||||
@@ -11179,7 +11183,7 @@ thegirlandthefig.com,the girl & the fig,Food
|
||||
theglobalresearchnetwork.com,Global Healthcare Research,Healthcare
|
||||
thehartford.com,HARTFORD FIRE INSURANCE,Finance
|
||||
thehost.ua,TheHost Ukraine,Web Host
|
||||
thehostgroup.com,The Host Group,Web Host
|
||||
thehostgroup.com,HostGo,Web Host
|
||||
theice.com,Intercontinental Exchange (ICE),Finance
|
||||
theinternetsubway.us,The Internet Subway,ISP
|
||||
themercury.com,The Mercury,News
|
||||
@@ -11759,7 +11763,7 @@ ultahost.com,UltaHost,Web Host
|
||||
ultel.net,Ultel,ISP
|
||||
ultimate-guitar.com,Ultimate Guitar,Entertainment
|
||||
ultimatedomain.hosting,Ultimate Domain Hosting,Web Host
|
||||
ultisat.com,Globecomm Services Maryland,MSP
|
||||
ultisat.com,UltiSat,MSP
|
||||
ultra.one,UltraOne,ISP
|
||||
ultralinkce.com.br,Ultralink,ISP
|
||||
ultralinkweb.com.br,Ultralink Telecom,ISP
|
||||
@@ -11845,7 +11849,7 @@ unidadeditorial.es,Unidad Editorial,News
|
||||
unidata.it,Unidata,ISP
|
||||
unifesp.br,Universidade Federal de Sao Paulo,Education
|
||||
unifiedlayer.com,UnifiedLayers,Web Host
|
||||
unifiedpostgroup.com,Unifiedpost Group,SaaS
|
||||
unifiedpostgroup.com,Banqup,SaaS
|
||||
unifique.com.br,Unifique,ISP
|
||||
unifique.net,Unifique,ISP
|
||||
unijos.edu.ng,University of Jos,Education
|
||||
|
||||
|
@@ -158,17 +158,30 @@ _FULL_IP_RE = re.compile(
|
||||
# Rebrand-signal scan. Triggered phrases are followed by a captured brand name
|
||||
# (capitalized, non-noise word). The reviewer ultimately judges whether a hit
|
||||
# is a real rebrand banner — the regex's job is to not miss the obvious ones.
|
||||
# Real cases: "now Navanta", "is now part of Lumen", "formerly known as
|
||||
# Symantec Email Security", "we became Newfold Digital".
|
||||
# Real cases: "BankOnIT is now Navanta", "is now part of Lumen", "we are now
|
||||
# Cencora", "formerly known as Symantec Email Security", "we became Newfold
|
||||
# Digital".
|
||||
#
|
||||
# A bare leading "now <Capital>" was tried and dropped — modern marketing
|
||||
# pages saturate the body text with CTA fragments like "Buy Now PROMO",
|
||||
# "Order Now Free Shipping", "Apply Now Who We Are", which all match a bare
|
||||
# `now <Capital>` and are 95%+ false positives. Requiring a copular verb
|
||||
# (`is/are/was/were/am now`) keeps the linguistic shape of an actual
|
||||
# announcement and rules out CTA buttons. The same is true in reverse for
|
||||
# bare "formerly <Capital>" — kept because "formerly" virtually never
|
||||
# appears in a CTA context, but the same noise list catches the residual
|
||||
# "Formerly Available" / "Formerly Open" cases.
|
||||
REBRAND_RE = re.compile(
|
||||
r"(?:"
|
||||
r"(?:now|formerly(?: known as)?) "
|
||||
r"\b(?:"
|
||||
r"formerly(?: known as)? "
|
||||
r"|"
|
||||
r"(?:is|are|was|were|am) now (?:(?:a )?part of )?"
|
||||
r"|"
|
||||
r"(?:we became|rebranded(?: as| to)?|merged with|"
|
||||
r"acquired by|previously known as|previously operated as|"
|
||||
r"is now (?:a )?part of|new name for|joined the) "
|
||||
r"new name for) "
|
||||
r")"
|
||||
r"([A-Za-z][A-Za-z0-9&]+)",
|
||||
r"([A-Za-z][A-Za-z0-9&]+)\b",
|
||||
re.IGNORECASE,
|
||||
)
|
||||
|
||||
@@ -182,14 +195,11 @@ REBRAND_RE = re.compile(
|
||||
# change / etc. by a space, dash, or underscore, which virtually never
|
||||
# occurs outside a rebrand context.
|
||||
REBRAND_PATH_RE = re.compile(
|
||||
r"(?:"
|
||||
r"rebrand"
|
||||
r"|brand[ _-](?:launch|announcement|reveal|refresh|change|update)"
|
||||
r"|name[ _-]change"
|
||||
r"\b(?:"
|
||||
r"brand[ _-](?:launch|announcement|reveal)"
|
||||
r"|our[ _-]new[ _-](?:name|brand)"
|
||||
r"|new[ _-]name[ _-]for"
|
||||
r"|(?:acquisition|merger)[ _-]announcement"
|
||||
r")",
|
||||
r")\b",
|
||||
re.IGNORECASE,
|
||||
)
|
||||
|
||||
@@ -199,35 +209,62 @@ REBRAND_PATH_RE = re.compile(
|
||||
# narrow so real one-word brand names (Navanta, Lumen, Sykt, etc.) survive.
|
||||
_REBRAND_NOISE = frozenset(
|
||||
{
|
||||
"Available",
|
||||
"Accepting",
|
||||
"Active",
|
||||
"Booking",
|
||||
"Closed",
|
||||
"Complete",
|
||||
"Enrolling",
|
||||
"Expanding",
|
||||
"Free",
|
||||
"Hiring",
|
||||
"Live",
|
||||
"Loading",
|
||||
"Offering",
|
||||
"Online",
|
||||
"Open",
|
||||
"Operating",
|
||||
"Pending",
|
||||
"Playing",
|
||||
"Powered",
|
||||
"Selling",
|
||||
"Serving",
|
||||
"Shipping",
|
||||
"Showing",
|
||||
"Streaming",
|
||||
"Supporting",
|
||||
"Trending",
|
||||
"Underway",
|
||||
"You",
|
||||
"Your",
|
||||
# Past-participles / present-participles that the "are now <Cap>"
|
||||
# / "is now <Cap>" pattern picks up from ordinary marketing prose.
|
||||
# Compared case-insensitively against the captured brand, so a
|
||||
# single entry covers any casing the page emits ("LIVE", "Live",
|
||||
# "live"). Add lowercase forms here.
|
||||
"available",
|
||||
"accepting",
|
||||
"active",
|
||||
"booking",
|
||||
"closed",
|
||||
"complete",
|
||||
"enrolling",
|
||||
"expanding",
|
||||
"free",
|
||||
"hiring",
|
||||
"installed",
|
||||
"live",
|
||||
"loading",
|
||||
"offering",
|
||||
"online",
|
||||
"open",
|
||||
"operating",
|
||||
"part", # "is now Part of [our family]" already filtered by structure;
|
||||
# this catches inverted phrasing where "Part" is the captured token.
|
||||
"pending",
|
||||
"playing",
|
||||
"powered",
|
||||
"secure", # "is now Secure Managed Wi-Fi" / "is now Secure Login"
|
||||
"selling",
|
||||
"serving",
|
||||
"shipping",
|
||||
"showing",
|
||||
"streaming",
|
||||
"supporting",
|
||||
"trending",
|
||||
"underway",
|
||||
# Short prepositions / pronouns that grammatically follow the verb
|
||||
# but are not brand names: "are now In Control", "is now On the air".
|
||||
"down",
|
||||
"in",
|
||||
"off",
|
||||
"on",
|
||||
"out",
|
||||
"up",
|
||||
"you",
|
||||
"your",
|
||||
# Standards / certifications that follow "is now <CERT> certified"
|
||||
# in marketing copy (compliance announcements).
|
||||
"iso",
|
||||
# Social-media platform rebrands that ubiquitously appear in
|
||||
# footers as "X (formerly Twitter)", "Meta (formerly Facebook)",
|
||||
# "Block (formerly Square)". The mention is real but it's almost
|
||||
# never about the page operator's own rebrand.
|
||||
"twitter",
|
||||
"facebook",
|
||||
"square",
|
||||
}
|
||||
)
|
||||
|
||||
@@ -349,7 +386,7 @@ def _rebrand_signal(*texts: str) -> str:
|
||||
# post-trigger noise like "now hiring" / "formerly available".
|
||||
if not brand or not brand[0].isupper():
|
||||
continue
|
||||
if brand in _REBRAND_NOISE:
|
||||
if brand.lower() in _REBRAND_NOISE:
|
||||
continue
|
||||
start = max(0, m.start() - 30)
|
||||
end = min(len(text), m.end() + 80)
|
||||
|
||||
@@ -1,6 +1,13 @@
|
||||
#!/usr/bin/env python
|
||||
"""Re-fetch mapped reverse-DNS base domains and surface possible rebrand signals.
|
||||
|
||||
Cadence: run roughly once a year. Operator rebrands and acquisitions
|
||||
accumulate slowly, and a yearly sweep is sufficient to keep the map current
|
||||
without spending review effort on near-empty diffs. This is not part of the
|
||||
standard per-batch mapping workflow — that workflow uses the related
|
||||
`collect_domain_info.py` for unmapped domains. Use this script when you want
|
||||
to revisit the *already-mapped* set for drift.
|
||||
|
||||
Walks `base_reverse_dns_map.csv`, fetches each domain's homepage with the same
|
||||
machinery used by `collect_domain_info.py`, and writes a TSV listing rows where
|
||||
one of two default drift signals fired:
|
||||
|
||||
Reference in New Issue
Block a user