diff --git a/AGENTS.md b/AGENTS.md index 9ed8900..06d43bb 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -225,7 +225,7 @@ When `unknown_base_reverse_dns.csv` has new entries, follow this order rather th - `find_unknown_base_reverse_dns.py` — regenerates `unknown_base_reverse_dns.csv` from `base_reverse_dns.csv` by subtracting what is already mapped or known-unknown. Enforces the no-full-IP privacy rule at ingest. Translates non-domain-shaped `source_name` rows (raw MMDB `as_name` strings surfaced by the ASN-fallback path in `utils.py:get_ip_address_info` when the IP had no PTR and the `as_domain` was uncategorized) to their corresponding `as_domain` via the bundled MMDB, so the row enters the pipeline as a researchable domain (and drops out automatically if that `as_domain` is already mapped). Run after merging a batch. - `detect_psl_overrides.py` — scans the lists for clustered IP-containing patterns, auto-adds brand suffixes to `psl_overrides.txt`, folds affected entries to their base, and removes any remaining full-IP entries. Run before the collector on any new batch. - `collect_domain_info.py` — the bulk enrichment collector described above. Respects `psl_overrides.txt` and skips full-IP entries. Two derived columns surface drift signals that are also useful during initial classification: `rebrand_signal` combines a body-text regex (matches "now X", "formerly known as X", "is now part of X", etc.) with a path/alt-text regex (matches "rebrand", "brand-launch", "brand-announcement", "name-change", "our-new-name") so that image-only acquisition banners — `Brand announcement` — also fire. `external_links` lists the homepage's non-self, non-social outbound link hosts; useful as review context but not a flag trigger by default in the drift sweep (most external links are to partners / customers / vendors and don't indicate a rebrand). -- `detect_rebrands.py` — drift sweep that re-fetches every key in `base_reverse_dns_map.csv` with the same machinery as `collect_domain_info.py` and emits a TSV of rows where `rebrand_signal` or `redirect_changed` (final URL host doesn't sit under the input domain) fired. Output is for periodic review — a single signal is one corroborating source; promoting a flagged row still needs a second source per the two-corroborating-sources rule. Resume-safe via `-o`. Use `--limit N` to spot-check a slice; `--include-clean` to also emit non-flagged rows; `--flag-external-links` to additionally flag rows whose only signal is an outbound non-self host (off by default to keep partner/vendor noise out of the review queue). +- `detect_rebrands.py` — drift sweep that re-fetches every key in `base_reverse_dns_map.csv` with the same machinery as `collect_domain_info.py` and emits a TSV of rows where `rebrand_signal` or `redirect_changed` (final URL host doesn't sit under the input domain) fired. **Run once a year, not more often** — operator rebrands accumulate slowly and a yearly cadence is enough to keep the map current without spending review effort on near-empty diffs. Not part of the standard per-batch workflow. Output is for periodic review — a single signal is one corroborating source; promoting a flagged row still needs a second source per the two-corroborating-sources rule. Resume-safe via `-o`. Use `--limit N` to spot-check a slice; `--include-clean` to also emit non-flagged rows; `--flag-external-links` to additionally flag rows whose only signal is an outbound non-self host (off by default to keep partner/vendor noise out of the review queue). - `find_bad_utf8.py` — locates invalid UTF-8 bytes (used after past encoding corruption). - `sortlists.py` — case-insensitive sort + dedupe + `type`-column validator for the list files; the authoritative sorter run after every batch edit. diff --git a/parsedmarc/resources/maps/README.md b/parsedmarc/resources/maps/README.md index 3fbe31f..8b8c043 100644 --- a/parsedmarc/resources/maps/README.md +++ b/parsedmarc/resources/maps/README.md @@ -140,6 +140,8 @@ The output of `collect_domain_info.py`. Tab-separated, one row per researched do ## detect_rebrands.py +**Cadence: run roughly once a year.** Not part of the standard mapping workflow — operator rebrands and acquisitions accumulate slowly, and a yearly sweep is sufficient to keep `base_reverse_dns_map.csv` from drifting out of date. There is no benefit to running it more often. + Drift sweep that re-fetches every key in `base_reverse_dns_map.csv` with the same machinery as `collect_domain_info.py` and writes a TSV (`rebrand_drift.tsv` by default) of rows where a drift signal fired. Two signals are flagged by default: - `rebrand_signal` — the collector's body-text and path/alt-text regexes (see above) matched. diff --git a/parsedmarc/resources/maps/base_reverse_dns_map.csv b/parsedmarc/resources/maps/base_reverse_dns_map.csv index 4d6305d..ff59362 100644 --- a/parsedmarc/resources/maps/base_reverse_dns_map.csv +++ b/parsedmarc/resources/maps/base_reverse_dns_map.csv @@ -290,6 +290,7 @@ advania.no,Advania Norway,MSP adventconstructions.co.tz,Advent Construction Limited,Industrial adventhealth.com,AdventHealth,Healthcare adventisthealth.org,Adventist Health,Healthcare +advocatehealth.com,Advocate Health,Healthcare adyen.com,Adyen,Finance adyl.com.br,Adyl Telecom,ISP ae.com.br,Agência Estado,News @@ -575,7 +576,7 @@ americantower.com,American Tower,Technology americantower.com.br,American Tower,Technology americatelnet.com.pe,Americatel Peru,ISP amerinoc.com,AmeriNOC,Web Host -amerisourcebergen.com,AMERISOURCEBERGEN,Healthcare +amerisourcebergen.com,Cencora,Healthcare ameslab.gov,Ames Laboratory,Government amethyst.co.jp,Amethyst,Healthcare amfam.com,American Family Insurance,Finance @@ -918,7 +919,7 @@ auriga.com,Auriga,Technology auriganet.in,Auriganet Digital Technologies,ISP auris.com,Auris,SaaS aurologic.com,aurologic,Web Host -aurorahealthcare.org,Aurora Health Care,Healthcare +aurorahealthcare.org,Advocate Health,Healthcare ausgrid.com.au,Ausgrid,Utilities auspost.com.au,Australia Post,Logistics aussiebb.com.au,Aussie Broadband,ISP @@ -1051,6 +1052,7 @@ bank-banque-canada.ca,Bank of Canada,Government bank-verlag.de,Bank-Verlag,Finance bankofamerica.com,Bank of America,Finance bankonitusa.com,Navanta,MSP +banqup.com,Banqup,SaaS banxico.org.mx,Banco de Mexico,Government barak-online.net,Netvision,ISP barcconnects.net,BARC Connects,ISP @@ -1760,6 +1762,7 @@ cello.co.nz,Cello Group,ISP celsiainternet.com,Celsia Internet,ISP celya.fr,Celya (Carrefour),ISP cencominc.com,Cencom,ISP +cencora.com,Cencora,Healthcare cenet.catholic.edu.au,CEnet Catholic Education Network,Education cengagebrain.com,Cengage Learning,Education cenic.org,CENIC,Nonprofit @@ -2292,7 +2295,7 @@ conrad.nyc,ConradIT,MSP consideredcreative.com,Considered Creative,Marketing consilio.com,Consilio,Legal consol.com,Consolidated Edison (ConEd),Utilities -consolidated.com,Consolidated Communications,ISP +consolidated.com,Fidium Fiber,ISP consolidated.coop,Consolidated,Utilities consolidatedlabel.com,Consolidated Label,Print consolidatednd.com,Consolidated Telcom,ISP @@ -2399,6 +2402,7 @@ countyofriverside.us,County of Riverside,Government courierplus.net,Courier Plus,Logistics covage.com,Covage,ISP covenantuniversity.edu.ng,Covenant University,Education +covermymeds.com,CoverMyMeds,Healthcare cox.com,Cox Communications,ISP cox.net,Cox Communications,ISP coxenterprises.com,Cox Enterprises,Conglomerate @@ -2642,7 +2646,7 @@ data.cr,American Data Networks Costa Rica,ISP data102.com,Hivelocity (Data102),Web Host data443.com,Data443,Email Security databank.com,DataBank,Web Host -databridgesites.com,Meridian Parkway Data Center Owner,Web Host +databridgesites.com,TierPoint,Web Host datacamp.co.uk,CDN77,Web Host datacanopy.com,Data Canopy Colocation,IaaS datacate.com,Datacate,Web Host @@ -3355,7 +3359,7 @@ emailsecurity.app,Mesh Security,Email Security emailservice.io,Mailprotector,Email Security emailsrv.net,Mailprotector,Email Security emailsrvr.com,Rackspace Email,Email Security -emarsys.com,SAP Emarsys,Marketing +emarsys.com,SAP Engagement Cloud,Marketing emberpoint.com,Ember Point,SaaS embou.com,Embou,ISP embrapa.br,Embrapa Brazilian Agricultural Research Corporation,Government @@ -9410,7 +9414,7 @@ richmond.edu,University of Richmond,Education richmondfed.org,Federal Reserve Bank of Richmond,Government ricoh-usa.com,Ricoh USA,Manufacturing ridsa.com.ar,Red Intercable Digital,ISP -rig.net,RigNet,ISP +rig.net,Viasat,ISP rightel.ir,Rightel,ISP rightnowtech.com,Oracle Service Cloud,SaaS rightside.ru,Telecoma,ISP @@ -9552,7 +9556,7 @@ rvurology.com,Rogue Valley Urology,Healthcare rwas.co.uk,Royal Welsh Agricultural Society,Agriculture rwth-aachen.de,RWTH Aachen University,Education rwts.com.au,Real World Technology Solutions,MSP -rxlightning.com,RxLightning,Healthcare +rxlightning.com,CoverMyMeds,Healthcare rybnet.pl,Rybnet,ISP ryoka.co.jp,Ryoka Denka Kasei,Manufacturing rzd.ru,Russian Railways,Logistics @@ -11041,7 +11045,7 @@ telepacific.net,TPx Communications,ISP telepark-passau.de,Telepark Passau,ISP teleperformance.com,Teleperformance,SaaS telepermit.co.nz,Spark NZ,ISP -telepoint.bg,Telepoint,Web Host +telepoint.bg,Digital Realty,Web Host telered.com.ar,Telered,ISP telering.at,Magenta,ISP telesat.com,Telesat,ISP @@ -11179,7 +11183,7 @@ thegirlandthefig.com,the girl & the fig,Food theglobalresearchnetwork.com,Global Healthcare Research,Healthcare thehartford.com,HARTFORD FIRE INSURANCE,Finance thehost.ua,TheHost Ukraine,Web Host -thehostgroup.com,The Host Group,Web Host +thehostgroup.com,HostGo,Web Host theice.com,Intercontinental Exchange (ICE),Finance theinternetsubway.us,The Internet Subway,ISP themercury.com,The Mercury,News @@ -11759,7 +11763,7 @@ ultahost.com,UltaHost,Web Host ultel.net,Ultel,ISP ultimate-guitar.com,Ultimate Guitar,Entertainment ultimatedomain.hosting,Ultimate Domain Hosting,Web Host -ultisat.com,Globecomm Services Maryland,MSP +ultisat.com,UltiSat,MSP ultra.one,UltraOne,ISP ultralinkce.com.br,Ultralink,ISP ultralinkweb.com.br,Ultralink Telecom,ISP @@ -11845,7 +11849,7 @@ unidadeditorial.es,Unidad Editorial,News unidata.it,Unidata,ISP unifesp.br,Universidade Federal de Sao Paulo,Education unifiedlayer.com,UnifiedLayers,Web Host -unifiedpostgroup.com,Unifiedpost Group,SaaS +unifiedpostgroup.com,Banqup,SaaS unifique.com.br,Unifique,ISP unifique.net,Unifique,ISP unijos.edu.ng,University of Jos,Education diff --git a/parsedmarc/resources/maps/collect_domain_info.py b/parsedmarc/resources/maps/collect_domain_info.py index 6b69c52..8bd5057 100644 --- a/parsedmarc/resources/maps/collect_domain_info.py +++ b/parsedmarc/resources/maps/collect_domain_info.py @@ -158,17 +158,30 @@ _FULL_IP_RE = re.compile( # Rebrand-signal scan. Triggered phrases are followed by a captured brand name # (capitalized, non-noise word). The reviewer ultimately judges whether a hit # is a real rebrand banner — the regex's job is to not miss the obvious ones. -# Real cases: "now Navanta", "is now part of Lumen", "formerly known as -# Symantec Email Security", "we became Newfold Digital". +# Real cases: "BankOnIT is now Navanta", "is now part of Lumen", "we are now +# Cencora", "formerly known as Symantec Email Security", "we became Newfold +# Digital". +# +# A bare leading "now " was tried and dropped — modern marketing +# pages saturate the body text with CTA fragments like "Buy Now PROMO", +# "Order Now Free Shipping", "Apply Now Who We Are", which all match a bare +# `now ` and are 95%+ false positives. Requiring a copular verb +# (`is/are/was/were/am now`) keeps the linguistic shape of an actual +# announcement and rules out CTA buttons. The same is true in reverse for +# bare "formerly " — kept because "formerly" virtually never +# appears in a CTA context, but the same noise list catches the residual +# "Formerly Available" / "Formerly Open" cases. REBRAND_RE = re.compile( - r"(?:" - r"(?:now|formerly(?: known as)?) " + r"\b(?:" + r"formerly(?: known as)? " + r"|" + r"(?:is|are|was|were|am) now (?:(?:a )?part of )?" r"|" r"(?:we became|rebranded(?: as| to)?|merged with|" r"acquired by|previously known as|previously operated as|" - r"is now (?:a )?part of|new name for|joined the) " + r"new name for) " r")" - r"([A-Za-z][A-Za-z0-9&]+)", + r"([A-Za-z][A-Za-z0-9&]+)\b", re.IGNORECASE, ) @@ -182,14 +195,11 @@ REBRAND_RE = re.compile( # change / etc. by a space, dash, or underscore, which virtually never # occurs outside a rebrand context. REBRAND_PATH_RE = re.compile( - r"(?:" - r"rebrand" - r"|brand[ _-](?:launch|announcement|reveal|refresh|change|update)" - r"|name[ _-]change" + r"\b(?:" + r"brand[ _-](?:launch|announcement|reveal)" r"|our[ _-]new[ _-](?:name|brand)" - r"|new[ _-]name[ _-]for" r"|(?:acquisition|merger)[ _-]announcement" - r")", + r")\b", re.IGNORECASE, ) @@ -199,35 +209,62 @@ REBRAND_PATH_RE = re.compile( # narrow so real one-word brand names (Navanta, Lumen, Sykt, etc.) survive. _REBRAND_NOISE = frozenset( { - "Available", - "Accepting", - "Active", - "Booking", - "Closed", - "Complete", - "Enrolling", - "Expanding", - "Free", - "Hiring", - "Live", - "Loading", - "Offering", - "Online", - "Open", - "Operating", - "Pending", - "Playing", - "Powered", - "Selling", - "Serving", - "Shipping", - "Showing", - "Streaming", - "Supporting", - "Trending", - "Underway", - "You", - "Your", + # Past-participles / present-participles that the "are now " + # / "is now " pattern picks up from ordinary marketing prose. + # Compared case-insensitively against the captured brand, so a + # single entry covers any casing the page emits ("LIVE", "Live", + # "live"). Add lowercase forms here. + "available", + "accepting", + "active", + "booking", + "closed", + "complete", + "enrolling", + "expanding", + "free", + "hiring", + "installed", + "live", + "loading", + "offering", + "online", + "open", + "operating", + "part", # "is now Part of [our family]" already filtered by structure; + # this catches inverted phrasing where "Part" is the captured token. + "pending", + "playing", + "powered", + "secure", # "is now Secure Managed Wi-Fi" / "is now Secure Login" + "selling", + "serving", + "shipping", + "showing", + "streaming", + "supporting", + "trending", + "underway", + # Short prepositions / pronouns that grammatically follow the verb + # but are not brand names: "are now In Control", "is now On the air". + "down", + "in", + "off", + "on", + "out", + "up", + "you", + "your", + # Standards / certifications that follow "is now certified" + # in marketing copy (compliance announcements). + "iso", + # Social-media platform rebrands that ubiquitously appear in + # footers as "X (formerly Twitter)", "Meta (formerly Facebook)", + # "Block (formerly Square)". The mention is real but it's almost + # never about the page operator's own rebrand. + "twitter", + "facebook", + "square", } ) @@ -349,7 +386,7 @@ def _rebrand_signal(*texts: str) -> str: # post-trigger noise like "now hiring" / "formerly available". if not brand or not brand[0].isupper(): continue - if brand in _REBRAND_NOISE: + if brand.lower() in _REBRAND_NOISE: continue start = max(0, m.start() - 30) end = min(len(text), m.end() + 80) diff --git a/parsedmarc/resources/maps/detect_rebrands.py b/parsedmarc/resources/maps/detect_rebrands.py index 35215ee..5461cbe 100644 --- a/parsedmarc/resources/maps/detect_rebrands.py +++ b/parsedmarc/resources/maps/detect_rebrands.py @@ -1,6 +1,13 @@ #!/usr/bin/env python """Re-fetch mapped reverse-DNS base domains and surface possible rebrand signals. +Cadence: run roughly once a year. Operator rebrands and acquisitions +accumulate slowly, and a yearly sweep is sufficient to keep the map current +without spending review effort on near-empty diffs. This is not part of the +standard per-batch mapping workflow — that workflow uses the related +`collect_domain_info.py` for unmapped domains. Use this script when you want +to revisit the *already-mapped* set for drift. + Walks `base_reverse_dns_map.csv`, fetches each domain's homepage with the same machinery used by `collect_domain_info.py`, and writes a TSV listing rows where one of two default drift signals fired: