You're viewing the mstdn.social public feed.

Federated feed Local feed

Jan Wildeboer 😷:krulorange:jwildeboer@social.wildeboer.net
Jun 29, 2026, 8:12 AM
My timeline (which contains a lot of project leaders/sysadmins from big projects) is filling with posts about a new, ongoing wave of what most likely are scrapers collecting training data for „AI“ companies. They seem to be using botnets (or what some call „residential IP proxies“ to make it sound a bit more legitimate) with millions of IP addresses, making it really hard to defend against. Some have decided to take their sites down until this is over. This is now the world we live in :(
💬 11🔄 183⭐ 167

Replies

alalan@lighthouse.co.im
Jun 29, 2026, 8:18 AM
@jwildeboer It's weird that the response is to take sites down rather than reach for technical countermeasures -- rate limiting, UA filtering, datacenter ASN blocks. Is the residential proxy problem genuinely that hard to solve at scale, or is the downtime itself the point? A visible protest signal rather than a quiet WAF tweak feels like a different kind of statement about where people think the leverage actually is?
💬 7🔄 0⭐ 4
Koen Hufkens, PhDkoen_hufkens@mastodon.social
Jun 29, 2026, 8:20 AM
@alan @jwildeboer Many of the people affected don't have the (extra) time, money or expertise.
💬 1🔄 4⭐ 29
Kat SKatS@chaosfem.tw
Jul 3, 2026, 8:34 AM
@koen_hufkens @alan @jwildeboer This. My sites have been down for months because I just can't face the work involved in defending against this shit.
I'm sure it's within my technical capabilities; I'm just still too burnt out to properly think about it.
💬 1🔄 1⭐ 3
Koen Hufkens, PhDkoen_hufkens@mastodon.social
Jul 3, 2026, 8:49 AM
@KatS @alan @jwildeboer This further raises the issue of enclosure of the web.
Where sites could formerly be put up and maintained easily you now are forced / directed towards platforms as they have the means, at scale, to deal with this.
💬 1🔄 0⭐ 2
Libraries Are Data Centers 🌈coreysnipes@hachyderm.io
Jul 3, 2026, 9:51 AM
@koen_hufkens So true.
💬 0🔄 0⭐ 0
Agnieszka R. Turczyńskaagturcz@circumstances.run
Jun 29, 2026, 8:43 AM
@alan I'm sorry for asking that, but do you actually know what "residential proxy" is?
@jwildeboer
💬 1🔄 1⭐ 3
Lauri Kotilainenrytmis@hachyderm.io
Jun 29, 2026, 8:21 PM
@agturcz @alan @jwildeboer
Apparently many ”smart” TV manufacturers ship proxy SDKs from companies like Bright, and they turn the TVs into nodes in a botnet that is used for ”AI” data scraping, so the traffic comes from all over the place.
I’d guess not many consumers know about it, let alone have the technical know-how to prevent it.
💬 1🔄 2⭐ 3
Beady Belle FanchannelProfpatsch@mastodon.xyz
Jul 2, 2026, 1:48 PM
@rytmis @agturcz @alan @jwildeboer omg this is a monetization angle for TVs that is just so obvious when you consider the race to the bottom in that industry
💬 1🔄 0⭐ 0
Lauri Kotilainenrytmis@hachyderm.io
Jul 2, 2026, 1:56 PM
@Profpatsch @agturcz @alan @jwildeboer
Yep. I just read about it some weeks back and immediately tried to look for dumb TVs as an alternative. Of course, they don’t really exist as a product category any more, so the next best thing was to block those things at the router. ☹️
💬 0🔄 0⭐ 0
Leonardo Di OttioLeonardoDiOttio@mastodon.social
Jun 29, 2026, 8:52 AM
@alan @jwildeboer As these IPs are largely from people’s personal connections (for instance because they have a malware infected Smart TV or router, or run some kind of smart TV/free game/browser extension with this dubious code deliberately inserted in it) you would effectively be blocking entire consumer ISPs.
If you run a regular website you want people using regular consumer ISPs to reach it.
That makes the use of these proxies so effective.
💬 1🔄 1⭐ 9
JPfroztbyte@mastodon.social
Jun 29, 2026, 2:50 PM
@LeonardoDiOttio @alan @jwildeboer it's not only that kind of stuff, fwiw - go look up bright data's sdk (for example), then do some speculative math on how many people are out there with phones that are full of apps with that sort of shit in it (there's more than only bright-sdk out there)
💬 0🔄 0⭐ 2
Jan Wildeboer 😷:krulorange:jwildeboer@social.wildeboer.net
Jun 29, 2026, 8:58 AM
@alan These botnets are more or less immune to rate limiting, as they use many (and I mean millions) of IP addresses fro a run and each IP address is only used for a few requests before it is being put back in the queue. The IP addresses are also from many different providers, so a (sub-)net wide block also doesn't help. I wrote about those "residential IP proxies in [1] and [2].
[1] https://jan.wildeboer.net/2025/02/Blocking-Stealthy-Botnets/
[2] https://jan.wildeboer.net/2025/04/Web-is-Broken-Botnet-Part-2/
💬 4🔄 15⭐ 29
alalan@lighthouse.co.im
Jun 29, 2026, 12:52 PM
@jwildeboer 1/2
Solidarity -- you're not alone in this. The one-attempt-per-IP pattern is specifically designed to be invisible to anything threshold-based. CrowdSec helps at the edges but a fresh residential IP making a single SASL attempt looks like a legitimate user having a bad day. Your manual cronjob approach is the right call. Automation just gives you false confidence.
💬 1🔄 4⭐ 10
alalan@lighthouse.co.im
Jun 29, 2026, 12:52 PM
@jwildeboer 2/2
The deeper problem is upstream. Apple, Google and Microsoft are allowing SDK-injected bandwidth harvesting through their app stores. Until that's addressed at source, we're all playing whack-a-mole with an essentially infinite residential IP pool. This isn't a mail security problem -- it's a platform accountability problem.
💬 2🔄 11⭐ 28
⠠⠵ avukoavuko@infosec.exchange
Jun 29, 2026, 4:02 PM
@alan @jwildeboer
It is all an accountability problem. 🤷🏻‍♂️
💬 0🔄 1⭐ 6
Cassandrichdalias@hachyderm.io
Jun 29, 2026, 6:36 PM
@alan @jwildeboer Yes, it is entirely Apple's and Google's fault that they are hosting botnet malware in their "walled gardens" as legitimate and vetted software. Without that, the botnets would not exist on any viable scale.
💬 0🔄 1⭐ 5
John Breenjab01701mid@mastodon.social
Jun 29, 2026, 6:55 PM
@jwildeboer @alan It's sad that we can't use MAC address-based filtering on the IoT client devices themselves. All of them reserve blocks of MAC addresses, usually from the NIC manufacturer's block, where it would be easy to block all traffic from "Samsung TV sets" or "Roku Devices" or "Apple TV".
💬 0🔄 0⭐ 0
Woozle Hypertwinwoozle@toot.cat
Jun 29, 2026, 9:48 PM
@jwildeboer @alan
I have a proposal: BotID
💬 0🔄 0⭐ 2
MidgePhotoPhoto55@mastodon.social
Jun 29, 2026, 10:58 PM
@jwildeboer @alan
I suppose one could rate limit the site's outward traffic, without reference to where it is going ...
And then have a very long list of specific addresses which get an extra rate.
And that list, which would be like a Squid access list, could be shared in some sections among quite a lot of sites.
💬 0🔄 0⭐ 0
🆘Bill Cole 🇺🇦grumpybozo@toad.social
Jun 29, 2026, 3:56 PM
@alan @jwildeboer The attacks have thwarted all of those tactics. They use UAs constructed from real UA tokens with minor variations. They have graduated from cheap VMs on Huawei Cloud and Digital Ocean to random IoTs in millions of households and mobile devices in millions of hands.
A few days ago I was able to measure over a thousand simultaneous sessions, each from a different /16 network.
My response to that isn’t taking the site down, but I am shedding load aggressively.
💬 1🔄 2⭐ 10
zug zug? zig zig? zig zug? who knows?!algernon@come-from.mad-scientist.club
Jun 29, 2026, 6:39 PM
@grumpybozo @alan @jwildeboer FWIW, you can still mitigate most of them if you look at headers other than the user agent.
Many of the crawlers that try to disguise themselves as real browsers utterly fail at sending headers those browsers would, like sec-fetch-mode on any HTTPS request.
With few exceptions, if the UA contains Chrome/ or Firefox/, and the request doesn't have a sec-fetch-mode header, the chance of it being a crawler is almost certain.
I've been successfully mitigating pretty much all of them for about a year now (from ~100 million requests/day down to 3 million, the majority of which is served garbage).
💬 1🔄 1⭐ 0
b_bb_b@mastodon.roflcopter.fr
Jul 1, 2026, 7:21 PM
@algernon @grumpybozo @alan @jwildeboer Did you try to use that trick ? What tool did you used and did it works well ?
💬 1🔄 0⭐ 0
alalan@lighthouse.co.im
Jul 1, 2026, 8:32 PM
@b_b
@b_b @algernon @grumpybozo @jwildeboer Haven't hit this scale of problem on my server, so no direct experience with the header trick — but algernon's UA/sec-fetch-mode mismatch approach sounds like a solid low-cost filter if it ever ramps up here.
💬 0🔄 0⭐ 0
zug zug? zig zig? zig zug? who knows?!algernon@come-from.mad-scientist.club
Jul 1, 2026, 8:38 PM
@b_b @grumpybozo @alan @jwildeboer I've been using this trick (+ a few tweaks) for about a year now, with iocaine, with great success.
💬 1🔄 0⭐ 0
Jan Wildeboer 😷:krulorange:jwildeboer@social.wildeboer.net
Jul 1, 2026, 8:39 PM
@algernon A wonderful understatement. Perfect answer :) @b_b @grumpybozo @alan
💬 0🔄 0⭐ 2
Peter Bindelsdascandy@infosec.exchange
Jun 30, 2026, 6:57 AM
@alan @jwildeboer The residential proxy is the *industrialized* scale use of smart TVs to host a proxy for companies to use to redirect requests through, so it's actually most people's regular TVs that are attacking you.
Which also means that if you block any of them, you're cutting off actual users too. Residences have one IP, and most users don't even know they're hosting a proxy for companies to lease.
💬 0🔄 0⭐ 1
Dj PorCus - WillPorCus@hostux.social
Jul 3, 2026, 9:07 AM
@alan @jwildeboer Wel,l all these technics need time and ressources. I can comment that we also have that problem at my new job , my simple fix until we have more decent filtering was to block all china mobile . But it (all the rest) takes time :/
💬 0🔄 0⭐ 1
Koen Hufkens, PhDkoen_hufkens@mastodon.social
Jun 29, 2026, 8:20 AM
@jwildeboer I think this should be illegal in the EU if found out. Not entirely sure.
💬 1🔄 1⭐ 8
nxadmnxadm@infosec.exchange
Jun 29, 2026, 8:24 AM
@koen_hufkens @jwildeboer
De facto botnets sound illegal to me. But in today's EU only if the company is Russian or Chinese. US companies do whatever they like.
💬 1🔄 1⭐ 7
DamonHDDamonHD@mastodon.social
Jun 29, 2026, 8:26 AM
@nxadm @koen_hufkens @jwildeboer One set of those 'residential proxies' is apparently compromised 'smart' TVs; another is stuff silently embedded in 'free' mobile games. We all pay the price for those.
💬 2🔄 4⭐ 15
nxadmnxadm@infosec.exchange
Jun 29, 2026, 8:33 AM
@DamonHD @koen_hufkens @jwildeboer
That's alarming.
💬 0🔄 0⭐ 3
Peter Bindelsdascandy@infosec.exchange
Jun 30, 2026, 7:04 AM
@DamonHD @nxadm @koen_hufkens @jwildeboer Not so much compromise as shipped with the device.
💬 0🔄 0⭐ 1
JamoteuszJamoteusz@mastodon.com.pl
Jun 29, 2026, 8:40 AM
@jwildeboer “the great digital theft” known as “knowledge harvesting”
💬 0🔄 2⭐ 4
Andreas Kruthoffkruthoff@mastodon.social
Jun 29, 2026, 12:26 PM
@jwildeboer I can confirm. Even on my small blog I see over 1200 different IP addresses scraping since a while.
💬 0🔄 2⭐ 4
galoophgalooph@masto.galooph.com
Jun 29, 2026, 1:30 PM
@jwildeboer We're definitely seeing this at @codeenigma :-(
💬 0🔄 1⭐ 4
Hanno Reinhannorein@mastodon.social
Jun 29, 2026, 1:41 PM
@jwildeboer same here. The documentation of REBOUND https://rebound.hanno-rein.de is getting hammered with scrapers. It was never a problem before because the website is small and only contains static pages. Ridiculous.
💬 0🔄 4⭐ 6
Dźwiedziudzwiedziu@mastodon.social
Jun 29, 2026, 1:43 PM
@jwildeboer
This is going to end up with invitation-only graynet, where you'll be banned the moment you'll try trawling.
💬 1🔄 1⭐ 4
Face Thumbchrisp@cyberplace.social
Jun 29, 2026, 10:56 PM
@dzwiedziu @jwildeboer Web 4.0, invite only with lobste.rs style reputation system.
💬 0🔄 0⭐ 1
Tim Ward ⭐🇪🇺🔶 #FBPETimWardCam@c.im
Jun 29, 2026, 1:54 PM
@jwildeboer Why would an AI company need millions of copies of the same data?
💬 2🔄 0⭐ 2
Rupert V/rupert@mastodon.nz
Jun 29, 2026, 2:03 PM
@TimWardCam @jwildeboer Because their trawler is as sloppily coded as everything else they do.
💬 1🔄 0⭐ 15
Cassandrichdalias@hachyderm.io
Jun 29, 2026, 6:39 PM
@rupert @TimWardCam @jwildeboer Exactly. This is *ideological* - they deem actually-engineered solutions that do things efficiently as a backwards "dirty human" way of doing things. Obviously since their AI slop is superior, they should do the scraping in whatever way the AI slop vomits out code to do it.
💬 1🔄 0⭐ 3
Tim Ward ⭐🇪🇺🔶 #FBPETimWardCam@c.im
Jun 29, 2026, 6:42 PM
@dalias @rupert @jwildeboer People were doing this sort of thing before the AI garbage.
One place I worked, several years ago now, there was a performance problem.
"We'd better fix that by tweaking the cloud autoscaling parameters" they said.
FFS. I had a look at the actual code and made it go several times faster.
💬 0🔄 0⭐ 1
Jan Wildeboer 😷:krulorange:jwildeboer@social.wildeboer.net
Jun 29, 2026, 2:13 PM
@TimWardCam It's not necessarily the AI companies themselves. There's a whole new sector of (VC-backed) startups that claim to be able to deliver perfectly clean and curated training data for domain-specific models. And in a weird turn of events, they find out that many crawlers running in big datacenters are now being blocked by many sites they want to scrape. So using the "residential proxy IP" botnets seems to them a good option.
💬 1🔄 0⭐ 8
Harry Woodharry_wood@en.osm.town
Jun 30, 2026, 9:59 AM
@jwildeboer @TimWardCam Yes. This was puzzling me. Surely the big AI providers, OpenAI, Google, etc, wouldn't want to damage their brand by operating scrapers so incompetently.
But no. It's not them. The scraperpocalypse coincides with the arrival of LLMs *partly* because of increased demand for data sets, but partly just because LLM coding enables vast armies of script kiddies to easily develop scrapers that use circumvention tactics.
💬 1🔄 0⭐ 0
Jan Wildeboer 😷:krulorange:jwildeboer@social.wildeboer.net
Jun 30, 2026, 10:49 AM
@harry_wood The scrapers that hammer my server aren't using circumvention tactics and are, in fact, very stupid ones. What makes them hard to block is that they come from all over the world, with unique IP addresses that have no clearly identifiable origin. That's the "residential IP proxy" effect. They do a few requests and disappear again. So rate limiting doesn't catch them. @TimWardCam
💬 0🔄 0⭐ 1
maswanmaswan@mastodon.acc.sunet.se
Jun 29, 2026, 2:09 PM
@jwildeboer Yup. Our free software mirror sometimes takes a minute to respond to an apt update, because there's millions of scraper IPs hitting the entire namespace (we use a fast caching of popular files for performance) saturating the backend storage.
We've tried various blocking and qos approaches, but we have yet to find something that really helps.
Baseline performance for us is 10-40Gbit/s, and we are now down to hardware upgrades in the hopes that real users will have enough left over.
💬 1🔄 2⭐ 10
Jan Wildeboer 😷:krulorange:jwildeboer@social.wildeboer.net
Jun 29, 2026, 2:20 PM
@maswan Ouch :(
💬 0🔄 0⭐ 3
Dclawmgd81@infosec.exchange
Jun 30, 2026, 2:25 AM
@jwildeboer At work, I've simply begun blocking /8's at the firewall.... it's easier and actually causes less collateral damage than one might assume at this point.
💬 0🔄 0⭐ 0
Nicdnicd@masto.ahlcode.fi
Jul 3, 2026, 4:44 AM
@jwildeboer Just this week a repository in my Forgejo instance was under attack. In a day, I racked up over 130k distinct IPs with fail2ban and had to abandon that approach.
I now have a simple trick that cut out practically all of the traffic, but I hesitate to share it as it's not difficult to work around… I wish we didn't have to resort to such things.
💬 0🔄 0⭐ 1