there is currently a bot inside MIT IP space, address 18[.]4[.]38[.]176, scanning fedi at large. i have confirmed this with 5+ unrelated instance admins, large and small instances, across mastodon/misskey/pleroma/akkoma.
the bot is poorly behaved. i have observed it making repeated requests, multiple times per second, for the exact same paths (the paths being, generally: user profiles, specific posts, and sometimes following links in posts). returning 403s does not stop this activity. one of my domains received hundreds of additional requests despite replying with 403 to all of them. i have also seen it make requests for paths containing html tags - seems like a badly written parser. the purpose of these requests and what data is being gathered is unclear.
PTR on the ip returns sts-drand03.mit.edu. a quick web search for "mit drand" brings back https://mitsloan.mit.edu/faculty/directory/david-g-rand and his personal website: https://davidrand-cooperation.com/ (note: other IPs in the /24 also have names in the PTR which match up with names of MIT faculty, but only the .176 IP appears to be involved in this activity). seems he's doing research into "misinformation" and "fake news" on social media. he also appears to be on fedi! so @Drand@techhub.social, given this activity is sourced from an IP with your name on it, could you share the purpose of this traffic? what data is being collected and how is it being used? do you plan to respect robots.txt or identify yourself in your useragent? is there a process for instance admins to opt out of this activity other than blocking the source IP?
@Drand@natalie I can't reply to you because, although you are happy to scrape my instance, you have chosen an instance that blocks me.
I am skeptical that "researching content moderation policies" is an accurate way to describe what you are doing: I have not been asked about my content moderation policies. And how are you going to determine whether a given piece of content is in violation of an instance's policy if you don't even know whether the moderators have seen it? If you are interested in misinformation, you may wish to join or operate an instance that ensures that you can interact with the people you are exploiting for your career.
Can you explain whether or not you told the IRB that you were going to reimburse me for my bandwidth overages? Also, when I am done going through my logs to finish figuring out whose posts you have scraped and I notify them, where should I send them to request the removal of their posts from your dataset without compromising their anonymity?
@natalie Hi all, apologies for this, I didn't realize the bot was being poorly behaved - we've now stopped it. In terms of why we were doing the scraping, we are doing research on how content moderation policies vary across servers, and how this can help inform the Fediverse more broadly about effective approaches to content moderation. You can get more of a sense of the kind of research we do here: https://docs.google.com/document/d/1k2D4zVqkSHB1M9wpXtAe3UzbeE0RPpD_E2UpaPf6Lds/edit?usp=sharing
Sorry again about causing problems for folks! (And thanks to a couple of people for emailing me to let me know about this)
@natalie first access I have from this IP on my public instance is [24/Nov/2023:19:39:06 +0100] "GET /api/v1/instance HTTP/1.1" at ~4 minutes interval 3 times, then seems to have loaded some js/css/fonts the 25, scrapped the public timeline on the 28 multiple times and started scrapping users the 29 more or less the same behavior on this instance
for those who have checked logs on their instances, could you share the dates when the activity started? on this instance, the first request i have is from 2023/11/29, steadily ramping up since then
@natalie earliest one for me is 02/Dec/2023:18:45:36 and it seemed to have stopped on 3rd after 3344 requests to my profile and the profile of the only other active user on my instance, the user agent was "Python-urllib/3.9". Actually, now it seems like they're trying to connect via IPv6 which I have explicitly disabled so nginx is giving them a connection refused error, clearly that does not stop them because the last request was about 2 hours ago
@Moon@natalie@Drand@coolboymew Same here. Incidentally, he has apparently screwed up some of his links, so his bot is fetching URLs like "GET /signin%3C/a%3E%3Cbr/%3E%3Cbr/%3EYang HTTP/1.1", "GET /about%3C/a%3E%3Cbr/%3ELooks HTTP/1.1", "/notice/ACV7evq9u0cd7bBo1Y%3C/a%3E".
Incidentally, he was scraping some Spinster user's profile from FSE, @Piss_Ant.
aaronsw killed himself because he bulk-scraped some academic papers from MIT and MIT had him arrested and the federal prosecutor wanted 35 years for exceeding authorized access under the CFAA. So, now, MIT faculty member David Rand is exceeding authorized access in order to produce more academic papers for MIT. I wonder: if he produces a paper by scraping and then someone scrapes that paper, what happens?
> seems he's doing research into "misinformation" and "fake news" on social media.
He's a Professor of Marketing in the Brain and Cognitive Sciences Department at Sloan. It is safe to reason that (1) he is a psychopath, (2) he has no idea how any of the technical shit works and has a grad student doing it, the grad student in turn having hired an engineer undergrad.
> is there a process for instance admins to opt out of this activity
I guess, per his faculty page, you could just call him, or perhaps email him. Despite scraping FSE, he has chosen an instance that blocks FSE.