[MacPorts] #71358: source/browser/repos links are 403 forbidden
MacPorts
noreply at macports.org
Fri Nov 22 08:53:23 UTC 2024
#71358: source/browser/repos links are 403 forbidden
-------------------------+------------------------
Reporter: ryandesign | Owner: neverpanic
Type: defect | Status: closed
Priority: Normal | Milestone:
Component: trac | Version:
Resolution: fixed | Keywords:
Port: |
-------------------------+------------------------
Changes (by neverpanic):
* status: accepted => closed
* resolution: => fixed
Comment:
These are all unauthenticated requests, so maybe? We'd still see the
requests from Fastly at least once, though — and a lot of the requests do
also contain query parameters. Random examples from the log right now:
{{{
/browser/branches/gsoc15-dependency/base/src/pextlib1.0/tests/filemap.tcl?rev=143448&order=size&desc=1
/browser/trunk/dports/audio/libofa/files/patch-mathutils?rev=76560
/browser/trunk/dports/ruby/rb-gtkglext?rev=107210
}}}
I'm not sure Fastly would cache those, or Fastly's cache would be large
enough to cache them. On the other hand, Fastly may have better rules to
filter spam that we could just enable to get rid of these.
I have now done some more analysis on the user agents that are being used
to send requests to trac, and most of the requests are being sent by AI
crawlers:
{{{
root at braeburn ~ # grep -Po '"[^"]+"$' /var/log/apache2/trac.access.log.1 |
sort | uniq -c | sort -n | tail -n 7
8137 "Mozilla/5.0 (compatible; SemrushBot/7~bl;
+http://www.semrush.com/bot.html)"
8233 "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/30.0.1599.66 Safari/537.36"
8621 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.43"
10221 "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible;
bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76
Safari/537.36"
12553 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1)
AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5
(Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)"
22750 "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible;
GPTBot/1.2; +https://openai.com/gptbot)"
52977 "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible;
ClaudeBot/1.0; +claudebot at anthropic.com)"
root at braeburn ~ # grep -Po '"[^"]+"$' /var/log/apache2/trac.access.log |
sort | uniq -c | sort -n | tail -n 7
1195 "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/30.0.1599.66 Safari/537.36"
1379 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"
2072 "-"
2431 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1)
AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5
(Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)"
3211 "Mozilla/5.0 (compatible; SemrushBot/7~bl;
+http://www.semrush.com/bot.html)"
4823 "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible;
bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76
Safari/537.36"
195556 "Mozilla/5.0 (compatible) Ai2Bot-Dolma
(+https://www.allenai.org/crawl)"
}}}
The top entries are clearly crawlers, and some entries just don't make
sense. For example, do we really believe somebody on i686(!) Linux with
X11 running Chrome 30 (!) legitimately sent 1195 or 8233 requests?
I have added deny rules for anything matching outdated Chrome or Firefox
versions and have outright forbidden access for all the AI crawl bots I
saw:
{{{
<Location />
# Block very agressive spiders and spambots or crawlers
that
# don't identify as bots but cause a lot of load. These
Chrome
# and Firefox versions are all very old.
<If "%{HTTP_USER_AGENT} =~ m#(Sogou web
spider|okhttp|Chrome/(1[01][0-9]|[1-9][0-9])[.]|Firefox/([1-9]|[1-8][0-9])[.])#i">
Require all denied
</If>
# Deny AI crawler bots
<If "%{HTTP_USER_AGENT} =~ m#(GPTBot|ClaudeBot|OAI-
SearchBot|ImagesiftBot|SemanticScholarBot|Ai2Bot-Dolma)#i">
Require all denied
</If>
</Location>
}}}
I've also checked which IPs cause most HTTP 403s with this configuration
and added the netblocks of the most egregious offenders to iptables. This
currently affects
- 101.47.146.0/24 and 101.47.17.0/24 from Byteplus in Singapore
- 14.155.212.0/24, 14.155.189.0/24 and 14.155.182.0/24 from China Telecom
- 47.76.0.0/14 from Alibaba Cloud
- 4.227.36.0/24 from Microsoft
- 64.124.0.0/17 allocated to Zayo Bandwidth in Denver, CO
I also sent an abuse complaint to Digital Ocean, but I don't think filing
abuse requests for those will scale.
The good news is that with these changes, our CPU load is a low as it
hasn't been in months, at least until the next AI crawlbot comes along.
I guess we can consider this closed for now, then.
--
Ticket URL: <https://trac.macports.org/ticket/71358#comment:8>
MacPorts <https://www.macports.org/>
Ports system for macOS
More information about the macports-tickets
mailing list