diff options
author | Oscar Najera <hi@oscarnajera.com> | 2025-05-02 15:32:36 +0200 |
---|---|---|
committer | Oscar Najera <hi@oscarnajera.com> | 2025-05-02 15:32:36 +0200 |
commit | 31ec4b3614a693d9aefe2d1308debeda77932221 (patch) | |
tree | 9d0f6cc7ae8b568def37e28f71ddb9bcd2b248eb /webstats | |
parent | 4bcd63ba807562e5a3abfd3625656c883141d8ea (diff) | |
download | scratch-31ec4b3614a693d9aefe2d1308debeda77932221.tar.gz scratch-31ec4b3614a693d9aefe2d1308debeda77932221.tar.bz2 scratch-31ec4b3614a693d9aefe2d1308debeda77932221.zip |
finish post
Diffstat (limited to 'webstats')
-rw-r--r-- | webstats/workforrobots.org | 533 |
1 files changed, 248 insertions, 285 deletions
diff --git a/webstats/workforrobots.org b/webstats/workforrobots.org index ba1fa0f..08f124b 100644 --- a/webstats/workforrobots.org +++ b/webstats/workforrobots.org @@ -4,16 +4,14 @@ #+HTML_HEAD: <script src="http://127.0.0.1:8095/skewer"></script> #+HTML_HEAD: <link rel="stylesheet" href="static/uPlot.min.css" /> #+HTML_HEAD_EXTRA: <script src="static/uPlot.iife.min.js"></script> -#+HTML_HEAD_EXTRA: <script src="plots.js"></script> -#+HTML_HEAD_EXTRA: <script src="/skewer"></script> - I self-host some of my git repositories to keep sovereignty and independence from large Internet corporations. Public facing repositories are for everybody, -and today that means for robots. With the =AI-hype=, I wanted to have a look at -what are those AI companies taking. It is worse than everything, it is -idiotically everything. They can't recognize, that they are parsing git -repositories and use the appropriate way of downloading them. +and today that means for robots. Robots are the main consumers of my work. With +the =AI-hype=, I wanted to have a look at what are those AI companies collecting +from my work. It is worse than everything, it is idiotically everything. They +can't recognize, that they are parsing git repositories and use the appropriate +way of downloading them. #+begin_src sqlite :exports none SELECT @@ -151,7 +149,9 @@ engine, but rather one to test its =Chrome= browser and how it renders pages. =Rest= is all the remaining /robots/ or /users/. The have consumed around =≈400MiB=, placing them in aggregate in a behavior like =Macintosh, Scrapy, -PetalBot & AhrefsBot=. Mostly likely is hacker bots proving the site. +PetalBot & AhrefsBot=. Mostly often is hacker bots proving the site. Which also +means that =~400MiB= is what you need to crawl the site. AI crawlers siphoning +*10X* that amount is abusive. * How should they visit? =CGit= is a web interface for =git= repositories. You can browse some of my @@ -165,9 +165,8 @@ scraping everything. How have the good citizens behaved? That is on table [[git-users]]. The =Software Heritage= keeps mirrors of git repositories. It thus watches for updates and the -downloads. There are other people besides them that downloaded, but in total, in -the same time they only downloaded =≈21MiB=. That is =0.3%= compared to -=ClaudeBot=. +downloads. There are other people besides them that downloaded, but in total +they only downloaded =≈21MiB=. That is =0.3%= compared to =ClaudeBot=. #+begin_src sqlite :exports results SELECT @@ -206,14 +205,14 @@ order by total(length) desc The web front end of git repositories of course, but it there a pattern? Table [[status-codes]] show the status codes of all requests performed by the users. -The failure rate of =OpenAI= is alarming. It from its =3.5 million= requests -=15%= are client errors, mostly not found pages of error =404=. What is their -scraper doing so wrong? =ClaudeBot= as noted earlier, manages to scrape with -half the requests and a failure rate of =1.6%=. +The failure rate of =OpenAI= is alarming. From its =3.5 million= requests =15%= +are client errors: =404= not found page error, and it consumes about =≈2GiB= of +bandwidth. What is their scraper doing so wrong? =ClaudeBot= as noted earlier, +manages to scrape with half the requests and an error rate of =1.6%=. -=Everybody else= are the =Rest=, all users that can't be grouped into the major -users. Yet they have a failure rate of =25%=, but that is normal as they are -mostly hacker robots scanning for vulnerabilities. +=Everybody else= are all the remaining users. They do have an error rate of +=25%=, but that is normal as they are generally hacker robots scanning for +vulnerabilities. You are always under attack on the internet. #+begin_src sqlite :exports results SELECT @@ -244,8 +243,9 @@ SELECT count(*) FILTER (WHERE status BETWEEN 200 AND 299) AS "2XX", count(*) FILTER (WHERE status BETWEEN 300 AND 399) AS "3XX", count(*) FILTER (WHERE status BETWEEN 400 AND 499) AS "4XX", - --(100.0 * count(*) FILTER (WHERE status BETWEEN 400 AND 499)) / count(*) AS "fails 4XX", - count(*) FILTER (WHERE status BETWEEN 500 AND 599) AS "5XX" + count(*) FILTER (WHERE status BETWEEN 500 AND 599) AS "5XX", + round((total(length) FILTER (WHERE status BETWEEN 400 AND 499))/1024/1024, 2) AS "4XX MiB", + round((100.0 * count(*) FILTER (WHERE status BETWEEN 400 AND 499)) / count(*), 2) AS "4XX %" FROM logs GROUP BY @@ -256,38 +256,39 @@ ORDER BY #+name: status-codes #+caption: HTTP status codes per user agent -#+RESULTS[4939ff0e53fd1b72a7f39a19a3a50483893b07d5]: -| Agent | 2XX | 3XX | 4XX | fails 4XX | 5XX | -|----------------+---------+-----+--------+--------------------+-----| -| OpenAI-GPTBot | 3017848 | 0 | 554511 | 15.521738400215 | 121 | -| Everybody else | 99066 | 467 | 34630 | 25.8091923354971 | 14 | -| ClaudeBot | 1591179 | 26 | 25611 | 1.58360240950446 | 446 | -| Barkrowler | 272343 | 0 | 1618 | 0.590579921742685 | 7 | -| Macintosh | 79071 | 2 | 1086 | 1.3548073204506 | 0 | -| Bytespider | 13609 | 0 | 531 | 3.75477301654646 | 2 | -| PetalBot | 69223 | 0 | 473 | 0.678651878847009 | 1 | -| Scrapy | 207240 | 0 | 348 | 0.167492094661911 | 183 | -| AhrefsBot | 59733 | 0 | 90 | 0.150421179302046 | 9 | -| Google | 3576 | 0 | 2 | 0.0558971492453885 | 0 | -| SeekportBot | 2500 | 0 | 0 | 0.0 | 0 | +#+RESULTS[b4402559f97ad9a4f1ec20091284651af575ffeb]: +| Agent | 2XX | 3XX | 4XX | 5XX | 4XX MiB | 4XX % | +|----------------+---------+-----+--------+-----+---------+-------| +| / | < | | | | < | | +| OpenAI-GPTBot | 3017848 | 0 | 554511 | 121 | 2060.23 | 15.52 | +| Everybody else | 99066 | 467 | 34630 | 14 | 101.96 | 25.81 | +| ClaudeBot | 1591179 | 26 | 25611 | 446 | 162.67 | 1.58 | +| Barkrowler | 272343 | 0 | 1618 | 7 | 5.35 | 0.59 | +| Macintosh | 79071 | 2 | 1086 | 0 | 7.87 | 1.35 | +| Bytespider | 13609 | 0 | 531 | 2 | 3.94 | 3.75 | +| PetalBot | 69223 | 0 | 473 | 1 | 3.2 | 0.68 | +| Scrapy | 207240 | 0 | 348 | 183 | 1.14 | 0.17 | +| AhrefsBot | 59733 | 0 | 90 | 9 | 0.61 | 0.15 | +| Google | 3576 | 0 | 2 | 0 | 0.02 | 0.06 | +| SeekportBot | 2500 | 0 | 0 | 0 | 0.0 | 0.0 | Let's have a look at the most not found pages. Listed in table [[fail-pages]] are each of the page paths, how much bandwidth(=tx=) each consumed, and then the -request per bots. With one exception, all pages are placeholder from the -template engine. The repository =hugo-minimalist-theme= is a [[https://gohugo.io][Hugo]] theme. Within -the curly braces ={{ }}= the rendering engine replaces values. Certainly the -html parser reads them raw an from the link =a= tag and requests the page. +requests per bot. With one exception, all pages are placeholder links used in +website theme templates. The repository =hugo-minimalist-theme= is a [[https://gohugo.io][Hugo]] theme. +Within the curly braces ={{ }}= the rendering engine replaces values. Certainly +the html parser reads them raw an from the link =a= tag and requests the page. =ClaudeBot= seems to track error pages and not query them again. =OpenAI= is incapable of doing that, and stubbornly tries over and over. If you grep for the string /href="{{ .RelPermalink }}"/ over the entire git -history, you find it appears up to today =954= times. It is surprising how -=OpenAI= manage to request it =3= times that amount. +history of that repository, you find it appears up to today =954= times. It is +surprising and annoying how =OpenAI= manages to request it triple that amount. #+begin_src sqlite :exports results SELECT - replace(path, '|', '%7C') AS Page, - round(total(length) / 1024 / 1024, 2) AS "tx MiB", + replace(replace(replace(replace(path, '%7B', '{'), '%7D', '}'), '|', '\vert'), '%20', ' ') AS Page, + round(total (length) / 1024 / 1024, 2) AS "tx MiB", count(*) FILTER (WHERE agentid = 143) AS "OpenAI", count(*) FILTER (WHERE agentid = 1) AS "ClaudeBot", count(*) FILTER (WHERE agentid NOT IN (1, 143)) AS "Rest" @@ -306,258 +307,220 @@ LIMIT 10 #+name: fail-pages #+caption: Top 10: =404= error not found pages. -#+attr_html: :class fail-pages -#+RESULTS[c7d6ef8ed6eac11d94e7e007c99399963237da56]: -| Page | tx MiB | OpenAI | ClaudeBot | Rest | -|--------------------------------------------------------------------------------------+--------+--------+-----------+------| -| /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20.RelPermalink%20%7D%7D | 8.36 | 2805 | 3 | 7 | -| /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20.URL%20%7D%7D | 5.39 | 1629 | 1 | 13 | -| /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20.%20%7D%7D | 4.82 | 1559 | 1 | 4 | -| /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20$href%20%7D%7D | 4.28 | 1209 | 4 | 5 | -| /.env | 3.84 | 0 | 0 | 744 | -| /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20.Permalink%20%7D%7D | 3.75 | 1060 | 2 | 15 | -| /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20$pag.Next.URL%20%7D%7D | 3.36 | 916 | 1 | 7 | -| /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20$pag.Prev.URL%20%7D%7D | 3.34 | 912 | 0 | 7 | -| /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20if%20ne%20.MediaType.SubType | 2.95 | 817 | 1 | 0 | -| /hugo-minimalist-theme/plain/layouts/taxonomy/%7B%7B%20.Name%20%7C%20urlize%20%7D%7D | 2.86 | 745 | 5 | 0 | +#+RESULTS[19605bfdef59599b47ed8f0e4b3bce71daaca7d3]: +| Page | tx MiB | OpenAI | ClaudeBot | Rest | +|---------------------------------------------------------------------------+--------+--------+-----------+------| +| /hugo-minimalist-theme/plain/layouts/partials/{{ .RelPermalink }} | 8.36 | 2805 | 3 | 7 | +| /hugo-minimalist-theme/plain/layouts/partials/{{ .URL }} | 5.39 | 1629 | 1 | 13 | +| /hugo-minimalist-theme/plain/layouts/partials/{{ . }} | 4.82 | 1559 | 1 | 4 | +| /hugo-minimalist-theme/plain/layouts/partials/{{ $href }} | 4.28 | 1209 | 4 | 5 | +| /.env | 3.84 | 0 | 0 | 744 | +| /hugo-minimalist-theme/plain/layouts/partials/{{ .Permalink }} | 3.75 | 1060 | 2 | 15 | +| /hugo-minimalist-theme/plain/layouts/partials/{{ $pag.Next.URL }} | 3.36 | 916 | 1 | 7 | +| /hugo-minimalist-theme/plain/layouts/partials/{{ $pag.Prev.URL }} | 3.34 | 912 | 0 | 7 | +| /hugo-minimalist-theme/plain/layouts/partials/{{ if ne .MediaType.SubType | 2.95 | 817 | 1 | 0 | +| /hugo-minimalist-theme/plain/layouts/taxonomy/{{ .Name \vert urlize }} | 2.86 | 745 | 5 | 0 | + +What about the hackers? Table [[hacker-attacks]] excludes the AI bots to look for +the attack surface. First is producing a =400 Bad Request= to the main site. +Trying to steal/find environment secrets under the =.env= file, or the git +configuration. Then the most common type of attack is aims to exploit the remote +code execution in =PHPUnit= by looking for the file =eval-stdin=. #+begin_src sqlite :exports results SELECT - round(total(length) / 1024 / 1024, 2) AS "tx MiB", - count(*), - count(distinct agentid) agents, - replace(path, '|', '%7C') AS path - --substr(path, 0, 50) -FROM - logs -WHERE - path NOT LIKE '/ingrid/%' - AND status = 404 - and agentid NOT IN (1, 143) -GROUP BY - path -ORDER BY - 1 DESC -LIMIT 20 -#+end_src - -#+RESULTS[5d7adfe6034ad15603705e1501ba90fadc0a8f1c]: -| tx MiB | count(*) | agents | path | -|--------+----------+--------+----------------------------------------------------------------------------------------------------------| -| 3.84 | 744 | 368 | /.env | -| 2.26 | 409 | 1 | /cgi-bin/luci/;stok=/locale | -| 2.02 | 381 | 182 | /.git/config | -| 1.08 | 195 | 1 | /actuator/gateway/routes | -| 0.88 | 173 | 2 | /hello.world?%ADd+allow_url_include%3d1+%ADd+auto_prepend_file%3dphp://input | -| 0.65 | 119 | 6 | /sitemap.xml | -| 0.57 | 222 | 12 | /vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php | -| 0.55 | 72 | 10 | /dotfiles/tree/config/doom/snippets/org-mode/daily?h=semgrep&id=0b30f0bf0d1697504e05630b387b8e32302c3f7c | -| 0.54 | 71 | 10 | /dotfiles/tree/config/doom/snippets/org-mode/daily?h=semgrep&id=5b57ec9f30c80a391a1f7ac5c3a69a91658e2c53 | -| 0.47 | 102 | 10 | /config.json | -| 0.45 | 119 | 15 | /api/.env | -| 0.42 | 103 | 55 | /_profiler/phpinfo | -| 0.39 | 87 | 10 | /.env.prod | -| 0.37 | 93 | 11 | /.env.save | -| 0.37 | 92 | 9 | /.env.production | -| 0.35 | 64 | 3 | /src/.git/config | -| 0.35 | 63 | 1 | /server/.git/config | -| 0.35 | 63 | 2 | /media/.git/config | -| 0.35 | 63 | 3 | /cms/.git/config | -| 0.35 | 68 | 1 | /95.111.247.99/.env | - - -#+begin_src sqlite :results value file :file fails.csv -SELECT - count(*) as hits, - sum(length), - path - --substr(path, 0, 50) + round(total (length) / 1024 / 1024, 2) AS "Tx MiB", + count(*) Requests, + count(DISTINCT agentid) Agents, + count(DISTINCT ipid) IPs, + group_concat (DISTINCT status) AS "Errors", + group_concat (DISTINCT request_method) AS "Methods", + replace(path, '_', '\under{}') AS path FROM logs WHERE path NOT LIKE '/ingrid/%' - AND status = 404 + AND status >= 300 + AND agentid NOT IN (1, 143) GROUP BY path + --, status ORDER BY - hits desc, path -limit 50 - -#+end_src - -#+RESULTS[2972512bc0b63468a3eba78fde5efcce703e5d56]: -[[file:fails.csv]] - -#+begin_src sqlite -select count(*), sum(length), path,user_agent from logs -join agent on agentid=id -WHERE - path NOT LIKE '/ingrid/%' - AND status = 404 -group by path, agentid -ORDER BY - sum(count(*)) over (partition by path) desc, 1 desc, path, agentid -limit 50 -#+end_src - -#+RESULTS[cb18bd88215d0873626e54f3a56ea4b6851c4138]: -| count(*) | sum(length) | path | user_agent | | -|----------+-------------+-------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------| -| 2805 | 8732358 | /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20.RelPermalink%20%7D%7D | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) | | -| 7 | 18923 | /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20.RelPermalink%20%7D%7D | Mozilla/5.0 (compatible; ImagesiftBot; +imagesift.com) | | -| 3 | 16958 | /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20.RelPermalink%20%7D%7D | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com) | | -| 1629 | 5597913 | /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20.URL%20%7D%7D | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) | | -| 7 | 25024 | /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20.URL%20%7D%7D | Mozilla/5.0 (compatible; ImagesiftBot; +imagesift.com) | | -| 3 | 8467 | /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20.URL%20%7D%7D | AliyunSecBot/Aliyun (AliyunSecBot@service.alibaba.com) | | -| 1 | 2655 | /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20.URL%20%7D%7D | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com) | | -| 1 | 6609 | /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20.URL%20%7D%7D | Mozilla/5.0 (Macintosh; PPC Mac OS X 10_7_9 rv:3.0; mt-MT) AppleWebKit/532.12.1 (KHTML, like Gecko) Version/5.0.3 Safari/532.12.1 | | -| 1 | 6609 | /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20.URL%20%7D%7D | Mozilla/5.0 (X11; Linux i686; rv:1.9.6.20) Gecko/8855-02-04 23:24:49.455231 Firefox/3.6.10 | | -| 1 | 6543 | /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20.URL%20%7D%7D | Opera/9.30.(X11; Linux x86_64; tn-ZA) Presto/2.9.175 Version/11.00 | | -| 1559 | 5018667 | /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20.%20%7D%7D | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) | | -| 4 | 23869 | /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20.%20%7D%7D | Mozilla/5.0 (compatible; ImagesiftBot; +imagesift.com) | | -| 1 | 7021 | /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20.%20%7D%7D | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com) | | -| 1209 | 4437078 | /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20$href%20%7D%7D | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) | | -| 5 | 27298 | /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20$href%20%7D%7D | Mozilla/5.0 (compatible; ImagesiftBot; +imagesift.com) | | -| 4 | 23683 | /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20$href%20%7D%7D | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com) | | -| 1060 | 3867072 | /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20.Permalink%20%7D%7D | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) | | -| 12 | 41170 | /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20.Permalink%20%7D%7D | Mozilla/5.0 (compatible; ImagesiftBot; +imagesift.com) | | -| 3 | 8087 | /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20.Permalink%20%7D%7D | AliyunSecBot/Aliyun (AliyunSecBot@service.alibaba.com) | | -| 2 | 14042 | /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20.Permalink%20%7D%7D | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com) | | -| 916 | 3490491 | /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20$pag.Next.URL%20%7D%7D | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) | | -| 7 | 29355 | /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20$pag.Next.URL%20%7D%7D | Mozilla/5.0 (compatible; ImagesiftBot; +imagesift.com) | | -| 1 | 2655 | /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20$pag.Next.URL%20%7D%7D | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com) | | -| 912 | 3471446 | /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20$pag.Prev.URL%20%7D%7D | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) | | -| 7 | 29395 | /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20$pag.Prev.URL%20%7D%7D | Mozilla/5.0 (compatible; ImagesiftBot; +imagesift.com) | | -| 817 | 3081643 | /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20if%20ne%20.MediaType.SubType | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) | | -| 1 | 6981 | /hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20if%20ne%20.MediaType.SubType | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com) | | -| 798 | 2487903 | /hugo-minimalist-theme/plain/layouts/%7B%7B%20.Type%20%7D%7D | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) | | -| 5 | 26298 | /hugo-minimalist-theme/plain/layouts/%7B%7B%20.Type%20%7D%7D | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com) | | -| 4 | 16227 | /hugo-minimalist-theme/plain/layouts/%7B%7B%20.Type%20%7D%7D | Mozilla/5.0 (compatible; ImagesiftBot; +imagesift.com) | | -| 745 | 2977378 | /hugo-minimalist-theme/plain/layouts/taxonomy/%7B%7B%20.Name%20 | %20urlize%20%7D%7D | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) | -| 5 | 22642 | /hugo-minimalist-theme/plain/layouts/taxonomy/%7B%7B%20.Name%20 | %20urlize%20%7D%7D | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com) | -| 65 | 348406 | /.env | Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0 | | -| 63 | 310359 | /.env | Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36 | | -| 58 | 310754 | /.env | Mozilla/5.0 (Linux; U; Android 4.4.2; en-US; HM NOTE 1W Build/KOT49H) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 UCBrowser/11.0.5.850 U3/0.8.0 Mobile Safari/534.30 | | -| 52 | 301380 | /.env | l9explore/1.2.2 | | -| 22 | 111334 | /.env | Go-http-client/1.1 | | -| 18 | 38352 | /.env | - | | -| 16 | 85808 | /.env | python-requests/2.32.3 | | -| 16 | 36940 | /.env | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 | | -| 9 | 52245 | /.env | Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:103.0) Gecko/20100101 Firefox/103.0 abuse.xmco.fr | | -| 9 | 48190 | /.env | Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50 | | -| 8 | 42840 | /.env | python-requests/2.26.0 | | -| 6 | 32086 | /.env | python-requests/2.31.0 | | -| 5 | 26773 | /.env | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3 | | -| 3 | 8719 | /.env | Mozilla/5.0 (iPhone; CPU iPhone OS 17_5_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4.1 Mobile/15E148 Safari/604.1 | | -| 3 | 16043 | /.env | python-httpx/0.28.1 | | -| 3 | 17375 | /.env | Mozilla/5.0 (Linux; Android 9; Redmi Note 7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.89 Mobile Safari/537.36 | | -| 3 | 17375 | /.env | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0 | | -| 3 | 16139 | /.env | python-requests/2.27.1 | | - -#+begin_src sqlite :results value file :file leadfail.csv -SELECT - datetime (date, 'unixepoch'), - status, - count(*) as hits, - round(sum(length) / 1024, 2) as KB, - user_agent -FROM - logs - JOIN agent ON agent.id = logs.agentid -WHERE - path = '/hugo-minimalist-theme/plain/layouts/partials/%7B%7B%20.RelPermalink%20%7D%7D' -group by agentid -ORDER BY - referrer -#+end_src - -#+RESULTS[e3501152f1ec3e7ff822a6713736ff35fb0dcb46]: -[[file:leadfail.csv]] - -#+begin_src sqlite -select - datetime (date, 'unixepoch'), - agentid, - count(*), - path -from logs -where path like '/hello.world?%' -group by agentid -#+end_src - -#+RESULTS[2f837a0274a09a9c00545946c8992d57286667ff]: -| datetime (date, 'unixepoch') | agentid | count(*) | path | -|------------------------------+---------+----------+------------------------------------------------------------------------------| -| 2025-04-12 23:49:39 | 12 | 167 | /hello.world?%ADd+allow_url_include%3d1+%ADd+auto_prepend_file%3dphp://input | -| 2025-02-08 23:48:46 | 238 | 6 | /hello.world?%ADd+allow_url_include%3d1+%ADd+auto_prepend_file%3dphp://input | - -#+begin_src sqlite -select * from agent where id in (12, 238) -#+end_src - -#+RESULTS[6d8a93a2351c1668ba5a7b975fdbbc6271cace36]: -| id | user_agent | -|-----+-----------------------------------------------------------------------| -| 12 | Custom-AsyncHttpClient | -| 238 | Mozilla/5.0 (Linux; Linux x86_64; en-US) Gecko/20100101 Firefox/122.0 | - -#+begin_src sqlite -SELECT - round(total (length) / 1024, 2) "tx KiB", - count(*) hits, - agentid, - user_agent -FROM - logs - JOIN agent ON agentid = id - --where agentid in ( - --select id from agent -WHERE - user_agent LIKE '%google%' - --) -GROUP BY - id -ORDER BY - 1 DESC -#+end_src - - -#+begin_src sqlite -SELECT - count(*), - count(distinct ipid), - count(distinct agentid), - total (length), - request_method, - -- status, - substr(path, 0, 50) -FROM - logs -WHERE - path NOT LIKE '/ingrid/%' AND - status = 200 -GROUP BY - path -ORDER BY - 1 DESC -LIMIT 15 + --sum(count(*)) OVER (PARTITION BY path) DESC, + Requests DESC +LIMIT 10 #+end_src -#+RESULTS[8469c661009196d66d76364873866923b89c8205]: -| count(*) | count(distinct ipid) | count(distinct agentid) | total (length) | request_method | substr(path, 0, 50) | -|----------+----------------------+-------------------------+----------------+----------------+---------------------------------| -| 3540 | 1779 | 425 | 22903047.0 | GET | / | -| 2171 | 1563 | 48 | 9744532.0 | GET | /robots.txt | -| 855 | 518 | 50 | 4234481.0 | GET | /favicon.ico | -| 334 | 262 | 31 | 1304042.0 | GET | /cgit.css | -| 150 | 116 | 59 | 1030405.0 | GET | /hugo-minimalist-theme/about/ | -| 144 | 1 | 1 | 854165.0 | GET | /?XDEBUG_SESSION_START=phpstorm | -| 119 | 69 | 52 | 421470.0 | GET | /cgit.png | -| 98 | 36 | 14 | 330051.0 | GET | /hugo-minimalist-theme/ | -| 46 | 31 | 15 | 165446.0 | GET | /homepage/ | -| 42 | 32 | 13 | 157112.0 | GET | /dotfiles/ | -| 39 | 34 | 14 | 255204.0 | GET | /hugo-minimalist-theme/log/ | -| 37 | 30 | 13 | 132913.0 | GET | /hugo-minimalist-theme/tree/ | -| 37 | 33 | 11 | 103541.0 | GET | /?s=name | -| 36 | 33 | 12 | 242095.0 | GET | /dotfiles/log/ | -| 35 | 32 | 14 | 124829.0 | GET | /dotfiles/tree/ | +#+name: hacker-attacks +#+caption: Top 10 attacks leading to error pages ranked by number of requests. Agents and IPs columns count different agents and IPs doing the requests. +#+RESULTS[5d9125d3f545af74d94789c0893d257619926437]: +| Tx MiB | Requests | Agents | IPs | Errors | Methods | path | +|--------+----------+--------+------+-------------+----------+----------------------------------------------------------------------------------------------------------| +| 3.17 | 3482 | 6 | 1139 | 400,421,408 | GET,POST | / | +| 3.84 | 744 | 368 | 256 | 404 | GET,POST | /.env | +| 2.26 | 409 | 1 | 11 | 404 | GET | /cgi-bin/luci/;stok=/locale | +| 2.02 | 381 | 182 | 121 | 404 | GET | /.git/config | +| 0.57 | 222 | 12 | 167 | 404 | GET,POST | /vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php | +| 1.08 | 195 | 1 | 1 | 404 | GET | /actuator/gateway/routes | +| 0.88 | 173 | 2 | 137 | 404 | POST | /hello.world?%ADd+allow\under{}url\under{}include%3d1+%ADd+auto\under{}prepend\under{}file%3dphp://input | +| 0.23 | 157 | 2 | 127 | 404 | GET | /vendor/phpunit/phpunit/Util/PHP/eval-stdin.php | +| 0.22 | 152 | 2 | 123 | 404 | GET | /vendor/phpunit/src/Util/PHP/eval-stdin.php | +| 0.22 | 148 | 2 | 119 | 404 | GET | /vendor/phpunit/phpunit/LICENSE/eval-stdin.php | + +* Future plans +Quite many webmasters have been annoyed by this abusive scraping of AI bots. The +project [[https://xeiaso.net/blog/2025/anubis/][Anubis]] implements a proof of work /tax/ to visitors of a webpage. This +way abusive AI bots scraping reduced. + +I personally dislike that idea, it does create an extra expense that the AI +companies, which indiscriminately crawl the internet. But no one really wins. It +is a failure of our internet system, that micro payments aren't yet a reality. +This means for myself, being part of the change and bring my bitcoin lightning +tipping system back online, this time with real coins. We need to get people +used to pay for resources on the internet. For that we need a working +infrastructure, we can't wait for the banking system to do it. In my opinion, +the main reason why our internet so aggressively invades our privacy is because +the banking system never provided a way to move money across the internet. The +only people that could pay were advertising companies. + +Knowing how stupid the AI crawlers are, I believe poisoning the training data is +better than using a Proof of work tax to cure AI companies such aggressive and +mindless crawling. Projects like [[https://iocaine.madhouse-project.org/][Iocane]] provide a way for it and is what I'll +implement in the future. + +#+begin_export html +<script type="text/javascript"> + addEventListener("load", () => { + function csvFloat(data) { + let headers = data[0]; + let series = headers.map((_, idx) => + data.slice(1).map((row) => parseFloat(row[idx])), + ); + return [headers, series]; + } + function responseParseCSV(response) { + if (response.ok) + return response.text().then((data) => + data + .split(/\n/) + .filter((x) => x) + .map((row) => row.split(/,/)), + ); + throw new Error("not 2XX resp"); + } + function withSuffix(val, suffix) { + return val.toFixed(1).replace(/.?0+$/, "").concat("", suffix); + } + function siScaling(value) { + var v = Math.abs(value); + return 0 === v + ? [0, ""] + : v >= 1000000000000000.0 + ? [value / 1000000000000000.0, "P"] + : v >= 1000000000000.0 + ? [value / 1000000000000.0, "T"] + : v >= 1000000000.0 + ? [value / 1000000000.0, "G"] + : v >= 1000000.0 + ? [value / 1000000.0, "M"] + : v >= 1000.0 + ? [value / 1000.0, "K"] + : v >= 0.6 + ? [value, ""] + : v >= 0.001 + ? [value / 0.001, "m"] + : v >= 0.000001 + ? [value / 0.000001, "μ"] + : v >= 0.000000001 + ? [value / 0.000000001, "n"] + : v >= 0.000000000001 + ? [value / 0.000000000001, "p"] + : null; + } + function scaling(val, suffix) { + return withSuffix.apply(this, siScaling(val)); + } + + function spacedColor(idx, alpha) { + if (alpha === undefined) { + alpha = "/ 1"; + } + return "hsl(" + 137.506 * idx + " 70% 55% " + alpha + ")"; + } + function agentChart(header, series, container) { + const opts = { + width: 920, + height: 600, + hooks: { + setSeries: [ + (u, seriesIdx, opts) => { + if (opts.focus != null) { + u.series.forEach((s, i) => { + s.width = i == seriesIdx ? 3 : 1; + }); + } + }, + ], + }, + focus: { alpha: 0.5 }, + cursor: { + focus: { + prox: 1e6, + bias: 0, + dist: (self, seriesIdx, dataIdx, valPos, curPos) => { + return valPos - curPos; + }, + }, + }, + + series: [ + {}, + { + label: header[1], + stroke: "black", + dash: [10, 5], + value: (u, v) => (v ? scaling(v) : v), + scale: "hits", + }, + ].concat( + header.slice(2).map((name, idx) => ({ + label: name, + stroke: spacedColor(idx), + fill: spacedColor(idx, "/ 0.1"), + value: (u, v) => (v ? scaling(v) + "B" : v), + })), + ), + axes: [ + {}, + { + values: (u, vals, space) => + vals.map((v) => (v ? scaling(v) + "B" : v)), + size: 60, + label: "Bandwidth", + labelSize: 50, + }, + { + side: 1, + scale: "hits", + label: "Requests", + grid: { show: false }, + values: (u, vals, space) => vals.map((v) => (v ? scaling(v) : v)), + labelSize: 50, + }, + ], + }; + let uplot = new uPlot(opts, series, container); + container["uobj"] = uplot; + } + + fetch("/top_agent_traffic.csv") + .then(responseParseCSV) + .then(csvFloat) + .then(([headers, series]) => { + let cont = document.querySelector("#agent-traffic").parentNode; + cont.innerHTML = ""; + agentChart(headers, series, cont); + }); + }); +</script> +#+end_export |