diff options
author | Oscar Najera <hi@oscarnajera.com> | 2025-05-01 23:54:17 +0200 |
---|---|---|
committer | Oscar Najera <hi@oscarnajera.com> | 2025-05-01 23:54:17 +0200 |
commit | 4bcd63ba807562e5a3abfd3625656c883141d8ea (patch) | |
tree | cf0e5b7728d0c5b92e2135f54dda66847c495118 /webstats/workforrobots.org | |
parent | d0a76b844210d7d84fb997a446948a50e1d268cd (diff) | |
download | scratch-4bcd63ba807562e5a3abfd3625656c883141d8ea.tar.gz scratch-4bcd63ba807562e5a3abfd3625656c883141d8ea.tar.bz2 scratch-4bcd63ba807562e5a3abfd3625656c883141d8ea.zip |
review text
Diffstat (limited to 'webstats/workforrobots.org')
-rw-r--r-- | webstats/workforrobots.org | 102 |
1 files changed, 48 insertions, 54 deletions
diff --git a/webstats/workforrobots.org b/webstats/workforrobots.org index 9b7cde6..ba1fa0f 100644 --- a/webstats/workforrobots.org +++ b/webstats/workforrobots.org @@ -5,15 +5,15 @@ #+HTML_HEAD: <link rel="stylesheet" href="static/uPlot.min.css" /> #+HTML_HEAD_EXTRA: <script src="static/uPlot.iife.min.js"></script> #+HTML_HEAD_EXTRA: <script src="plots.js"></script> +#+HTML_HEAD_EXTRA: <script src="/skewer"></script> I self-host some of my git repositories to keep sovereignty and independence -from large Internet corporations. The public facing repositories are for -everybody, and today that means for robots. With the =AI-hype= on coding -assistants, I wanted to have a look at what are those AI companies taking. It is -worse than everything, it is idiotically everything. They can't recognize that -they are parsing git repositories and use the appropriate way of downloading -them. +from large Internet corporations. Public facing repositories are for everybody, +and today that means for robots. With the =AI-hype=, I wanted to have a look at +what are those AI companies taking. It is worse than everything, it is +idiotically everything. They can't recognize, that they are parsing git +repositories and use the appropriate way of downloading them. #+begin_src sqlite :exports none SELECT @@ -33,12 +33,12 @@ FROM | 1735686035 | 2024-12-31 23:00:35 | 2025-01-01 00:00:35 | 1745109504 | 2025-04-20 00:38:24 | 2025-04-20 02:38:24 | * Who is visiting I analyzed the =Apache= log files of my =cgit= service in the period from -=2025-01-01= till =2025-04-20=. Table [[top-users]] shows the top /users/ of public -facing git repository. The leading AI companies =OpenAI= and =Anthropic= with -their respective bots =GPTBot= and =ClaudeBot= simply dominate. I found it -unbelievable that they could extract about =≈7GiB= of data each. That is a lot -of Bandwidth out of my server for a few git repositories and in a lightweight -web interphase. +=2025-01-01= till =2025-04-20=. Table [[top-users]] shows the top /users/ of my +public facing git repository. The leading AI companies =OpenAI= and =Anthropic= +with their respective bots =GPTBot= and =ClaudeBot= simply dominate the load on +the service. I found it unbelievable that they could extract about =≈7GiB= of +data each. That is a lot of Bandwidth out of my server for a few git +repositories and in a lightweight web interface. #+begin_src sqlite :exports results --SELECT @@ -70,26 +70,27 @@ LIMIT 10 #+end_src #+name: top-users -#+caption: Top 10 /users/ as self-identified by their /User Agent/ to the server +this is confusing how can I rewrite this caption +#+caption: Top 10 /users/ ranked by bandwidth usage (/Tx/). /User Agent/ user agent is how they self-identify themselves. #+RESULTS[36d7b647efa39c3af86581279748a2bb53d034f3]: | Requests | Tx MiB | User Agent | |----------+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| 3572480 | 8819.6 | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) | -| 1617262 | 6766.3 | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com) | -| 273968 | 721.4 | Mozilla/5.0 (compatible; Barkrowler/0.9; +https://babbar.tech/crawler) | -| 80159 | 498.3 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 | -| 207771 | 475.8 | Scrapy/2.11.2 (+https://scrapy.org) | -| 69697 | 466.1 | Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot) | -| 59832 | 416.4 | Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/) | -| 14142 | 83.3 | Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com) | -| 2500 | 53.7 | Mozilla/5.0 (compatible; SeekportBot; +https://bot.seekport.com) | -| 3578 | 30.9 | Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.7049.52 Mobile Safari/537.36 (compatible; GoogleOther) | - -That is the total. What does it look like as a function of time? Figure -[[fig:agent-traffic]] shows the load on CGit frontend service by each visiting -agent. Hover over the plot to read the exact value for each agent at a given -time on the legend. You can highlight a specific curve by hovering over it or -its legend. +| 3572480 | 8819.6 | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; *GPTBot* /1.2; +https://openai.com/gptbot) | +| 1617262 | 6766.3 | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; *ClaudeBot* /1.0; +claudebot@anthropic.com) | +| 273968 | 721.4 | Mozilla/5.0 (compatible; *Barkrowler* /0.9; +https://babbar.tech/crawler) | +| 80159 | 498.3 | Mozilla/5.0 (*Macintosh*; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 | +| 207771 | 475.8 | *Scrapy* /2.11.2 (+https://scrapy.org) | +| 69697 | 466.1 | Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; *PetalBot*;+https://webmaster.petalsearch.com/site/petalbot) | +| 59832 | 416.4 | Mozilla/5.0 (compatible; *AhrefsBot* /7.0; +http://ahrefs.com/robot/) | +| 14142 | 83.3 | Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; *Bytespider*; spider-feedback@bytedance.com) | +| 2500 | 53.7 | Mozilla/5.0 (compatible; *SeekportBot*; +https://bot.seekport.com) | +| 3578 | 30.9 | Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.7049.52 Mobile Safari/537.36 (compatible; *Google* Other) | + +What does it look like as a function of time? Figure [[fig:agent-traffic]] shows the +load on CGit frontend service by each visiting agent over time. Hover over the +plot to read the exact value for each agent at a given time on the legend. You +can highlight a specific curve by hovering over it or its legend. You can toggle +the display of a curve by clicking on its legend. #+begin_src sqlite :results value file :file top_agent_traffic.csv :exports none SELECT @@ -115,19 +116,21 @@ GROUP BY time #+end_src -#+RESULTS[87f78b2de4d43785f81e682925c6b6542d794883]: +#+RESULTS[2ec6d40ab4f5a844bdbd855884f8d0a6346fd780]: [[file:top_agent_traffic.csv]] #+attr_html: :id agent-traffic -#+CAPTION: Load on CGit frontend service by each visiting agent. The first black dashed line shows the total request at the server and uses the right axis scale. All other solid-filled lines, use the left axis and represent the bandwidth usage. +#+CAPTION: Load on CGit frontend service by each visiting agent. The black dashed line shows the total request at the server and uses the right axis scale. All other solid-filled lines, use the left axis and represent the bandwidth usage. #+NAME: fig:agent-traffic [[./jsfill.png]] -You can see how aggressively the =ClaudeBot= scrapes pages, consuming a lot of -bandwidth is a /short/ time. =OpenAI-GPTBot= works show its rate-limitation, and -performs its scraping over a /longer/ period of time. However, as seen in table -[[top-users]], it performs more than twice the amount of request consumes =30%= more -bandwidth. +This is confusing and hard to read. Rewrite it. + +You can see how aggressively the =ClaudeBot= scrapes pages, using a lot of +bandwidth is a /short/ time. On the other hand =OpenAI-GPTBot= seems rate +limited, because it scrapes over a /longer/ period of time. However, as seen in +table [[top-users]], it performs more than twice the amount of request and consumes +=30%= more bandwidth. The rest of the visitors are bots too. =Barkrowler= is a regular visitor gathering metrics for online marketing. =AhrefsBot= is of the same type, yet @@ -141,25 +144,24 @@ sudden, took as much as it found useful, =<1%= of what the big AI bot take, and swiftly left again. =Bytespider= is almost background noise, but it is also to train an LLM, this -time for ByteDance, the Chinese owner of TikTok. +time for =ByteDance=, the Chinese owner of =TikTok=. The last one =Google= doesn't even seem to be the bot for indexing its search -engine, but rather one to test its chrome browser and how it renders pages. +engine, but rather one to test its =Chrome= browser and how it renders pages. =Rest= is all the remaining /robots/ or /users/. The have consumed around =≈400MiB=, placing them in aggregate in a behavior like =Macintosh, Scrapy, -PetalBot & AhrefsBot=. +PetalBot & AhrefsBot=. Mostly likely is hacker bots proving the site. * How should they visit? - -This is a collection of =git= repositories. Yet, you can browse some of my code, -some files, that is it. Yet, when you want *everything*, the correct use of this +=CGit= is a web interface for =git= repositories. You can browse some of my +code, some files, that is it. If you want *everything*, the correct use of this service is through the =git= client, downloading my publicly available software. -That makes the data a lot more useful, even for those AI companies as the data -cleanup would be easier. They, themselves should use their AI to recognize what -kind of page they are vising and act accordingly instead of stupidly scraping -everything. +That makes the data a lot more useful, even for those AI companies. Because the +data cleanup would be easier. They, themselves should use their AI to recognize +what kind of page they are vising and act accordingly instead of stupidly +scraping everything. How have the good citizens behaved? That is on table [[git-users]]. The =Software Heritage= keeps mirrors of git repositories. It thus watches for updates and the @@ -339,14 +341,6 @@ ORDER BY LIMIT 20 #+end_src -my cgit instance using nginx is not working correctly. -On the home(landing/starting) page all links work. Say I click on sepo "prime". -that leads to the path /prime. That is how is written on the html links. - -Now that I'm in path '/prime' I want to go to the about page. -The generated link is '/prime/prime/about'. It is repeating the path, because "im on page /prime", the new pages should be relative "about", or absolute '/prime/about'. But it is concatenating '/prime', to where I want to go '/prime/about', leaving '/prime/prime/about' - -Where do I fix this? What would the nginx config look like. #+RESULTS[5d7adfe6034ad15603705e1501ba90fadc0a8f1c]: | tx MiB | count(*) | agents | path | |--------+----------+--------+----------------------------------------------------------------------------------------------------------| |