review text

author: Oscar Najera <hi@oscarnajera.com> 2025-05-01 23:54:17 +0200
committer: Oscar Najera <hi@oscarnajera.com> 2025-05-01 23:54:17 +0200
commit: 4bcd63ba807562e5a3abfd3625656c883141d8ea (patch)
tree: cf0e5b7728d0c5b92e2135f54dda66847c495118 /webstats/workforrobots.org
parent: d0a76b844210d7d84fb997a446948a50e1d268cd (diff)
download: scratch-4bcd63ba807562e5a3abfd3625656c883141d8ea.tar.gz
scratch-4bcd63ba807562e5a3abfd3625656c883141d8ea.tar.bz2
scratch-4bcd63ba807562e5a3abfd3625656c883141d8ea.zip
1 files changed, 48 insertions, 54 deletions
diff --git a/webstats/workforrobots.org b/webstats/workforrobots.org
index 9b7cde6..ba1fa0f 100644
--- a/webstats/workforrobots.org
+++ b/webstats/workforrobots.org
@@ -5,15 +5,15 @@
 #+HTML_HEAD: <link rel="stylesheet" href="static/uPlot.min.css" />
 #+HTML_HEAD_EXTRA: <script src="static/uPlot.iife.min.js"></script>
 #+HTML_HEAD_EXTRA: <script src="plots.js"></script>
+#+HTML_HEAD_EXTRA: <script src="/skewer"></script>
 
 
 I self-host some of my git repositories to keep sovereignty and independence
-from large Internet corporations. The public facing repositories are for
-everybody, and today that means for robots. With the =AI-hype= on coding
-assistants, I wanted to have a look at what are those AI companies taking. It is
-worse than everything, it is idiotically everything. They can't recognize that
-they are parsing git repositories and use the appropriate way of downloading
-them.
+from large Internet corporations. Public facing repositories are for everybody,
+and today that means for robots. With the =AI-hype=, I wanted to have a look at
+what are those AI companies taking. It is worse than everything, it is
+idiotically everything. They can't recognize, that they are parsing git
+repositories and use the appropriate way of downloading them.
 
 #+begin_src sqlite :exports none
 SELECT
@@ -33,12 +33,12 @@ FROM
 | 1735686035 | 2024-12-31 23:00:35               | 2025-01-01 00:00:35                            | 1745109504 | 2025-04-20 00:38:24               | 2025-04-20 02:38:24                            |
 * Who is visiting
 I analyzed the =Apache= log files of my =cgit= service in the period from
-=2025-01-01= till =2025-04-20=. Table [[top-users]] shows the top /users/ of public
-facing git repository. The leading AI companies =OpenAI= and =Anthropic= with
-their respective bots =GPTBot= and =ClaudeBot= simply dominate. I found it
-unbelievable that they could extract about =≈7GiB= of data each. That is a lot
-of Bandwidth out of my server for a few git repositories and in a lightweight
-web interphase.
+=2025-01-01= till =2025-04-20=. Table [[top-users]] shows the top /users/ of my
+public facing git repository. The leading AI companies =OpenAI= and =Anthropic=
+with their respective bots =GPTBot= and =ClaudeBot= simply dominate the load on
+the service. I found it unbelievable that they could extract about =≈7GiB= of
+data each. That is a lot of Bandwidth out of my server for a few git
+repositories and in a lightweight web interface.
 
 #+begin_src sqlite :exports results
 --SELECT
@@ -70,26 +70,27 @@ LIMIT 10
 #+end_src
 
 #+name: top-users
-#+caption: Top 10 /users/ as self-identified by their /User Agent/ to the server
+this is confusing how can I rewrite this caption
+#+caption: Top 10 /users/ ranked by bandwidth usage (/Tx/). /User Agent/ user agent is how they self-identify themselves.
 #+RESULTS[36d7b647efa39c3af86581279748a2bb53d034f3]:
 | Requests | Tx MiB | User Agent                                                                                                                                                            |
 |----------+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-|  3572480 | 8819.6 | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)                                                                |
-|  1617262 | 6766.3 | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)                                                               |
-|   273968 |  721.4 | Mozilla/5.0 (compatible; Barkrowler/0.9; +https://babbar.tech/crawler)                                                                                                |
-|    80159 |  498.3 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36                                                 |
-|   207771 |  475.8 | Scrapy/2.11.2 (+https://scrapy.org)                                                                                                                                   |
-|    69697 |  466.1 | Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot) |
-|    59832 |  416.4 | Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)                                                                                                    |
-|    14142 |   83.3 | Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)                  |
-|     2500 |   53.7 | Mozilla/5.0 (compatible; SeekportBot; +https://bot.seekport.com)                                                                                                      |
-|     3578 |   30.9 | Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.7049.52 Mobile Safari/537.36 (compatible; GoogleOther)  |
-
-That is the total. What does it look like as a function of time? Figure
-[[fig:agent-traffic]] shows the load on CGit frontend service by each visiting
-agent. Hover over the plot to read the exact value for each agent at a given
-time on the legend. You can highlight a specific curve by hovering over it or
-its legend.
+|  3572480 | 8819.6 | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; *GPTBot* /1.2; +https://openai.com/gptbot)                                                           |
+|  1617262 | 6766.3 | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; *ClaudeBot* /1.0; +claudebot@anthropic.com)                                                            |
+|   273968 |  721.4 | Mozilla/5.0 (compatible; *Barkrowler* /0.9; +https://babbar.tech/crawler)                                                                                             |
+|    80159 |  498.3 | Mozilla/5.0 (*Macintosh*; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36                                               |
+|   207771 |  475.8 | *Scrapy* /2.11.2 (+https://scrapy.org)                                                                                                                                |
+|    69697 |  466.1 | Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; *PetalBot*;+https://webmaster.petalsearch.com/site/petalbot) |
+|    59832 |  416.4 | Mozilla/5.0 (compatible; *AhrefsBot* /7.0; +http://ahrefs.com/robot/)                                                                                                 |
+|    14142 |   83.3 | Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; *Bytespider*; spider-feedback@bytedance.com)                |
+|     2500 |   53.7 | Mozilla/5.0 (compatible; *SeekportBot*; +https://bot.seekport.com)                                                                                                    |
+|     3578 |   30.9 | Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.7049.52 Mobile Safari/537.36 (compatible; *Google* Other) |
+
+What does it look like as a function of time? Figure [[fig:agent-traffic]] shows the
+load on CGit frontend service by each visiting agent over time. Hover over the
+plot to read the exact value for each agent at a given time on the legend. You
+can highlight a specific curve by hovering over it or its legend. You can toggle
+the display of a curve by clicking on its legend.
 
 #+begin_src sqlite :results value file :file top_agent_traffic.csv :exports none
 SELECT
@@ -115,19 +116,21 @@ GROUP BY
     time
 #+end_src
 
-#+RESULTS[87f78b2de4d43785f81e682925c6b6542d794883]:
+#+RESULTS[2ec6d40ab4f5a844bdbd855884f8d0a6346fd780]:
 [[file:top_agent_traffic.csv]]
 
 #+attr_html: :id agent-traffic
-#+CAPTION: Load on CGit frontend service by each visiting agent. The first black dashed line shows the total request at the server and uses the right axis scale. All other solid-filled lines, use the left axis and represent the bandwidth usage.
+#+CAPTION: Load on CGit frontend service by each visiting agent. The black dashed line shows the total request at the server and uses the right axis scale. All other solid-filled lines, use the left axis and represent the bandwidth usage.
 #+NAME: fig:agent-traffic
 [[./jsfill.png]]
 
-You can see how aggressively the =ClaudeBot= scrapes pages, consuming a lot of
-bandwidth is a /short/ time. =OpenAI-GPTBot= works show its rate-limitation, and
-performs its scraping over a /longer/ period of time. However, as seen in table
-[[top-users]], it performs more than twice the amount of request consumes =30%= more
-bandwidth.
+This is confusing and hard to read. Rewrite it.
+
+You can see how aggressively the =ClaudeBot= scrapes pages, using a lot of
+bandwidth is a /short/ time. On the other hand =OpenAI-GPTBot= seems rate
+limited, because it scrapes over a /longer/ period of time. However, as seen in
+table [[top-users]], it performs more than twice the amount of request and consumes
+=30%= more bandwidth.
 
 The rest of the visitors are bots too. =Barkrowler= is a regular visitor
 gathering metrics for online marketing. =AhrefsBot= is of the same type, yet
@@ -141,25 +144,24 @@ sudden, took as much as it found useful, =<1%= of what the big AI bot take, and
 swiftly left again.
 
 =Bytespider= is almost background noise, but it is also to train an LLM, this
-time for ByteDance, the Chinese owner of TikTok.
+time for =ByteDance=, the Chinese owner of =TikTok=.
 
 The last one =Google= doesn't even seem to be the bot for indexing its search
-engine, but rather one to test its chrome browser and how it renders pages.
+engine, but rather one to test its =Chrome= browser and how it renders pages.
 
 =Rest= is all the remaining /robots/ or /users/. The have consumed around
 =≈400MiB=, placing them in aggregate in a behavior like =Macintosh, Scrapy,
-PetalBot & AhrefsBot=.
+PetalBot & AhrefsBot=. Mostly likely is hacker bots proving the site.
 
 * How should they visit?
-
-This is a collection of =git= repositories. Yet, you can browse some of my code,
-some files, that is it. Yet, when you want *everything*, the correct use of this
+=CGit= is a web interface for =git= repositories. You can browse some of my
+code, some files, that is it. If you want *everything*, the correct use of this
 service is through the =git= client, downloading my publicly available software.
 
-That makes the data a lot more useful, even for those AI companies as the data
-cleanup would be easier. They, themselves should use their AI to recognize what
-kind of page they are vising and act accordingly instead of stupidly scraping
-everything.
+That makes the data a lot more useful, even for those AI companies. Because the
+data cleanup would be easier. They, themselves should use their AI to recognize
+what kind of page they are vising and act accordingly instead of stupidly
+scraping everything.
 
 How have the good citizens behaved? That is on table [[git-users]]. The =Software
 Heritage= keeps mirrors of git repositories. It thus watches for updates and the
@@ -339,14 +341,6 @@ ORDER BY
 LIMIT 20
 #+end_src
 
-my cgit instance using nginx is not working correctly.
-On the home(landing/starting) page all links work. Say I click on sepo "prime".
-that leads to the path /prime. That is how is written on the html links.
-
-Now that I'm in path '/prime' I want to go to the about page.
-The generated link is '/prime/prime/about'. It is repeating the path, because "im on page /prime", the new pages should be relative "about", or absolute '/prime/about'. But it is concatenating '/prime', to where I want to go '/prime/about', leaving '/prime/prime/about'
-
-Where do I fix this? What would the nginx config look like.
 #+RESULTS[5d7adfe6034ad15603705e1501ba90fadc0a8f1c]:
 | tx MiB | count(*) | agents | path                                                                                                     |
 |--------+----------+--------+----------------------------------------------------------------------------------------------------------|
author	Oscar Najera <hi@oscarnajera.com>	2025-05-01 23:54:17 +0200
committer	Oscar Najera <hi@oscarnajera.com>	2025-05-01 23:54:17 +0200
commit	4bcd63ba807562e5a3abfd3625656c883141d8ea (patch)
tree	cf0e5b7728d0c5b92e2135f54dda66847c495118 /webstats/workforrobots.org
parent	d0a76b844210d7d84fb997a446948a50e1d268cd (diff)
download	scratch-4bcd63ba807562e5a3abfd3625656c883141d8ea.tar.gz scratch-4bcd63ba807562e5a3abfd3625656c883141d8ea.tar.bz2 scratch-4bcd63ba807562e5a3abfd3625656c883141d8ea.zip