aboutsummaryrefslogtreecommitdiffstats
path: root/webstats/workforrobots.org
diff options
context:
space:
mode:
authorOscar Najera <hi@oscarnajera.com>2025-05-01 23:54:17 +0200
committerOscar Najera <hi@oscarnajera.com>2025-05-01 23:54:17 +0200
commit4bcd63ba807562e5a3abfd3625656c883141d8ea (patch)
treecf0e5b7728d0c5b92e2135f54dda66847c495118 /webstats/workforrobots.org
parentd0a76b844210d7d84fb997a446948a50e1d268cd (diff)
downloadscratch-4bcd63ba807562e5a3abfd3625656c883141d8ea.tar.gz
scratch-4bcd63ba807562e5a3abfd3625656c883141d8ea.tar.bz2
scratch-4bcd63ba807562e5a3abfd3625656c883141d8ea.zip
review text
Diffstat (limited to 'webstats/workforrobots.org')
-rw-r--r--webstats/workforrobots.org102
1 files changed, 48 insertions, 54 deletions
diff --git a/webstats/workforrobots.org b/webstats/workforrobots.org
index 9b7cde6..ba1fa0f 100644
--- a/webstats/workforrobots.org
+++ b/webstats/workforrobots.org
@@ -5,15 +5,15 @@
#+HTML_HEAD: <link rel="stylesheet" href="static/uPlot.min.css" />
#+HTML_HEAD_EXTRA: <script src="static/uPlot.iife.min.js"></script>
#+HTML_HEAD_EXTRA: <script src="plots.js"></script>
+#+HTML_HEAD_EXTRA: <script src="/skewer"></script>
I self-host some of my git repositories to keep sovereignty and independence
-from large Internet corporations. The public facing repositories are for
-everybody, and today that means for robots. With the =AI-hype= on coding
-assistants, I wanted to have a look at what are those AI companies taking. It is
-worse than everything, it is idiotically everything. They can't recognize that
-they are parsing git repositories and use the appropriate way of downloading
-them.
+from large Internet corporations. Public facing repositories are for everybody,
+and today that means for robots. With the =AI-hype=, I wanted to have a look at
+what are those AI companies taking. It is worse than everything, it is
+idiotically everything. They can't recognize, that they are parsing git
+repositories and use the appropriate way of downloading them.
#+begin_src sqlite :exports none
SELECT
@@ -33,12 +33,12 @@ FROM
| 1735686035 | 2024-12-31 23:00:35 | 2025-01-01 00:00:35 | 1745109504 | 2025-04-20 00:38:24 | 2025-04-20 02:38:24 |
* Who is visiting
I analyzed the =Apache= log files of my =cgit= service in the period from
-=2025-01-01= till =2025-04-20=. Table [[top-users]] shows the top /users/ of public
-facing git repository. The leading AI companies =OpenAI= and =Anthropic= with
-their respective bots =GPTBot= and =ClaudeBot= simply dominate. I found it
-unbelievable that they could extract about =≈7GiB= of data each. That is a lot
-of Bandwidth out of my server for a few git repositories and in a lightweight
-web interphase.
+=2025-01-01= till =2025-04-20=. Table [[top-users]] shows the top /users/ of my
+public facing git repository. The leading AI companies =OpenAI= and =Anthropic=
+with their respective bots =GPTBot= and =ClaudeBot= simply dominate the load on
+the service. I found it unbelievable that they could extract about =≈7GiB= of
+data each. That is a lot of Bandwidth out of my server for a few git
+repositories and in a lightweight web interface.
#+begin_src sqlite :exports results
--SELECT
@@ -70,26 +70,27 @@ LIMIT 10
#+end_src
#+name: top-users
-#+caption: Top 10 /users/ as self-identified by their /User Agent/ to the server
+this is confusing how can I rewrite this caption
+#+caption: Top 10 /users/ ranked by bandwidth usage (/Tx/). /User Agent/ user agent is how they self-identify themselves.
#+RESULTS[36d7b647efa39c3af86581279748a2bb53d034f3]:
| Requests | Tx MiB | User Agent |
|----------+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| 3572480 | 8819.6 | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) |
-| 1617262 | 6766.3 | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com) |
-| 273968 | 721.4 | Mozilla/5.0 (compatible; Barkrowler/0.9; +https://babbar.tech/crawler) |
-| 80159 | 498.3 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 |
-| 207771 | 475.8 | Scrapy/2.11.2 (+https://scrapy.org) |
-| 69697 | 466.1 | Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot) |
-| 59832 | 416.4 | Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/) |
-| 14142 | 83.3 | Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com) |
-| 2500 | 53.7 | Mozilla/5.0 (compatible; SeekportBot; +https://bot.seekport.com) |
-| 3578 | 30.9 | Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.7049.52 Mobile Safari/537.36 (compatible; GoogleOther) |
-
-That is the total. What does it look like as a function of time? Figure
-[[fig:agent-traffic]] shows the load on CGit frontend service by each visiting
-agent. Hover over the plot to read the exact value for each agent at a given
-time on the legend. You can highlight a specific curve by hovering over it or
-its legend.
+| 3572480 | 8819.6 | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; *GPTBot* /1.2; +https://openai.com/gptbot) |
+| 1617262 | 6766.3 | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; *ClaudeBot* /1.0; +claudebot@anthropic.com) |
+| 273968 | 721.4 | Mozilla/5.0 (compatible; *Barkrowler* /0.9; +https://babbar.tech/crawler) |
+| 80159 | 498.3 | Mozilla/5.0 (*Macintosh*; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 |
+| 207771 | 475.8 | *Scrapy* /2.11.2 (+https://scrapy.org) |
+| 69697 | 466.1 | Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; *PetalBot*;+https://webmaster.petalsearch.com/site/petalbot) |
+| 59832 | 416.4 | Mozilla/5.0 (compatible; *AhrefsBot* /7.0; +http://ahrefs.com/robot/) |
+| 14142 | 83.3 | Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; *Bytespider*; spider-feedback@bytedance.com) |
+| 2500 | 53.7 | Mozilla/5.0 (compatible; *SeekportBot*; +https://bot.seekport.com) |
+| 3578 | 30.9 | Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.7049.52 Mobile Safari/537.36 (compatible; *Google* Other) |
+
+What does it look like as a function of time? Figure [[fig:agent-traffic]] shows the
+load on CGit frontend service by each visiting agent over time. Hover over the
+plot to read the exact value for each agent at a given time on the legend. You
+can highlight a specific curve by hovering over it or its legend. You can toggle
+the display of a curve by clicking on its legend.
#+begin_src sqlite :results value file :file top_agent_traffic.csv :exports none
SELECT
@@ -115,19 +116,21 @@ GROUP BY
time
#+end_src
-#+RESULTS[87f78b2de4d43785f81e682925c6b6542d794883]:
+#+RESULTS[2ec6d40ab4f5a844bdbd855884f8d0a6346fd780]:
[[file:top_agent_traffic.csv]]
#+attr_html: :id agent-traffic
-#+CAPTION: Load on CGit frontend service by each visiting agent. The first black dashed line shows the total request at the server and uses the right axis scale. All other solid-filled lines, use the left axis and represent the bandwidth usage.
+#+CAPTION: Load on CGit frontend service by each visiting agent. The black dashed line shows the total request at the server and uses the right axis scale. All other solid-filled lines, use the left axis and represent the bandwidth usage.
#+NAME: fig:agent-traffic
[[./jsfill.png]]
-You can see how aggressively the =ClaudeBot= scrapes pages, consuming a lot of
-bandwidth is a /short/ time. =OpenAI-GPTBot= works show its rate-limitation, and
-performs its scraping over a /longer/ period of time. However, as seen in table
-[[top-users]], it performs more than twice the amount of request consumes =30%= more
-bandwidth.
+This is confusing and hard to read. Rewrite it.
+
+You can see how aggressively the =ClaudeBot= scrapes pages, using a lot of
+bandwidth is a /short/ time. On the other hand =OpenAI-GPTBot= seems rate
+limited, because it scrapes over a /longer/ period of time. However, as seen in
+table [[top-users]], it performs more than twice the amount of request and consumes
+=30%= more bandwidth.
The rest of the visitors are bots too. =Barkrowler= is a regular visitor
gathering metrics for online marketing. =AhrefsBot= is of the same type, yet
@@ -141,25 +144,24 @@ sudden, took as much as it found useful, =<1%= of what the big AI bot take, and
swiftly left again.
=Bytespider= is almost background noise, but it is also to train an LLM, this
-time for ByteDance, the Chinese owner of TikTok.
+time for =ByteDance=, the Chinese owner of =TikTok=.
The last one =Google= doesn't even seem to be the bot for indexing its search
-engine, but rather one to test its chrome browser and how it renders pages.
+engine, but rather one to test its =Chrome= browser and how it renders pages.
=Rest= is all the remaining /robots/ or /users/. The have consumed around
=≈400MiB=, placing them in aggregate in a behavior like =Macintosh, Scrapy,
-PetalBot & AhrefsBot=.
+PetalBot & AhrefsBot=. Mostly likely is hacker bots proving the site.
* How should they visit?
-
-This is a collection of =git= repositories. Yet, you can browse some of my code,
-some files, that is it. Yet, when you want *everything*, the correct use of this
+=CGit= is a web interface for =git= repositories. You can browse some of my
+code, some files, that is it. If you want *everything*, the correct use of this
service is through the =git= client, downloading my publicly available software.
-That makes the data a lot more useful, even for those AI companies as the data
-cleanup would be easier. They, themselves should use their AI to recognize what
-kind of page they are vising and act accordingly instead of stupidly scraping
-everything.
+That makes the data a lot more useful, even for those AI companies. Because the
+data cleanup would be easier. They, themselves should use their AI to recognize
+what kind of page they are vising and act accordingly instead of stupidly
+scraping everything.
How have the good citizens behaved? That is on table [[git-users]]. The =Software
Heritage= keeps mirrors of git repositories. It thus watches for updates and the
@@ -339,14 +341,6 @@ ORDER BY
LIMIT 20
#+end_src
-my cgit instance using nginx is not working correctly.
-On the home(landing/starting) page all links work. Say I click on sepo "prime".
-that leads to the path /prime. That is how is written on the html links.
-
-Now that I'm in path '/prime' I want to go to the about page.
-The generated link is '/prime/prime/about'. It is repeating the path, because "im on page /prime", the new pages should be relative "about", or absolute '/prime/about'. But it is concatenating '/prime', to where I want to go '/prime/about', leaving '/prime/prime/about'
-
-Where do I fix this? What would the nginx config look like.
#+RESULTS[5d7adfe6034ad15603705e1501ba90fadc0a8f1c]:
| tx MiB | count(*) | agents | path |
|--------+----------+--------+----------------------------------------------------------------------------------------------------------|