πŸ“’ Actions Speak Louder Than Words!

Auto submit web content to Backway Machine

Posted: Jul 11, 2020 | Reading time: 3 min
⚠️ Warning: This post is over a year old, the information may be out of date.
πŸ“’ I’ve moved to a new website. Please visit me at https://journal.robbi.my !
post

One of great ninja trick and technique that I always use when want check older version of web design, older web content revision, find back long-gone content from dying-forever website or all sorts of treasures by using “Backway Machine” or “Archive Today”. Let say this thing as virtually store until they removed.

I don’t know how the system work in-behind but maybe thus archive content maybe compressed and sent to darkweb and they will fetch-it when you need it. Scary huh? So not everyone and every content should be submitted.

Each time they taking a snapshort they must be put it somewhere, could be store as Hadoop Distributed File System (HDFS) using Apache Hadoop and Apache Accumulo technology. Who know?

Anyway, I plan to auto submit my blog content to Backway Machine as currently I do it manually using Chrome extension which is pain in ass.

IMG
https://web.archive.org/web//https://robbinespu.gitlab.io/

How to let the librarians know?

Well, maybe later on, I will create ruby plugin (since I using Jekyll), now let just do it via simple bash scripting with combination of curl and awk.

$ curl https://robbinespu.gitlab.io/sitemap.xml | grep "<loc>" | \
awk -F"<loc>" '{print $2} ' | awk -F"</loc>" '{print $1}'

By subtracting my sitemap then I can bind with existing tools such as wayback-machine-archiver (python pip) or wayback_archiver (ruby gem).

Note (update)

Seem both script allow you to include sitemap as parameter and they will take care everything

So I can just call and submit my web content to Backway Machine each time build Jekyll via CI/CD.

But sometime Backway Machine doesn’t work properly….

$ archiver https://robbinespu.gitlab.io
ERROR:root:404 Client Error: NOT FOUND for url: https://web.archive.org/save/https://robbinespu.gitlab.io
Traceback (most recent call last):
  File "c:\python3\lib\site-packages\wayback_machine_archiver\archiver.py", line 35, in call_archiver
    r.raise_for_status()
  File "c:\python3\lib\site-packages\requests\models.py", line 941, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: NOT FOUND for url: https://web.archive.org/save/https://robbinespu.gitlab.io
$ wayback_archiver https://robbinespu.gitlab.io/sitemap.xml --concurrency=1 --verbose
I, [2020-07-11T13:58:40.708292 #9619]  INFO -- WaybackArchiver: Looking for Sitemap(s) in /robots.txt
I, [2020-07-11T13:58:40.824281 #9619]  INFO -- WaybackArchiver: Fetching Sitemap at https://robbinespu.gitlab.io/sitemap.xml
D, [2020-07-11T13:58:40.824443 #9619] DEBUG -- WaybackArchiver: Requesting https://robbinespu.gitlab.io/sitemap.xml
D, [2020-07-11T13:58:40.834043 #9619] DEBUG -- WaybackArchiver: [200, OK] Requested https://robbinespu.gitlab.io/sitemap.xml
I, [2020-07-11T13:58:40.848374 #9619]  INFO -- WaybackArchiver: Total URLs to be sent: 65
I, [2020-07-11T13:58:40.848427 #9619]  INFO -- WaybackArchiver: Request are sent with up to 1 parallel threads
.
.
.
D, [2020-07-11T13:58:51.086457 #9619] DEBUG -- WaybackArchiver: Requesting https://web.archive.org/save/https://robbinespu.gitlab.io/blog/2019/11/21/zekr-quran-on-linux-fedora-30/
D, [2020-07-11T13:58:51.422413 #9619] DEBUG -- WaybackArchiver: [404, NOT FOUND] Requested https://web.archive.org/save/https://robbinespu.gitlab.io/blog/2019/11/21/zekr-quran-on-linux-fedora-30/
I, [2020-07-11T13:58:51.422509 #9619]  INFO -- WaybackArchiver: Posted [404, NOT FOUND] https://robbinespu.gitlab.io/blog/2019/11/21/zekr-quran-on-linux-fedora-30/
.
.
D, [2020-07-11T13:58:54.283012 #9619] DEBUG -- WaybackArchiver: Requesting https://web.archive.org/save/https://robbinespu.gitlab.io/blog/2020/02/15/inkscape-tutorial-kad-jemputan/
D, [2020-07-11T13:58:54.573641 #9619] DEBUG -- WaybackArchiver: [429, Too Many Requests] Requested https://web.archive.org/save/https://robbinespu.gitlab.io/blog/2020/02/15/inkscape-tutorial-kad-jemputan/
I, [2020-07-11T13:58:54.573738 #9619]  INFO -- WaybackArchiver: Posted [429, Too Many Requests] https://robbinespu.gitlab.io/blog/2020/02/15/inkscape-tutorial-kad-jemputan/

Thus error is beyond my control. I don’t have available proxy to use and sometime Backway Machine is down. Dang! So that all, I can’t complaint more because this is free service. The save-function in the Wayback Machine has been unreliable but this is the only option I have. Let just try my luck there.

Edit

Have some thoughts, discussion or feedback on this post?
IndieWeb Interactions

Below you can find the interactions that this page has had using Indieweb. Which means, you can mentioned this URL on any website that support WebMention. Have you written a response to this post? Let me know the URL:

((Do you use a website that do not set up with WebMention capabilities? You can use Comment Parade.)