📢 Actions Speak Louder Than Words!

Auto submit web content to Backway Machine

Posted: Jul 11, 2020 | Reading time: 3 min

One of great ninja trick and technique that I always use when want check older version of web design, older web content revision, find back long-gone content from dying-forever website or all sorts of treasures by using “Backway Machine” or “Archive Today”. Let say this thing as virtually store until they removed.

I don’t know how the system work in-behind but maybe thus archive content maybe compressed and sent to darkweb and they will fetch-it when you need it. Scary huh? So not everyone and every content should be submitted.

Each time they taking a snapshort they must be put it somewhere, could be store as Hadoop Distributed File System (HDFS) using Apache Hadoop and Apache Accumulo technology. Who know?

Anyway, I plan to auto submit my blog content to Backway Machine as currently I do it manually using Chrome extension which is pain in ass.


How to let the librarians know?

Well, maybe later on, I will create ruby plugin (since I using Jekyll), now let just do it via simple bash scripting with combination of curl and awk.

$ curl https://robbinespu.gitlab.io/sitemap.xml | grep "<loc>" | \
awk -F"<loc>" '{print $2} ' | awk -F"</loc>" '{print $1}' 

By subtracting my sitemap then I can bind with existing tools such as wayback-machine-archiver (python pip) or wayback_archiver (ruby gem).

Note (update)

Seem both script allow you to include sitemap as parameter and they will take care everything

So I can just call and submit my web content to Backway Machine each time build Jekyll via CI/CD.

But sometime Backway Machine doesn’t work properly….

$ archiver https://robbinespu.gitlab.io
ERROR:root:404 Client Error: NOT FOUND for url: https://web.archive.org/save/https://robbinespu.gitlab.io
Traceback (most recent call last):
  File "c:\python3\lib\site-packages\wayback_machine_archiver\archiver.py", line 35, in call_archiver
  File "c:\python3\lib\site-packages\requests\models.py", line 941, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: NOT FOUND for url: https://web.archive.org/save/https://robbinespu.gitlab.io
$ wayback_archiver https://robbinespu.gitlab.io/sitemap.xml --concurrency=1 --verbose
I, [2020-07-11T13:58:40.708292 #9619]  INFO -- WaybackArchiver: Looking for Sitemap(s) in /robots.txt
I, [2020-07-11T13:58:40.824281 #9619]  INFO -- WaybackArchiver: Fetching Sitemap at https://robbinespu.gitlab.io/sitemap.xml
D, [2020-07-11T13:58:40.824443 #9619] DEBUG -- WaybackArchiver: Requesting https://robbinespu.gitlab.io/sitemap.xml
D, [2020-07-11T13:58:40.834043 #9619] DEBUG -- WaybackArchiver: [200, OK] Requested https://robbinespu.gitlab.io/sitemap.xml
I, [2020-07-11T13:58:40.848374 #9619]  INFO -- WaybackArchiver: Total URLs to be sent: 65
I, [2020-07-11T13:58:40.848427 #9619]  INFO -- WaybackArchiver: Request are sent with up to 1 parallel threads
D, [2020-07-11T13:58:51.086457 #9619] DEBUG -- WaybackArchiver: Requesting https://web.archive.org/save/https://robbinespu.gitlab.io/blog/2019/11/21/zekr-quran-on-linux-fedora-30/
D, [2020-07-11T13:58:51.422413 #9619] DEBUG -- WaybackArchiver: [404, NOT FOUND] Requested https://web.archive.org/save/https://robbinespu.gitlab.io/blog/2019/11/21/zekr-quran-on-linux-fedora-30/
I, [2020-07-11T13:58:51.422509 #9619]  INFO -- WaybackArchiver: Posted [404, NOT FOUND] https://robbinespu.gitlab.io/blog/2019/11/21/zekr-quran-on-linux-fedora-30/
D, [2020-07-11T13:58:54.283012 #9619] DEBUG -- WaybackArchiver: Requesting https://web.archive.org/save/https://robbinespu.gitlab.io/blog/2020/02/15/inkscape-tutorial-kad-jemputan/
D, [2020-07-11T13:58:54.573641 #9619] DEBUG -- WaybackArchiver: [429, Too Many Requests] Requested https://web.archive.org/save/https://robbinespu.gitlab.io/blog/2020/02/15/inkscape-tutorial-kad-jemputan/
I, [2020-07-11T13:58:54.573738 #9619]  INFO -- WaybackArchiver: Posted [429, Too Many Requests] https://robbinespu.gitlab.io/blog/2020/02/15/inkscape-tutorial-kad-jemputan/

Thus error is beyond my control. I don’t have available proxy to use and sometime Backway Machine is down. Dang! So that all, I can’t complaint more because this is free service. The save-function in the Wayback Machine has been unreliable but this is the only option I have. Let just try my luck there.


Discussion and feedback

You can use utterances provided below to post comment on behalf using Github account. Alternatively, you can just send a public comment to my mailing list or send a private message to my e-mail. In a few cases and on certain time, I just don’t have time to moderate them. Please read terms-of-service (ToS) for details.