#webscraping | #headless-browsing | #Patreon web scraper

November 29, 2019

Patreon Archiver

Patreon is a platform that supports folks making. I’m gladly donating to a few creators in order to get high-quality content. The patreon website itself is used as a content delivery channel and creators post updates and provide access to paid content there.

Unfortunately the patreon page is slow, loads a bunch of ad trackers and hard to navigate. The pagination gives access to only a few posts at a time and the performance of the page really makes you not want to load more… Additionally, when you stop your paid support for a channel you also loose access to everything that was posted.

To remedy this situation somewhat I created a small utility that uses a headless Chrome instance via puppeteer to download that data and provide it in form of single page html pages that just contain the posted content and not more.

Setup

You will need to have node.js installed. Either get it through your operating systems package manager or use nvm.

Clone the patreon scraper repository and install the dependencies:

$ git clone https://github.com/rksm/patreon-scraper
$ cd patreon-scraper
$ npm install

Usage & Customization

You will need a patreon account and (obviously) will only have access to the content that your patreon account has currently access to.

The first step is to login to patreon using the scraper. In the patreon-scraper directory, run

npx ts-node open_browser.ts

This will open a Chrome window. Login using your patreon credentials. Then completely close the Chrome window and application.

Next, check if Chrome headless can find the patreon login cookie. Run

npx ts-node open_browser.ts --check-cookies

It should print the cookies data in JSON representation.

{
  name: 'session_id',
  value: 8hSpzJbzLX4ZaMV4Zes_YJqb2OkcSdMEqsxOdzSsylg,
  domain: '.patreon.com',
  path: '/',
  expires: 1637185860.958448,
  size: 53,
  httpOnly: true,
  secure: true,
  session: false,
  sameSite: 'Lax',
  sameParty: false,
  sourceScheme: 'Secure',
  sourcePort: 443
}

Now to download and archive actual content: Each creator gets their own “campaign id”, which is a number identifying the content and posts. In order to figure out the campaign id of a given patreon page run

ts-node fetch_campaign_id.ts https://www.patreon.com/darknetdiaries

It should print out that number (replace the URL obviously). In the case of Darknet Diaries it is 1682532.

With that id you can now download the content and convert it into a single page html document:

ts-node ./fetch_data.ts --campaign_id 1682532 --data_dir darknetdiaries_data --patreon_url https://www.patreon.com/darknetdiaries/
ts-node ./render_data.ts  --campaign_id 1682532 --data_dir darknetdiaries_data --patreon_url https://www.patreon.com/darknetdiaries/

That is it. The folder darknetdiaries_data should now contain an html file with the creator’s posts as well as a json file with the raw data.