Web scraping with Common Lisp: cookies and stuff

It’s been a long time since my last post, but let’s pick up where I left off.

Read part 1 here!

Often you want to grab some data from a website but you can’t just send a get request to a page, you need to log in first. How does the site even know you’re logged in? Well, attached to each request is a bunch of cookies, which are essentialy name-value pairs. Moreover, the server’s response may update or add new cookies, which are faithfully stored by your browser for some period of time. So when you are logging in to a site the following happens:

  1. You send a POST request with your name and password as parameters.
  2. The server responds by setting up your cookies in a way that allows it to recognize your future requests. Usually it sets a cookie that contains your “session id”, which uniquely identifies your current browser session.
  3. When you make any requests after that, the cookie that contains session id is sent along with them, and the server assumes you are logged in.

As you can see, the whole algorithm hinges on the fact that your browser must store and resend the cookie that the server has set up. And when you are making requests through a library or a command-line tool such as curl or wget, the responsibility to store and pass the cookies lies upon you.

Ok, so with Common Lisp we’re using the DRAKMA library. By default it will not send any cookies, or do anything with received cookies. However if you pass a special cookie jar object as a keyword parameter to http-request, it will send cookies from it, and update them based on the server’s response. If you use the same cookie jar object to POST a login request and later to retrieve some data, usually this will be enough to trick the server into serving you the content behind the authentication wall.

    (let ((cookie-jar (make-instance 'drakma:cookie-jar)))
        (drakma:http-request login-url :method :post :parameters login-parameters :cookie-jar cookie-jar)
        (drakma:http-request data-url :cookie-jar cookie-jar))

I think it’s annoying to always write “:cookie-jar cookie-jar” for every request, so in my library webgunk a special variable *webgunk-cookie-jar* is passed as the requests’ cookie jar (it’s nil by default). So you can instead:

    (let ((*webgunk-cookie-jar* (make-instance 'drakma:cookie-jar)))
        (http-request login-url :method :post :parameters login-parameters)
        (http-request data-url))

Special variables are sure handy. In webgunk/modules I created an object-oriented API that uses this feature and webgunk/reddit is a simple reddit browser based on it. Here’s the code for authorization:

(defmethod authorize ((obj reddit-browser) &key login password)
  (let ((login-url "https://ssl.reddit.com/api/login"))
    (with-cookie-jar obj
      (http-request login-url :method :post
                    :parameters `(("api_type" . "json")
                                  ("user" . ,login)
                                  ("passwd" . ,password))))))

where with-cookie-jar is just

(defmacro with-cookie-jar (obj &body body)
  `(let ((*webgunk-cookie-jar* (get-cookie-jar ,obj)))

Note that logging in isn’t always as easy. Sometimes the server’s procedure for setting the cookies is rather tricky (e.g. involving Javascript and redirects). However you almost always can trick the server that you’re logged in by logging in with your browser and then copying the cookie values from your browser (this is known as session hijacking, except you’re only hijacking your own session so it’s ok).

For example, I used to play an online game called SaltyBet, in which you place imaginary money bets on which character will win in a fight. The outcome could be predicted by analyzing the past fights of each character. After losing a million of SaltyBucks due to suboptimal betting, I have built a system that would collect the results of past fights from SaltyBet’s own website, calculate and display the stats for each character and also show their most recent fights, and the biggest upsets that they have been involved in. This was incredibly effective and I was able to recoup my lost money twice over!

But anyway, the data was available only to paid members so I needed to log in to scrape it. And the method described above did not work. In the end what worked was a simple:

(defparameter *cookies* 
  '(("PHPSESSID" . "4s5u76vufh0gt9hs6mrmjpioi0")
    ("__cfduid" . "dc683e3c2eb82b6c050c1446d5aa203dd1376731139271")))

(defmethod authorize ((obj saltybet-browser) &key)
  (clear-cookies obj)
  (loop for (name . value) in *cookies*
       do (add-cookie obj (make-instance 'drakma:cookie :name name :value value
                                         :domain ".saltybet.com"))))

How did I get these values? I just copy-pasted them from my Firefox! They were good for a few days, so it wasn’t much hassle at all. Sometimes a stupid solution is the most effective one.