Be a Web Browser for Halloween!
October 28th, 2005What are you dressing up for this Halloween? If you haven’t chosen a costume yet, let me suggest one. Using the command line tools “tcpdump” and “curl,” you can masquerade so convincingly, almost any web site will think you’re Safari!
Something that came up recently on the AppleScript-Users list was a query about how to use curl to essentially automate a web page access as if it were coming from Safari.
This is one of those things that is pretty easy to do, but you have to have had a reason to learn about all the tools required to do it. I figured that even some of the smarty-pants who read my blog might get a little education out of a clear tutorial on the steps required.
As a simple example, I’ll use Yahoo’s “Stock Portfolio” page. If you want to follow along at home, you’ll need to create a Yahoo account and a simple stock portfolio page listing ticker symbols and prices. If you are feeling intelligent you should also be able to follow along with any other site that depends on cookies and/or User-Agent to function properly.
For years I have used Yahoo as a repository for my “quick glance” addiction to stock stock market prices. I check this page usually at least once a day, and to make things easier, I even put in a Safari-specific keyboard shortcut (ctrl-Y) so I can easily bounce over there when I’m waiting for a compile, or whatever. Let’s imagine that instead of wasting web browsing time going to that page, I want to write a script to fetch the contents of this page on a daily basis and generate a plain-text summary to deliver to my Mail inbox. This is not easy to do unless you know how to convince Yahoo that you’re a real web browser. The URL to my stock portfolio page looks like this:
If you copy that URL and paste it into a browser that doesn’t have Yahoo cookies configured (with “remember me” selected), it will bring you to a Yahoo login page. Similarly, if you attempt to load the URL with curl’s default options:
curl -D "./MyHeaders" "http://finance.yahoo.com/p?v&k=pf_2"
You get a very boring, unhelpful response:
Additionally, a 302 Found
error was encountered while trying to use an ErrorDocument to handle the request.
The -D option to curl asks it to dump the headers it receives along with the response. If we examine the headers, we see that they contain the mojo that makes a regular browser redirect you to the “Please Login” page:
iBook> cat MyHeaders HTTP/1.1 302 Found Date: Fri, 28 Oct 2005 14:41:09 GMT P3P: policyref="http://p3p.yahoo.com/w3c/p3p.xml", CP="CAO DSP COR CUR ADM DEV TAI PSA PSD IVAi IVDi CONi TELo OTPi OUR DELi SAMi OTRi UNRi PUBi IND PHY ONL UNI PUR FIN COM NAV INT DEM CNT STA POL HEA PRE GOV" P3P: policyref="http://p3p.yahoo.com/w3c/p3p.xml", CP="CAO DSP COR CUR ADM DEV TAI PSA PSD IVAi IVDi CONi TELo OTPi OUR DELi SAMi OTRi UNRi PUBi IND PHY ONL UNI PUR FIN COM NAV INT DEM CNT STA POL HEA PRE GOV" Location: http://login.yahoo.com/?.src=quote&.intl=us&.mesg=You%20have%20requested%20a%20portfolio%20that%20requires%20a%20Yahoo!%20ID%20and%20a%20password.%20Please%20sign%20in.&.done=http://finance.yahoo.com%2fp%3fv%26k%3dpf_2 Content-Type: text/html X-Cache: MISS from finance.yahoo.com Connection: close Transfer-Encoding: chunked
At this point, if we were just trying to load the login page with curl, we could add the “-L” (follow Location: hints) option and try again. But we’re not interested in logging in – I just want to see my stock results!
Depending on the server in question, a number of factors may prevent a simple curl request for a URL from behaving identically to the same URL being loaded in a web browser. The most common problems have to do with cookies and the User-Agent string. Although more and more sites seem to be relaxing their User-Agent specific tests on pages, you will occasionally come upon a site that loads brilliantly in Safari, yet from curl it arrogantly tells you “You are not running a supported browser.”
In the case of Yahoo! Stocks, we only need to accommodate the cookies. But for the sake of a complete demonstration, I will outline the steps required to observe and masquerade as a web browser supporting both cookies and a forged User-Agent string.
The general procedure is something like this:
- Surveil the model.
- Make and try out the costume.
- Condition costume for long-term use.
Step 1. Surveilling the Model
The first step to any successful costume preparation is taking a long, hard look at the subject being imitated. If you throw your costume together quickly and omit Marilyn’s mole or Groucho’s moustache and cigar, you’ll be the costume that nobody can guess at the Halloween Ball.
A convenient surveillance gadget built-in to your Mac’s arsenal of equipment is the “tcpdump” command-line tool. We’re going to use tcpdump to closely examine the appearance of our subject as it walks into Yahoo Stocks. Then we’ll use the same tool to repeatedly try out our costume, until we’re convinced it will work. Fortunately, Yahoo won’t laugh at us for showing up at the party in an ever-more refined version of our get-up.
The tcpdump tool doesn’t boast the most user-friendly interface in the world, so I’m going to ask you to just “do what I say,” and let you become more familiar with the myriad options it supports on your own time. Over time I have developed a sense for the options I like to use to make the surveillance job easier. These options are so common, that I use an alias to tcpdump for my everyday chores. The alias on my computer looks like this:
alias tcpd='sudo tcpdump -Atq -s 0 -i en1'
The “-i en1” bit at the end is because I typically work over an Airport network connection. I’m instructing the tcpdump tool to listen on that interface, as opposed to my (utterly dull, disconnected) wired ethernet port. If you’re using a standard ethernet connection, you will probably want to omit the “-i” option altogether, or else replace it with the network interface you know you’re connecting through. I believe it will default to the en0 interface if the option is omitted. In the examples that follow, I will use “tcpd” as shorthand for the alias defined above. If you are following along at home and don’t want to define an alias, you’ll have to paste the above (modified as necessary for your network interface) into the examples wherever you see “tcpd.”
To set up our sting operation, we want to focus tcpdump’s attention on the site in question. Since tcpdump defaults to all the traffic coming in or out of your computer, it can be a bit overwhelming if you don’t filter the results in some way. One of the easiest ways to filter the results of the tool is to specify a particular host whose network communications you are interested in. This is done with a simple “host” parameter, which we can tack on to the end of our alias with all its options:
iBook> tcpd host finance.yahoo.com tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on en1, link-type EN10MB (Ethernet), capture size 65535 bytes
At this point, if you go to your web browser and navigate to the target URL, you will see a mountain of data spew into your Terminal window. It’s important to make sure that your Terminal window is configured for “Unlimited Scrollback,” to be sure that you don’t miss the all-important beginning of every web transaction. You can adjust this setting in Terminal by selecting “Show Info” from the File menu, and then selecting “Buffer” from the pop-up menu in the floating info window. What I like to do is leave a tcpdump process running while I do my surveillance, periodically clearing (cmd-K) the contents of the Terminal window so I can be sure that what I’m about to test will be the only thing showing up in the window.
When I navigate to my Yahoo Stocks page from Safari, the output in my tcpdump Terminal window includes a bunch of data preceded by these all important lines:
GET /p?v&k=pf_2 HTTP/1.1 Accept: */* Accept-Language: en Accept-Encoding: gzip, deflate Cookie: [Omitted to Protect my Innocence] Referer: http://finance.yahoo.com/p?v&k=pf_2 User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/412.7 (KHTML, like Gecko) Safari/412.5 If-Modified-Since: Fri, 28 Oct 2005 15:16:17 GMT Connection: keep-alive Host: finance.yahoo.com
This is the heart of the costume. For a GET request in particular, anything you need to simulate about the way a browser interacts with the server is summarized here. The mole, glasses, moustache, cigar, and even fingerprints are all included in the above text. If you can emulate that, you’re almost undetectable as an intruder (if the site relies on JavaScript interpretation in the client, then you may be out of luck. That scenario is beyond the scope of this article).
To get a sense of what we’re starting with, let’s save a copy of the above “reference point” and clear the screen in Terminal for some more surveillance. We want to take a look at our curl approach in the buff, before we even attempt to clothe it in a garish Safari cloak. In a separate Terminal window, run the same curl command we ran earlier (the -D header dump is optional):
curl -D "MyHeaders" "http://finance.yahoo.com/p?v&k=pf_2"
Back in the surveillance window, we see something like the following near the top of our output:
GET /p?v&k=pf_2 HTTP/1.1 User-Agent: curl/7.13.1 (powerpc-apple-darwin8.0) libcurl/7.13.1 OpenSSL/0.9.7g zlib/1.2.3 Host: finance.yahoo.com Pragma: no-cache Accept: */*
My, our costume is very bare indeed! At this point, you could start frantically imitating every nuance of Safari’s output, in an effort to become unmistakably Safari-like. But the standard for imitation on the web isn’t as strict as that. You’re liable to pass without detection by imitating only a small subset of the behaviors your subject exhibits. When putting on these charades, it’s wise to don only one piece of fancy dress at a time, and then make a go of crashing the party.
Step 2: Trying on the Costume
The first thing you try on might depend on a number of factors. You’ll get the feel for what you need to do as you gain experience with this kind of undercover work. In many cases like this, where what we’re trying to do is to take advantage of an “remember me” type cookie that automatically logs the user in, it’s obvious that we’re going to need to imitate the cookies on some level. The easiest way to pass cookies through curl is with the “–cookie” option, which allows you to quite simply paste in to the command line the exact string of cookies you wish to hand to the browser. We can copy the string directly from the saved Safari tcpdump, and paste it like so:
curl -D "MyHeaders" --cookie '[Omitted to Protect my Innocence]' "http://finance.yahoo.com/p?v&k=pf_2"
It’s important to use single-quotes around the cookie text, because if your cookies look anything like mine, they will generate command-line parsing errors if you don’t. When I run the above command-line, lo-and-behold, it works! Spewing out into my Terminal window is the HTML content including all of my favorite ticker symbols, along with their current prices. That was easy. This is what the curl transaction looks like under surveillance, now:
HTTP/1.1 User-Agent: curl/7.13.1 (powerpc-apple-darwin8.0) libcurl/7.13.1 OpenSSL/0.9.7g zlib/1.2.3 Host: finance.yahoo.com Pragma: no-cache Accept: */* Cookie: [Omitted to Protect my Innocence]
In this case we got lucky, and the costume worked on the first try. But let’s suppose that Yahoo were more persnickety about User-Agents. If a site you’re attempting to fool is too smart for its own good and accuses you of being an “unsupported browser,” curl offers an easy workaround. Looking back at our original snippet, we see that Safari advertised itself to the browser as “Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/412.7 (KHTML, like Gecko) Safari/412.5”. Phew! That’s a mouthful. Who knows if we need to copy all of it or only a part of it to convince the server we’re legit. We might as well copy all of it since we’re playing games of misidentification anyway. The -A option to curl gives you easy access to the User-Agent string it sends along to the server. Again, single-quotes are probably a good idea, depending on the content of the string and your shell of choice:
curl -D "MyHeaders" -A 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/412.7 (KHTML, like Gecko) Safari/412.5' --cookie '[Omitted to Protect my Innocence]' "http://finance.yahoo.com/p?v&k=pf_2"
GET /p?v&k=pf_2 HTTP/1.1 User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/412.7 (KHTML, like Gecko) Safari/412.5 Host: finance.yahoo.com Pragma: no-cache Accept: */* Cookie: [Omitted to Protect my Innocence]
Oh it’s just too easy! You’re probably thinking to yourself now: “why did this guy even write this article? It’s child’s play!” It’s true, but it’s a child’s play that most people simply aren’t familiar with! Similarly to the “User-Agent” string, other strings can be easily tweaked by passing the right curl option. For example, to change the “Referer” URL, just pass along the desired string to the “-e” option. To get an exhaustive list of the options available to you, simply run curl with the “-h” option.
So that’s it. We’re done, right? Not quite. The successful command line solution above works today, right this minute, but there’s no guarantee it will work for the long haul! This is where things get tough.
Step 3. Reinforcing for the Long Haul
When you go to the Yahoo Stocks page every day, from the same browser, you can avoid entering your login information by asking it to “remember you.” When you come along tomorrow, your browser passes up the cookies it got yesterday, and Yahoo says “Dude, I totally know you from somewhere!” (in a Bill & Ted voice, I’m told). But when you slack off on your visits to Yahoo, your cookies get stale. In general, the way to solve this problem in curl is to configure a “cookie jar” for future interaction with the server. The cookie jar is a file used not only for sending cookies to the server, but also for saving new cookies it returns after you load a page. This is similar to what a browser is doing when you repeatedly visit a page on the same domain. The browser looks into its “.yahoo.com” cookie jar, and sees what it can find. Anything that looks appropriate gets tacked onto the “Cookie:” header of the outgoing request.
For our example, we might realize that while the trick is working for now, in a few minutes, hours, days, or months, the gig might be up. We will have lost our credibility because we’ll be passing today’s cookies at a time when the server is no longer trusting of them. To establish a cookie jar, use the –cookie-jar option to curl:
mkdir ~/MyHacks/ curl --cookie-jar ~/MyHacks/YahooCookieJar --cookie '[Omitted to Protect my Innocence]' "http://finance.yahoo.com/p?v&k=pf_2"
Now examine the cookie jar to see what we’ve got. Yikes! It’s empty. Well, ain’t that the darned-tootenest thing. Examining the tcpdump response from Yahoo, it’s easy to observe that in fact, there is no “Set-Cookie:” header in its response. The tricky devil! Examining the source of the resulting page, it looks like they might use some kind of JavaScript trick to set the cookie after the page loads. At this point you might be hoping I don’t spend any more of your precious reading time explaining how we might get around this obstacle. If you are, then you’re in luck. I’m tired of typing, and Yahoo is turning out to be a worse example than I had hoped for. I just don’t actually care to make this work for the long term. If you do, then the avenue I’d recommend pursuing is to see whether tickling another URL (e.g. the Yahoo home page) causes the desired “updated cookies” to be returned and therefore deposited into the cookie jar.
Assuming the page you’re accessing does set cookies in a more predictable manner, you should see inside your cookie jar file a list of everything the server sent to curl. Keep this file around, and from this point forward, don’t pass any literal cookies – just let curl use the cookie jar on your behalf:
curl --cookie-jar ~/MyHacks/YahooCookieJar "http://finance.yahoo.com/p?v&k=pf_2"
There! Now we’ve got you walking and talking like a real web browser! You can still insert whatever user-agent, referer, etc., arguments you need into the command line above. The world is your oyster!
My little run-in with Yahoo’s cookie stinginess demonstrates that the road to consistent imitation is not always bump-free, but hopefully with the tools described in this article you’ll consider yourself better equipped to make a go at crashing the browser party. Good luck, and be sure to take that fake moustache off before you show up at work tomorrow!
October 28th, 2005 at 6:52 pm
Nice, the TCP Dump intro will probably come in handy some day…
Now, if there were just some slick way of pushing fresh cookies into safari so you can stay logged into sites with a one day timeout you don’t visit everyday (http://kstruct.com/?p=7)
November 2nd, 2005 at 9:11 am
Matt: I wonder if there is some way of using Safari’s JavaScript commands to push the cookies in, similarly to how the sites themselves do it. Or maybe it’s possible to use XmlHTTPRequest or whatever to quietly reload a page in the background every so often to update the cookie?
November 30th, 2005 at 12:19 pm
Daniel: for monitoring HTTP headers, Firefox’s http://livehttpheaders.mozdev.org/ is very helpful (and maybe a bit friendlier than tcpdump).
November 30th, 2005 at 12:30 pm
Thanks, Beau. I’ll check that out. unfortunately, the mozdev.org site seems slammed to the point of inaccessibility right now. Maybe Firefox 1.5 is too successful!