When it comes to web scraping, having a real, physical web browser is a chore. Not only is it a problem to open 10+ copies of IE, but the whole idea of showing up UI on a machine that may not have an operator to close it should things go awry is something that feels, well, very wrong to me.
What is Phantom.js, anyway?
Phantom.js is a dual API: it contains both its own APIs as well as APIs exposed through the
fs modules of CommonJS. As a result, to get this API within SharpKit, I would have to trail through these APIs and generate corresponding C# classes and methods for them. This is actually fairly trivial – all I have to do is to decorate the class with
[JsType(JsMode.Prototype)] and the methods with
[JsMethod/JsProperty] and everything would work.
As you can see, it’s impossible to redefine property names, so the C# API has lost some of its gloss. But that’s not critical. What’s important it that I’ve been able to replicate the API of both the
WebPage objects in C#. I didn’t go for the CommonJS modules because it seemed that the existing features, together with the DOM and jQuery APIs provided by Sharpkit are more than enough.
Trying It Out
To try things out, I rewrote the Phantom pizza-searching example to use C#. Here it is, in all its glory:
Surprisingly enough, there aren’t that many caveats to note here. One issue I had is how to return types from methods that should, in all fairness, be made generic. In the end, I ended up deciding that it’s better to leave the return type as
JsObject and then using the
As<>() method to cast the result to whatever I actually need. The explicit type would have to appear eventually, unless of course we went with
dynamic – but then again, I don’t know if this even possible.
Needless to say, the example worked out of the box, which shouldn’t really be surprising considering that I actually peeked into the resulting JS file and saw that what was generated.
There’s no way getting around the fact that, with this approach, your main data mining process will be spawning an extra process for PhantomJS to do its work. Given that Phantom seems to only support the
fs module, the expected mechanism for communication between the mining service and phantom would appear to be file storage. (This implies that progress monitoring is, effectively, impossible.)
At any rate, I have replicated
fs in SharpKit, adding just enough of the methods in order to get the ball rolling. Annoyingly enough, I’ve declared the
require() instruction in a rather strange fashion:
There’s probably a better way out there, as this approach requires me to use the
As<>() method again to get the right module type. Ah well.
Given that PhantomJS uses Qt, Python and who knows what else, it’s hardly surprising that its file system API doesn’t like standard Windows paths such as
c:somewhere.txt. Not a big deal, really, because we can just write to the current directory, read off the data by our web minder, and then delete the file when we’re done. The file name would typically be provided as a parameter to the
phantomjs.exeprocess (we get the parameters as a collection, remember?).
It’s All Gone Pear-Shaped
Having verified that the file system calls work (somewhat), I’ve decided to jump in and rewrite one of my WatiN-driven workflows using this new PhantomJS framework. That’s when I hit a couple of fairly serious problems:
- First of all, Phantom refused to work over SSL. It took some searching to figure out that, in the latest dynamic release, the authors simply forgot to include one of the required DLLs. Luckily, that was easy to fix.
- The next annoying thing I discovered is that doing something wrongly within Phantom basically causes your program to either hang or exit without any explanation. There is no debugging as far as I can see, which makes tracking things down impossible (compare with WatiN, where debugging is sensible).
- The jQuery API turned out to be insufficient for the purposes of data mining. That’s when I realized where the business value of WatiN is – the ability to write something like
browser.Forms.TextField(t => t.Name.StartsWith("test"))is what makes WatiN a lot more usable. True, you can try to get the same result with jQuery selectors, but those typically cause silent failure and are not statically checked the wame way WatiN-based LINQ expressions are.
That last bullet point is the killer, as far as I can tell. Essentially, what this means is that if I want to have a good LINQ-driven selection mechanism, I have to write my own APIs on top of jQuery to give me the collections in the format I want. Furthermore, were I to do this (and this is a fairly substantial task), there would still be a problem debugging into these collections as they are iterated and processed.
Back to the homepage.