in dotnet

Web Mining via Phantom.js and Sharpkit

When it comes to web scraping, having a real, physical web browser is a chore. Not only is it a problem to open 10+ copies of IE, but the whole idea of showing up UI on a machine that may not have an operator to close it should things go awry is something that feels, well, very wrong to me.

Unfortunately, there is no headless WebKit library for .Net and current approaches all either have Java or JavaScript bindings or, alternatively, suggest that you use IKVM. However, this got me thinking: if an IKVM-driven port works (albeit not very well, in terms of performance, at least), then how about going in the direction of JavaScript solutions such as Phantom.js?

And then it hit me: I’ve already good a transcompiler called SharpKit that knows how to translate C# to JavaScript. So what if I were to simply write my code in C#, then let Phantom.js execute it as JavaScript?

What is Phantom.js, anyway?

Phantom.js is a dual API: it contains both its own APIs as well as APIs exposed through the webpage and fs modules of CommonJS. As a result, to get this API within SharpKit, I would have to trail through these APIs and generate corresponding C# classes and methods for them. This is actually fairly trivial – all I have to do is to decorate the class with [JsType(JsMode.Prototype)] and the methods with [JsMethod/JsProperty] and everything would work.

As you can see, it’s impossible to redefine property names, so the C# API has lost some of its gloss. But that’s not critical. What’s important it that I’ve been able to replicate the API of both the phantom and WebPage objects in C#. I didn’t go for the CommonJS modules because it seemed that the existing features, together with the DOM and jQuery APIs provided by Sharpkit are more than enough.

Trying It Out

To try things out, I rewrote the Phantom pizza-searching example to use C#. Here it is, in all its glory:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
var page = new WebPage();
string url = "http://lite.yelp.com/search?find_desc=pizza&find_loc=94040&find_submit=Search";
page.open(url, status => {
  if (status != "success") console.log("Unable to access network");
  else
  {
    var result = page.evaluate(() =>
    {
      var list = document.querySelectorAll("span.address");
      var pizza = new JsArray<string>();
      for (int i = 0; i < list.length; i++)
      {
        pizza.push(list[i].innerText);
      }
      return pizza;
    });
    console.log(result.As<JsArray<string>>().join("n"));
  }
  Phantom.exit();
});

Surprisingly enough, there aren’t that many caveats to note here. One issue I had is how to return types from methods that should, in all fairness, be made generic. In the end, I ended up deciding that it’s better to leave the return type as JsObject and then using the As<>() method to cast the result to whatever I actually need. The explicit type would have to appear eventually, unless of course we went with dynamic – but then again, I don’t know if this even possible.

Needless to say, the example worked out of the box, which shouldn’t really be surprising considering that I actually peeked into the resulting JS file and saw that what was generated.

Persisting Data

There’s no way getting around the fact that, with this approach, your main data mining process will be spawning an extra process for PhantomJS to do its work. Given that Phantom seems to only support the fs module, the expected mechanism for communication between the mining service and phantom would appear to be file storage. (This implies that progress monitoring is, effectively, impossible.)

At any rate, I have replicated fs in SharpKit, adding just enough of the methods in order to get the ball rolling. Annoyingly enough, I’ve declared the require() instruction in a rather strange fashion:

1
2
3
4
5
6
7
8
9
[JsType(JsMode.Prototype, Export=false, Name = "require")]
public static class Require
{
  [JsMethod(Name="require", Global = true)]
  public static JsObject Module(string moduleName)
  {
    return null;
  }
}

There’s probably a better way out there, as this approach requires me to use the As<>() method again to get the right module type. Ah well.

1
2
// not very pretty
var fs = Require.Module("fs").As<FileSystem>();

Given that PhantomJS uses Qt, Python and who knows what else, it’s hardly surprising that its file system API doesn’t like standard Windows paths such as c:somewhere.txt. Not a big deal, really, because we can just write to the current directory, read off the data by our web minder, and then delete the file when we’re done. The file name would typically be provided as a parameter to the phantomjs.exeprocess (we get the parameters as a collection, remember?).

It’s All Gone Pear-Shaped

Having verified that the file system calls work (somewhat), I’ve decided to jump in and rewrite one of my WatiN-driven workflows using this new PhantomJS framework. That’s when I hit a couple of fairly serious problems:

  • First of all, Phantom refused to work over SSL. It took some searching to figure out that, in the latest dynamic release, the authors simply forgot to include one of the required DLLs. Luckily, that was easy to fix.
  • The next annoying thing I discovered is that doing something wrongly within Phantom basically causes your program to either hang or exit without any explanation. There is no debugging as far as I can see, which makes tracking things down impossible (compare with WatiN, where debugging is sensible).
  • The jQuery API turned out to be insufficient for the purposes of data mining. That’s when I realized where the business value of WatiN is – the ability to write something like browser.Forms[0].TextField(t => t.Name.StartsWith("test")) is what makes WatiN a lot more usable. True, you can try to get the same result with jQuery selectors, but those typically cause silent failure and are not statically checked the wame way WatiN-based LINQ expressions are.

That last bullet point is the killer, as far as I can tell. Essentially, what this means is that if I want to have a good LINQ-driven selection mechanism, I have to write my own APIs on top of jQuery to give me the collections in the format I want. Furthermore, were I to do this (and this is a fairly substantial task), there would still be a problem debugging into these collections as they are iterated and processed.

Back to the homepage.

Write a Comment

Comment