— words and things

the news harvester

Over the past few months with my mentors at ZEIT ONLINE I have been doing a lot of thinking about how to better monitor news organizations performance in the social media space. To successfully monitor how stories propagate into the social world you are going to need access to a few things:

  1. A feed of the home page ( or whatever list of articles/urls you want to monitor ).
  2. A machine to collect and store what you find.
  3. A way to find the metrics from the social media world.

The home page is a good source because it is always changing so it will provide you with a constant list of urls to collect that the news organization feels are important enough to show on their home page. This is usually provided in some sort of xml ( via rss ) or json format that is constantly updated and publicly available for you to access at any time. In other words you need a harvester to constantly pick the stories from the top of the pile.

Once you have access to those resources you can start to aggregate them together to create an access point so that yourself and others can use to make sense of the data. Below is a small diagram to explain visually exactly what I just mentioned followed by an explanation of exactly how I am harvesting ZEIT ONLINE.


Starting off in the top left, Amo provides me with all the share values for a given url. It is constantly being used to provide feedback to a story’s share counts. Google+ did not have an API for getting the number of +1s a url has received so I had to create one myself. Facebook returns their share data in XML. None of the services from the social world was available in JSON except for the Twitter API. Without Amo I would have to constantly poll Twitter, Facebook and Google separately and then merge them all together inside of my app code. Amo is just an abstraction of social share API’s all wrapped up into one nice “likeable” JSON object.

The next piece you see is the “Harvesting Layer”. It feeds the database everything that it needs to serve up the “API and Caching” layer. The following is how I update the database from the constantly polling feed of ZEIT ONLINE articles.

  1. Grab a list of the articles from the home page and compare them to the recent articles I have collected. In my case grabbing the ZEIT home page was easy. They use xslt to drive their whole site, in other words their whole site is one open API for me to consume. For example, go to the ZEIT home page. Once you are there replace the “www” with “xml” ( xml.zeit.de/index ). What you will find is a massive XML structure that represents all of the data and meta data behind the page you see their. It is pretty much a playground for developers. If you are using node make sure to check out xml-simple, it is the best xml parser i know of. Inside the centerpage you will find the feed. If you want to harvest another source you will have to become familiar with their output structure in order to properly break it down in your app. You really only need three key ingredients: publish date, title and url.
  2. Strip out any illegal characters that JSON won’t parse ( this can happen when dealing with non-english content, for example ). Make sure you set the encoding properly to deal with umlauts. Put the new article inside of the database with Mikeals’ request. Because I used CouchDB, I needed a unique identifier. I used a nice little node utility to create a UUID for me for every url. At this point you are done with the URL harvesting.
  3. Set up a worker to handle all of the updating of the share objects. For this I created a tool to constantly poll the database to check for articles. Once I have a list of the articles I want to track I send them off to Amo to collect the share information. Once I have the share information I put everything back into the database and create a new revision. Another great reason to use CouchDB is for how easy it is to go back in time and review all of the past revisions. This allows me to really track a stories growth and see when it is on the rise.

At this point you will have a constant feed of articles being monitored and collected as well as their share counts being updated at the same time. The database is filling up and the next step is to set up a way to access this from the outside. This is where the API layer comes in. You are going to want to sanitize all of the data you collected in a way that any developer in the future who wants to use your stuff can understand. Creating an API means defining a layer of urls to your data that will remain consistent and will hide the complexity of a system like this for future developers who rely on your data.  I really like express because it is easy to create routes and from those routes you can return whatever you like. In my case the database returns me a filtered result set based on publish date. Luckily with couch this is really easy. I just added a stream proxy to my express route and passed the data parameters requested to the database and boom anyone in the ZEIT network now is able to filter ZEIT ONLINE articles by date.