Over the past few months with my mentors at ZEIT ONLINE I have been doing a lot of thinking about how to better monitor news organizations performance in the social media space. To successfully monitor how stories propagate into the social world you are going to need access to a few things:
- A feed of the home page ( or whatever list of articles/urls you want to monitor ).
- A machine to collect and store what you find.
- A way to find the metrics from the social media world.
The home page is a good source because it is always changing so it will provide you with a constant list of urls to collect that the news organization feels are important enough to show on their home page. This is usually provided in some sort of xml ( via rss ) or json format that is constantly updated and publicly available for you to access at any time. In other words you need a harvester to constantly pick the stories from the top of the pile.
Once you have access to those resources you can start to aggregate them together to create an access point so that yourself and others can use to make sense of the data. Below is a small diagram to explain visually exactly what I just mentioned followed by an explanation of exactly how I am harvesting ZEIT ONLINE.
Starting off in the top left, Amo provides me with all the share values for a given url. It is constantly being used to provide feedback to a story’s share counts. Google+ did not have an API for getting the number of +1s a url has received so I had to create one myself. Facebook returns their share data in XML. None of the services from the social world was available in JSON except for the Twitter API. Without Amo I would have to constantly poll Twitter, Facebook and Google separately and then merge them all together inside of my app code. Amo is just an abstraction of social share API’s all wrapped up into one nice “likeable” JSON object.
The next piece you see is the “Harvesting Layer”. It feeds the database everything that it needs to serve up the “API and Caching” layer. The following is how I update the database from the constantly polling feed of ZEIT ONLINE articles.
- Grab a list of the articles from the home page and compare them to the recent articles I have collected. In my case grabbing the ZEIT home page was easy. They use xslt to drive their whole site, in other words their whole site is one open API for me to consume. For example, go to the ZEIT home page. Once you are there replace the “www” with “xml” ( xml.zeit.de/index ). What you will find is a massive XML structure that represents all of the data and meta data behind the page you see their. It is pretty much a playground for developers. If you are using node make sure to check out xml-simple, it is the best xml parser i know of. Inside the centerpage you will find the feed. If you want to harvest another source you will have to become familiar with their output structure in order to properly break it down in your app. You really only need three key ingredients: publish date, title and url.
- Strip out any illegal characters that JSON won’t parse ( this can happen when dealing with non-english content, for example ). Make sure you set the encoding properly to deal with umlauts. Put the new article inside of the database with Mikeals’ request. Because I used CouchDB, I needed a unique identifier. I used a nice little node utility to create a UUID for me for every url. At this point you are done with the URL harvesting.
- Set up a worker to handle all of the updating of the share objects. For this I created a tool to constantly poll the database to check for articles. Once I have a list of the articles I want to track I send them off to Amo to collect the share information. Once I have the share information I put everything back into the database and create a new revision. Another great reason to use CouchDB is for how easy it is to go back in time and review all of the past revisions. This allows me to really track a stories growth and see when it is on the rise.
At this point you will have a constant feed of articles being monitored and collected as well as their share counts being updated at the same time. The database is filling up and the next step is to set up a way to access this from the outside. This is where the API layer comes in. You are going to want to sanitize all of the data you collected in a way that any developer in the future who wants to use your stuff can understand. Creating an API means defining a layer of urls to your data that will remain consistent and will hide the complexity of a system like this for future developers who rely on your data. I really like express because it is easy to create routes and from those routes you can return whatever you like. In my case the database returns me a filtered result set based on publish date. Luckily with couch this is really easy. I just added a stream proxy to my express route and passed the data parameters requested to the database and boom anyone in the ZEIT network now is able to filter ZEIT ONLINE articles by date.
Six months into being a Knight-Mozilla Fellow, in what has proven to be the best year of my life, 2012 continues to get better in the ZEIT ONLINE news room in Berlin. Over a year of planning leads up to today. We moved into a beautiful brand new office next to Anhalter Bahnof. The dust is still settling as the movers put the last pieces of furniture in place but I can already start to feel the energy of the news room buzzing throughout the place.
A question we are all often asked when we meet new people is “What do you do?”. I used to find that question annoying unless I was talking to other technical people because it meant that I had explain to them the details of a highly technical field in order for them to get it. While I love telling people about hardware virtualization and all of the details of the work I did at IBM it seems that most people get lost in that conversation and immediately switch the topic. Now I revel in such opportunity to explain what exactly it is that I spend time doing on the day to day hack with Mozilla and ZEIT ONLINE. It gives me a chance to explain how exciting working in an open way for the news room can be. Of course it comes with a unique set of challenges just like any software situation these days but ZEIT has done an great job at making it easy for a developer to get access to all of the proper tools necessary to get the job done.
On the other hand, Berlin does a really great job setting a creative vibe. With graffiti all over the city, endless open airs and a thriving music scene to explore it is almost as though the whole city is one big canvas for anything that has to do with the arts. This city is still being rebuilt and is starting to bud into one of the most creative cities in the world. Still being relatively cheap it is starting to attract a diverse set of makers and artists as well as a diverse group of start ups. Working as a fellow at ZEIT gives me a chance to be right in the middle of a transforming city. Mozilla will also be moving to the kiez so as a fellow you will have a space to show everyone all the awesome you are making!
Being a fellow gives you unique access to a very great resource, the other news fellows. One of the best parts about being an open news fellow is thinking outside of the box with your counterparts and getting to spend time with them and random cities all over the world. The first group of news fellows at this point has become one big family and when we have issues with things we are building we look to one another for advice. I find that to be a priceless aspect of being an open news fellow. The comradery formed between us is very unique given our situation. We are all place in some of the worlds most renowed news rooms acting almost as spies, sharing the info we find with one another in order to improve not only ourselves but the curent news model.
If there is one down side to being an open news fellow it would be the fact that you have one of the coolest jobs in the world and after your 10 months of amazing are over you are left wondering “What now? How can anything top this?!”. I still have four months left in this fellowship and every day I wake up I look forward to hopping on my bike and scurrying through the Berlin streets heading into the office and working with my friends at ZEIT ONLINE. There are few things I can think of that are better than the life of an open news fellow. Luckily for you, you get a chance to do it too!
The German blogosphere came together last week at re:publica to kick start Berlin web week. A few weeks before this Kai Bennerman approached me with an idea about tracking this conference as it was being talked about in real time via twitter. He had a few ideas about some different key words that he wanted to follow that were being tagged with the #re12 hash. My approach to building this widget first started with the streaming API that twitter provides the only downfall of this is that using the streaming api would require some server pieces to work properly and the assets were not available to build such a service in the time frame that we had to complete this project. I decided to use twitters public search api instead as my platform to build a jQuery plugin on top of.
The first idea was simple. Filter tweets that contain the following words: ’session’,'track’, ‘vortrag’, ‘talk’, ‘ and panel’ that were included in tweets that contained the rp12 hashtag. At first you would think that you would just include these words inside of the query string that we send to the twitter api but if you do that the results seem to get over filtered and raise the complexity of the query string which increases your chance of getting rate limited by the search api. Another problem with adding the items to the query string meant that twitter would try and return these tweets to you ONLY if all words were matched, I need an OR scenario not an AND.
You can find the code on the openNews github.
In the city of Berlin finding good coffee was a much harder task than I anticipated. Luckily i found this place right off Bernauer Straße, bonanza coffee lives up to it’s name. An easy bike 5 minute bike ride from my flat every morning for what I must say is the best coffee I have found so far.
During the train ride from Hamburg to Berlin this past Friday I was doing some thinking on how exactly to extract the count of plus one’s of any given url. I knew that the API would not support my request and that was about it. I spent a few hours examining the button’s source code to tried and figure out exactly what kind of voodoo they were doing to let you magically embed the button.
Typical of most “embedable widgets” like this, is an init script. Google gives to you their version of this init script within the “Getting Started” section for the +1 Button ( plusone.js ). This script sets some globals ( among other things ) and then embed’s the main script. In this particular case, the script that is being embedded is doing a lot of different things that I did not take the time to fully understand because I quickly found the main thing I was looking for. The point at which the browser creates the URL of the iframe that contains the actual button html. Once I found that it was easy for me to use jsdom and jquery to scrape the src of the iframe that is embedded by the main script.
Feel free to clone the simple utility I created to help me gain access to the number of plus ones for a given url. I will post more detail of the project I am building this script for in the future. The inline comments should be enough to help you get started.