Over the past few months with my mentors at ZEIT ONLINE I have been doing a lot of thinking about how to better monitor news organizations performance in the social media space. To successfully monitor how stories propagate into the social world you are going to need access to a few things:
- A feed of the home page ( or whatever list of articles/urls you want to monitor ).
- A machine to collect and store what you find.
- A way to find the metrics from the social media world.
The home page is a good source because it is always changing so it will provide you with a constant list of urls to collect that the news organization feels are important enough to show on their home page. This is usually provided in some sort of xml ( via rss ) or json format that is constantly updated and publicly available for you to access at any time. In other words you need a harvester to constantly pick the stories from the top of the pile.
Once you have access to those resources you can start to aggregate them together to create an access point so that yourself and others can use to make sense of the data. Below is a small diagram to explain visually exactly what I just mentioned followed by an explanation of exactly how I am harvesting ZEIT ONLINE.
Starting off in the top left, Amo provides me with all the share values for a given url. It is constantly being used to provide feedback to a story’s share counts. Google+ did not have an API for getting the number of +1s a url has received so I had to create one myself. Facebook returns their share data in XML. None of the services from the social world was available in JSON except for the Twitter API. Without Amo I would have to constantly poll Twitter, Facebook and Google separately and then merge them all together inside of my app code. Amo is just an abstraction of social share API’s all wrapped up into one nice “likeable” JSON object.
The next piece you see is the “Harvesting Layer”. It feeds the database everything that it needs to serve up the “API and Caching” layer. The following is how I update the database from the constantly polling feed of ZEIT ONLINE articles.
- Grab a list of the articles from the home page and compare them to the recent articles I have collected. In my case grabbing the ZEIT home page was easy. They use xslt to drive their whole site, in other words their whole site is one open API for me to consume. For example, go to the ZEIT home page. Once you are there replace the “www” with “xml” ( xml.zeit.de/index ). What you will find is a massive XML structure that represents all of the data and meta data behind the page you see their. It is pretty much a playground for developers. If you are using node make sure to check out xml-simple, it is the best xml parser i know of. Inside the centerpage you will find the feed. If you want to harvest another source you will have to become familiar with their output structure in order to properly break it down in your app. You really only need three key ingredients: publish date, title and url.
- Strip out any illegal characters that JSON won’t parse ( this can happen when dealing with non-english content, for example ). Make sure you set the encoding properly to deal with umlauts. Put the new article inside of the database with Mikeals’ request. Because I used CouchDB, I needed a unique identifier. I used a nice little node utility to create a UUID for me for every url. At this point you are done with the URL harvesting.
- Set up a worker to handle all of the updating of the share objects. For this I created a tool to constantly poll the database to check for articles. Once I have a list of the articles I want to track I send them off to Amo to collect the share information. Once I have the share information I put everything back into the database and create a new revision. Another great reason to use CouchDB is for how easy it is to go back in time and review all of the past revisions. This allows me to really track a stories growth and see when it is on the rise.
At this point you will have a constant feed of articles being monitored and collected as well as their share counts being updated at the same time. The database is filling up and the next step is to set up a way to access this from the outside. This is where the API layer comes in. You are going to want to sanitize all of the data you collected in a way that any developer in the future who wants to use your stuff can understand. Creating an API means defining a layer of urls to your data that will remain consistent and will hide the complexity of a system like this for future developers who rely on your data. I really like express because it is easy to create routes and from those routes you can return whatever you like. In my case the database returns me a filtered result set based on publish date. Luckily with couch this is really easy. I just added a stream proxy to my express route and passed the data parameters requested to the database and boom anyone in the ZEIT network now is able to filter ZEIT ONLINE articles by date.
Six months into being a Knight-Mozilla Fellow, in what has proven to be the best year of my life, 2012 continues to get better in the ZEIT ONLINE news room in Berlin. Over a year of planning leads up to today. We moved into a beautiful brand new office next to Anhalter Bahnof. The dust is still settling as the movers put the last pieces of furniture in place but I can already start to feel the energy of the news room buzzing throughout the place.
A question we are all often asked when we meet new people is “What do you do?”. I used to find that question annoying unless I was talking to other technical people because it meant that I had explain to them the details of a highly technical field in order for them to get it. While I love telling people about hardware virtualization and all of the details of the work I did at IBM it seems that most people get lost in that conversation and immediately switch the topic. Now I revel in such opportunity to explain what exactly it is that I spend time doing on the day to day hack with Mozilla and ZEIT ONLINE. It gives me a chance to explain how exciting working in an open way for the news room can be. Of course it comes with a unique set of challenges just like any software situation these days but ZEIT has done an great job at making it easy for a developer to get access to all of the proper tools necessary to get the job done.
On the other hand, Berlin does a really great job setting a creative vibe. With graffiti all over the city, endless open airs and a thriving music scene to explore it is almost as though the whole city is one big canvas for anything that has to do with the arts. This city is still being rebuilt and is starting to bud into one of the most creative cities in the world. Still being relatively cheap it is starting to attract a diverse set of makers and artists as well as a diverse group of start ups. Working as a fellow at ZEIT gives me a chance to be right in the middle of a transforming city. Mozilla will also be moving to the kiez so as a fellow you will have a space to show everyone all the awesome you are making!
Being a fellow gives you unique access to a very great resource, the other news fellows. One of the best parts about being an open news fellow is thinking outside of the box with your counterparts and getting to spend time with them and random cities all over the world. The first group of news fellows at this point has become one big family and when we have issues with things we are building we look to one another for advice. I find that to be a priceless aspect of being an open news fellow. The comradery formed between us is very unique given our situation. We are all place in some of the worlds most renowed news rooms acting almost as spies, sharing the info we find with one another in order to improve not only ourselves but the curent news model.
If there is one down side to being an open news fellow it would be the fact that you have one of the coolest jobs in the world and after your 10 months of amazing are over you are left wondering “What now? How can anything top this?!”. I still have four months left in this fellowship and every day I wake up I look forward to hopping on my bike and scurrying through the Berlin streets heading into the office and working with my friends at ZEIT ONLINE. There are few things I can think of that are better than the life of an open news fellow. Luckily for you, you get a chance to do it too!
The German blogosphere came together last week at re:publica to kick start Berlin web week. A few weeks before this Kai Bennerman approached me with an idea about tracking this conference as it was being talked about in real time via twitter. He had a few ideas about some different key words that he wanted to follow that were being tagged with the #re12 hash. My approach to building this widget first started with the streaming API that twitter provides the only downfall of this is that using the streaming api would require some server pieces to work properly and the assets were not available to build such a service in the time frame that we had to complete this project. I decided to use twitters public search api instead as my platform to build a jQuery plugin on top of.
The first idea was simple. Filter tweets that contain the following words: ’session’,'track’, ‘vortrag’, ‘talk’, ‘ and panel’ that were included in tweets that contained the rp12 hashtag. At first you would think that you would just include these words inside of the query string that we send to the twitter api but if you do that the results seem to get over filtered and raise the complexity of the query string which increases your chance of getting rate limited by the search api. Another problem with adding the items to the query string meant that twitter would try and return these tweets to you ONLY if all words were matched, I need an OR scenario not an AND.
You can find the code on the openNews github.
I find myself in this position often these days. Stuck in a new city with no real sense of where I am. I think this is due to the fact that I still have yet to purchase a smart device to navigate me throughout these cities. This is going to change as soon as I get back to Berlin. I was scheduled to give a talk at BBC at 1300 yesterday but after I had taken the tube the wrong way I knew that I was not going to be able to make it since I was now halfway up the northern line going the wrong direction. Thankfully Andrew from the BBC is awesome and let me reschedule for later in the year.
Right about the time I informed him I was not going to make it Nicola Hughes sent me a chat and said “You should come talk at the guardian today”. I gladly accepted and off I was southbound on the northern line. I made it to kings cross which is the closest stop I could find on the tube map ( which is a beautiful piece of graphic design ) and headed north toward to guardians building. The first thing I noticed when I walked in was that this place is HUGE. I came from IBM which is a massive corporation globally, but thankfully I was in a building off of the main campus and it was just a core group of about 15 of us working together. At the guardian it is just one massive open working space on three different floors. I am not convinced that I would be very productive in that type of environment due to the large about of people and distractions moving around but it was fascinating to see everyone in this huge work space.
So after having less that an hour to prepare for this talk, there I was infront of 20+ developers who were waiting for me to share something cool with them. I decided that it would be best for me to just give as many super fast lightning talks as I could introducing them to new tech or even just things I had worked on in the past. So this is an attempt to run down what I talked about in a blog post so others could hopefully find some new content or ideas from which to be inspired.
I started off talking about Mozilla’s partnership with The Living Docs Project by showing two of the projects from the original hack day I took part in at Mozilla’s offices in San Francisco that sort of proved this type of collaboration could really work. You should also checkout the Wired.com blog because they did a really great post on the hack day. During this demonstration one of the main points I wanted to get across what that all of this is so new and that most people are still not really sure how they can make video interactive or if they even want to do this. To me it is a totally new exciting way of dealing with video on the web but to some film makers it is frightening because they do not want to take away from the art of their film making. One thing I really wish to communicate to all film makers is that this type of work and collaboration between hackers and film makers is only going to add quality to the story you are trying to tell. Working with technologists is going to let you tell the story you want, while at the same time giving the user / listener / watcher ways to discover so much more related to your story or cause.
The next thing I gave a quick over view of was the work we are starting to do at Zeit Online. The thing I love most about Zeit Online is that their are no limits to what we can do. At somewhere like the BBC or The Guardian, while these are great news organizations they are massive and a bit controlled by the politics that is a reality within every day life of such a large organization. It seems to be that this is holding back the true potential of what it means to be an OpenNews fellow. I did not realize until I walked around The Guardian yesterday how lucky I am to be at a place like Zeit Online. It is not yet reached this massive scale and I get to work a long side the Editor In Chief and his posse to help them realize their vision. This is golden and it is just beginning. I am really interested to see the progress of each of the fellows at the end of this year and how much of it was truly open. With that said I was really impressed about how the Guardian is putting a lot of their development out in the open now. They have some really really smart people there doing some really amazing things. However in order to get the chance to do this work in the open they had to really push their cause to the upper management to let them do this. In the end it is just code that they are putting out. Tthey are not releasing the brains behind the code because this is where the real power lies. Code is just code and it is not the ability to solve really hard problems, more organizations should realize this and know that releasing this code is going to help others not only learn but innovate.
The next thing I talked about was the twitter streaming API and the awesome maps that being being built on top of it. I gave a demonstration of the Twittermap I worked on with one of the smartest developers I know Nate Hunzacker he uses his own version of NLP called Speakeasy to calculate the sentiment of tweets any given place in the world. In other words it is pretty freaking awesome.
After that I went on a social media rant about the google+ api but it turns out that no one at the guardian or even in all of london think that google+ is relevant to sharing data and if they do then they are in an overwhelming majority. This is not so much the case in Berlin. I see most of my colleagues at Zeit Online posting data constantly about the news in Germany. I am interested to see how this battle plays out in the coming years. Will google+ be the next google wave? They are trying to hard but failing. I had an interesting chat with a developer at the guardian yesterday who told me that they are now seeing more traffic coming from facebook shares than they are from any google search or service. The only thing I really love about google+ is the hangouts. I taught my grandma and sister how to use them so I could hang out with them back from North Carolina from Europe.
I took a few questions and that was pretty much the end of my day at the guardian. Then I went to the awesome Mozilla offices in London for our Mozilla London Office party!!!
During the train ride from Hamburg to Berlin this past Friday I was doing some thinking on how exactly to extract the count of plus one’s of any given url. I knew that the API would not support my request and that was about it. I spent a few hours examining the button’s source code to tried and figure out exactly what kind of voodoo they were doing to let you magically embed the button.
Typical of most “embedable widgets” like this, is an init script. Google gives to you their version of this init script within the “Getting Started” section for the +1 Button ( plusone.js ). This script sets some globals ( among other things ) and then embed’s the main script. In this particular case, the script that is being embedded is doing a lot of different things that I did not take the time to fully understand because I quickly found the main thing I was looking for. The point at which the browser creates the URL of the iframe that contains the actual button html. Once I found that it was easy for me to use jsdom and jquery to scrape the src of the iframe that is embedded by the main script.
Feel free to clone the simple utility I created to help me gain access to the number of plus ones for a given url. I will post more detail of the project I am building this script for in the future. The inline comments should be enough to help you get started.
WHOOOOAAAA! I need to put my foot on the “digital brake” for a second to slow down enough to write up a synopsis of the past 6 months ( mostly because the resilient Dan Sinker is forcing me too, it is much needed either way ). Since I last posted here I have been around the world ( thrice ) and back. I have had the privilege to have hacked along side some genius at the New York Times Open Hack at the NYTimes Headquarters, The Living Docs Hack Day in San Fransisco at the Mozilla Offices ( which has one of the best views [ fast forward to minute one if you want to see it ] I have ever seen at an office ), Hacktoberfest In Berlin and the Mozilla Festival in London. I want to recap a little bit of what I have been up to since I made the move from my away from IBM family back in Raleigh, North Carolina — where some really cool server technology is being created by some of the smartest people I have ever met as part of IBM’s smarter planet initiative focused on networking and the cloud.
As much as I want to tell you about my moving experience ( you can read more about my first day in Berlin if you wish ) and the crazy fun nights I have had in the city so far ( omg! berlin! ), I am going to skip all of that and give you an overview of my experience ( so far ) with the news partner Zeit Online and how excited I am to be working along side their futurist media visionaries over the next 10 months. It did not take long for them to summon me to their offices. Officially I do not start working with them until March 1, but unofficially I was working with them 48 hours after I hit the ground in Berlin.
They have had a very long time to plan and to brain storm internally, so they were very very prepared. Luckily I had been prepared to counter their preparedness. Armed with multiple machines, iPods, iPads and an extreme amount of awesome code within them all as well as extreme ideas. I was ready to blow their minds with what I planned to bring to their organization. The cool thing about this fellowship is that there is no command structure really when it comes to what can or must be done. The knight foundation and mozilla are just putting faith in all the news fellows to do whatever it is needs to be done to make the future of journalism. I am coming in with my ideas, merging them with the news partners and then building everything in the open on github. I had someone ask me “Who is your boss?” I replied with “No one and nothing more than inspiration, focus flow and the mighty unyielding wrath of Dan Sinker’s beard.” ( imagine the looks that response invoked ). The even cooler thing about this fellowship is that we are all in it together, openly, on the same level working along side one another to help each other reach what we decide are our mutual goals and we do that the best we can with the tools we have or the tools that we create. It does not matter if you are the editor in chief or the intern, if you have big ideas then bring it and lets take it from idea to prototype as fast and efficient as possible.
Not only do we have plans to build tools that the news arena has never seen before, we plan to open source them all so every other news organization can take advantage of the blood sweat and beers we have put into them. We are basing the majority of these tools around the social media atmosphere and putting the consumers first, giving them full control of what they see and share. We are also building tools to redefine how news organizations understand comments internally to help better understand what people are talking about within their site. We want to help users of news media not only better understand what is happening from a news perspective but also how the users can have a massive impact on how that news gets shared. When we as consumers share news from places like Zeit, the BBC or the Guardian — we are sharing it exclusively with our network. A network that we ( from a news organiztion as well as user perspective ) may not understand fully. Over the next few months we are going to help news publishers as well as news consumers better understand the networks that are reading and sharing media worldwide. We will do this by diving head first into these networks and analyzing them for many things.
I have given you a very high level view of what we have been discussing, drawing on the white boards and building in our Berlin offices. I understand — Talk is cheap. I am currently in Hamburg at the other Zeit offices meeting all the brains behind the front and back end development. Soon ( and very soon ) you will have working code that you can fork, manipulate and use as your own within your news company. I am going into this with the idea that these tools cannot be built exclusively for Zeit Online, they must be scalable for all institutions. They tools I am working on will be very well commented and as generic as possible to allow for optimal merge and simple manipulation. This means more work on my end, but the reward will be multiplied by the hard work and focus that I have already started capitalizing on. The energy that is being produced by the merging of ideas here at Zeit Online is awe inspiring and will be enough to keep anyone excited about what we are forging here. Stay tuned to the openNews blogs ( if you do not have us in your RSS feed you are already falling behind ) as well as my counterpart fellows through out news rooms all over the world. Also hop in the bi-monthly open news calls where we discuss what is going down in own news rooms. You do not want to miss what we are doing because I believe that it is going to make a huge difference and help to shape the future of the news media not only within Germany but all around the world.