One such data source is Twitter. People from all over the globe send and share 140 character messages called “tweets” of what is happening around the globe every second. From politics to health to technology real time news of the latest trending topics can be obtained and shared. With the recent increase in talks about Ebola virus and the rising death toll in the West African nations, we decided to analyze and visualize Ebola outbreaks using Twitter data.
The Ebola virus disease (EVD), formerly known as Ebola hemorrhagic fever is continuing its rampage throughout West Africa with the number of people infected doubling every three to four weeks .So far, more than 8,000 people are thought to have contracted the disease, and almost half of those have died, according to the World Health Organization. We wanted to analyze the correlation between the Ebola tweets coming from various countries and Ebola outbreaks in that community. Analyzing the tweets could also potentially lead to understanding what the global community was discussing when it came to Ebola and if twitter was being used as a medium to create awareness and send alerts about the disease.
First step, get all ‘#Ebola’ tweets
- We used the Twitter Streaming API and Tweepy (Twitter Python API) to extract all tweets with keyword “Ebola” and stored it in a CSV file. We then took a subset of a million tweets with location for improved analysis.
- We visualized the Ebola Stream in two different forms, Historic and Live. For the Historic Analysis, the first step was to collect all tweets related to Ebola Virus over a period of 5 days. In case of the Dynamic Analysis we used the live twitter stream of tweets with location information only.
As we know that ‘Analysis is only as good as the Data on which it is based’, so data quality was our next priority. We cleansed the data set on Rapid Miner and handled missing attributes and checked for invalid characters. We also used Excel function such as text to columns to clean the data set. We then used the latitude and longitude information of the tweets and converted it to location data using Steve Morse’s Forward Geocoding tool*.
Now that we have clean data, how do we tell our story?
Based on analysis of David McCandless and a Danish physicist (Tor Norretranders) the sense of sight is the fastest at absorbing information. Hence huge amounts of data can be simplified and clearly understood when visualized in the right way. The best way to absorb millions of rows worth of information is in visual or graphic format.
Let’s Visualize!
We now needed a visualization tool and we went through several recommendations given by our professor and looked for open source tools with which we could tell our story. Also, we wanted to focus our visualization on the following aspects of our data:
1. Spatial and Temporal Component
We visualized both Historic and Dynamic sources of twitter data and from these visuals we wanted to focus on the events surrounding the first European Patient in Spain and the death of Thomas Duncan, the first Ebola patient on US soil. In addition we wanted to look at the reactions of the global communities to these events.
Tools considered: Tableau, CartoDB, Leaflet, Reddis, Mapbox
We visualized both Historic and Dynamic sources of twitter data and from these visuals we wanted to focus on the events surrounding the first European Patient in Spain and the death of Thomas Duncan, the first Ebola patient on US soil. In addition we wanted to look at the reactions of the global communities to these events.
Tools considered: Tableau, CartoDB, Leaflet, Reddis, Mapbox
Tools used:
- Tableau: We picked Tableau as it gives you the ability to show the change of the data over time and we could display the spatial and temporal component of our tweets in an effective way. Tableau is easy to use and allows users to publish the dashboard created and embed it your website.
- Mapbox and Leaflet: Mapbox provides a visually appealing map and Leaflet allows you to place markers on the map based on location. With a little tweak in the python tweepy code we could map the tweets dynamically.
- Redis: We wanted to dynamically visualize LIVE twitter streams related to Ebola and generate a heat map to show the real time tweets that were coming in. We found a real time heat map which was generated using Redis, Tweepy and MongoDb and provided a clear visual of incoming tweets.
To analyze the content of the tweets to understand what are the keywords associated with Ebola. We aimed at finding stories or events through word clouds.
Tools considered: Wordle, Tableau, Tagul, D3js
- Tagul [https://tagul.com]: Tagul is a web service which creates word clouds/tag clouds which can be embedded into a website. It is easy to use and unlike Wordle and Tableau offers a variety of visually appealing templates for the word Clouds and provides easy options to import the text to be visualized.
- Rabbit MQ+D3: Rabbit MQ was used in combination with D3js to generate dynamic word clouds. It used Live Streams and generated word clouds and hence was used to see the latest keywords associated with the disease in real time. Ability to visualize live tweets is not available in Wordle, Tableau or Tagul and hence this combination was used.
To analyze the sentiments of the users and see whether there is just widespread panic or are there some positive messages about co-operation funding medical aid being shared on twitter.
Tools considered: Alchemy
Tool picked:
- Alchemy: Generates a dynamic chart based on sentiments of the live stream of tweets. It can be easily embedded with the website and is open source.
The CDC is using a mapping software called ESRI to visualize the data and overlay existing sources of data from censuses to build up a richer picture. The level of activity at each mobile phone mast also gives a kind of heat map of where people are and crucially, where and how far they are moving. Hence this is being used to map population movements and predict how the Ebola virus might spread.
Hence, visualization of data, dynamic or historic has the potential to provide a plethora of useful information on which decisions can be based.
- *http://stevemorse.org/jcal/latlonbatch.html
- http://www.northeastern.edu/news/2014/10/ebola-infographics-2/
- http://blog.comsysto.com/2012/07/10/real-time-twitter-heat-map-with-mongodb/
- http://www.bbc.com/news/business-29617831
- http://www.dailymail.co.uk/sciencetech/article-2797178/could-text-help-halt-ebola-big-data-key-stopping-deadly-virus-tracks-help-cure.html
- http://www.who.int/mediacentre/factsheets/fs103/en/



No comments:
Post a Comment