Friday, December 5, 2014

This thanksgiving let us give thanks to analytics

Having spent a semester learning about big data and the increase in "datafication", its hard not to feel like we are neo in the matrix and everything around us is data that can be used for analysis.
                               

Lets take the recent thanksgiving holiday shopping weekend for example.


(source: http://truthcentral.mccann.com/truth-studies-blog/twelve-truths-of-holiday-shopping/)
Consider the infographic on the left from a McCann report on big data analytics. The infographic contains data on a survey done on 10,000 people across eleven countries. The survey asked users how they will be shopping this holiday season and although majority of them still prefer in store shopping, online shopping and even mobile shopping (includes tablets) has been on the rise over the past couple of years.

This means there is going to be a substantial increase in data coming from all these different data sources. Businesses looking to improve sales or just looking to spot the latest shopping trends need to seriously start investing in some form of data analytics in order to take advantage of this growing trend. An example of this is point 4 of the infographic where almost half of the young consumers surveyed felt they would prefer certain stores to pick out their gifts rather than a person gifting them. Also important to note is point 3, where a third of the consumers surveyed globally would outsource their holiday shopping if they could. Who better to do this than the businesses providing the goods to the consumers!

Most important point out of this infographic is the 1st point. Consumers are making use of social media to talk about their holiday shopping and even talking about the kind of gifts they would like to receive over these holidays. This should be a signal to businesses to make use of big data and data analytics in order to better serve their customers.


   
Source: Lowes.com
Take the example of Lowes. They collect data across several channels, online and offline, which they then use to offer product recommendations to drive more sales. Consider a customer that purchased a kitchen appliance, Lowes takes this information and provides additional recommendations of similar products or companion products that can help them in remodeling their kitchen. This is a good example of how analytics can be used to predict customer needs in order to better serve your consumer base as well as driving sales at the same time.

Consider another program piloted by IKEA in association with Yahoo! and Acxiom. They wanted to track the impact of online marketing had on in store sales. They matched Yahoo! IDs to IKEA's own customer data and were able to create specific user profiles which could be used for targeted advertising. Additionally IKEA was able to track the relative success of the online marketing ads by comparing in store purchases of users.

This IKEA program brings up an important point when it comes to online marketing campaigns. Consider this Yahoo survey where consumers said, compared to general ads, personalized ads tend be more engaging, educational, time-saving and even more memorable. Also users have begun to complain about whats know as "discount fatigue" where consumers are getting bombarded with generic promotions that do not fit the user needs. This is a prime opportunity for businesses like IKEA to make use of big data to provide a better end user experience.

In conclusion, we would like to list out a few ideas taken from the huffington post article on how brands can better use their data to provide more engaging marketing to consumers:
  • Predict their holiday shopping list. Can you determine the attributes of who they bought for last year and recommend gifts that are popular among people with similar traits. Instead of promoting your line of kitchen gadgets, inform loyal customer Myles, who purchases for a novice at the holidays when he himself is rather advanced, about recommendations on beginner products, cooking classes or recipes that would be perfect for the budding chef on his list.
  • Understand who is likely to host holiday events at their home, known by their social graph and items purchased. Use this information to help the stressed holiday hostess with tips, recipes, ideas to entertain the kiddos, playlists, etc.
  • Know how much they are likely to spend. Look at your customers' spending habits throughout the year and past holiday seasons to determine the size of their holiday budget. Yes, we all would love to buy the 900 cashmere throw, but which of your customers are the ones who could afford this? Target only your big spenders with this, while more reasonable items to the rest of us.
  • Recognize habits. Understand customers' habits and when they are likely to shop (i.e. Michelle is an early bird shopper, while Laurie waits for the very last minute). Does Joel pass by your store during your commute? Then don't send a geo-targeted ad to them on a Monday when he is heading to work, but try to entice him one evening after work with personal gift recommendations.
References:
1. http://truthcentral.mccann.com/truth-studies-blog/twelve-truths-of-holiday-shopping/
2. http://www.godigitalmarketing.com/big-datas-big-role-holiday-sales-nationalblog/
3. http://www.huffingtonpost.com/puneet-mehta/a-lesson-from-glengarry-g_b_6255342.html

Tuesday, November 11, 2014

Big Data Revolution!!!


“Prevention is better than cure”. Big Data in HealthCare justifies this.

Big Data Symposium held at University of Arizona Medical Center on October 20th, 2014 was focused on how big data can be leveraged in Healthcare. These talks inspired us to write a blog on Big Data in HealthCare. We wanted to explore more on what is happening in the industry in terms of Big Data.




Let us listen to what Nicolaus Henke, McKinsey director has to say about how Data Analytics is changing the practice of Medicine.



According to McKinsey&Company the use of Big Data in Healthcare is called as “Big Data Revolution”. . But why is it a revolution? Some real world examples of how Big Data has been useful in Healthcare:

·  Kaiser Permanente has fully implemented a new computer system, HealthConnect, to ensure data exchange across all medical facilities and promote the use of electronic health records. The integrated system has improved outcomes in cardiovascular disease and achieved an estimated $1 billion in savings from reduced office visits and lab tests.

·   Blue Shield of California, in partnership with NantHealth, is improving health-care delivery and patient outcomes by developing an integrated technology system that will allow doctors, hospitals, and health plans to deliver evidence-based care that is more coordinated and personalized. This will help improve performance in a number of areas, including prevention and care coordination.

·   AstraZeneca established a four-year partnership with WellPoint’s data and analytics subsidiary, HealthCore, to conduct real-world studies to determine the most effective and economical treatments for some chronic illnesses and common diseases. AstraZeneca will use HealthCore data, together with its own clinical-trial data, to guide R&D investment decisions. The company is also in talks with payors about providing coverage for drugs already on the market, again using HealthCore data as evidence.



Data is everywhere!! For instance, Asthmapolis has created a GPS-enabled tracker that records inhaler usage by asthmatics. The information is ported to a central database and used to identify individual, group, and population-based trends. The data are then merged with Centers for Disease Control and Prevention information about known asthma catalysts (such as high pollen counts in the Northeast or volcanic fog in Hawaii). Together, the information helps physicians develop personalized treatment plans and spot prevention opportunities. 


This is in correlation with the lecture given by Professor Sudha Ram. She was able to predict Asthma related emergencies using big data from EHR, Social Media and Sensor Datasets. Tweets based on “Asthma” related keywords were collected to perform the research. Based on this data, they developed a predictive model. Research in this field is happening in every part of the world. Hence it is known as a revolution.




The goal of using data is to be develop predictive models. These models can be applied against patients to monitor their health, diseases and better management of hospitals. American Healthways says it has identified 30 "impact conditions" that affect people's lives but can be prevented, or at least ameliorated, by timely interventions. American Healthways calculates that these 30 conditions accounted for half of the $1.4 trillion spent on direct medical expenses in the United States last year. Predictive modeling, combined with care enhancement programs, could save 20 percent of these costs, or $140 billion. If 70 percent of the savings is kept by health plans, that leaves $42 billion in fees to be garnered by American Healthways and its competitors. No wonder predictive modeling is hot, hot, hot!!

References:

[1] http://www.mckinsey.com/insights/health_systems_and_services/the_big-data_revolution_in_us_health_care
[2] http://www.managedcaremag.com/archives/0109/0109.predictive.html
[3] https://www.facebook.com/BusinessIntelligenceAndAnalyticsCenter

Wednesday, October 22, 2014

Data Minions venture into the world of Data Visualization

With the increase in buzz about big data analytics and availability of data in large volume, velocity, variety and veracity there has been a revolution on how this data is being utilized. Data Visualization enables us to unleash the true power of BIG DATA. Zettabytes of data is generated by several sources, be it through mail, messages, photos, video clips, sensor data etc. and shared every second.

One such data source is Twitter. People from all over the globe send and share 140 character messages called “tweets” of what is happening around the globe every second. From politics to health to technology real time news of the latest trending topics can be obtained and shared. With the recent increase in talks about Ebola virus and the rising death toll in the West African nations, we decided to analyze and visualize Ebola outbreaks using Twitter data.

The Ebola virus disease (EVD), formerly known as Ebola hemorrhagic fever is continuing its rampage throughout West Africa with the number of people infected doubling every three to four weeks .So far, more than 8,000 people are thought to have contracted the disease, and almost half of those have died, according to the World Health Organization. We wanted to analyze the correlation between the Ebola tweets coming from various countries and Ebola outbreaks in that community. Analyzing the tweets could also potentially lead to understanding what the global community was discussing when it came to Ebola and if twitter was being used as a medium to create awareness and send alerts about the disease.

First step, get all ‘#Ebola’ tweets
  • We used the Twitter Streaming API and Tweepy (Twitter Python API) to extract all tweets with keyword “Ebola” and stored it in a CSV file. We then took a subset of a million tweets with location for improved analysis. 
  • We visualized the Ebola Stream in two different forms, Historic and Live. For the Historic Analysis, the first step was to collect all tweets related to Ebola Virus over a period of 5 days. In case of the Dynamic Analysis we used the live twitter stream of tweets with location information only.


As we know that ‘Analysis is only as good as the Data on which it is based’, so data quality was our next priority. We cleansed the data set on Rapid Miner and handled missing attributes and checked for invalid characters. We also used Excel function such as text to columns to clean the data set. We then used the latitude and longitude information of the tweets and converted it to location data using Steve Morse’s Forward Geocoding tool*.

Now that we have clean data, how do we tell our story?

Based on analysis of David McCandless and a Danish physicist (Tor Norretranders) the sense of sight is the fastest at absorbing information. Hence huge amounts of data can be simplified and clearly understood when visualized in the right way. The best way to absorb millions of rows worth of information is in visual or graphic format.

Let’s Visualize! 

We now needed a visualization tool and we went through several recommendations given by our professor and looked for open source tools with which we could tell our story. Also, we wanted to focus our visualization on the following aspects of our data:

1. Spatial and Temporal Component

We visualized both Historic and Dynamic sources of twitter data and from these visuals we wanted to focus on the events surrounding the first European Patient in Spain and the death of Thomas Duncan, the first Ebola patient on US soil. In addition we wanted to look at the reactions of the global communities to these events.

Tools considered: Tableau, CartoDB, Leaflet, Reddis, Mapbox

Tools used:

  • Tableau: We picked Tableau as it gives you the ability to show the change of the data over time and we could display the spatial and temporal component of our tweets in an effective way. Tableau is easy to use and allows users to publish the dashboard created and embed it your website.
  • Mapbox and Leaflet: Mapbox provides a visually appealing map and Leaflet allows you to place markers on the map based on location. With a little tweak in the python tweepy code we could map the tweets dynamically.
  • Redis: We wanted to dynamically visualize LIVE twitter streams related to Ebola and generate a heat map to show the real time tweets that were coming in. We found a real time heat map which was generated using Redis, Tweepy and MongoDb and provided a clear visual of incoming tweets.
2. Visual Text Representation

To analyze the content of the tweets to understand what are the keywords associated with Ebola. We aimed at finding stories or events through word clouds.

Tools considered: Wordle, Tableau, Tagul, D3js

Tool picked:

  • Tagul [https://tagul.com]: Tagul is a web service which creates word clouds/tag clouds which can be embedded into a website. It is easy to use and unlike Wordle and Tableau offers a variety of visually appealing templates for the word Clouds and provides easy options to import the text to be visualized.
  • Rabbit MQ+D3: Rabbit MQ was used in combination with D3js to generate dynamic word clouds. It used Live Streams and generated word clouds and hence was used to see the latest keywords associated with the disease in real time. Ability to visualize live tweets is not available in Wordle, Tableau or Tagul and hence this combination was used. 
3. Sentiment of Tweets

To analyze the sentiments of the users and see whether there is just widespread panic or are there some positive messages about co-operation funding medical aid being shared on twitter.

Tools considered: Alchemy

Tool picked:
  • Alchemy: Generates a dynamic chart based on sentiments of the live stream of tweets. It can be easily embedded with the website and is open source. 
To conclude we would like to say that we aren't the only ones interested in Ebola data. There are similar analytics being performed by the US Centers for Disease Control and Prevention (CDC). They are collecting mobile phone mast activity data from mobile operators and mapping where calls to helplines are mostly coming from. Spikes in calls to a helpline from one particular area could suggest an outbreak and alert authorities to direct more resources there.

The CDC is using a mapping software called ESRI to visualize the data and overlay existing sources of data from censuses to build up a richer picture. The level of activity at each mobile phone mast also gives a kind of heat map of where people are and crucially, where and how far they are moving. Hence this is being used to map population movements and predict how the Ebola virus might spread.


Hence, visualization of data, dynamic or historic has the potential to provide a plethora of useful information on which decisions can be based. 

Visualization can thus, act as the voice of the data and can help us narrate the right story that could possibly be inter-woven with the facts we have at hand to add credibility, converting stories to insights...!


Sources:

Tuesday, September 30, 2014

Video Games and Big Data Analytics

Over the past decade, video games have undergone a change in the way that they are sold and marketed to consumers. Games have gone from physical packaged goods being sold exclusively through retail stores to an instant download, subscription based model accessible directly on your gaming device. The concept of gaming itself has shifted from playing with friends on your couch to playing with millions of players online around the world.

Bill Grosso, Principal consultant at Osolog LLC, gave a webinar detailing some of these changes in the gaming industry and how analytics can be used by game studios to better understand its customers and in turn build better games. The hope being, with the power of analytics, game studios will be able to capitalize on the growth in the gaming industry which is now a 20 billion dollar industry in the North American market alone.


Zynga, a leading online game developer, which builds its games on social media platforms, provided the infographic on the left about the kind data that is being generated in its systems. Developers like Zynga have begun to build some form of social interaction directly into its games from in game chat functionality to providing “share” options to users through Facebook or Twitter.


An interesting observation to note is many of these gaming companies used to use MySQL databases but as the size of the data quickly grew, data warehouses could not handle the load and a lot of ETL operations would take about 24 hours to complete. Scaling the MySQL instances vertically didn't help as well. 





It turned out Hadoop was a natural solution to their problems because of its following qualities:

  • Cost effective
  • Scalable
  • Open Source
  • Quick execution

Now that they could store all this information the next question for game companies was what to do with the massive amounts of user generated data that they were collecting?

Listed below are some of the possible use cases for this data:

1.    Enhanced Customer Experience

Facebook is a popular gaming platform for casual games which has an inherently wide consumer base. But for a game to be popular on this platform it needs to ensure that the needs of the people playing it are addressed in order for consumers to continue playing the game. Zynga used the data they collected from the original Farmville to make "animals" the central character in the next version of the game. In the data that they collected, they found that people interacted a lot with "animals" in the first version of Farmville and this prompted the change in version 2.0.

Riot games which owns the successful "League of Legends" franchise made significant changes to their game client based on data collected from its users. Certain components of the client which were loading slowly on user’s computers and were re-written to reduce this delay.

2.    Increased sales

Virtual item sales are a major revenue generator for online games. Companies sell their merchandise through in game shops based on user behavior within the game. Dota 2 which is an online game, is offered free of cost for users to play. But the revenue earned through Dota 2 comes from sale of in game "virtual items" that users can buy and equip to personalize their in game characters.

3.    Increased player engagement

Based on data collected from user behavior gaming companies can make decisions on how engaging the game is for the user. If a specific scenario in the game is very difficult and has a lot of users dropping out, the company can incorporate this insight when it makes decisions while redesigning the game.

As we look at these statistics one thing stands out, video games are massively "data bloated" and with the increased adoption of Hadoop and NoSQL, game companies will be able to meet the growing needs of their customer and make better business decisions.

To conclude, here's a video of Barry Sohl, CTO of Buffalo Studios talking about how analyzing game data helped their studio address their conversion goals.



References:
http://www.bigdata-startups.com/BigData-startup/zynga-is-a-big-data-company-masqueraded-as-a-gaming-company/#!prettyPhoto
http://www.qubole.com/big-data-gaming-industry/

http://www.slideshare.net/StampedeCon/big-data-at-riot-games-using-hadoop-to-understand-player-experience-stampedecon-2013?related=1

Tuesday, September 9, 2014

MongoDB v/s CouchDB ...In Simple Terms



When the Big Bee of Big Data buzzes around you, so do the other drones that stay within the Hive. Before, we get to the queen herself, we decided to pick one of others today, and let it sting it’s way once again through the Internet. As the Data Minion Team, this is our very first blog and we dedicate it to NoSQl Databases - Couchdb and Mongodb.


The MIS 586 Big Data class emphasizes the use of MongoDB and we decided to go ahead and find out what could have been the possible pros and cons of using this NoSQL database over the one's of its kind in the class. Just letting curiosity be our guide! :)

One of the most simplest and yet not so widely known things is, what it really stands for. Some people think the name pokes fun at SQL but it isn’t really the case. It stands for Not-only-SQL. Although human tendency may favor what one already knows and understands, NoSQL databases are actually simpler in the way they are structured compared to relational databases. There is no denying that in some cases relational databases might actually serve a better purpose than NoSQL databases. As always, it all depends on the kind of application.


Out of the four major types of NoSQL databases, CouchDB and MongoDB are the most popular Document type storages. We try to elaborate our thoughts on why we as a class probably were asked to use MongoDb over CouchDB.


Consistency and availability: From what we understand, it is not essential for us to have a guarantee of receiving a response whether a command has been successfully executed or not. Rather, consistency of the Twitter data stored is more important in this scenario so that data retrieval for analysis is achieved without any ambiguity. MongoDB provides for consistency and CouchDB provides for availability in database systems. (although couchDB is eventually consistent too).
Why not both, you ask? Well, Brewer will tell you why. According to Brewer (in simpler terms), in a distributed computing system that can be subject to communication failures, only the properties of liveliness or safety can be achieved. Liveliness corresponds to availability and safety to consistency.

Quoting FoundationDB's documentation we found online:


In 2000, Eric Brewer conjectured that a distributed system cannot simultaneously provide all three of the following desirable properties:
  • Consistency: A read sees all previously completed writes.
  • Availability: Reads and writes always succeed.
  • Partition tolerance: Guaranteed properties are maintained even when network failures prevent some machines from communicating with others.
In 2002, Gilbert and Lynch proved this in the asynchronous and partially synchronous network models, so it is now commonly called the CAP Theorem. Brewer originally described this impossibility result as forcing a choice of "two out of the three" CAP properties, leaving three viable design options: CP, AP, and CA. However, further consideration shows that CA is not really a coherent option because a system that is not Partition-tolerant will, by definition, be forced to give up Consistency or Availability during a partition. Therefore, a more modern interpretation of the theorem is: during a network partition, a distributed system must choose either Consistency or Availability.

So, MongoDB with it’s CP properties suffices us pretty well for the MIS 586 class.

Easy peasey - Mongo can retrieve documents from the database super-duper easily and quickly (probably due to indexing) and it’s query structure is similar to MySQL. CouchDB on the other hand requires MapReduce functions for most queries. So there’s one less learning curve there! For the geeks in our class, there is still hope. MongoDB does provide support for MapReduce functions which is better utilized when dealing with the complex queries.

With a swish and a flick - What’s the best part you ask, eh? MongoDB is much more efficient and quicker! All because it converts JSON documents into BSON (binary JSON). BSON data is smaller, thus leading to faster computations.

Dealing with multiple nodes - CouchDB has a master-master replication and MongoDB uses master-slave replication. So, if you anticipate nodes being disconnected for quite some time, CouchDB would be a good idea for you! If not, MongoDB would work just fine without having to deal with the complex but, mind you, very safe replication standards.

Collections: MongoDB stores it’s data into small buckets called collections but CouchDB stores it all directly into one mammoth bucket! So, while querying MongoDB knows exactly which collection to look into and hence is faster! There goes another reason why MongoDB is faster!

Another, good-to-know difference between the two databases, MongoDB is written in C++ and uses binary protocol. CouchDB is written in Erlang and uses HTTP/REST.

One example where using MongoDB will not be such a  good idea is when for some reason some devices are cut off from the network and you have get them back in sync, CouchDB makes this syncing very smooth and hence has to be favored in this scenario.

In summary, it all boils down to one question.

Is safety more important to you or speed?

In that question, lies all your answers! Well, at least ours did!
References:


https://foundationdb.com/key-value-store/white-papers/the-cap-theorem/
* We understand that this list of deciding factors is by no means comprehensive. We have simply attempted to create an article that lists the major reasons why we use MongoDB in class and at the same time one that would help a newbie Big data Internet surfer in his quest to understand the difference between the two databases. If you happen to know of more factors or if you have any opinion on this article, please feel free to leave it in the comments!

Tuesday, August 26, 2014

Data Minions and Big Data

The Team : Our take on Big Data    
                                                       

Numerous pages have been penned down in the name of Big Data over the past few years. We think much of that fiber usage is justified. As a team, we strongly believe that the Big Data age embodies a mammoth shift from what we think of as data…although at this point, it may seem like just another label given to a shiny new technology! Through this class, we will attempt to understand best what it really entails and use our knowledge of this powerful technology and spread the right meaning of it when we graduate. 

We, the Data Minions are an excited bunch of MIS graduates, keen about working with data and learning more through the MIS 586 course. We may be novices in the Big Data arena right now but we wish to learn the art of wading smoothly through it. We have included pictures and our favorite Big Data quotes, right here!

Name: Elma Pinto
About me: Selective. Adaptive. Creative.
Favorite data quote: Data! Data! Data!” he cried impatiently. “I can’t make bricks without clay!” – Sherlock Holmes,The Adventure in the Copper Beaches










Name: Karan Dhingra
About: Analytical since 1988
Favorite Data Quote: "Data Science is not VOODOO. We are not building fancy math models for their own sake. We are trying to listen to what the customer is telling us through their behavior" - Kevin Geraghty, SVP of Analytics at 360i







Name: Sidharth Agarwal
About me: I do what I said I would do
Favorite Data Quote: "Conversion of comment to customer is as conversion of data to business"








Name: Vanitha Venkatanarayanan
About me: Data Whisperer 
Favorite Data Quote: "Information is the oil of the 21st century, and analytics is the combustion engine"---Peter Sondergaard [SVP and Global Head of research at Gartner]







Name: Geethu Babu
About me: I believe in the little things that mean a lot
Favorite Data Quote: "Data is becoming the new raw material of business"- Craig Mundie, Senior Advisor to the CEO at Microsoft.









Name: Nickil Somanna
About me: Practicing narcissist and occasional bedroom guitar player
Favorite Data Quote: "In God we trust.All others must bring data". - W.Edwards Deming, statistician, professor, author, lecturer, and consultant.