Data Minioning and Mining : September 2014

Over the past decade, video games have undergone a change in the way that they are sold and marketed to consumers. Games have gone from physical packaged goods being sold exclusively through retail stores to an instant download, subscription based model accessible directly on your gaming device. The concept of gaming itself has shifted from playing with friends on your couch to playing with millions of players online around the world.

Bill Grosso, Principal consultant at Osolog LLC, gave a webinar detailing some of these changes in the gaming industry and how analytics can be used by game studios to better understand its customers and in turn build better games. The hope being, with the power of analytics, game studios will be able to capitalize on the growth in the gaming industry which is now a 20 billion dollar industry in the North American market alone.

Zynga, a leading online game developer, which builds its games on social media platforms, provided the infographic on the left about the kind data that is being generated in its systems. Developers like Zynga have begun to build some form of social interaction directly into its games from in game chat functionality to providing “share” options to users through Facebook or Twitter.

An interesting observation to note is many of these gaming companies used to use MySQL databases but as the size of the data quickly grew, data warehouses could not handle the load and a lot of ETL operations would take about 24 hours to complete. Scaling the MySQL instances vertically didn't help as well.

It turned out Hadoop was a natural solution to their problems because of its following qualities:

Cost effective
Scalable
Open Source
Quick execution

Now that they could store all this information the next question for game companies was what to do with the massive amounts of user generated data that they were collecting?

Listed below are some of the possible use cases for this data:

1. Enhanced Customer Experience

Facebook is a popular gaming platform for casual games which has an inherently wide consumer base. But for a game to be popular on this platform it needs to ensure that the needs of the people playing it are addressed in order for consumers to continue playing the game. Zynga used the data they collected from the original Farmville to make "animals" the central character in the next version of the game. In the data that they collected, they found that people interacted a lot with "animals" in the first version of Farmville and this prompted the change in version 2.0.

Riot games which owns the successful "League of Legends" franchise made significant changes to their game client based on data collected from its users. Certain components of the client which were loading slowly on user’s computers and were re-written to reduce this delay.

2. Increased sales

Virtual item sales are a major revenue generator for online games. Companies sell their merchandise through in game shops based on user behavior within the game. Dota 2 which is an online game, is offered free of cost for users to play. But the revenue earned through Dota 2 comes from sale of in game "virtual items" that users can buy and equip to personalize their in game characters.

3. Increased player engagement

Based on data collected from user behavior gaming companies can make decisions on how engaging the game is for the user. If a specific scenario in the game is very difficult and has a lot of users dropping out, the company can incorporate this insight when it makes decisions while redesigning the game.

As we look at these statistics one thing stands out, video games are massively "data bloated" and with the increased adoption of Hadoop and NoSQL, game companies will be able to meet the growing needs of their customer and make better business decisions.

To conclude, here's a video of Barry Sohl, CTO of Buffalo Studios talking about how analyzing game data helped their studio address their conversion goals.

References:

http://www.bigdata-startups.com/BigData-startup/zynga-is-a-big-data-company-masqueraded-as-a-gaming-company/#!prettyPhoto

http://www.qubole.com/big-data-gaming-industry/

http://www.slideshare.net/StampedeCon/big-data-at-riot-games-using-hadoop-to-understand-player-experience-stampedecon-2013?related=1

When the Big Bee of Big Data buzzes around you, so do the other drones that stay within the Hive. Before, we get to the queen herself, we decided to pick one of others today, and let it sting it’s way once again through the Internet. As the Data Minion Team, this is our very first blog and we dedicate it to NoSQl Databases - Couchdb and Mongodb.

The MIS 586 Big Data class emphasizes the use of MongoDB and we decided to go ahead and find out what could have been the possible pros and cons of using this NoSQL database over the one's of its kind in the class. Just letting curiosity be our guide! :)

One of the most simplest and yet not so widely known things is, what it really stands for. Some people think the name pokes fun at SQL but it isn’t really the case. It stands for Not-only-SQL. Although human tendency may favor what one already knows and understands, NoSQL databases are actually simpler in the way they are structured compared to relational databases. There is no denying that in some cases relational databases might actually serve a better purpose than NoSQL databases. As always, it all depends on the kind of application.

Out of the four major types of NoSQL databases, CouchDB and MongoDB are the most popular Document type storages. We try to elaborate our thoughts on why we as a class probably were asked to use MongoDb over CouchDB.

Consistency and availability: From what we understand, it is not essential for us to have a guarantee of receiving a response whether a command has been successfully executed or not. Rather, consistency of the Twitter data stored is more important in this scenario so that data retrieval for analysis is achieved without any ambiguity. MongoDB provides for consistency and CouchDB provides for availability in database systems. (although couchDB is eventually consistent too).

Why not both, you ask? Well, Brewer will tell you why. According to Brewer (in simpler terms), in a distributed computing system that can be subject to communication failures, only the properties of liveliness or safety can be achieved. Liveliness corresponds to availability and safety to consistency. 

Quoting FoundationDB's documentation we found online:

In 2000, Eric Brewer conjectured that a distributed system cannot simultaneously provide all three of the following desirable properties:

Consistency: A read sees all previously completed writes.
Availability: Reads and writes always succeed.
Partition tolerance: Guaranteed properties are maintained even when network failures prevent some machines from communicating with others.

In 2002, Gilbert and Lynch proved this in the asynchronous and partially synchronous network models, so it is now commonly called the CAP Theorem. Brewer originally described this impossibility result as forcing a choice of "two out of the three" CAP properties, leaving three viable design options: CP, AP, and CA. However, further consideration shows that CA is not really a coherent option because a system that is not Partition-tolerant will, by definition, be forced to give up Consistency or Availability during a partition. Therefore, a more modern interpretation of the theorem is: during a network partition, a distributed system must choose either Consistency or Availability.

So, MongoDB with it’s CP properties suffices us pretty well for the MIS 586 class.

Easy peasey - Mongo can retrieve documents from the database super-duper easily and quickly (probably due to indexing) and it’s query structure is similar to MySQL. CouchDB on the other hand requires MapReduce functions for most queries. So there’s one less learning curve there! For the geeks in our class, there is still hope. MongoDB does provide support for MapReduce functions which is better utilized when dealing with the complex queries.

With a swish and a flick - What’s the best part you ask, eh? MongoDB is much more efficient and quicker! All because it converts JSON documents into BSON (binary JSON). BSON data is smaller, thus leading to faster computations.

Dealing with multiple nodes - CouchDB has a master-master replication and MongoDB uses master-slave replication. So, if you anticipate nodes being disconnected for quite some time, CouchDB would be a good idea for you! If not, MongoDB would work just fine without having to deal with the complex but, mind you, very safe replication standards.

Collections: MongoDB stores it’s data into small buckets called collections but CouchDB stores it all directly into one mammoth bucket! So, while querying MongoDB knows exactly which collection to look into and hence is faster! There goes another reason why MongoDB is faster!

Another, good-to-know difference between the two databases, MongoDB is written in C++ and uses binary protocol. CouchDB is written in Erlang and uses HTTP/REST.

One example where using MongoDB will not be such a good idea is when for some reason some devices are cut off from the network and you have get them back in sync, CouchDB makes this syncing very smooth and hence has to be favored in this scenario.

In summary, it all boils down to one question.

Is safety more important to you or speed?

In that question, lies all your answers! Well, at least ours did!

References:

https://foundationdb.com/key-value-store/white-papers/the-cap-theorem/

* We understand that this list of deciding factors is by no means comprehensive. We have simply attempted to create an article that lists the major reasons why we use MongoDB in class and at the same time one that would help a newbie Big data Internet surfer in his quest to understand the difference between the two databases. If you happen to know of more factors or if you have any opinion on this article, please feel free to leave it in the comments!

Data Minioning and Mining

Tuesday, September 30, 2014

Video Games and Big Data Analytics

Tuesday, September 9, 2014

MongoDB v/s CouchDB ...In Simple Terms