Data Minioning and Mining : MongoDB v/s CouchDB ...In Simple Terms

When the Big Bee of Big Data buzzes around you, so do the other drones that stay within the Hive. Before, we get to the queen herself, we decided to pick one of others today, and let it sting it’s way once again through the Internet. As the Data Minion Team, this is our very first blog and we dedicate it to NoSQl Databases - Couchdb and Mongodb.

The MIS 586 Big Data class emphasizes the use of MongoDB and we decided to go ahead and find out what could have been the possible pros and cons of using this NoSQL database over the one's of its kind in the class. Just letting curiosity be our guide! :)

One of the most simplest and yet not so widely known things is, what it really stands for. Some people think the name pokes fun at SQL but it isn’t really the case. It stands for Not-only-SQL. Although human tendency may favor what one already knows and understands, NoSQL databases are actually simpler in the way they are structured compared to relational databases. There is no denying that in some cases relational databases might actually serve a better purpose than NoSQL databases. As always, it all depends on the kind of application.

Out of the four major types of NoSQL databases, CouchDB and MongoDB are the most popular Document type storages. We try to elaborate our thoughts on why we as a class probably were asked to use MongoDb over CouchDB.

Consistency and availability: From what we understand, it is not essential for us to have a guarantee of receiving a response whether a command has been successfully executed or not. Rather, consistency of the Twitter data stored is more important in this scenario so that data retrieval for analysis is achieved without any ambiguity. MongoDB provides for consistency and CouchDB provides for availability in database systems. (although couchDB is eventually consistent too).

Why not both, you ask? Well, Brewer will tell you why. According to Brewer (in simpler terms), in a distributed computing system that can be subject to communication failures, only the properties of liveliness or safety can be achieved. Liveliness corresponds to availability and safety to consistency. 

Quoting FoundationDB's documentation we found online:

In 2000, Eric Brewer conjectured that a distributed system cannot simultaneously provide all three of the following desirable properties:

Consistency: A read sees all previously completed writes.
Availability: Reads and writes always succeed.
Partition tolerance: Guaranteed properties are maintained even when network failures prevent some machines from communicating with others.

In 2002, Gilbert and Lynch proved this in the asynchronous and partially synchronous network models, so it is now commonly called the CAP Theorem. Brewer originally described this impossibility result as forcing a choice of "two out of the three" CAP properties, leaving three viable design options: CP, AP, and CA. However, further consideration shows that CA is not really a coherent option because a system that is not Partition-tolerant will, by definition, be forced to give up Consistency or Availability during a partition. Therefore, a more modern interpretation of the theorem is: during a network partition, a distributed system must choose either Consistency or Availability.

So, MongoDB with it’s CP properties suffices us pretty well for the MIS 586 class.

Easy peasey - Mongo can retrieve documents from the database super-duper easily and quickly (probably due to indexing) and it’s query structure is similar to MySQL. CouchDB on the other hand requires MapReduce functions for most queries. So there’s one less learning curve there! For the geeks in our class, there is still hope. MongoDB does provide support for MapReduce functions which is better utilized when dealing with the complex queries.

With a swish and a flick - What’s the best part you ask, eh? MongoDB is much more efficient and quicker! All because it converts JSON documents into BSON (binary JSON). BSON data is smaller, thus leading to faster computations.

Dealing with multiple nodes - CouchDB has a master-master replication and MongoDB uses master-slave replication. So, if you anticipate nodes being disconnected for quite some time, CouchDB would be a good idea for you! If not, MongoDB would work just fine without having to deal with the complex but, mind you, very safe replication standards.

Collections: MongoDB stores it’s data into small buckets called collections but CouchDB stores it all directly into one mammoth bucket! So, while querying MongoDB knows exactly which collection to look into and hence is faster! There goes another reason why MongoDB is faster!

Another, good-to-know difference between the two databases, MongoDB is written in C++ and uses binary protocol. CouchDB is written in Erlang and uses HTTP/REST.

One example where using MongoDB will not be such a good idea is when for some reason some devices are cut off from the network and you have get them back in sync, CouchDB makes this syncing very smooth and hence has to be favored in this scenario.

In summary, it all boils down to one question.

Is safety more important to you or speed?

In that question, lies all your answers! Well, at least ours did!

References:

https://foundationdb.com/key-value-store/white-papers/the-cap-theorem/

* We understand that this list of deciding factors is by no means comprehensive. We have simply attempted to create an article that lists the major reasons why we use MongoDB in class and at the same time one that would help a newbie Big data Internet surfer in his quest to understand the difference between the two databases. If you happen to know of more factors or if you have any opinion on this article, please feel free to leave it in the comments!

Data Minioning and Mining

Tuesday, September 9, 2014

MongoDB v/s CouchDB ...In Simple Terms

No comments:

Post a Comment