Web Dev: Document Databases (MongoDB)

an intro to nosql

You don't need to be familiar with traditional databases to read and understand this guide, but nevertheless it is those systems that have done the most to define document-based databases.

The world of NoSQL solutions is precisely that: they're written and explained in the context of how a SQL database would do it, hence they are known as No SQL solutions, since many of these databases directly contradict how a SQL database would do things. Databases like couchdb, mongodb, cassandra, redis, etc, were all made because traditional databases just weren't cutting it when it came to sheer size or diversity of information.

In this guide I will be mostly focusing on the document database MongoDB, because I like it a lot, and I just learned it myself after using MySQL/MSSQL for most of my life. But the concepts will be similar in other NoSQL databases.

NoSQL vs RDBMS

Note: if you don't know anything about traditional relational databases, you don't need to read this section. It just highlights the differences between the two worlds.

In a traditional RDBMS (relational database management system) you have discreet databases, tables, columns, data types, rows, indexes, views, et cetera, all being queried with an implementation of SQL.

This sounds all well and good, but what happens when half your rows in a table don't use a given column? It's still using that space for the column whether you use it or not. It makes your data homogeneous instead of dynamic.

Traditional databases also rely heavily on complete consistency, atomic operations, and other assurances that everything is okay. As I've described before, traditional databases are very nervous and require a lot of hand-holding.

What does this mean? It's simple: when you insert a row into a table, the database system wants time to interpret that incoming data, insert it, index it, and assure you that it completed the entire transaction. And that's great! Most people would love that. Perfect for banks and stuff where data integrity is key.

But wait -- that sounds like a slow, tedious process! And it damn well is! What if I have 10,000 (or 10,000,000) people all hitting my database at once, reading and writing data!? Traditional databases don't scale well to fit this kind of demand.

The traditional model is to segment a database into replicated copies called slaves which anyone can only read from, and then have a few master databases that are write-able and push their updates to the slaves. So now when you insert data not only does the master have to do all that integrity checking, but it has to assure the user that it pushed all that data to all the slaves!

This is where NoSQL solutions come into play. In today's world of hundreds of millions of users swarming around a dynamic website, speed is king. Nevermind the whole integrity-checks, just get my data in the database RIGHT NOW!

Databases like Cassandra and MongoDB excel at this, but the cost is that they are eventually consistent as opposed to being transactional. They also don't have columns! They don't have schemas! They're willing to spread databases over as many other machines as needed instead of having a master-slave architecture.

so what the hell does document-based mean?

MongoDB is document-based instead of row-based and relation-based. It's broken down this way:

databases!

collections! these are inside a database!

documents! these are inside collections!
indexes! these help organize documents!

Again, comparing this to a SQL database: what's missing? Users! Rows! Columns!

here's a simple database

lol_db

lol_collection:

{
	'_id': ObjectId('a928aa001910000001'),
	'name': 'cyle',
	'job_title': 'awesome'
}

{
	'_id': ObjectId('a928aa001910000002'),
	'name': 'frankie',
	'job_title': 'not as awesome',
	'phone': '555-0291'
}

lol_db is the database name and lol_collection is a collection of documents inside that database. What are those two things within the { } inside the collection? those are documents!

You see two documents: they each have the fields (not columns) "_id", "name", and "job_title", but frankie has an additional "phone" field! As I said, a document database has no columns or schema... it's all determined at the document-level instead of at some table-level. There is no "right way" to make a document inside a collection, you could have two documents that have totally different fields. It's completely up to you to make it how you want.

what real difference does that make? well, it pushes all of the normalization out of the database's hands. if speed is king, you don't want your database to worry about normalizing the data or even making sure that the data is normalized. the database doesn't care, it just wants to store your data!

this makes the database a very laid-back and fast component. Don't get me wrong: it still does some things for you, however really the only thing it cares about is that _id field. You don't put anything there, you never have to insert that: it always does it for you! The only thing MongoDB is really concerned about is whether every document has a unique identifier. Everything else is up to you.

Data Types

While MongoDB doesn't care what data you put where, it does have available data types, and they are way more expansive than any traditional database.

Basically you should stop thinking in terms of rows or columns or tables and instead think of the database as a collection of objects. Specifically, JSON objects. JSON is Javascript Object Notation; it's a standardized way of displaying different data types in discreet ways. So what that means is that every time you see square brackets like this: [ ], that signifies an Array. Whenever you see curly braces, { }, that signifies an Object. It's just a way of converting these abstract data types into plain old text!

In MongoDB, you can have objects (also know as associative arrays or hashes), standard simple arrays, integers, floating point numbers, dates, or plain old strings. You can even have javascript code embedded as objects! Which is crazy!

here's a document that exemplifies some of these data types:

{
	'_id': ObjectId('a928abc9180001'), // <--- that's of the type ObjectId
	'name': 'cyle', // <--- that's of type String
	'age': 24, // <--- that's of type Number
	'start_date': Date('2011-01-01 03:30PM EST'), // <--- that's of type Date
	'coworkers': [ 'frankie', 'monty' ] // <--- that's a simple Array!
	'favorites': { 'color': 'red', 'food': 'chocolate' } // <--- that's an Object!
	'skills': { 'languages': [ 'php', 'ruby', 'javascript' ], 'servers': [ 'debian', 'red hat', 'windows' ] } // <--- that's an object with arrays inside!
}

So you can see... individual documents can be very simple or very complicated. And that's great!

Ok but how do I do relationships?

Many nerds have wondered how relationships work if there's not a built-in mechanism to handle them. lol

Traditional databases have very strict relational data models available to you, which they assert for you. You can't do it that way in MongoDB; it's mostly put on the program-side. So when you write a PHP interface to MongoDB, you have to do data-checking on the PHP side instead of the database side.

But that doesn't stop relational modeling, it just means you have to worry about it on a different end. Luckily MongoDB provides a unique ID for every single document, so you can share that ID across documents. You can have ObjectId references to other documents inside a document!

So, for example, a blog database in MongoDB could have a collection of posts (each post being a document), but each post has a field called 'comments' that is an array of ObjectIds to comments in another collection. No big deal. You can put that information as an array directly in a document without having to make a separate way of pointing the two pools of data (the posts and the comments) at each other.

So you don't need a third collection to describe the relationships between the posts and the comments: they can be built into either the posts documents, the comments documents, or both!

start using mongodb

you'll need to know how to connect to a server via the command line, and it'll need to have mongodb installed. once you're in, type this:

> mongo

(Alternatively, mongo actually has on their website a tutorial that works just like the actual shell, here: http://try.mongodb.org/)

tada! you should now be in. note that typically there are no users in mongodb. security is instead based on the architecture around the server rather than within the database service. so, for example, you'd want to make sure that only specified machines can access your database or if you put the mongodb database on the same server as your application and limit mongodb only to accept local connections.

so mongo has its own prompt, just like mysql, and you start off in the "test" database, as it tells you.

mongo queries (anything but SQL)

In traditional databases, you use a language called SQL to access things. In MongoDB, you use an altogether different method. In fact, if you've used javascript, it looks a lot like javascript!

Remember that a document-based database is made up of collections and objects rather than tables and rows. The way to get and put in information reflects this. Here's an example of a "query" to select documents; you'd write this in mongo's command line shell:

> db.things.find({'name':'cyle'})

What does that do? It tries to find all the entries in the things collection that have the field 'name' set to 'cyle'. To be even more specific, it's running the find() method on the things collection-object, and the argument for that find() command is the object {'name': 'cyle'}. Notice that this is a JSON object.

What's even cooler is that this is very flexible in the way Mongo interprets it... for example, if the 'name' field was a string in most cases, but in other cases it was an array, mongo is smart enough to see if the value 'cyle' is within the arrays if it encounters one. So the documents themselves can be flexible, and querying for information will be similarly flexible.

Here's how you add a document to a collection:

> db.things.insert({ 'name': 'monty', 'likes': ['books', 'carrots', 'dolphins'] })

It's as simple as that... you're just putting an object into the collection. Mongo doesn't care about whether that document fits into any kind of schema, because there is no schema! It only cares that the object is a valid JSON object.

Furthermore, you never need to create a collection the way you create a table; mongo will automatically create the collection the first time you put a document in it! So there was no "CREATE COLLECTION things" before adding documents to it. (You could explicitly create it, but you don't have to.)

mongo's considerations

As I said, mongo is very laid-back. It's very relaxed, yet it's very fast. It is also important to remember that it is used to handling large amounts of data, so the stance it takes to selecting and updating documents is one of very few assumptions. Let me tell you what I mean.

In a traditional database, if I were to run this SQL statement:

UPDATE some_table SET what='hahaha';

That command would update every single row in some_table with the new value for that column. A sweeping change that could take a long time if there are a lot of rows. Let's see that same command in mongo's langauge:

> db.things.update({}, {'what': 'hahaha'})

What does that do? It finds the first thing that matches the criteria (the empty {}, which means anything) and then changes the entire object to be { 'what': 'hahaha' }. So whatever object was there is now entirely replaced. It updates on the document level as opposed to on the field level.

You might have expected it to find all the records in the collection and update the 'what' field to be the value 'hahaha' on all of them. But that's not how mongodb works, and that's a good and bad thing. You have to be a bit more explicit, because mongo is so laid-back and unassuming about what you want.

Here's what it would actually have to look like:

> db.things.update({}, { $set: {'what': 'hahaha'} }, null, true)

Oh boy, that just got a little more complicated. That's okay, let's break this down step by step.

You're running the update() method on the things collection. The update() method can take up to four arguments.

The first argument is the criteria, or "what do I update?" you can think of it like a "where" clause in a SQL statement. the blank object, { }, means any document.
The second argument is what to do. in this case, we want it to $set the 'what' field to 'hahaha'. The explicit "set" is needed if you don't want to replace the whole document; instead, it will find the "what" field if it exists and replace whats there with "hahaha", or if the document does not already have a "what" field, it'll add it!
The third argument is upserting, which sounds strange, but it's awesome. if this is set to TRUE, it will insert this object if it doesn't exist! so it's essentially saying "if the criteria isn't met, then there's nothing to update, but we want this information in the database, so make a new document with it inside!" this can be incredibly powerful as it replaces potentially three queries with only one.
The fourth argument, if true, means do this for every single document that fits the criteria. this is implicit in a SQL statement; it needs to be explicit in a mongo command.

Along this same vein of no assumptions, when you are querying for documents in a collection, mongo technically does not give you those records immediately: it gives you a cursor, or reference, to them. When you use the SELECT SQL command, it dumps all those rows on you. Mongo just gives you a pointer for them. This saves memory!

So when you run this command to find all documents in a collection:

> db.things.find({})

It's actually just returning a cursor to those documents which you can capture and then sort, limit, and offset. In a traditional database, it actually grabs all those rows and makes you perform actions on the whole chunk of them. Take a look at this:

> db.things.find({}).sort({'name':1}).limit(10)

See how mongo wants you to daisy-chain commands on top of each other like in javascript or other object-oriented langauges?

It's essentially saying this: in the things collection, find every document. With every document, sort it by the name field. With that sorted list, limit it to the top 10. You haven't actually touched or retrieved any documents yet; it's only been using lightweight references to the possible collection.

And out of that comes a cursor which represents those documents! Not the actual documents! You then need to iterate with the cursor to go through each individual document.

That way, only one document is processed at a time, which conserves memory on the server.

practical example

Like MySQL, and all databases, it's not just about what mongo wants you to do - but how the programming languages you use interface with the database. The mongo shell might want you to use the commands I described above, but the PHP API for MongoDB does things differently (albeit very similarly). This example shows how PHP would talk to MongoDB.

Note that you need to have PHP 5, PECL, and the MongoDB PECL extension installed.

<?php

// this gets 5 documents sorted by the "name" field from a collection

$m = new Mongo(); // initialize the MongoDB driver
$db = $m->lol_db; // select which database within our MongoDB server to use
$things = $db->things; // select a collection within that database

$thing_cursor = $things->find()->sort(array('name'=>1))->limit(5); // do the "query" which returns a cursor

foreach ($thing_cursor as $thing) {
	echo '<div><pre>';
	print_r($thing); // show us what's in the fetched document!
	echo '</pre></div>';
}

?>

You can see how this line in PHP:

$thing_cursor = $things->find()->sort(array('name'=>1))->limit(5);

Would look like this line in the mongo shell:

var thing_cursor = db.things.find({}).sort({'name':1}).limit(5)

The vast majority of mongo's shell-functions are written this way in the programming languages which interface with it.

Also note how we use a foreach loop to go through the documents given to us by the cursor. It'll only give us one document at a time unless we actually want it to dump all of them on us.

Similarly, inserting data through PHP looks like this:

<?php

$m = new Mongo(); // initialize the MongoDB driver
$db = $m->lol_db; // select which database within our MongoDB server to use
$things = $db->things; // select a collection within that database

// the PHP associative array which will be "translated" into a document:
$new_thing = array('name' => 'bryce', 'age' => 27, 'fav_colors' => array('green', 'brown')); 

$things->insert($new_thing); // that's it!

?>

Here you can see that we create a new associative array called $new_thing and we pass it to MongoDB to store. The MongoDB-PHP API does the translating for us; it turns PHP's associative array into a JSON object.

Check out MongoDB's PHP docs for more info!

one more consideration

The true power of laid-back document-based databases really shines when you enter an asynchronous world, like the one provided by fast platforms like node.js.

This is because MongoDB wants to just do it! as fast as possible, but even if that's not fast enough, it wants you to keep working and trust that the data will be saved eventually.

What do I mean by asynchronously? Well, look at synchronous (aka "blocking") programming like PHP: it executes the script one line at a time and waits for the functions on the line to be done before moving on.

This means that PHP will wait for MySQL to insert the row before doing anything else. That's blocking. The inverse of this is one that just keeps going and trusts that MySQL will insert that row.

Think of it this way: you're on a basketball team and you have the ball and you're dribbling down court with it. A blocking scripting language wants you to take that ball all the way to the basket as if you were the only one playing, busting through anyone opposing you, no matter how long it takes. On the inverse, an async language is perfectly willing to pass the ball (your data) to another player and let them worry about it so that you can run up ahead unimpeded. While that other player has the ball, you can do whatever you want! You could go have tea! You could read a book! Or do nothing at all and be idle! Then when that other player wants to throw you the ball, you've been busy doing whatever else you had to do to use your time most efficiently. Does that make sense? I hope so.

But I'll cover async stuff in my node.js guide. It's a crazy paradigm-shift that is now happening in a big way.

in concluuuusion!

MongoDB and other document-based databases are very powerful and have a lot of potential. They're extremely useful in high-end scalable massive data stores.

They're not right for every project... sometimes a transactional atomic traditional MySQL database is the right way to go. Sometimes not. MongoDB is just another tool in your belt.

As always, email me if you have any suggestions/comments for this guide. cyle_gage@emerson.edu