Web Dev: Relational Databases (MySQL)

intro

Oh lawds, databases! There are so many to choose from, but the most obvious choice is MySQL. It's pretty much a defacto standard for a lot of web platforms. More than I can name here.

This will cover the very basics of SQL-like databases and then get into more advanced topics like what the hell relational means, what is normalization, and how best to architect a schema!

what you need!

MySQL installed on a server. I'm going to throw things out mostly as mysql command-line stuff, so it helps to know how to get into the mysql command line program. Many of the concepts are easily transferrable to GUIs like phpmyadmin, but I don't want to teach you through a GUI. I want to teach you how it actually works.

a database is what

a traditional database service is pretty simple. you have the following components:

databases!

tables! these are inside a database!

columns! these are inside tables! which are inside databases!
rows! these are inside tables! which are inside databases!
keys/indexes! this is a bit more complicated.

users! or, who can access a database, or a table inside a database

here's a simple database

lol_db

lol_tbl
id	name	job_title
1	cyle	awesome
2	frankie	not as awesome

lol_db is the database name and it contains the table lol_tbl. that table has columns id, name, and job_title. It has two rows of information.

what I can't easily illustrate in that little mockup is that:

the id column is of type int and is the primary key or index
the name column is of type varchar(255)
the job_title column is of type varchar(255)

So simply put: a database is a container of tables. A table is a container of rows. Those rows contain information. The kind of information in those rows is dictated by the columns.

For example, the id column only accepts numbers, specifically integers. The name column only allows up to 255 characters of text.

This is called the model or schema of the database. Those two terms mean how the data is structured, rather than being concerned about the content of the data. It's like designing a building: you need to be conscious of what the building will be used for, but it's not so much about who is going to fill the offices, rather just that it will be offices inside the building so we need to structure it for that.

Databases can contain as many tables as you want. MySQL can contain however many databases you want.

A real world example: a blog would probably only need one database. What does a blog contain? Let's say posts, authors, and comments. So we'd need a table for posts, a table for comments, right? The posts table would have columns for post ID number, the post title, the post content, the date, and the author. Comments would be similar.

Column/Data Types

Most every traditional database has strict data types for use with a column. Every column has to know what kind of data to accept into it. Here are the most common:

number (int, bigint, smallint, float) - numbers! no text. int is the most basic, it just stores a solid number. float stores a floating-point number like 1.3. int is useful for auto-incrementing unique ID numbers.
text (varchar, text) - text! varchar means variable character (between 0 and 65,535 characters, you specify the limit) and text means just a big huge mass of text.
date (datetime, timestamp) - stores dates in a server-friendly format. useful if you have to sort things by date.

that list is not exhaustive. there are lots of in-betweens and specificities. read the mysql docs for more detailed information.

Keys and Indexes

Keys/Indexes (sometimes they're interchangeable terms) are important for the performance of a table. They basically give a simple way for the table to be organized. You do this through one or many primary, unique, and foreign keys. They are what they sound like, mostly.

Keys apply to columns! So every row has an entry in that index.

A primary key is the first index the table will use. Most often, this'll be an auto-incrementing number, starting at 1. Every record has one, and each is unique.

A unique key means there will only one of each, obviously. Again, useful for a unique auto-incrementing number. Or a unique username.

A foreign key is a reference to a primary key in another table. We'll get into this later.

These are used so that the table can quickly move through the rows to find what you're looking for. It's basically just something the table can be assured of. Traditional databases need this kind of hand-holding because they're very nervous.

SOOO setting up a table

... is as simple as doing the following:

what does this table need to hold?
what will be the primary means of sorting it?
what piece could be used as a key/index?

Normally when you think of a blog, you don't think of ID numbers. But think of this: there needs to be some way to uniquely pick out a blog post. You could certainly use the post's title, but what if you use it again? You can't look it up that way.

So you abstract the data by adding an immutable ID number which will be unique for every post. Now you don't need to worry about those title-collisions.

SQL

Structured Query Language is the most common means by accessing a traditional database. You write queries for the data you want from a table in a database. (Note: in these examples I end each statement with a semicolon, which you need to do when using mysql via command line, but usually not if you're going through PHP or other langauge.)

A query for the above table example could look like this:

SELECT * FROM lol_tbl;

What does that do? It selects all columns (the asterisk) from the table lol_tbl. Let's be more specific.

SELECT name FROM lol_tbl ORDER BY name ASC;

Can you guess what that does? It selects all the names from lol_tbl and orders them by the name column alphabetically (since the name column is text) in ascending order (A to Z). Here's another one:

SELECT title FROM lol_tbl WHERE name='cyle';

That selects the title from lol_tbl for the row where the name is 'cyle'. Notice the quotation marks. When looking for text, you need to use them.

So what did you just learn? How to get everything from a table, how to sort it, how to get individual pieces, and how to select a specific row. Inserting things is just as easy.

INSERT INTO lol_tbl (name, title) VALUES ('monty', 'pretty cool');

Again, it's pretty straightforward. (At least I think, but I've been doing this for awhile.) You are inserting into lol_tbl, specifically into the columns name and title, the values 'monty' and 'pretty cool'. So that the "name" will be "monty" and the "title" will be "pretty cool". See how they correspond in order like that?

Similarly, there are statements for UPDATE and DELETEing content from the table. These four commands are typically referred to as basic CRUD functions, or Create, Read, Update, Delete. In MySQL's case, these are represented as INSERT, SELECT, UPDATE, and DELETE.

SQL statements are made up of clauses that make up the parameters for a statement. For example, the above statements have ORDER BY clauses, WHERE clauses, etc.

There are lots of other commands to do everything from CREATE TABLE to DROP DATABASE, but most of your work will be done with the four CRUD statements and their idiosyncrasies.

Users and Building Things with SQL

In order to connect to your database, you need to establish users with passwords and access settings to specific databases or tables.

You can log in as root in the command line using the mysql command, like so:

> mysql -u root -p

(If you don't know how to do this stuff via command line, you should probably read about linux servers.)

When you do that command, you'll need to input the root password, and then you'll get a mysql prompt. This is a lot like the shell prompt, except now instead of living in a file system, you're living in a database service. Here you can directly run SQL commands.

Type exit at any time to leave and go back to the shell.

mysql> SHOW DATABASES;

That'll show you all the databases currently in the service. Note that you don't need to use all-caps, I just like to. Make sure you remember the semicolon!

mysql> CREATE DATABASE lol_wut;

that'll make a database. super simple. but how do we access it? we need a user.

mysql> CREATE USER 'lol_user'@'localhost' IDENTIFIED BY 'password';

What are you seeing here? The 'lol_user'@'localhost' means the username will be lol_user and it will only be able to connect if it's on the same server (hence localhost). This is very secure if your MySQL database is on the same machine as your application (PHP, whatever) files.

And of course it has to have a password... ok so you've made a user. But how do you set permissions on the database you just made?

mysql> GRANT ALL ON lol_wut.* TO 'lol_user'@'localhost';

This grants all permissions on the lol_wut database (and all its tables, that's what the .* signifies) to the user we just made. So now you can access this database from another programming language and use it.

But let's go further: how do we actually make a table without having to go into phpmyadmin? First of all, make sure you're using the new database:

mysql> USE lol_wut;

And now check this out:

mysql> CREATE TABLE test_table ( id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, name VARCHAR(255) NULL, wut VARCHAR(255) NULL );

Can you figure out how the table was created? I'll break it down into pieces.

CREATE TABLE test_table just lets you name the table.
( ) whatever is inside the parentheses following the CREATE TABLE [name] part will be the table's columns, in a comma-separated list.
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY means make a column, call it id, of type INT, don't let it be NULL (basically this means "throw an error if it doesn't get a value"), the database should auto-increment it, and it's the primary key for the table. Most tables should have something like this.
The two columns after that are similar but simpler: name VARCHAR(255) NULL means make a column called name of type varchar (with a maximum of 255 characters) and allow it to be null if need be.

Now the table is created! We can start INSERTing data into it, and then SELECT data out of it.

I won't go into how to ALTER TABLE or stuff like that, I'll let the language reference tell you about that: http://dev.mysql.com/doc/refman/5.1/en/sql-syntax.html (specifically the first two sections)

Relational Architecture

Once you can think in terms of structuring data, you need to start thinking in terms of relational data and how to normalize data. What does this mean? We'll start with relationships.

Relational data is pretty simple. I'll go back to the blog schema I described earlier: a table for posts, a table for comments. But wait - how does the database know which comments go with which blog posts?

You could do this all on the application side, whether it's PHP or Ruby or Python or whatever. Alternatively, you could do it on the database side. Here's how the two tables look:

blog_db

posts_tbl
id	title	content	author
1	coolest post ever	this is the best blog post ever!	cyle
2	second coolest	not as awesome, but still awesome!	cyle

comments_tbl
id	comment	author
1	wow, so cool!	frankie
2	that's pretty cool!	monty

Okay so right now we have no way of knowing what comment goes to what post. How do we fix that? We add a relationship. Not like a love interest, but just a mutual understanding.

Let's add a column to the comments_tbl called post_id which contains the ID (in this case a foreign ID) of the post that the comment was for. Holy shit! We've just done something relational! The new comments table:

blog_db

comments_tbl
id	comment	author	post_id
1	wow, so cool!	frankie	1
2	that's pretty cool!	monty	1

That's awesome. So we could easily use this query to get the latest blog posts:

SELECT * FROM posts_tbl ORDER BY id DESC;

We use descending to get latest first. We could use this query to get the latest comments for a blog post:

SELECT * FROM comments_tbl WHERE post_id=# ORDER BY id DESC;

Replace the # with the post ID, of course. That's all there really is to relational database architecture on the concept side, but how far you go with it is determined by how you normalize the data.

Normalized Architecture

Normalization is the process of creating a kind of standard for how the data relates across tables to minimize redundancy. Here's a good example.

Again using our blog. You see cyle used repeatedly as the author of both posts. But what if I decide to change my username? I'd have to do an UPDATE query for all instances in every table! That's kind of heavy if our blog gets really big.

How do we fix that problem? Normalize! Let's add an authors table!

blog_db

authors_tbl
id	name
1	cyle
2	frankie
3	monty

Now let's change how the other tables store their information, in a relational and normalized way!

blog_db

posts_tbl
id	title	content	author_id
1	coolest post ever	this is the best blog post ever!	1
2	second coolest	not as awesome, but still awesome!	1

comments_tbl
id	comment	post_id	author_id
1	wow, so cool!	1	2
2	that's pretty cool!	1	3

So the same way we could get comments for a certain post ID, we can get the author's name for each post ID like so:

SELECT name FROM authors_tbl WHERE id=#;

With # being the author_id from another table.

But wait... this is terribly inefficient. Think about this: there is only one possible name per author ID, so why run two queries to get this data? Surely we can do it in one!

Multi-table queries, JOINs, subqueries, oh my!

We can do a lot across tables or by subqueries of tables. I'm just going to shove this in your face and see if you understand it:

SELECT posts_tbl.*, authors_tbl.name FROM posts_tbl, authors_tbl WHERE authors_tbl.id = posts_tbl.author_id ORDER BY posts_tbl.id DESC;

OH JEEZ that's a big query! What is it getting? Well, read it slowly: it's getting all the information from the posts_tbl, as well as the author's name from authors_tbl, and it's saying in the WHERE clause that for every author_id in the posts table there is a corresponding ID in the authors_tbl. And sort it by the post's ID number in descending order.

We could also do this with a JOIN clause:

SELECT posts_tbl.*, authors_tbl.name FROM posts_tbl JOIN authors_tbl ON authors_tbl.id = posts_tbl.author_id ORDER BY posts_tbl.id DESC;

That does the same thing as the above statement. I always find JOINs to be a bit more confusing, but it's personal preference. JOIN used to be the most efficient way of doing things, but now subqueries are just as effective.

The JOIN version is technically better because the first example will fail if there is no corresponding ID row in the authors_tbl. The JOIN wouldn't fail; it would simply not have a name, and continue on.

So what would a subquery look like? Well let's say we wanted all of cyle's latest posts:

SELECT * FROM posts_tbl WHERE author_id IN (SELECT id FROM authors_tbl WHERE name='cyle') ORDER BY id DESC;

You can see there is a statement within a statement, hence subquery. (You can go even deeper, Inception-style, if you want.)

The subquery will return the ID # for the user named 'cyle' and give that to the WHERE clause in the parent query! Wow that sounds fun. I hope you're following along.

So what is normalization? It's making sure all your data fits this kind of model. It's replacing redundancies in data with foreign keys. It's turning what seemed like it could be two tables into four or five.

Other ways of normalizing and relating!

You could make a table that exists between two tables, linking them together! Like so:

blog_db

posts_tbl
id	title	content
1	coolest post ever	this is the best blog post ever!
2	second coolest	not as awesome, but still awesome!

comments_tbl
id	comment
1	wow, so cool!
2	that's pretty cool!

comments_to_posts_tbl
id	post_id	comment_id
1	1	1
2	1	2

You understand that? There is a third table which simply describes which comment IDs go to which post IDs. That's a totally valid way of doing it. It's especially useful for a system in which one thing may go to many things or many things may go to many other things.

With the previous schema, a single comment could only relate to one post. With this setup, one comment could relate to as many posts as you want because the relationship is outside the comment itself. Get the difference? It's all about architecting how the data will flow together.

How would you write the SQL query to get the latest comments for a given blog post? Maybe like so:

SELECT * FROM comments_tbl WHERE id IN (SELECT comment_id FROM comments_to_posts_tbl WHERE post_id=#) ORDER BY id DESC;

Again, replace the # in there with the current post ID.

How to Get Here from There

The process of architecting these databases is not easy. Some of it, like making authors based on ID rather than by name, seems simple in hindsight, but it's tough to forecast what data needs to be normalized until you have a ton of it.

I find drawing out a web of what data is needed to be the most effective means of figuring it out. As I did before with the blog, ask yourself: what pieces of data makes up a blog? How do they relate? What can be normalized?

Literally sitting down with a piece of blank paper and a pen is often the easiest way of figuring it out.

A Practical Example: PHP/MySQL

Oh boy oh boy. Here's a sample of how the blog would work when accessed in MySQL via PHP using the lol_db example from above. This example gets the latest blog posts, the author name, and their comments.

<html>
<head>
<title>lol blog<title>
<head>
<body>
<?php

mysql_connect('localhost', 'db_username', 'db_password');
mysql_select_db('lol_db');

$posts_query = mysql_query('SELECT posts_tbl.*, authors_tbl.name FROM posts_tbl, authors_tbl WHERE authors_tbl.id=posts_tbl.author_id ORDER BY id DESC');
while ($post = mysql_fetch_assoc($posts_query)) {
	echo '<div class="post">';
	echo '<h1>'.$post['title'].'</h1>';
	echo '<p class="post_author">by '.$post['name'].'</p>';
	echo '<p class="post_content">'.$post['content'].'</p>';
	echo '<div class="comments">';
	$comments_query = mysql_query('SELECT comments_tbl.*, authors_tbl.name FROM comments_tbl, authors_tbl WHERE authors_tbl.id=comments_tbl.author_id AND post_id='.$post['id']);
	while ($comment = mysql_fetch_assoc($comments_query)) {
		echo '<p class="comment">'.$comment['comment'].' (by '.$comment['name']).'</p>';
	}
	echo '</div>';
	echo '</div>';
}

?>
</body>
</html>

Important concepts to notice:

Connecting to the database service and selecting the database are two discreet steps.
Executing the query is literally a SQL statement passed to a PHP function.
You have to iterate with a loop (in this case, a while loop) all the rows returned by the query.
In PHP, the easiest way to use a row is as if it were an associative array, via the mysql_fetch_assoc() function.

in conclusiooo

Traditional databases are great for a lot of stuff. A lot of simple (relatively simple) things that have very definable needs and data structures. But there are better solutions for other problems. Traditional databases don't scale up well; they're not very fast when a huge amount of people need to use them. Read the next guide for an alternative approach.