back to contents

Web Dev: Relational Databases (MySQL)

intro

Oh lawds, databases! There are so many to choose from, but the most obvious choice is MySQL. It's pretty much a defacto standard for a lot of web platforms. More than I can name here.

This will cover the very basics of SQL-like databases and then get into more advanced topics like what the hell relational means, what is normalization, and how best to architect a schema!

what you need!

MySQL installed on a server. I'm going to throw things out mostly as mysql command-line stuff, so it helps to know how to get into the mysql command line program. Many of the concepts are easily transferrable to GUIs like phpmyadmin, but I don't want to teach you through a GUI. I want to teach you how it actually works.

a database is what

a traditional database service is pretty simple. you have the following components:

here's a simple database

lol_db
lol_tbl
idnamejob_title
1cyleawesome
2frankienot as awesome

lol_db is the database name and it contains the table lol_tbl. that table has columns id, name, and job_title. It has two rows of information.

what I can't easily illustrate in that little mockup is that:

So simply put: a database is a container of tables. A table is a container of rows. Those rows contain information. The kind of information in those rows is dictated by the columns.

For example, the id column only accepts numbers, specifically integers. The name column only allows up to 255 characters of text.

This is called the model or schema of the database. Those two terms mean how the data is structured, rather than being concerned about the content of the data. It's like designing a building: you need to be conscious of what the building will be used for, but it's not so much about who is going to fill the offices, rather just that it will be offices inside the building so we need to structure it for that.

Databases can contain as many tables as you want. MySQL can contain however many databases you want.

A real world example: a blog would probably only need one database. What does a blog contain? Let's say posts, authors, and comments. So we'd need a table for posts, a table for comments, right? The posts table would have columns for post ID number, the post title, the post content, the date, and the author. Comments would be similar.

Column/Data Types

Most every traditional database has strict data types for use with a column. Every column has to know what kind of data to accept into it. Here are the most common:

that list is not exhaustive. there are lots of in-betweens and specificities. read the mysql docs for more detailed information.

Keys and Indexes

Keys/Indexes (sometimes they're interchangeable terms) are important for the performance of a table. They basically give a simple way for the table to be organized. You do this through one or many primary, unique, and foreign keys. They are what they sound like, mostly.

Keys apply to columns! So every row has an entry in that index.

A primary key is the first index the table will use. Most often, this'll be an auto-incrementing number, starting at 1. Every record has one, and each is unique.

A unique key means there will only one of each, obviously. Again, useful for a unique auto-incrementing number. Or a unique username.

A foreign key is a reference to a primary key in another table. We'll get into this later.

These are used so that the table can quickly move through the rows to find what you're looking for. It's basically just something the table can be assured of. Traditional databases need this kind of hand-holding because they're very nervous.

SOOO setting up a table

... is as simple as doing the following:

  1. what does this table need to hold?
  2. what will be the primary means of sorting it?
  3. what piece could be used as a key/index?

Normally when you think of a blog, you don't think of ID numbers. But think of this: there needs to be some way to uniquely pick out a blog post. You could certainly use the post's title, but what if you use it again? You can't look it up that way.

So you abstract the data by adding an immutable ID number which will be unique for every post. Now you don't need to worry about those title-collisions.

SQL

Structured Query Language is the most common means by accessing a traditional database. You write queries for the data you want from a table in a database. (Note: in these examples I end each statement with a semicolon, which you need to do when using mysql via command line, but usually not if you're going through PHP or other langauge.)

A query for the above table example could look like this:

SELECT * FROM lol_tbl;

What does that do? It selects all columns (the asterisk) from the table lol_tbl. Let's be more specific.

SELECT name FROM lol_tbl ORDER BY name ASC;

Can you guess what that does? It selects all the names from lol_tbl and orders them by the name column alphabetically (since the name column is text) in ascending order (A to Z). Here's another one:

SELECT title FROM lol_tbl WHERE name='cyle';

That selects the title from lol_tbl for the row where the name is 'cyle'. Notice the quotation marks. When looking for text, you need to use them.

So what did you just learn? How to get everything from a table, how to sort it, how to get individual pieces, and how to select a specific row. Inserting things is just as easy.

INSERT INTO lol_tbl (name, title) VALUES ('monty', 'pretty cool');

Again, it's pretty straightforward. (At least I think, but I've been doing this for awhile.) You are inserting into lol_tbl, specifically into the columns name and title, the values 'monty' and 'pretty cool'. So that the "name" will be "monty" and the "title" will be "pretty cool". See how they correspond in order like that?

Similarly, there are statements for UPDATE and DELETEing content from the table. These four commands are typically referred to as basic CRUD functions, or Create, Read, Update, Delete. In MySQL's case, these are represented as INSERT, SELECT, UPDATE, and DELETE.

SQL statements are made up of clauses that make up the parameters for a statement. For example, the above statements have ORDER BY clauses, WHERE clauses, etc.

There are lots of other commands to do everything from CREATE TABLE to DROP DATABASE, but most of your work will be done with the four CRUD statements and their idiosyncrasies.

Users and Building Things with SQL

In order to connect to your database, you need to establish users with passwords and access settings to specific databases or tables.

You can log in as root in the command line using the mysql command, like so:

> mysql -u root -p

(If you don't know how to do this stuff via command line, you should probably read about linux servers.)

When you do that command, you'll need to input the root password, and then you'll get a mysql prompt. This is a lot like the shell prompt, except now instead of living in a file system, you're living in a database service. Here you can directly run SQL commands.

Type exit at any time to leave and go back to the shell.

mysql> SHOW DATABASES;

That'll show you all the databases currently in the service. Note that you don't need to use all-caps, I just like to. Make sure you remember the semicolon!

mysql> CREATE DATABASE lol_wut;

that'll make a database. super simple. but how do we access it? we need a user.

mysql> CREATE USER 'lol_user'@'localhost' IDENTIFIED BY 'password';

What are you seeing here? The 'lol_user'@'localhost' means the username will be lol_user and it will only be able to connect if it's on the same server (hence localhost). This is very secure if your MySQL database is on the same machine as your application (PHP, whatever) files.

And of course it has to have a password... ok so you've made a user. But how do you set permissions on the database you just made?

mysql> GRANT ALL ON lol_wut.* TO 'lol_user'@'localhost';

This grants all permissions on the lol_wut database (and all its tables, that's what the .* signifies) to the user we just made. So now you can access this database from another programming language and use it.

But let's go further: how do we actually make a table without having to go into phpmyadmin? First of all, make sure you're using the new database:

mysql> USE lol_wut;

And now check this out:

mysql> CREATE TABLE test_table ( id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, name VARCHAR(255) NULL, wut VARCHAR(255) NULL );

Can you figure out how the table was created? I'll break it down into pieces.

Now the table is created! We can start INSERTing data into it, and then SELECT data out of it.

I won't go into how to ALTER TABLE or stuff like that, I'll let the language reference tell you about that: http://dev.mysql.com/doc/refman/5.1/en/sql-syntax.html (specifically the first two sections)

Relational Architecture

Once you can think in terms of structuring data, you need to start thinking in terms of relational data and how to normalize data. What does this mean? We'll start with relationships.

Relational data is pretty simple. I'll go back to the blog schema I described earlier: a table for posts, a table for comments. But wait - how does the database know which comments go with which blog posts?

You could do this all on the application side, whether it's PHP or Ruby or Python or whatever. Alternatively, you could do it on the database side. Here's how the two tables look:

blog_db
posts_tbl
idtitlecontentauthor
1coolest post everthis is the best blog post ever!cyle
2second coolestnot as awesome, but still awesome!cyle
comments_tbl
idcommentauthor
1wow, so cool!frankie
2that's pretty cool!monty

Okay so right now we have no way of knowing what comment goes to what post. How do we fix that? We add a relationship. Not like a love interest, but just a mutual understanding.

Let's add a column to the comments_tbl called post_id which contains the ID (in this case a foreign ID) of the post that the comment was for. Holy shit! We've just done something relational! The new comments table:

blog_db
comments_tbl
idcommentauthorpost_id
1wow, so cool!frankie1
2that's pretty cool!monty1

That's awesome. So we could easily use this query to get the latest blog posts:

SELECT * FROM posts_tbl ORDER BY id DESC;

We use descending to get latest first. We could use this query to get the latest comments for a blog post:

SELECT * FROM comments_tbl WHERE post_id=# ORDER BY id DESC;

Replace the # with the post ID, of course. That's all there really is to relational database architecture on the concept side, but how far you go with it is determined by how you normalize the data.

Normalized Architecture

Normalization is the process of creating a kind of standard for how the data relates across tables to minimize redundancy. Here's a good example.

Again using our blog. You see cyle used repeatedly as the author of both posts. But what if I decide to change my username? I'd have to do an UPDATE query for all instances in every table! That's kind of heavy if our blog gets really big.

How do we fix that problem? Normalize! Let's add an authors table!

blog_db
authors_tbl
idname
1cyle
2frankie
3monty

Now let's change how the other tables store their information, in a relational and normalized way!

blog_db
posts_tbl
idtitlecontentauthor_id
1coolest post everthis is the best blog post ever!1
2second coolestnot as awesome, but still awesome!1
comments_tbl
idcommentpost_idauthor_id
1wow, so cool!12
2that's pretty cool!13

So the same way we could get comments for a certain post ID, we can get the author's name for each post ID like so:

SELECT name FROM authors_tbl WHERE id=#;

With # being the author_id from another table.

But wait... this is terribly inefficient. Think about this: there is only one possible name per author ID, so why run two queries to get this data? Surely we can do it in one!

Multi-table queries, JOINs, subqueries, oh my!

We can do a lot across tables or by subqueries of tables. I'm just going to shove this in your face and see if you understand it:

SELECT posts_tbl.*, authors_tbl.name FROM posts_tbl, authors_tbl WHERE authors_tbl.id = posts_tbl.author_id ORDER BY posts_tbl.id DESC;

OH JEEZ that's a big query! What is it getting? Well, read it slowly: it's getting all the information from the posts_tbl, as well as the author's name from authors_tbl, and it's saying in the WHERE clause that for every author_id in the posts table there is a corresponding ID in the authors_tbl. And sort it by the post's ID number in descending order.

We could also do this with a JOIN clause:

SELECT posts_tbl.*, authors_tbl.name FROM posts_tbl JOIN authors_tbl ON authors_tbl.id = posts_tbl.author_id ORDER BY posts_tbl.id DESC;

That does the same thing as the above statement. I always find JOINs to be a bit more confusing, but it's personal preference. JOIN used to be the most efficient way of doing things, but now subqueries are just as effective.

The JOIN version is technically better because the first example will fail if there is no corresponding ID row in the authors_tbl. The JOIN wouldn't fail; it would simply not have a name, and continue on.

So what would a subquery look like? Well let's say we wanted all of cyle's latest posts:

SELECT * FROM posts_tbl WHERE author_id IN (SELECT id FROM authors_tbl WHERE name='cyle') ORDER BY id DESC;

You can see there is a statement within a statement, hence subquery. (You can go even deeper, Inception-style, if you want.)

The subquery will return the ID # for the user named 'cyle' and give that to the WHERE clause in the parent query! Wow that sounds fun. I hope you're following along.

So what is normalization? It's making sure all your data fits this kind of model. It's replacing redundancies in data with foreign keys. It's turning what seemed like it could be two tables into four or five.

Other ways of normalizing and relating!

You could make a table that exists between two tables, linking them together! Like so:

blog_db
posts_tbl
idtitlecontent
1coolest post everthis is the best blog post ever!
2second coolestnot as awesome, but still awesome!
comments_tbl
idcomment
1wow, so cool!
2that's pretty cool!
comments_to_posts_tbl
idpost_idcomment_id
111
212

You understand that? There is a third table which simply describes which comment IDs go to which post IDs. That's a totally valid way of doing it. It's especially useful for a system in which one thing may go to many things or many things may go to many other things.

With the previous schema, a single comment could only relate to one post. With this setup, one comment could relate to as many posts as you want because the relationship is outside the comment itself. Get the difference? It's all about architecting how the data will flow together.

How would you write the SQL query to get the latest comments for a given blog post? Maybe like so:

SELECT * FROM comments_tbl WHERE id IN (SELECT comment_id FROM comments_to_posts_tbl WHERE post_id=#) ORDER BY id DESC;

Again, replace the # in there with the current post ID.

How to Get Here from There

The process of architecting these databases is not easy. Some of it, like making authors based on ID rather than by name, seems simple in hindsight, but it's tough to forecast what data needs to be normalized until you have a ton of it.

I find drawing out a web of what data is needed to be the most effective means of figuring it out. As I did before with the blog, ask yourself: what pieces of data makes up a blog? How do they relate? What can be normalized?

Literally sitting down with a piece of blank paper and a pen is often the easiest way of figuring it out.

A Practical Example: PHP/MySQL

Oh boy oh boy. Here's a sample of how the blog would work when accessed in MySQL via PHP using the lol_db example from above. This example gets the latest blog posts, the author name, and their comments.

<html>
<head>
<title>lol blog<title>
<head>
<body>
<?php

mysql_connect('localhost', 'db_username', 'db_password');
mysql_select_db('lol_db');

$posts_query = mysql_query('SELECT posts_tbl.*, authors_tbl.name FROM posts_tbl, authors_tbl WHERE authors_tbl.id=posts_tbl.author_id ORDER BY id DESC');
while ($post = mysql_fetch_assoc($posts_query)) {
	echo '<div class="post">';
	echo '<h1>'.$post['title'].'</h1>';
	echo '<p class="post_author">by '.$post['name'].'</p>';
	echo '<p class="post_content">'.$post['content'].'</p>';
	echo '<div class="comments">';
	$comments_query = mysql_query('SELECT comments_tbl.*, authors_tbl.name FROM comments_tbl, authors_tbl WHERE authors_tbl.id=comments_tbl.author_id AND post_id='.$post['id']);
	while ($comment = mysql_fetch_assoc($comments_query)) {
		echo '<p class="comment">'.$comment['comment'].' (by '.$comment['name']).'</p>';
	}
	echo '</div>';
	echo '</div>';
}

?>
</body>
</html>

Important concepts to notice:

in conclusiooo

Traditional databases are great for a lot of stuff. A lot of simple (relatively simple) things that have very definable needs and data structures. But there are better solutions for other problems. Traditional databases don't scale up well; they're not very fast when a huge amount of people need to use them. Read the next guide for an alternative approach.