Friday, June 6, 2014

Just enough SQL to be dangerous

Let's assume you have at least a clue what SQL is and what it is used for. I want to look at what is between a clue and knowing enough to be dangerous (aka nearly enough to be useful). So we will look at basic querying including joins and then... ok honestly I have no plan beyond that, but I need the writing practice, so let's have at it anyway!

Consider this basic statement:

SELECT id, name FROM users WHERE id = 123

There are 3 significant parts of this statement: SELECT, FROM and WHERE.
The SELECT clause is where we list the columns we want to see. 
The FROM clause is, perhaps obviously, where we define the data source of our query. This will very commonly be a table, but not always.
The WHERE clause contains conditions the records much match to show up in our result set. Expressions in the WHERE clause can be joined with AND and OR keywords and grouped with parentheses to build complex matching conditions.

In response to the query above, the database will go to the users table and look for records that have the value 123 in the id column. The server will return the id and name columns for the records that match.

Let's take a small step back and talk about a couple conventions. I'm sure you've seen plenty SELECT *  examples. I will not be using * as encouragement that you not use * either. It is simply bad practice and it's never too soon to start developing good habits.
The other convention to mention is capitalization. It is very common to see SQL reserved words in all caps like I have done here. SQL is not actually case sensitive. Do it however you want. Personally I prefer syntax highlighting to capitalization.

Here is another simple query:

SELECT id, title, due_date FROM tasks ORDER BY due_date DESC

This time we are selecting 3 columns instead of only 2 and our data source is the tasks table instead of users as above. This much should be familiar. The bigger differences is a lack of a WHERE clause and addition of ORDER BY. Having no WHERE clause means we will return all records in the table. This is ok for now, but eventually you will need to address how to handle large sets of data. The ORDER BY should be another potentially obvious one, it sorts (or orders) the results by the column name you specify. DESC is descending order, ASC is also available for ascending order. While the sort direction (DESC|ASC) is technically optional, I encourage you to always specify one.

This is all fine and dandy, but let's do something useful. The imaginary database I'm using here is for a task manager application. We've seen both the tasks and users table but a more common usage would be to list the tasks along with who they are assigned to. So let's add a column to the tasks table: assigned_to.
We will start with a baseline query like this:

SELECT id, title, assigned_to, due_date FROM tasks ORDER BY due_date DESC

In this case we could simply put the user's name in the assigned_to field and be done. This is bad, of course. "Data Normalization" is beyond the scope of this post, but I encourage you to go find some reading on the subject. Here we will just accept that you should only store data once, when possible.
This means we must refer to the user record for who owns this task. We do so by inserting the user id in the assigned_to field.

Let's use a task table like this:

| id | title      | assigned_to |
|  1 | mow lawn   | 1           |
|  2 | wash car   | 1           |
|  3 | paint shed | 2           |

with a users table like this:

| id | name  |
|  1 | alice |
|  2 | bob   |

Now if we run the query above, we just see the number 1 where we want to see 'alice'. What we need to do is JOIN the users table with the tasks table. We do this specifying the common field to join on.

SELECT, tasks.title,, tasks.due_date 
FROM tasks JOIN users ON = tasks.assigned_to
ORDER BY tasks.due_date DESC

There is a lot more going on here, let's a look one clause at a time.

In the SELECT clause we are specifying the table name and column name separated by a dot (.). This is because we have more than one table in the query, we need to specify which id column, for example, we want to see. With columns that only exist once even after all tables are joined, we don't strictly have to specify the table name in the select clause, but you should.

The FROM clause includes a JOIN. When we JOIN tables, we have to tell the database which value to match the rows on. We do this with the keyword ON. In the example above we are telling the database to "line up" the tasks tables rows with the users table rows matching the column value to the value in the tasks.assigned_to column. Joining the users table like this made it possible to select the column in place of the tasks.assigned_to column.

The ORDER BY clause is the same as before except we added the table name to the due_date column because we like clean readable code that will still make perfect sense when we come back to look at it a year and a half later and is also easier to maintain. Even tough the tasks table name was not required since there is no due_date column on any other tables being joined in this query.

Let's think this through in the order the database does. First we take the tasks table and join the users table to it matching the tasks.assigned_to field to the field. Our new data source, for the purposes of this query only, now looks like this:

| | tasks.title | tasks.assigned_to | | |
|  1       | mow lawn    | 1                 | 1        | alice      |
|  2       | wash car    | 1                 | 1        | alice      |
|  3       | paint shed  | 2                 | 2        | bob        |

If you're confused by alice showing up more than once, think about this happening one row at a time. The database sees task number 1 and matches the alice record based on the assigned_to field. On the next record, task id 2, it has no issue joining the same user record again.
Now when the database evaluates the WHERE clause it is working against this joined dataset, same as with the SELECT clause.

Speaking of a WHERE clause, how about seeing one with a JOIN?

SELECT, tasks.title,, tasks.due_date 
FROM tasks JOIN users ON = tasks.assigned_to
ORDER BY tasks.due_date DESC

Here we are only looking at Bob's tasks only.

Before we get too crazy, I want to address naming convention.
Too often I see terrible table and column names in the interest of saving a few characters. I strongly recommend naming tables and columns like I have done here. The table name should be a plural name of what type of data it is storing. Such as users, tasks, transactions, weapons, flowers, etc... Please do not name table: 'gems_table'. We already know it's table, just name it 'gems'. I really believe that either you or someone else will spend more time over the long run trying to find out what 'trans' means than you will save by leaving off 'ports'. Just name the table 'transports'.
Column names should never include the table name. The users table hold user data and only user data, no need to name the columns user_name or user_login. (of course there are cases where the field name and table name could be similar but this is a rare exception.)
This is all in the interest of easier to read SQL statements and easier to understand databases.

Now let's get crazy.
Let's add a categories table. Categories table has the fields: id and title. Also we will need to add a category field to the tasks table in order to assign the task to a category. 
Categories table:

| id | title      |
|  1 | yard       |
|  2 | automotive |

Now we want to see the same list of tasks but with the category title added.

SELECT, tasks.title, category.title,, tasks.due_date 
FROM tasks 
JOIN users ON = tasks.assigned_to
JOIN categories ON = tasks.category
ORDER BY tasks.due_date DESC

Here the categories table gets joined in after the users table is joined. Everything follows the same concepts as before, it just happens again.

Now after insisting you not try to save typing by shortening table and column name, I'm going to show you how you can save a few keystrokes.

Table aliases using the AS keyword. Take a look at the last example again using aliases:

SELECT, t.title, c.title,, t.due_date 
FROM tasks AS t
JOIN users AS u ON = t.assigned_to
JOIN categories AS c ON = t.category
ORDER BY t.due_date DESC

You can certainly make your own judgement, but I find that much easier to read and type.

That's all I have for now. Stay tuned for more opinionated randomness.

No comments:

Post a Comment