Due at 11:59:59 pm on Sunday, 05/03/2020.

Instructions

Download hw12.zip. Inside the archive, you will find starter files for the questions in this homework, along with a copy of the OK autograder.

Submission: When you are done, submit with python3 ok --submit. You may submit more than once before the deadline; only the final submission will be scored. Check that you have successfully submitted your code on okpy.org. See this article for more instructions on okpy and submitting assignments.

Readings: This homework relies on following references:

Usage

First, check that a file named sqlite_shell.py exists alongside the assignment files.

You can start an interactive SQLite session in your Terminal or Git Bash with the following command:

python3 sqlite_shell.py 

While the interpreter is running, you can type .help to see some of the commands you can run.

To exit out of the SQLite interpreter, type .exit or .quit or press Ctrl-C. Remember that if you see ...> after pressing enter, you probably forgot a ;.

You can also run all the statements in a .sql file by doing the following:

This command runs your code and then exits SQLite immediately afterwards.

python3 sqlite_shell.py < hw12.sql

This command runs your code and then opens an interactive SQLite session, which is similar to running Python code with the interactive -i flag.

python3 sqlite_shell.py --init hw12.sql

To complete this homework assignment, you will need to use SQLite version 3.8.3 or greater.

To check your progress, you can run sqlite3 directly by running:

python3 sqlite_shell.py --init hw12.sql

You should also check your work using ok:

python3 ok

SQL

Working with the Chinook Dataset

The Chinook dataset is a sample dataset that you'll be working with in this homework. It contains data about many different musical albums, as well as the tracks and musical artists on those albums. To simplify things, we've only given you the artists, albums, and tracks tables (note that this means that some columns in the tracks table, such as genreId and mediaTypeId won't reference anything). We've also cleaned up the data for you and truncated the tracks table (it used to be several thousand records long!). Let's take a look at the different tables we've given you:

  1. artists

    • artistId: the unique ID number assigned to this artist
    • name: the name of this artist
  2. albums

    • albumId: the unique ID number assigned to this album
    • title: the title of this album
    • artistId: the unique ID of the artist on this album
  3. tracks

    • trackId: the unique ID number assigned to this track
    • name: the name of this track
    • albumId: the unique ID of the album that this track is on
    • mediaTypeId: don't worry about this!
    • genreId: don't worry about this!
    • composer: the composer of this track
    • milliseconds: the length of this track in milliseconds
    • bytes: the size of this track file in bytes
    • unitPrice: the price of this track

You will write all of your solutions in the starter file provided. As with other labs, you can test your solutions with OK. In addition, you can use either of the following commands. You may need to refer to the Usage section to find the appropriate command for your OS:

python3 sqlite_shell.py < hw12.sql
python3 sqlite_shell.py --init hw12.sql

Question 1: What Would SQL print?

First, load the tables into sqlite3. If you're on Windows or Mac OS X (Mavericks or older), use the following command:

python3 sqlite_shell.py < hw12.sql

Before we start, inspect the schema of the tables that we've created for you:

sqlite> .schema

This tells you the name of each of our tables and their attributes.

Let's also take a look at some of the entries in our table. There are a lot of entries though, so let's just output the first 20:

sqlite> SELECT * FROM tracks LIMIT 20;

If you're curious about which artists/albums/tracks we've included, open up chinook.sql in your favorite text editor and take a look!

For each of the SQL queries below, decide to yourself and/or your partner what the query is looking for, then try running the query yourself and see!

sqlite> SELECT * FROM artists; -- This is a comment. * is shorthand for all columns!
selects all records from artists;
sqlite> SELECT name FROM tracks WHERE bytes > 3000000; -- There are 1000000 bytes in a megabyte.
selects the name of all tracks that are larger than 3 Mb
sqlite> SELECT name, composer FROM tracks WHERE albumId = 57 AND milliseconds > 180000; -- How many milliseconds are in a minute?
selects the name and composer of all tracks longer than 3 minutes on the album with albumID = 57

Question 2: Get rid of the fluff!

Write a SQL query to create a table that contains just the name and unitPrice of each track in the tracks table.

You should get the following output:

sqlite> SELECT * FROM pricing LIMIT 10;
For Those About To Rock (We Salute You)|0.99
Fast As a Shark|0.99
Restless and Wild|0.99
Princess of the Dawn|0.99
Put The Finger On You|0.99
Let's Get It Up|0.99
Inject The Venom|0.99
Snowballed|0.99
Evil Walks|0.99
C.O.D.|0.99
CREATE TABLE pricing as
  select "REPLACE THIS LINE WITH YOUR SOLUTION";

Use OK to test your code:

python3 ok -q pricing

Question 3: Long songs...

Now that we have learned how to select columns from a SQL table, let's filter the results to see some more interesting results!

Write a SQL query to create a table that contains the names of all tracks longer than 6 minutes (that's a long song!)

You should get the following output:

sqlite> SELECT * FROM long LIMIT 10;
Princess of the Dawn
Let There Be Rock
Overdose
Livin' On The Edge
You Oughta Know (Alternate)
Love, Hate, Love
Master Of Puppets
Harvester Of Sorrow
Wherever I May Roam
Snoopy's search-Red baron
CREATE TABLE long as
  select "REPLACE THIS LINE WITH YOUR SOLUTION";

Use OK to test your code:

python3 ok -q long

Question 4: Biggest Song

Which song takes up the most space (as a file)? Let's find out! (Incidentally this is also the longest song in our database.)

Write a SQL query with the columns name and milliseconds to try to determine what the largest song file is. In order to make it easier for us to inspect these values, use ORDER BY to sort the numerical values and LIMIT your result to just the biggest track.

You should get the following output:

sqlite> SELECT * FROM largest;
How Many More Times|711836
CREATE TABLE largest as
  select "REPLACE THIS LINE WITH YOUR SOLUTION";

Use OK to test your code:

python3 ok -q largest

Joins

We can use joins to include rows from another table that satisfy the where predicate. Joins can either be on different tables, or the same table if we include an alias. Here we are referencing the football table twice, once as the alias a and once as the alias b.

sqlite> SELECT a.Berkeley - b.Berkeley, a.Stanford - b.Stanford, a.Year, b.Year
...>        FROM Football as a, Football as b WHERE a.Year > b.Year;
-11|22|2014|2003
-13|21|2014|2002
-2|9|2003|2002

What is this query asking for?

You may notice that it does not seem like we actually performed any operations to do the join. However, the join is implicit in the fact that we listed more than one table after the FROM. In this example, we joined the table Football with itself and gave each instance of the table an alias, a and b so that we could distinctly refer to each table's attributes and perform selections and comparisons on them, such as a.Year > b.Year.

One way to think of a join is that it produces a cross-product between the two tables by matching each row from the first table with every other row in the second table, which creates a new, larger joined table.

Here's an illustration of what happened in the joining process during the above query.

joins

From here, the select statement examines the joined table and selects the values it desires: a.Berkeley - b.Berkeley and a.Stanford - b.Stanford but only from the rows WHERE a.Year > b.Year. This prevents duplicate results from appearing in our output!

Question 5: Long songs on albums

We know which songs are long, but now we want to know which albums have long songs on them!

Write a SQL query to create a table that contains the titles of all albums with tracks longer than 6 minutes on them.

You should get the following output:

sqlite> SELECT * FROM long_album LIMIT 10;
Restless and Wild
Let There Be Rock
Let There Be Rock
Big Ones
Jagged Little Pill
Facelift
Plays Metallica By Four Cellos
Plays Metallica By Four Cellos
Plays Metallica By Four Cellos
The Best Of Billy Cobham

Note that albums can appear more than once in this table. Why is this? What does the number of times an album appears tell you?

CREATE TABLE long_album as
  select "REPLACE THIS LINE WITH YOUR SOLUTION";

Use OK to test your code:

python3 ok -q long_album

Aggregate Functions

So far, we haven't given you a good way of counting the number of rows that fit certain criteria.

Bring in SQL aggregation, which is commonly used to aggregate values in order to answer these types of questions!

In order to perform SQL aggregation, we need to group rows in our table by one or more attributes. Once we have groups, we can aggregate over the groups in our table and find things like the maximum value (MAX), the minimum value (MIN), the number of rows in the group (COUNT), the average over all of the values (AVG), and more! SELECT statements that use aggregation are marked by two things: an aggregate function (MAX, MIN, COUNT, AVG, etc.) and a GROUP BY clause. For example, if we wanted to see how many tracks are on each album (identified by albumId):

sqlite> SELECT albumId, max(milliseconds) FROM tracks GROUP BY albumId LIMIT 10;
1|343719
3|375418
4|369319
5|381231
6|491885
7|387134
9|436453
10|343457
11|309786
12|163265

So the longest song on the album with albumId 57 is 343719 ms long!

This SELECT statement groups all of the rows in our table tracks by albumId. Then, within each group, we perform aggregation by MAXing over the attribute milliseconds. By selecting albumId and MAX(milliseconds), we then can see the greatest milliseconds for any track for a given albumId.

When using an aggregate function in your SELECT statement, any other columns in the SELECT statement must be in the GROUP BY statement!

Question 6: Number of songs on each album

Write a SQL query to create a table that contains the albumId and number of songs on the album with that albumId. HINT: Remember that count(*) will count the number of rows in each group of a GROUP BY statement.

You should get the following output:

sqlite> SELECT * FROM track_count LIMIT 10;
1|10
3|3
4|8
5|15
6|13
7|12
9|8
10|14
11|12
12|12
CREATE TABLE track_count as
  select "REPLACE THIS LINE WITH YOUR SOLUTION";

Use OK to test your code:

python3 ok -q track_count

Question 7: Counting albums

So far, we haven't yet considered the artists table. What if we want to know how many albums each artist has released?

Write a SQL query to create a table that contains the name of each artist along with how many albums that artist has released (that are in the albums table).

You should get the following output:

sqlite> SELECT * FROM album_count LIMIT 10;
Aaron Copland & London Symphony Orchestra|1
Aaron Goldberg|1
Academy of St. Martin in the Fields & Sir Neville Marriner|1
Academy of St. Martin in the Fields Chamber Ensemble & Sir Neville Marriner|1
Academy of St. Martin in the Fields, John Birch, Sir Neville Marriner & Sylvia McNair|1
Academy of St. Martin in the Fields, Sir Neville Marriner & Thurston Dart|1
Accept|2
Adrian Leaper & Doreen de Feis|1
Aerosmith|1
CREATE TABLE album_count as
  select "REPLACE THIS LINE WITH YOUR SOLUTION";

Use OK to test your code:

python3 ok -q album_count

Question 8: Busiest artists

How many songs have each artist produced? Write a SQL query to create a table containing artists' names and the number of tracks that they're produced (we'll assume an artist produced a track if it appears on one of their albums).

HINT: You may need all three tables!

You should get the following output:

sqlite> SELECT * FROM busiest_artists LIMIT 10;
AC/DC|18
Accept|3
Aerosmith|15
Alanis Morissette|13
Alice In Chains|12
Antônio Carlos Jobim|17
Apocalyptica|8
Audioslave|26
BackBeat|12
Billy Cobham|8
CREATE TABLE busiest_artists as
  select "REPLACE THIS LINE WITH YOUR SOLUTION";

Use OK to test your code:

python3 ok -q busiest_artists

Note that computing a join over all three tables is very expensive! To see how many rows are considered, run the following query:

SELECT count(*) FROM artists, albums, tracks;

There may be a slight delay (because you're computing so many different candidate rows!).