Copyright Jana Schaich Borg/Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)

MySQL Exercise 5: Summaries of Groups of Data

So far you've learned how to select, reformat, manipulate, order, and summarize data from a single table in database. In this lesson, you are going to learn how to summarize multiple subsets of your data in the same query. The method for doing this is to include a "GROUP BY" clause in your SQL queries.

The GROUP BY clause

The GROUP BY clause comes after the WHERE clause, but before ORDER BY or LIMIT:

The GROUP BY clause is easy to incorporate into your queries. In fact, it might be a little too easy to incorporate into MySQL queries, because it can be used incorrectly in MySQL queries even when no error message is displayed. As a consequence, I suggest you adopt a healthy dose of caution every time you use the GROUP BY clause. By the end of this lesson, you will understand why. When used correctly, though, GROUP BY is one of the most useful and efficient parts of an SQL query, and once you are comfortable using it, you will use it very frequently.

To get started, load the SQL library and the Dognition database, and set the dognition database as the default:


In [1]:
%load_ext sql
%sql mysql://studentuser:studentpw@mysqlserver/dognitiondb
%sql USE dognitiondb

%config SqlMagic.displaylimit=25


0 rows affected.

Let's return to a question from MySQL Exercise 4. How would you query the average rating for each of the 40 tests in the Reviews table? As we discussed, one very inefficient method to do that would be to write 40 separate queries with each one having a different test name in the WHERE conditional clause. Then you could copy or transcribe the results from all 40 queries into one place. But that wouldn't be very pleasant. Here's how you could do the same thing using one query that has a GROUP BY clause:

SELECT test_name, AVG(rating) AS AVG_Rating
FROM reviews
GROUP BY test_name

This query will output the average rating for each test. More technically, this query will instruct MySQL to average all the rows that have the same value in the test_name column.

Notice that I included test_name in the SELECT statement. As a strong rule of thumb, if you are grouping by a column, you should also include that column in the SELECT statement. If you don't do this, you won't know to which group each row of your output corresponds.

To see what I mean, try the query above without test_name included in the SELECT statement:


In [2]:
%%sql
SELECT test_name, AVG(rating) AS AVG_Rating
FROM reviews
GROUP BY test_name


40 rows affected.
Out[2]:
test_name AVG_Rating
1 vs 1 Game 3.9206
3 vs 1 Game 4.2857
5 vs 1 Game 3.9272
Arm Pointing 4.2153
Cover Your Eyes 2.6741
Delayed Cup Game 3.3514
Different Perspective 2.7647
Expression Game 4.0000
Eye Contact Game 2.9372
Eye Contact Warm-up 0.9632
Foot Pointing 4.0093
Impossible Task Game 3.0965
Impossible Task Warm-up 0.2174
Inferential Reasoning Game 4.5223
Inferential Reasoning Warm-up 4.3066
Memory versus Pointing 3.5584
Memory versus Smell 4.2623
Navigation Game 2.9841
Navigation Learning 2.0303
Navigation Warm-up 1.9805
Numerosity Warm-Up 2.6173
One Cup Warm-up 1.3693
Physical Reasoning Game 3.8492
Physical Reasoning Warm-up 1.6625
Self Control Game 3.8519
40 rows, truncated to displaylimit of 25

You can form groups using derived values as well as original columns. To illustrate this, let's address another question: how many tests were completed during each month of the year?

To answer this question, we need to take advantage of another datetime function described in the website below:

http://www.w3resource.com/mysql/date-and-time-functions/date-and-time-functions.php

MONTH() will return a number representing the month of a date entry. To get the total number of tests completed each month, you could put the MONTH function into the GROUP BY clause, in this case through an alias:

SELECT test_name, MONTH(created_at) AS Month, COUNT(created_at) AS Num_Completed_Tests
FROM complete_tests
GROUP BY Month;

You can also group by multiple columns or derived fields. If we wanted to determine the total number of each type of test completed each month, you could include both "test_name" and the derived "Month" field in the GROUP BY clause, separated by a comma.

SELECT test_name, MONTH(created_at) AS Month, COUNT(created_at) AS Num_Completed_Tests
FROM complete_tests
GROUP BY test_name, Month;

MySQL allows you to use aliases in a GROUP BY clause, but some database systems do not. If you are using a database system that does NOT accept aliases in GROUP BY clauses, you can still group by derived fields, but you have to duplicate the calculation for the derived field in the GROUP BY clause in addition to including the derived field in the SELECT clause:

SELECT test_name, MONTH(created_at) AS Month, COUNT(created_at) AS Num_Completed_Tests
FROM complete_tests
GROUP BY test_name, MONTH(created_at);

Try the query once with test_name first in the GROUP BY list, and once with Month first in the GROUP BY list below. Inspect the outputs:


In [4]:
%%sql
SELECT test_name, MONTH(created_at) AS Month, COUNT(created_at) AS Num_Completed_Tests
FROM complete_tests
GROUP BY test_name, MONTH(created_at);


480 rows affected.
Out[4]:
test_name Month Num_Completed_Tests
1 vs 1 Game 1 25
1 vs 1 Game 2 28
1 vs 1 Game 3 22
1 vs 1 Game 4 12
1 vs 1 Game 5 13
1 vs 1 Game 6 18
1 vs 1 Game 7 36
1 vs 1 Game 8 17
1 vs 1 Game 9 28
1 vs 1 Game 10 27
1 vs 1 Game 11 15
1 vs 1 Game 12 14
3 vs 1 Game 1 35
3 vs 1 Game 2 28
3 vs 1 Game 3 34
3 vs 1 Game 4 16
3 vs 1 Game 5 34
3 vs 1 Game 6 42
3 vs 1 Game 7 37
3 vs 1 Game 8 23
3 vs 1 Game 9 24
3 vs 1 Game 10 28
3 vs 1 Game 11 22
3 vs 1 Game 12 45
5 vs 1 Game 1 59
480 rows, truncated to displaylimit of 25

In [3]:
%%sql
SELECT test_name, MONTH(created_at) AS Month, COUNT(created_at) AS Num_Completed_Tests
FROM complete_tests
GROUP BY MONTH(created_at), test_name;


480 rows affected.
Out[3]:
test_name Month Num_Completed_Tests
1 vs 1 Game 1 25
3 vs 1 Game 1 35
5 vs 1 Game 1 59
Arm Pointing 1 622
Cover Your Eyes 1 452
Delayed Cup Game 1 356
Different Perspective 1 2
Expression Game 1 6
Eye Contact Game 1 624
Eye Contact Warm-up 1 707
Foot Pointing 1 506
Impossible Task Game 1 49
Impossible Task Warm-up 1 51
Inferential Reasoning Game 1 410
Inferential Reasoning Warm-up 1 425
Memory versus Pointing 1 464
Memory versus Smell 1 465
Navigation Game 1 57
Navigation Learning 1 59
Navigation Warm-up 1 62
Numerosity Warm-Up 1 36
One Cup Warm-up 1 476
Physical Reasoning Game 1 317
Physical Reasoning Warm-up 1 384
Self Control Game 1 8
480 rows, truncated to displaylimit of 25

Notice that in the first case, the first block of rows share the same test_name, but are broken up into separate months (for those of you who took the "Data Visualization and Communication with Tableau" course of this specialization, this is similar to what would happen if you put test_name first and created_at second on the rows or columns shelf in Tableau).

In the second case, the first block of rows share the same month, but are broken up into separate tests (this is similar to what would happen if you put created_at first and test_name second on the rows or columns shelf in Tableau). If you were to visualize these outputs, they would look like the charts below.

Different database servers might default to ordering the outputs in a certain way, but you shouldn't rely on that being the case. To ensure the output is ordered in a way you intend, add an ORDER BY clause to your grouped query using the syntax you already know and have practiced:

SELECT test_name, MONTH(created_at) AS Month, COUNT(created_at) AS Num_Completed_Tests
FROM complete_tests
GROUP BY test_name, Month
ORDER BY test_name ASC, Month ASC;

Question 1: Output a table that calculates the number of distinct female and male dogs in each breed group of the Dogs table, sorted by the total number of dogs in descending order (the sex/breed_group pair with the greatest number of dogs should have 8466 unique Dog_Guids):


In [5]:
%%sql
Describe dogs


21 rows affected.
Out[5]:
Field Type Null Key Default Extra
gender varchar(255) YES None
birthday varchar(255) YES None
breed varchar(255) YES None
weight int(11) YES None
dog_fixed tinyint(1) YES None
dna_tested tinyint(1) YES None
created_at datetime NO None
updated_at datetime NO None
dimension varchar(255) YES None
exclude tinyint(1) YES None
breed_type varchar(255) YES None
breed_group varchar(255) YES None
dog_guid varchar(60) YES MUL None
user_guid varchar(60) YES MUL None
total_tests_completed varchar(255) YES None
mean_iti_days varchar(255) YES None
mean_iti_minutes varchar(255) YES None
median_iti_days varchar(255) YES None
median_iti_minutes varchar(255) YES None
time_diff_between_first_and_last_game_days varchar(255) YES None
time_diff_between_first_and_last_game_minutes varchar(255) YES None

In [15]:
%%sql
SELECT breed_group, gender, COUNT(DISTINCT dog_guid) AS 'Amount'
FROM dogs
GROUP BY breed_group, gender
ORDER BY Amount DESC, breed, gender


18 rows affected.
Out[15]:
breed_group gender Amount
None male 8466
None female 8367
Sporting male 2584
Sporting female 2262
Herding male 1736
Herding female 1704
Toy male 1473
Toy female 1145
Non-Sporting male 1098
Working male 1075
Terrier male 919
Non-Sporting female 919
Working female 895
Terrier female 794
Hound male 725
Hound female 614
male 147
female 127

Some database servers, including MySQL, allow you to use numbers in place of field names in the GROUP BY or ORDER BY fields to reduce the overall length of the queries. I tend to avoid this abbreviated method of writing queries because I find it challenging to troubleshoot when you are writing complicated queries with many fields, but it does allow you to write queries faster. To use this method, assign each field in your SELECT statement a number acording to the order the field appears in the SELECT statement. In the following statement:

SELECT test_name, MONTH(created_at) AS Month, COUNT(created_at) AS Num_Completed_Tests

test_name would be #1, Month would be #2, and Num_Completed_Tests would be #3. You could then rewrite the query above to read:

SELECT test_name, MONTH(created_at) AS Month, COUNT(created_at) AS Num_Completed_Tests
FROM complete_tests
GROUP BY 1, 2
ORDER BY 1 ASC, 2 ASC;

Question 2: Revise the query your wrote in Question 1 so that it uses only numbers in the GROUP BY and ORDER BY fields.


In [16]:
%%sql
SELECT breed_group, gender, COUNT(DISTINCT dog_guid) AS 'Amount'
FROM dogs
GROUP BY 1, 2
ORDER BY 3 DESC


18 rows affected.
Out[16]:
breed_group gender Amount
None male 8466
None female 8367
Sporting male 2584
Sporting female 2262
Herding male 1736
Herding female 1704
Toy male 1473
Toy female 1145
Non-Sporting male 1098
Working male 1075
Non-Sporting female 919
Terrier male 919
Working female 895
Terrier female 794
Hound male 725
Hound female 614
male 147
female 127

The HAVING clause

Just like you can query subsets of rows using the WHERE clause, you can query subsets of aggregated groups using the HAVING clause. However, wheras the expression that follows a WHERE clause has to be applicable to each row of data in a column, the expression that follows a HAVING clause has to be applicable or computable using a group of data.

If you wanted to examine the number of tests completed only during the winter holiday months of November and December, you would need to use a WHERE clause, because the month a test was completed in is recorded in each row. Your query might look like this:

SELECT test_name, MONTH(created_at) AS Month, COUNT(created_at) AS Num_Completed_Tests
FROM complete_tests
WHERE MONTH(created_at)=11 OR MONTH(created_at)=12
GROUP BY 1, 2
ORDER BY 3 DESC;

If you then wanted to output only the test-month pairs that had at least 20 records in them, you would add a HAVING clause, because the stipulation of at least 20 records only makes sense and is only computable at the aggregated group level:

SELECT test_name, MONTH(created_at) AS Month, COUNT(created_at) AS Num_Completed_Tests
FROM complete_tests
WHERE MONTH(created_at)=11 OR MONTH(created_at)=12
GROUP BY 1, 2
HAVING COUNT(created_at)>=20
ORDER BY 3 DESC;

Question 3: Revise the query your wrote in Question 2 so that it (1) excludes the NULL and empty string entries in the breed_group field, and (2) excludes any groups that don't have at least 1,000 distinct Dog_Guids in them. Your result should contain 8 rows. (HINT: sometimes empty strings are registered as non-NULL values. You might want to include the following line somewhere in your query to exclude these values as well):

breed_group!=""

In [17]:
%%sql
SELECT breed_group, gender, COUNT(DISTINCT dog_guid) AS 'Amount'
FROM dogs
WHERE breed_group IS NOT NULL AND breed_group <> ''
GROUP BY breed_group, gender
HAVING COUNT(DISTINCT dog_guid) >= 1000
ORDER BY Amount DESC, breed, gender


8 rows affected.
Out[17]:
breed_group gender Amount
Sporting male 2584
Sporting female 2262
Herding male 1736
Herding female 1704
Toy male 1473
Toy female 1145
Non-Sporting male 1098
Working male 1075

In [27]:
%%sql
SELECT breed_group, gender, COUNT(DISTINCT dog_guid) AS 'Amount' FROM dogs WHERE breed_group IS NOT NULL AND breed_group <> ''
GROUP BY breed_group, gender
HAVING COUNT(DISTINCT dog_guid) >= 1000
ORDER BY Amount DESC, breed, gender


8 rows affected.
Out[27]:
breed_group gender Amount
Sporting male 2584
Sporting female 2262
Herding male 1736
Herding female 1704
Toy male 1473
Toy female 1145
Non-Sporting male 1098
Working male 1075

In [28]:
result = _

In [33]:
type(result)


Out[33]:
sql.run.ResultSet

We will review several issues that can be tricky about using GROUP BY in your queries in the next lesson, but those issues will make more sense once you are sure you are comfortable with the basic functionality of the GROUP BY and HAVING clauses.

Practice incorporating GROUP BY and HAVING into your own queries.

Question 4: Write a query that outputs the average number of tests completed and average mean inter-test-interval for every breed type, sorted by the average number of completed tests in descending order (popular hybrid should be the first row in your output).


In [35]:
%%sql
Describe dogs


21 rows affected.
Out[35]:
Field Type Null Key Default Extra
gender varchar(255) YES None
birthday varchar(255) YES None
breed varchar(255) YES None
weight int(11) YES None
dog_fixed tinyint(1) YES None
dna_tested tinyint(1) YES None
created_at datetime NO None
updated_at datetime NO None
dimension varchar(255) YES None
exclude tinyint(1) YES None
breed_type varchar(255) YES None
breed_group varchar(255) YES None
dog_guid varchar(60) YES MUL None
user_guid varchar(60) YES MUL None
total_tests_completed varchar(255) YES None
mean_iti_days varchar(255) YES None
mean_iti_minutes varchar(255) YES None
median_iti_days varchar(255) YES None
median_iti_minutes varchar(255) YES None
time_diff_between_first_and_last_game_days varchar(255) YES None
time_diff_between_first_and_last_game_minutes varchar(255) YES None

In [37]:
%%sql
SELECT breed_type, AVG(total_tests_completed), AVG(mean_iti_days)
FROM dogs 
GROUP BY breed_type
ORDER BY AVG(total_tests_completed) DESC


4 rows affected.
/opt/conda/lib/python3.4/site-packages/SQLAlchemy-1.0.11-py3.4-linux-x86_64.egg/sqlalchemy/engine/default.py:450: Warning: Truncated incorrect DOUBLE value: 'NaN'
  cursor.execute(statement, parameters)
Out[37]:
breed_type AVG(total_tests_completed) AVG(mean_iti_days)
Popular Hybrid 10.257530120481928 1.9682781756219017
Cross Breed 9.945900537634408 1.994688302855342
Pure Breed 9.871602824737856 2.2176045095807226
Mixed Breed/ Other/ I Don't Know 9.54250850170034 2.0993549515344747

Question 5: Write a query that outputs the average amount of time (in hours) it took customers to complete each type of test where any individual reaction times over 6000 minutes are excluded and only average reaction times that are greater than 0 minutes are included (your output should end up with 67 rows).


In [44]:
%%sql
show tables


6 rows affected.
Out[44]:
Tables_in_dognitiondb
complete_tests
dogs
exam_answers
reviews
site_activities
users

In [45]:
%%sql
Describe exam_answers


8 rows affected.
Out[45]:
Field Type Null Key Default Extra
script_detail_id int(11) YES None
subcategory_name varchar(255) YES None
test_name varchar(255) YES None
step_type varchar(255) YES None
start_time datetime YES None
end_time datetime YES None
loop_number int(11) YES None
dog_guid varchar(60) YES None

In [39]:
%%sql
Describe complete_tests


6 rows affected.
Out[39]:
Field Type Null Key Default Extra
created_at datetime NO None
updated_at datetime NO None
user_guid varchar(60) YES MUL None
dog_guid varchar(60) YES MUL None
test_name varchar(60) YES None
subcategory_name varchar(60) YES None

In [46]:
%%sql
SELECT test_name, AVG(TIMESTAMPDIFF(HOUR,start_time,end_time)) AS avg_time
FROM exam_answers
WHERE TIMESTAMPDIFF(MINUTE,start_time,end_time) < 6000
GROUP BY test_name
HAVING AVG(TIMESTAMPDIFF(MINUTE,start_time,end_time)) > 0


67 rows affected.
Out[46]:
test_name avg_time
1 vs 1 Game 0.0000
3 vs 1 Game 0.0000
5 vs 1 Game 0.0000
Activity 0.5096
Arm Pointing 0.0238
Attachment 0.0381
Confinement 0.1757
Cover Your Eyes 0.0555
Delayed Cup Game 0.0158
Diet 0.6735
Different Perspective 0.0493
Emotions 0.1173
Environment 0.0779
Excitability 0.0263
Expression Game 0.0000
Eye Contact Game 0.0738
Eye Contact Warm-up 0.0513
Foot Pointing 0.0281
Gender 0.0882
Impossible Task Game 0.0238
Impossible Task Warm-up 0.1040
Inferential Reasoning Game 0.0157
Inferential Reasoning Warm-up 0.0336
Memory versus Pointing 0.0057
Memory versus Smell 0.0235
67 rows, truncated to displaylimit of 25

Question 6: Write a query that outputs the total number of unique User_Guids in each combination of State and ZIP code (postal code) in the United States, sorted first by state name in ascending alphabetical order, and second by total number of unique User_Guids in descending order (your first state should be AE and there should be 5043 rows in total in your output).


In [47]:
%%sql
Describe users


16 rows affected.
Out[47]:
Field Type Null Key Default Extra
sign_in_count int(11) YES 0
created_at datetime NO None
updated_at datetime NO None
max_dogs int(11) YES 0
membership_id int(11) YES None
subscribed tinyint(1) YES 0
exclude tinyint(1) YES None
free_start_user tinyint(1) YES None
last_active_at datetime YES None
membership_type int(11) YES None
user_guid text YES MUL None
city varchar(255) YES None
state varchar(255) YES None
zip varchar(255) YES None
country varchar(255) YES None
utc_correction varchar(255) YES None

In [50]:
%%sql
SELECT state, zip, COUNT(DISTINCT user_guid) AS user_number
FROM users
WHERE country='US'
GROUP BY state, zip
ORDER BY state ASC, user_number DESC


5043 rows affected.
Out[50]:
state zip user_number
AE 9128 2
AE 9845 1
AE 9053 1
AE 9107 1
AE 9469 1
AK 99507 3
AK 99709 3
AK 99501 2
AK 99577 2
AK 99502 1
AK 99509 1
AK 99516 1
AK 99518 1
AK 99567 1
AK 99587 1
AK 99611 1
AK 99645 1
AK 99676 1
AK 99705 1
AK 99712 1
AK 99752 1
AK 99775 1
AK 99824 1
AK 99928 1
AL 35209 4
5043 rows, truncated to displaylimit of 25

Question 7: Write a query that outputs the total number of unique User_Guids in each combination of State and ZIP code in the United States that have at least 5 users, sorted first by state name in ascending alphabetical order, and second by total number of unique User_Guids in descending order (your first state/ZIP code combination should be AZ/86303).


In [52]:
%%sql
SELECT state, zip, COUNT(DISTINCT user_guid) AS user_number
FROM users
WHERE country='US'
GROUP BY state, zip
HAVING user_number >= 5
ORDER BY state, user_number DESC


285 rows affected.
Out[52]:
state zip user_number
AZ 86303 14
AZ 85718 6
AZ 85253 5
AZ 85254 5
AZ 85260 5
AZ 85711 5
AZ 85749 5
CA 92107 16
CA 90046 13
CA 92130 12
CA 94107 12
CA 94110 10
CA 92024 10
CA 94611 9
CA 94114 9
CA 94941 9
CA 94025 9
CA 90049 8
CA 90068 8
CA 90069 8
CA 94131 8
CA 92064 8
CA 90803 7
CA 93003 7
CA 94117 7
285 rows, truncated to displaylimit of 25

**Be sure to watch the next video before beginning Exercise 6, the next set of MySQL exercises, and feel free to practice any other queries you wish below!**


In [ ]: