Copyright Jana Schaich Borg/Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)

MySQL Exercise 12: Queries that Test Relationships Between Test Completion and Testing Circumstances

In this lesson, we are going to practice integrating more of the concepts we learned over the past few weeks to address whether issues in our Dognition sPAP are related to the number of tests dogs complete. We are going to focus on a subset of the issues listed in the "Features of Testing Circumstances" branch of our sPAP. You will need to look up new functions several times and the final queries at which we will arrive by the end of this lesson will be quite complex, but we will work up to them step-by-step.

To begin, load the sql library and database, and make the Dognition database your default database:


In [1]:
%load_ext sql
%sql mysql://studentuser:studentpw@mysqlserver/dognitiondb
%sql USE dognitiondb

%config SqlMagic.displaylimit=25


0 rows affected.

1. During which weekdays do Dognition users complete the most tests?

The first question we are going to address is whether there is a certain day of the week when users are more or less likely to complete Dognition tests. If so, targeting promotions or reminder emails to those times of the week might increase the number of tests users complete.

At first, the query we need to address this question might seem a bit intimidating, but once you can describe what the query needs to do in words, writing the query won't seem so challenging.

Ultimately, we want a count of the number of tests completed on each day of the week, with all of the dog_guids and user_guids the Dognition team flagged in their exclude column excluded. To achieve this, we are going to have to use the GROUP BY clause to break up counts of the records in the completed_tests table according to days of the week. We will also have to join the completed_tests table with the dogs and users table in order to exclude completed_tests records that are associated with dog_guids or user_guids that should be excluded. First, though, we need a method for extracting the day of the week from a time stamp. In MySQL Exercise 2 we used a function called "DAYNAME". That is the most efficient function to use for this purpose, but not all database systems have this function, so let's try using a different method for the queries in this lesson. Search these sites to find a function that will output a number from 1-7 for time stamps where 1 = Sunday, 2 = Monday, …, 7 = Saturday:

http://dev.mysql.com/doc/refman/5.7/en/func-op-summary-ref.html
http://www.w3resource.com/mysql/mysql-functions-and-operators.php

Question 1: Using the function you found in the websites above, write a query that will output one column with the original created_at time stamp from each row in the completed_tests table, and another column with a number that represents the day of the week associated with each of those time stamps. Limit your output to 200 rows starting at row 50.


In [3]:
%%sql
SELECT ct.created_at, DAYOFWEEK(ct.created_at)
FROM complete_tests ct
LIMIT 49, 200


200 rows affected.
Out[3]:
created_at DAYOFWEEK(ct.created_at)
2013-02-05 22:10:06 3
2013-02-05 22:23:49 3
2013-02-05 22:26:36 3
2013-02-05 22:29:02 3
2013-02-05 22:32:25 3
2013-02-05 22:33:09 3
2013-02-05 22:36:11 3
2013-02-05 22:38:01 3
2013-02-05 22:48:58 3
2013-02-05 22:53:45 3
2013-02-05 22:59:45 3
2013-02-05 23:01:38 3
2013-02-05 23:04:43 3
2013-02-05 23:06:10 3
2013-02-05 23:35:48 3
2013-02-05 23:40:57 3
2013-02-05 23:45:30 3
2013-02-05 23:48:46 3
2013-02-05 23:54:40 3
2013-02-05 23:59:15 3
2013-02-06 00:05:31 4
2013-02-06 00:12:23 4
2013-02-06 00:16:59 4
2013-02-06 00:20:04 4
2013-02-06 00:24:26 4
200 rows, truncated to displaylimit of 25

Of course, the results of the query in Question 1 would be much easier to interpret if the output included the name of the day of the week (or a relevant abbreviation) associated with each time stamp rather than a number index.

Question 2: Include a CASE statement in the query you wrote in Question 1 to output a third column that provides the weekday name (or an appropriate abbreviation) associated with each created_at time stamp.


In [5]:
%%sql
SELECT ct.created_at, DAYOFWEEK(ct.created_at),
    CASE DAYOFWEEK(ct.created_at)
    WHEN 1 THEN 'Sun'
    WHEN 2 THEN 'Mon'
    WHEN 3 THEN 'Tue'
    WHEN 4 THEN 'Wed'
    WHEN 5 THEN 'Thu'
    WHEN 6 THEN 'Fri'
    WHEN 7 THEN 'Sat'
    END AS weekday
FROM complete_tests ct
LIMIT 49, 200


200 rows affected.
Out[5]:
created_at DAYOFWEEK(ct.created_at) weekday
2013-02-05 22:10:06 3 Tue
2013-02-05 22:23:49 3 Tue
2013-02-05 22:26:36 3 Tue
2013-02-05 22:29:02 3 Tue
2013-02-05 22:32:25 3 Tue
2013-02-05 22:33:09 3 Tue
2013-02-05 22:36:11 3 Tue
2013-02-05 22:38:01 3 Tue
2013-02-05 22:48:58 3 Tue
2013-02-05 22:53:45 3 Tue
2013-02-05 22:59:45 3 Tue
2013-02-05 23:01:38 3 Tue
2013-02-05 23:04:43 3 Tue
2013-02-05 23:06:10 3 Tue
2013-02-05 23:35:48 3 Tue
2013-02-05 23:40:57 3 Tue
2013-02-05 23:45:30 3 Tue
2013-02-05 23:48:46 3 Tue
2013-02-05 23:54:40 3 Tue
2013-02-05 23:59:15 3 Tue
2013-02-06 00:05:31 4 Wed
2013-02-06 00:12:23 4 Wed
2013-02-06 00:16:59 4 Wed
2013-02-06 00:20:04 4 Wed
2013-02-06 00:24:26 4 Wed
200 rows, truncated to displaylimit of 25

Now that we are confident we have the correct syntax for extracting weekday labels from the created_at time stamps, we can start building our larger query that examines the number of tests completed on each weekday.

Question 3: Adapt the query you wrote in Question 2 to report the total number of tests completed on each weekday. Sort the results by the total number of tests completed in descending order. You should get a total of 33,190 tests in the Sunday row of your output.


In [9]:
%%sql
SELECT DAYOFWEEK(ct.created_at),
    CASE DAYOFWEEK(ct.created_at)
    WHEN 1 THEN 'Sun'
    WHEN 2 THEN 'Mon'
    WHEN 3 THEN 'Tue'
    WHEN 4 THEN 'Wed'
    WHEN 5 THEN 'Thu'
    WHEN 6 THEN 'Fri'
    WHEN 7 THEN 'Sat'
    END AS weekday,
    COUNT(ct.created_at) AS num_tests
FROM complete_tests ct
GROUP BY weekday
ORDER BY num_tests DESC


7 rows affected.
Out[9]:
DAYOFWEEK(ct.created_at) weekday num_tests
1 Sun 33190
2 Mon 30195
3 Tue 27989
7 Sat 27899
4 Wed 26473
5 Thu 24420
6 Fri 23080

So far these results suggest that users complete the most tests on Sunday night and the fewest tests on Friday night. We need to determine if this trend remains after flagged dog_guids and user_guids are excluded. Let's start by removing the dog_guids that have an exclude flag. We'll exclude user_guids with an exclude flag in later queries.

Question 4: Rewrite the query in Question 3 to exclude the dog_guids that have a value of "1" in the exclude column (Hint: this query will require a join.) This time you should get a total of 31,092 tests in the Sunday row of your output.


In [10]:
%%sql
SELECT DAYOFWEEK(ct.created_at),
    CASE DAYOFWEEK(ct.created_at)
    WHEN 1 THEN 'Sun'
    WHEN 2 THEN 'Mon'
    WHEN 3 THEN 'Tue'
    WHEN 4 THEN 'Wed'
    WHEN 5 THEN 'Thu'
    WHEN 6 THEN 'Fri'
    WHEN 7 THEN 'Sat'
    END AS weekday,
    COUNT(ct.created_at) AS num_tests
FROM complete_tests ct 
    JOIN dogs d ON ct.dog_guid=d.dog_guid
WHERE 
    (d.exclude=0 OR d.exclude IS NULL)
GROUP BY weekday
ORDER BY num_tests DESC


7 rows affected.
Out[10]:
DAYOFWEEK(ct.created_at) weekday num_tests
1 Sun 31092
2 Mon 28250
7 Sat 26231
3 Tue 25764
4 Wed 24501
5 Thu 22347
6 Fri 21028

Now we need to exclude the user_guids that have a value of "1" in the exclude column as well. One way to do this would be to join the completed_tests, dogs, and users table with a sequence of inner joins. However, we've seen in previous lessons that there are duplicate rows in the users table. These duplicates will get passed through the join and will affect the count calculations. To illustrate this, compare the following two queries.

Question 5: Write a query to retrieve all the dog_guids for users common to the dogs and users table using the traditional inner join syntax (your output will have 950,331 rows).


In [14]:
%%sql
SELECT COUNT(d.dog_guid)
FROM users u JOIN dogs d ON u.user_guid=d.user_guid


1 rows affected.
Out[14]:
COUNT(d.dog_guid)
950331

Question 6: Write a query to retrieve all the distinct dog_guids common to the dogs and users table using the traditional inner join syntax (your output will have 35,048 rows).


In [15]:
%%sql
SELECT COUNT(DISTINCT d.dog_guid)
FROM users u JOIN dogs d ON u.user_guid=d.user_guid


1 rows affected.
Out[15]:
COUNT(DISTINCT d.dog_guid)
35048

The strategy we will use to handle duplicate rows in the users table will be to, first, write a subquery that retrieves the distinct dog_guids from an inner join between the dogs and users table with the appropriate records excluded. Then, second, we will join the result of this subquery to the complete_tests table and group the results according to the day of the week.

Question 7: Start by writing a query that retrieves distinct dog_guids common to the dogs and users table, excuding dog_guids and user_guids with a "1" in their respective exclude columns (your output will have 34,121 rows).


In [18]:
%%sql
SELECT
    DISTINCT d.dog_guid
FROM users u JOIN dogs d ON u.user_guid=d.user_guid
WHERE (d.exclude=0 OR d.exclude IS NULL) AND (u.exclude=0 OR u.exclude IS NULL)


34121 rows affected.
Out[18]:
dog_guid
fd27b272-7144-11e5-ba71-058fbc01cf0b
fd417cac-7144-11e5-ba71-058fbc01cf0b
fd27b5ba-7144-11e5-ba71-058fbc01cf0b
fd3fb0f2-7144-11e5-ba71-058fbc01cf0b
fd27b6b4-7144-11e5-ba71-058fbc01cf0b
fd27b79a-7144-11e5-ba71-058fbc01cf0b
fd27b948-7144-11e5-ba71-058fbc01cf0b
fd27bbbe-7144-11e5-ba71-058fbc01cf0b
fd27c1c2-7144-11e5-ba71-058fbc01cf0b
fd27c0fa-7144-11e5-ba71-058fbc01cf0b
fd27c5be-7144-11e5-ba71-058fbc01cf0b
fd27c74e-7144-11e5-ba71-058fbc01cf0b
fd27c64a-7144-11e5-ba71-058fbc01cf0b
fd27c7d0-7144-11e5-ba71-058fbc01cf0b
fd27c852-7144-11e5-ba71-058fbc01cf0b
fd27c8d4-7144-11e5-ba71-058fbc01cf0b
fd27cd98-7144-11e5-ba71-058fbc01cf0b
fd27ce1a-7144-11e5-ba71-058fbc01cf0b
fd3d249a-7144-11e5-ba71-058fbc01cf0b
fd27c956-7144-11e5-ba71-058fbc01cf0b
fd3cf7b8-7144-11e5-ba71-058fbc01cf0b
fd3cf84e-7144-11e5-ba71-058fbc01cf0b
fd27cb72-7144-11e5-ba71-058fbc01cf0b
fd27cea6-7144-11e5-ba71-058fbc01cf0b
fd27cf28-7144-11e5-ba71-058fbc01cf0b
34121 rows, truncated to displaylimit of 25

Question 8: Now adapt your query from Question 4 so that it inner joins on the result of the subquery you wrote in Question 7 instead of the dogs table. This will give you a count of the number of tests completed on each day of the week, excluding all of the dog_guids and user_guids that the Dognition team flagged in the exclude columns.


In [20]:
%%sql
SELECT DAYOFWEEK(ct.created_at),
    CASE DAYOFWEEK(ct.created_at)
    WHEN 1 THEN 'Sun'
    WHEN 2 THEN 'Mon'
    WHEN 3 THEN 'Tue'
    WHEN 4 THEN 'Wed'
    WHEN 5 THEN 'Thu'
    WHEN 6 THEN 'Fri'
    WHEN 7 THEN 'Sat'
    END AS weekday,
    COUNT(ct.created_at) AS num_tests
FROM complete_tests ct 
    JOIN 
    (SELECT
        DISTINCT d.dog_guid
    FROM users u JOIN dogs d ON u.user_guid=d.user_guid
    WHERE (d.exclude=0 OR d.exclude IS NULL) AND (u.exclude=0 OR u.exclude IS NULL)
    ) AS dd
    ON ct.dog_guid=dd.dog_guid
GROUP BY weekday
ORDER BY num_tests DESC


7 rows affected.
Out[20]:
DAYOFWEEK(ct.created_at) weekday num_tests
1 Sun 31036
2 Mon 28138
7 Sat 26149
3 Tue 25696
4 Wed 24433
5 Thu 22323
6 Fri 21027

In [21]:
%%sql
SELECT DAYOFWEEK(c.created_at) AS dayasnum, YEAR(c.created_at) AS year,
COUNT(c.created_at) AS numtests,
(CASE
WHEN DAYOFWEEK(c.created_at)=1 THEN "Su"
WHEN DAYOFWEEK(c.created_at)=2 THEN "Mo"
WHEN DAYOFWEEK(c.created_at)=3 THEN "Tu"
WHEN DAYOFWEEK(c.created_at)=4 THEN "We"
WHEN DAYOFWEEK(c.created_at)=5 THEN "Th"
WHEN DAYOFWEEK(c.created_at)=6 THEN "Fr"
WHEN DAYOFWEEK(c.created_at)=7 THEN "Sa"
END) AS daylabel
FROM complete_tests c JOIN
    (SELECT DISTINCT dog_guid
        FROM dogs d JOIN users u
        ON d.user_guid=u.user_guid
        WHERE ((u.exclude IS NULL OR u.exclude=0)
        AND (d.exclude IS NULL OR d.exclude=0))
    ) AS dogs_cleaned
ON c.dog_guid=dogs_cleaned.dog_guid
GROUP BY daylabel
ORDER BY numtests DESC;


7 rows affected.
Out[21]:
dayasnum year numtests daylabel
1 2013 31036 Su
2 2013 28138 Mo
7 2013 26149 Sa
3 2013 25696 Tu
4 2013 24433 We
5 2014 22323 Th
6 2013 21027 Fr

These results still suggest that Sunday is the day when the most tests are completed and Friday is the day when the fewest tests are completed. However, our first query suggested that more tests were completed on Tuesday than Saturday; our current query suggests that slightly more tests are completed on Saturday than Tuesday, now that flagged dog_guids and user_guids are excluded.

It's always a good idea to see if a data pattern replicates before you interpret it too strongly. The ideal way to do this would be to have a completely separate and independent data set to analyze. We don't have such a data set, but we can assess the reliability of the day of the week patterns in a different way. We can test whether the day of the week patterns are the same in all years of our data set.

Question 9: Adapt your query from Question 8 to provide a count of the number of tests completed on each weekday of each year in the Dognition data set. Exclude all dog_guids and user_guids with a value of "1" in their exclude columns. Sort the output by year in ascending order, and then by the total number of tests completed in descending order. HINT: you will need a function described in one of these references to retrieve the year of each time stamp in the created_at field:

http://dev.mysql.com/doc/refman/5.7/en/func-op-summary-ref.html
http://www.w3resource.com/mysql/mysql-functions-and-operators.php


In [23]:
%%sql
SELECT DAYOFWEEK(ct.created_at),
    CASE DAYOFWEEK(ct.created_at)
    WHEN 1 THEN 'Sun'
    WHEN 2 THEN 'Mon'
    WHEN 3 THEN 'Tue'
    WHEN 4 THEN 'Wed'
    WHEN 5 THEN 'Thu'
    WHEN 6 THEN 'Fri'
    WHEN 7 THEN 'Sat'
    END AS weekday,
    YEAR(ct.created_at) AS year,
    COUNT(ct.created_at) AS num_tests
FROM complete_tests ct 
    JOIN 
    (SELECT
        DISTINCT d.dog_guid
    FROM users u JOIN dogs d ON u.user_guid=d.user_guid
    WHERE (d.exclude=0 OR d.exclude IS NULL) AND (u.exclude=0 OR u.exclude IS NULL)
    ) AS dd
    ON ct.dog_guid=dd.dog_guid
GROUP BY weekday, year
ORDER BY year ASC, num_tests DESC


21 rows affected.
Out[23]:
DAYOFWEEK(ct.created_at) weekday year num_tests
1 Sun 2013 8203
7 Sat 2013 6854
2 Mon 2013 5740
4 Wed 2013 5665
3 Tue 2013 5393
6 Fri 2013 4997
5 Thu 2013 4961
2 Mon 2014 9309
1 Sun 2014 9210
3 Tue 2014 9177
4 Wed 2014 8857
7 Sat 2014 8257
5 Thu 2014 7286
6 Fri 2014 6475
1 Sun 2015 13623
2 Mon 2015 13089
3 Tue 2015 11126
7 Sat 2015 11038
5 Thu 2015 10076
4 Wed 2015 9911
6 Fri 2015 9555

These results suggest that although the precise order of the weekdays with the most to fewest completed tests changes slightly from year to year, Sundays always have a lot of completed tests, and Fridays always have the fewest or close to the fewest completed tests. So far, it seems like it might be a good idea for Dognition to target reminder or encouragement messages to customers on Sundays. However, there is one more issue our analysis does not address. All of the time stamps in the created_at column are in Coordinated Universal Time (abbreviated UTC). This is a time convention that is constant around the globe. Nonetheless, as the picture below illustrates, countries and states have different time zones. The same UTC time can correspond with local times in different countries that are as much as 24 hours apart:

Therefore, the weekdays we have extracted so far may not accurately reflect the weekdays in the local times of different countries. The only way to correct the time stamps for time zone differences is to obtain a table with the time zones of every city, state, or country. Such a table was not available to us in this course, but we can run some analyses that approximate a time zone correction for United States customers.

Question 10: First, adapt your query from Question 9 so that you only examine customers located in the United States, with Hawaii and Alaska residents excluded. HINTS: In this data set, the abbreviation for the United States is "US", the abbreviation for Hawaii is "HI" and the abbreviation for Alaska is "AK". You should have 5,860 tests completed on Sunday of 2013.


In [24]:
%%sql
SELECT DAYOFWEEK(ct.created_at),
    CASE DAYOFWEEK(ct.created_at)
    WHEN 1 THEN 'Sun'
    WHEN 2 THEN 'Mon'
    WHEN 3 THEN 'Tue'
    WHEN 4 THEN 'Wed'
    WHEN 5 THEN 'Thu'
    WHEN 6 THEN 'Fri'
    WHEN 7 THEN 'Sat'
    END AS weekday,
    YEAR(ct.created_at) AS year,
    COUNT(ct.created_at) AS num_tests
FROM complete_tests ct 
    JOIN 
    (SELECT
        DISTINCT d.dog_guid
    FROM users u JOIN dogs d ON u.user_guid=d.user_guid
    WHERE (d.exclude=0 OR d.exclude IS NULL) AND (u.exclude=0 OR u.exclude IS NULL)
     AND u.country='US' AND u.state NOT IN ('HI', 'AK')
    ) AS dd
    ON ct.dog_guid=dd.dog_guid
GROUP BY weekday, year
ORDER BY year ASC, num_tests DESC


21 rows affected.
Out[24]:
DAYOFWEEK(ct.created_at) weekday year num_tests
1 Sun 2013 5860
7 Sat 2013 4674
2 Mon 2013 3695
4 Wed 2013 3496
3 Tue 2013 3449
6 Fri 2013 3163
5 Thu 2013 3090
2 Mon 2014 7278
3 Tue 2014 6800
1 Sun 2014 6632
4 Wed 2014 6331
7 Sat 2014 6006
5 Thu 2014 5271
6 Fri 2014 4831
2 Mon 2015 8784
1 Sun 2015 8570
3 Tue 2015 7218
7 Sat 2015 7116
5 Thu 2015 6343
4 Wed 2015 6164
6 Fri 2015 5689

The next step is to adjust the created_at times for differences in time zone. Most United States states (excluding Hawaii and Alaska) have a time zone of UTC time -5 hours (in the eastern-most regions) to -8 hours (in the western-most regions). To get a general idea for how much our weekday analysis is likely to change based on time zone, we will subtract 6 hours from every time stamp in the complete_tests table. Although this means our time stamps can be inaccurate by 1 or 2 hours, people are not likely to be playing Dognition games at midnight, so 1-2 hours should not affect the weekdays extracted from each time stamp too much.

The functions used to subtract time differ across database systems, so you should double-check which function you need to use every time you are working with a new database. We will use the date_sub function:

http://www.w3schools.com/sql/func_date_sub.asp

Question 11: Write a query that extracts the original created_at time stamps for rows in the complete_tests table in one column, and the created_at time stamps with 6 hours subtracted in another column. Limit your output to 100 rows.


In [33]:
%%sql
SELECT DAYOFWEEK(DATE_SUB(ct.created_at,INTERVAL 6 HOUR)) AS weeknum,
    CASE DAYOFWEEK(DATE_SUB(ct.created_at,INTERVAL 6 HOUR))
    WHEN 1 THEN 'Sun'
    WHEN 2 THEN 'Mon'
    WHEN 3 THEN 'Tue'
    WHEN 4 THEN 'Wed'
    WHEN 5 THEN 'Thu'
    WHEN 6 THEN 'Fri'
    WHEN 7 THEN 'Sat'
    END AS weekday,
    YEAR(ct.created_at) AS year,
    COUNT(ct.created_at) AS num_tests
FROM complete_tests ct 
    JOIN 
    (SELECT
        DISTINCT d.dog_guid
    FROM users u JOIN dogs d ON u.user_guid=d.user_guid
    WHERE (d.exclude=0 OR d.exclude IS NULL) AND (u.exclude=0 OR u.exclude IS NULL)
     AND u.country='US' AND u.state NOT IN ('HI', 'AK')
    ) AS dd
    ON ct.dog_guid=dd.dog_guid
GROUP BY weekday, year
ORDER BY year ASC, num_tests DESC


21 rows affected.
Out[33]:
weeknum weekday year num_tests
1 Sun 2013 6061
7 Sat 2013 4754
2 Mon 2013 3798
4 Wed 2013 3410
3 Tue 2013 3276
5 Thu 2013 3079
6 Fri 2013 3049
2 Mon 2014 7908
1 Sun 2014 7736
3 Tue 2014 6513
7 Sat 2014 6081
4 Wed 2014 5772
5 Thu 2014 4800
6 Fri 2014 4339
1 Sun 2015 10406
2 Mon 2015 8229
7 Sat 2015 7154
3 Tue 2015 6673
4 Wed 2015 6266
5 Thu 2015 5881
6 Fri 2015 5275

Question 12: Use your query from Question 11 to adapt your query from Question 10 in order to provide a count of the number of tests completed on each day of the week, with approximate time zones taken into account, in each year in the Dognition data set. Exclude all dog_guids and user_guids with a value of "1" in their exclude columns. Sort the output by year in ascending order, and then by the total number of tests completed in descending order. HINT: Don't forget to adjust for the time zone in your DAYOFWEEK statement and your CASE statement.


In [32]:
%%sql
SELECT DAYOFWEEK(DATE_SUB(ct.created_at,INTERVAL 6 HOUR)) AS weeknum,
    CASE DAYOFWEEK(DATE_SUB(ct.created_at,INTERVAL 6 HOUR))
    WHEN 1 THEN 'Sun'
    WHEN 2 THEN 'Mon'
    WHEN 3 THEN 'Tue'
    WHEN 4 THEN 'Wed'
    WHEN 5 THEN 'Thu'
    WHEN 6 THEN 'Fri'
    WHEN 7 THEN 'Sat'
    END AS weekday,
    YEAR(ct.created_at) AS year,
    COUNT(ct.created_at) AS num_tests
FROM complete_tests ct 
    JOIN 
    (SELECT
        DISTINCT d.dog_guid
    FROM users u JOIN dogs d ON u.user_guid=d.user_guid
    WHERE (d.exclude=0 OR d.exclude IS NULL) AND (u.exclude=0 OR u.exclude IS NULL)
     AND u.country='US' AND u.state NOT IN ('HI', 'AK')
    ) AS dd
    ON ct.dog_guid=dd.dog_guid
GROUP BY weekday, year
ORDER BY year ASC, num_tests DESC


21 rows affected.
Out[32]:
weeknum weekday year num_tests
1 Sun 2013 6061
7 Sat 2013 4754
2 Mon 2013 3798
4 Wed 2013 3410
3 Tue 2013 3276
5 Thu 2013 3079
6 Fri 2013 3049
2 Mon 2014 7908
1 Sun 2014 7736
3 Tue 2014 6513
7 Sat 2014 6081
4 Wed 2014 5772
5 Thu 2014 4800
6 Fri 2014 4339
1 Sun 2015 10406
2 Mon 2015 8229
7 Sat 2015 7154
3 Tue 2015 6673
4 Wed 2015 6266
5 Thu 2015 5881
6 Fri 2015 5275

You can try re-running the query with time-zone corrections of 5, 7, or 8 hours, and the results remain essentially the same. All of these analyses suggest that customers are most likely to complete tests around Sunday and Monday, and least likely to complete tests around the end of the work week, on Thursday and Friday. This is certainly valuable information for Dognition to take advantage of.

If you were presenting this information to the Dognition team, you might want to present the information in the form of a graph that you make in another program. The graph would be easier to read if the output was ordered according to the days of the week shown in standard calendars, with Monday being the first day and Sunday being the last day. MySQL provides an easy way to do this using the FIELD function in the ORDER BY statement:

https://www.virendrachandak.com/techtalk/mysql-ordering-results-by-specific-field-values/

Question 13: Adapt your query from Question 12 so that the results are sorted by year in ascending order, and then by the day of the week in the following order: Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday.


In [34]:
%%sql
SELECT DAYOFWEEK(DATE_SUB(ct.created_at,INTERVAL 6 HOUR)) AS weeknum,
    CASE DAYOFWEEK(DATE_SUB(ct.created_at,INTERVAL 6 HOUR))
    WHEN 1 THEN 'Sun'
    WHEN 2 THEN 'Mon'
    WHEN 3 THEN 'Tue'
    WHEN 4 THEN 'Wed'
    WHEN 5 THEN 'Thu'
    WHEN 6 THEN 'Fri'
    WHEN 7 THEN 'Sat'
    END AS weekday,
    YEAR(ct.created_at) AS year,
    COUNT(ct.created_at) AS num_tests
FROM complete_tests ct 
    JOIN 
    (SELECT
        DISTINCT d.dog_guid
    FROM users u JOIN dogs d ON u.user_guid=d.user_guid
    WHERE (d.exclude=0 OR d.exclude IS NULL) AND (u.exclude=0 OR u.exclude IS NULL)
     AND u.country='US' AND u.state NOT IN ('HI', 'AK')
    ) AS dd
    ON ct.dog_guid=dd.dog_guid
GROUP BY weekday, year
ORDER BY year ASC, FIELD(weekday, 'Mon','Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun') ASC


21 rows affected.
Out[34]:
weeknum weekday year num_tests
2 Mon 2013 3798
3 Tue 2013 3276
4 Wed 2013 3410
5 Thu 2013 3079
6 Fri 2013 3049
7 Sat 2013 4754
1 Sun 2013 6061
2 Mon 2014 7908
3 Tue 2014 6513
4 Wed 2014 5772
5 Thu 2014 4800
6 Fri 2014 4339
7 Sat 2014 6081
1 Sun 2014 7736
2 Mon 2015 8229
3 Tue 2015 6673
4 Wed 2015 6266
5 Thu 2015 5881
6 Fri 2015 5275
7 Sat 2015 7154
1 Sun 2015 10406

Unfortunately other database platforms do not have the ORDER BY FIELD functionality. To achieve the same result in other platforms, you would have to use a CASE statement or a more advanced solution:

http://stackoverflow.com/questions/1309624/simulating-mysqls-order-by-field-in-postgresql

The link provided above is to a discussion on stackoverflow.com. Stackoverflow is a great website that, in their words, "is a community of 4.7 million programmers, just like you, helping each other." You can ask questions about SQL queries and get help from other experts, or search through questions posted previously to see if somebody else has already asked a question that is relevant to the problem you are trying to solve. It's a great resource to use whenever you run into trouble with your queries.

2. Which states and countries have the most Dognition users?

You ended up with a pretty long and complex query in the questions above that you tested step-by-step. Many people save these types of queries so that they can be adapted for similar queries in the future without having to redesign and retest the entire query.

In the next two questions, we will practice repurposing previously-designed queries for new questions. Both questions can be answered through relatively minor modifications of the queries you wrote above.

Question 14: Which 5 states within the United States have the most Dognition customers, once all dog_guids and user_guids with a value of "1" in their exclude columns are removed? Try using the following general strategy: count how many unique user_guids are associated with dogs in the complete_tests table, break up the counts according to state, sort the results by counts of unique user_guids in descending order, and then limit your output to 5 rows. California ("CA") and New York ("NY") should be at the top of your list.


In [ ]:

The number of unique Dognition users in California is more than two times greater than any other state. This information could be very helpful to Dognition. Useful follow-up questions would be: were special promotions run in California that weren't run in other states? Did Dognition use advertising channels that are particularly effective in California? If not, what traits differentiate California users from other users? Can these traits be taken advantage of in future marketing efforts or product developments?

Let's try one more analysis that examines testing circumstances from a different angle.

Question 15: Which 10 countries have the most Dognition customers, once all dog_guids and user_guids with a value of "1" in their exclude columns are removed? HINT: don't forget to remove the u.country="US" statement from your WHERE clause.


In [38]:
%%sql
SELECT dd.country,
    COUNT(ct.created_at) AS num_tests
FROM complete_tests ct 
    JOIN 
    (SELECT
        DISTINCT d.dog_guid, u.country
    FROM users u JOIN dogs d ON u.user_guid=d.user_guid
    WHERE (d.exclude=0 OR d.exclude IS NULL) AND (u.exclude=0 OR u.exclude IS NULL)
    ) AS dd
    ON ct.dog_guid=dd.dog_guid
GROUP BY dd.country
ORDER BY num_tests DESC


67 rows affected.
Out[38]:
country num_tests
US 121399
N/A 38970
CA 7276
GB 1885
AU 1861
NZ 670
NO 611
DE 608
DK 537
CH 419
FR 409
BR 364
IT 356
ES 306
SG 262
NL 258
EE 228
MX 208
BE 191
ZA 173
FI 155
IL 126
HU 121
IE 102
CL 91
67 rows, truncated to displaylimit of 25

The United States, Canada, Australia, and Great Britain are the countries with the most Dognition users. N/A refers to "not applicable" which essentially means we have no usable country data from those rows. After Great Britain, the number of Dognition users drops quite a lot. This analysis suggests that Dognition is most likely to be used by English-speaking countries. One question Dognition might want to consider is whether there are any countries whose participation would dramatically increase if a translated website were available.

3. Congratulations!

You have now written many complex queries on your own that address real analysis questions about a real business problem. You know how to look up new functions, you know how to troubleshoot your queries by isolating each piece of the query until you are sure the syntax is correct, and you know where to look for help if you get stuck. You are ready to start using SQL in your own business ventures. Keep learning, keep trying new things, and keep asking questions. Congratulations for taking your career to the next level!

There is another video to watch, and of course, more exercises to work through using the Dillard's data set.

In the meantime, enjoy practicing any other queries you want to try here:


In [ ]: