Copyright Jana Schaich Borg/Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
There are a few more logical operators we haven't covered yet that you might find useful when designing your queries. Expressions that use logical operators return a result of "true" or "false", depending on whether the conditions you specify are met. The "true" or "false" results are usually used to determine which, if any, subsequent parts of your query will be run. We will discuss the IF operator, the CASE operator, and the order of operations within logical expressions in this lesson.
Begin by loading the sql library and database, and making the Dognition database your default database:
In [8]:
%load_ext sql
%sql mysql://studentuser:studentpw@mysqlserver/dognitiondb
%sql USE dognitiondb
%config SqlMagic.displaylimit=25
IF expressions are used to return one of two results based on whether inputs to the expressions meet the conditions you specify. They are frequently used in SELECT statements as a compact way to rename values in a column. The basic syntax is as follows:
IF([your conditions],[value outputted if conditions are met],[value outputted if conditions are NOT met])
So we could write:
SELECT created_at, IF(created_at<'2014-06-01','early_user','late_user') AS user_type
FROM users
to output one column that provided the time stamp of when a user account was created, and a second column called user_type that used that time stamp to determine whether the user was an early or late user. User_type could then be used in a GROUP BY statement to segment summary calculations (in database systems that support the use of aliases in GROUP BY statements).
For example, since we know there are duplicate user_guids in the user table, we could combine a subquery with an IF statement to retrieve a list of unique user_guids with their classification as either an early or late user (based on when their first user entry was created):
SELECT cleaned_users.user_guid as UserID,
IF(cleaned_users.first_account<'2014-06-01','early_user','late_user') AS user_type
FROM (SELECT user_guid, MIN(created_at) AS first_account
FROM users
GROUP BY user_guid) AS cleaned_users
We could then use a GROUP BY statement to count the number of unique early or late users:
SELECT IF(cleaned_users.first_account<'2014-06-01','early_user','late_user') AS user_type,
COUNT(cleaned_users.first_account)
FROM (SELECT user_guid, MIN(created_at) AS first_account
FROM users
GROUP BY user_guid) AS cleaned_users
GROUP BY user_type
Try it yourself:
In [3]:
%%sql
SELECT IF(cleaned_users.first_account<'2014-06-01','early_user','late_user') AS user_type,
COUNT(cleaned_users.first_account)
FROM (SELECT user_guid, MIN(created_at) AS first_account
FROM users
GROUP BY user_guid) AS cleaned_users
GROUP BY user_type
Out[3]:
Question 1: Write a query that will output distinct user_guids and their associated country of residence from the users table, excluding any user_guids that have NULL values. You should get 16,261 rows in your result.
In [7]:
%%sql
SELECT DISTINCT u.user_guid, u.country
FROM users u
WHERE u.user_guid IS NOT NULL AND u.country IS NOT NULL
Out[7]:
Question 2: Use an IF expression and the query you wrote in Question 1 as a subquery to determine the number of unique user_guids who reside in the United States (abbreviated "US") and outside of the US.
In [8]:
%%sql
SELECT IF(cleanU.country='US', 'in US', 'out US') AS location, COUNT(cleanU.user_guid)
FROM (SELECT DISTINCT u.user_guid, u.country
FROM users u
WHERE u.user_guid IS NOT NULL AND u.country IS NOT NULL) AS cleanU
GROUP BY location
Out[8]:
In [9]:
%%sql
SELECT IF(cleaned_users.country='US','In US','Outside US') AS US_user,
count(cleaned_users.user_guid)
FROM (SELECT DISTINCT user_guid, country
FROM users
WHERE country IS NOT NULL) AS cleaned_users
GROUP BY US_user;
Out[9]:
Single IF expressions can only result in one of two specified outputs, but multiple IF expressions can be nested to result in more than two possible outputs. When you nest IF expressions, it is important to encase each IF expression--as well as the entire IF expression put together--in parentheses.
For example, if you examine the entries contained in the non-US countries category, you will see that many users are associated with a country called "N/A." "N/A" is an abbreviation for "Not Applicable"; it is not a real country name. We should separate these entries from the "Outside of the US" category we made earlier. We could use a nested query to say whenever "country" does not equal "US", use the results of a second IF expression to determine whether the outputed value should be "Not Applicable" or "Outside US." The IF expression would look like this:
IF(cleaned_users.country='US','In US', IF(cleaned_users.country='N/A','Not Applicable','Outside US'))
Since the second IF expression is in the position within the IF expression where you specify "value outputted if conditions are not met," its two possible outputs will only be considered if cleaned_users.country='US' is evaluated as false.
The full query to output the number of unique users in each of the three groups would be:
SELECT IF(cleaned_users.country='US','In US',
IF(cleaned_users.country='N/A','Not Applicable','Outside US')) AS US_user,
count(cleaned_users.user_guid)
FROM (SELECT DISTINCT user_guid, country
FROM users
WHERE country IS NOT NULL) AS cleaned_users
GROUP BY US_user
Try it yourself. You should get 5,642 unique user_guids in the "Not Applicable" category, and 1,263 users in the "Outside US" category.
In [10]:
%%sql
SELECT IF(cleaned_users.country='US','In US',
IF(cleaned_users.country='N/A','Not Applicable','Outside US')) AS US_user,
count(cleaned_users.user_guid)
FROM (SELECT DISTINCT user_guid, country
FROM users
WHERE country IS NOT NULL) AS cleaned_users
GROUP BY US_user
Out[10]:
The IF function is not supported by all database platforms, and some spell the function as IIF rather than IF, so be sure to double-check how the function works in the platform you are using.
If nested IF expressions seem confusing or hard to read, don't worry, there is a better function available for situations when you want to use conditional logic to output more than two groups. That function is called CASE.
The main purpose of CASE expressions is to return a singular value based on one or more conditional tests. You can think of CASE expressions as an efficient way to write a set of IF and ELSEIF statements. There are two viable syntaxes for CASE expressions. If you need to manipulate values in a current column of your data, you would use this syntax:
Using this syntax, our nested IF statement from above could be written as:
SELECT CASE WHEN cleaned_users.country="US" THEN "In US"
WHEN cleaned_users.country="N/A" THEN "Not Applicable"
ELSE "Outside US"
END AS US_user,
count(cleaned_users.user_guid)
FROM (SELECT DISTINCT user_guid, country
FROM users
WHERE country IS NOT NULL) AS cleaned_users
GROUP BY US_user
Go ahead and try it:
In [11]:
%%sql
SELECT CASE WHEN cleaned_users.country="US" THEN "In US"
WHEN cleaned_users.country="N/A" THEN "Not Applicable"
ELSE "Outside US"
END AS US_user,
COUNT(cleaned_users.user_guid)
FROM (SELECT DISTINCT user_guid, country
FROM users
WHERE country IS NOT NULL) AS cleaned_users
GROUP BY US_user
Out[11]:
Since our query does not require manipulation of any of the values in the country column, though, we could also take advantage of this syntax, which is slightly more compact:
Our query written in this syntax would look like this:
SELECT CASE cleaned_users.country
WHEN "US" THEN "In US"
WHEN "N/A" THEN "Not Applicable"
ELSE "Outside US"
END AS US_user,
count(cleaned_users.user_guid)
FROM (SELECT DISTINCT user_guid, country
FROM users
WHERE country IS NOT NULL) AS cleaned_users
GROUP BY US_user
Try this query as well:
In [12]:
%%sql
SELECT CASE cleaned_users.country
WHEN "US" THEN "In US"
WHEN "N/A" THEN "Not Applicable"
ELSE "Outside US"
END AS US_user,
count(cleaned_users.user_guid)
FROM (SELECT DISTINCT user_guid, country
FROM users
WHERE country IS NOT NULL) AS cleaned_users
GROUP BY US_user
Out[12]:
There are a couple of things to know about CASE expressions:
You will find that CASE statements are useful in many contexts. For example, they can be used to rename or revise values in a column.
Question 3: Write a query using a CASE statement that outputs 3 columns: dog_guid, dog_fixed, and a third column that reads "neutered" every time there is a 1 in the "dog_fixed" column of dogs, "not neutered" every time there is a value of 0 in the "dog_fixed" column of dogs, and "NULL" every time there is a value of anything else in the "dog_fixed" column. Limit your results for troubleshooting purposes.
In [16]:
%%sql
SELECT d.dog_guid, d.dog_fixed,
CASE
WHEN d.dog_fixed=1 THEN 'neutered'
WHEN d.dog_fixed=0 THEN 'not neutered'
ELSE NULL
END AS label
FROM dogs d
LIMIT 10
Out[16]:
You can also use CASE statements to standardize or combine several values into one.
Question 4: We learned that NULL values should be treated the same as "0" values in the exclude columns of the dogs and users tables. Write a query using a CASE statement that outputs 3 columns: dog_guid, exclude, and a third column that reads "exclude" every time there is a 1 in the "exclude" column of dogs and "keep" every time there is any other value in the exclude column. Limit your results for troubleshooting purposes.
In [20]:
%%sql
SELECT d.dog_guid, d.exclude,
CASE d.exclude
WHEN 1 THEN 'exclude'
ELSE 'keep'
END AS label
FROM dogs d
LIMIT 10
Out[20]:
Question 5: Re-write your query from Question 4 using an IF statement instead of a CASE statement.
In [21]:
%%sql
SELECT d.dog_guid, d.exclude,
IF(d.exclude=1, 'exclude', 'keep') AS label
FROM dogs d
LIMIT 10
Out[21]:
Case expressions are also useful for breaking values in a column up into multiple groups that meet specific criteria or that have specific ranges of values.
Question 6: Write a query that uses a CASE expression to output 3 columns: dog_guid, weight, and a third column that reads...
"very small" when a dog's weight is 1-10 pounds
"small" when a dog's weight is greater than 10 pounds to 30 pounds
"medium" when a dog's weight is greater than 30 pounds to 50 pounds
"large" when a dog's weight is greater than 50 pounds to 85 pounds
"very large" when a dog's weight is greater than 85 pounds
Limit your results for troubleshooting purposes.
In [25]:
%%sql
SELECT d.dog_guid, d.weight,
CASE
WHEN d.weight >=1 and d.weight <=10 THEN 'very small'
WHEN d.weight >10 and d.weight <=30 THEN 'small'
WHEN d.weight >30 and d.weight <=50 THEN 'medium'
WHEN d.weight >50 and d.weight <=85 THEN 'large'
WHEN d.weight >85 THEN 'very large'
ELSE 'just weird'
END AS label
FROM dogs d
LIMIT 10
Out[25]:
As you started to see with the query you wrote in Question 6, CASE expressions often end up needing multiple AND and OR operators to accurately describe the logical conditions you want to impose on the groups in your queries. You must pay attention to the order in which these operators are included in your logical expressions, because unless parentheses are included, the NOT operator is always evaluated before an AND operator, and an AND operator is always evaluated before the OR operator.
When parentheses are included, the expressions within the parenthese are evaluated first. That means this expression:
CASE WHEN "condition 1" OR "condition 2" AND "condition 3"...
will lead to different results than this expression:
CASE WHEN "condition 3" AND "condition 1" OR "condition 2"...
or this expression:
CASE WHEN ("condition 1" OR "condition 2") AND "condition 3"...
In the first case you will get rows that meet condition 2 and 3, or condition 1. In the second case you will get rows that meet condition 1 and 3, or condition 2. In the third case, you will get rows that meet condition 1 or 2, and condition 3.
Let's see a concrete example of how the order in which logical operators are evaluated affects query results.
Question 7: How many distinct dog_guids are found in group 1 using this query?
SELECT COUNT(DISTINCT dog_guid),
CASE WHEN breed_group='Sporting' OR breed_group='Herding' AND exclude!='1' THEN "group 1"
ELSE "everything else"
END AS groups
FROM dogs
GROUP BY groups
In [30]:
%%sql
SELECT COUNT(DISTINCT dog_guid),
CASE WHEN breed_group='Sporting' OR breed_group='Herding' AND exclude!='1' THEN "group 1"
ELSE "everything else"
END AS groups
FROM dogs
GROUP BY groups
Out[30]:
Question 8: How many distinct dog_guids are found in group 1 using this query?
SELECT COUNT(DISTINCT dog_guid),
CASE WHEN exclude!='1' AND breed_group='Sporting' OR breed_group='Herding' THEN "group 1"
ELSE "everything else"
END AS group_name
FROM dogs
GROUP BY group_name
In [27]:
%%sql
SELECT COUNT(DISTINCT dog_guid),
CASE WHEN exclude!='1' AND breed_group='Sporting' OR breed_group='Herding' THEN "group 1"
ELSE "everything else"
END AS group_name
FROM dogs
GROUP BY group_name
Out[27]:
Question 9: How many distinct dog_guids are found in group 1 using this query?
SELECT COUNT(DISTINCT dog_guid),
CASE WHEN exclude!='1' AND (breed_group='Sporting' OR breed_group='Herding') THEN "group 1"
ELSE "everything else"
END AS group_name
FROM dogs
GROUP BY group_name
In [28]:
%%sql
SELECT COUNT(DISTINCT dog_guid),
CASE WHEN exclude!='1' AND (breed_group='Sporting' OR breed_group='Herding') THEN "group 1"
ELSE "everything else"
END AS group_name
FROM dogs
GROUP BY group_name
Out[28]:
**So make sure you always pay attention to the order in which your logical operators are listed in your expressions, and whenever possible, include parentheses to ensure that the expressions are evaluated in the way you intend!**
Question 10: For each dog_guid, output its dog_guid, breed_type, number of completed tests, and use an IF statement to include an extra column that reads "Pure_Breed" whenever breed_type equals 'Pure Breed" and "Not_Pure_Breed" whenever breed_type equals anything else. LIMIT your output to 50 rows for troubleshooting. HINT: you will need to use a join to complete this query.
In [5]:
%%sql
SELECT d.dog_guid, d.breed_type, COUNT(ct.dog_guid),
IF(d.breed_type='Pure Breed', 'Pure_Breed', 'Not_Pure_Breed') AS 'Pure or not'
FROM dogs d JOIN complete_tests ct on d.dog_guid=ct.dog_guid
GROUP BY d.dog_guid
Out[5]:
In [6]:
%%sql
SELECT d.dog_guid AS dogID, d.breed_type AS breed_type, count(c.created_at) AS
numtests,
IF(d.breed_type='Pure Breed','pure_breed', 'not_pure_breed') AS pure_breed
FROM dogs d, complete_tests c
WHERE d.dog_guid=c.dog_guid
GROUP BY dogID
Out[6]:
Question 11: Write a query that uses a CASE statement to report the number of unique user_guids associated with customers who live in the United States and who are in the following groups of states:
Group 1: New York (abbreviated "NY") or New Jersey (abbreviated "NJ")
Group 2: North Carolina (abbreviated "NC") or South Carolina (abbreviated "SC")
Group 3: California (abbreviated "CA")
Group 4: All other states with non-null values
You should find 898 unique user_guids in Group1.
In [29]:
%%sql
SELECT CASE
WHEN (u.state='NY' OR u.state='NJ') THEN 'New York'
WHEN (u.state='CA') THEN 'California'
WHEN (u.state='NC' OR u.state='SC') THEN 'Carlolina'
ELSE NULL
END AS states_group,
COUNT(u.state)
FROM (SELECT DISTINCT uu.user_guid, uu.state
FROM users uu
WHERE uu.country='US'
) AS u
GROUP BY states_group
Out[29]:
Question 12: Write a query that allows you to determine how many unique dog_guids are associated with dogs who are DNA tested and have either stargazer or socialite personality dimensions. Your answer should be 70.
In [36]:
%%sql
SELECT DISTINCT d.dog_guid
FROM dogs d
WHERE (d.dimension='stargazer' OR d.dimension='socialite') AND d.dna_tested=1
Out[36]:
Feel free to practice any other queries you like here!
In [ ]: