Topic: Challenge Set 9 Part II
Subject: SQL
Date: 02/20/2017
Name: Prashant Tatineni
In [2]:
from sqlalchemy import create_engine
import pandas as pd
cnx = create_engine('postgresql://prashant:ptpro3@52.14.144.23:5432/prashant')
#port ~ 5432
In [2]:
pd.read_sql_query('''SELECT * FROM allstarfull LIMIT 5''',cnx)
Out[2]:
In [3]:
pd.read_sql_query('''SELECT * FROM schools LIMIT 5''',cnx)
Out[3]:
In [35]:
pd.read_sql_query('''SELECT * FROM salaries LIMIT 5''',cnx)
Out[35]:
In [4]:
pd.read_sql_query('''SELECT schoolstate,Count(schoolid) as ct FROM schools Group By schoolstate ORDER BY ct DESC LIMIT 5''',cnx)
Out[4]:
In [5]:
pd.read_sql_query('''SELECT playerid,salary
FROM Salaries
WHERE yearid = '1985' and salary > '500000' LIMIT 5;''',cnx)
Out[5]:
In [ ]:
In [6]:
Out[6]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
1. What was the total spent on salaries by each team, each year?
In [14]:
pd.read_sql_query('''SELECT yearid, teamid, SUM(salary) FROM salaries
GROUP BY 1,2
ORDER BY 1 DESC
LIMIT 10
''',cnx)
Out[14]:
2. What is the first and last year played for each player? Hint: Create a new table from 'Fielding.csv'.
In [19]:
pd.read_sql_query('''SELECT playerid, min(yearid), max(yearid)
FROM fielding
GROUP BY 1
LIMIT 10
''',cnx)
Out[19]:
3. Who has played the most all star games?
In [22]:
pd.read_sql_query('''SELECT playerid, COUNT(*)
FROM allstarfull
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1
''',cnx)
Out[22]:
4. Which school has generated the most distinct players? Hint: Create new table from 'CollegePlaying.csv'.
In [8]:
pd.read_sql_query('''SELECT schoolid, count(distinct playerid)
FROM schoolsplayers
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1
''',cnx)
Out[8]:
5. Which players have the longest career? Assume that the debut and finalGame columns comprise the start and end, respectively, of a player's career. Hint: Create a new table from 'Master.csv'. Also note that strings can be converted to dates using the DATE function and can then be subtracted from each other yielding their difference in days.
In [30]:
pd.read_sql_query('''SELECT playerid, finalgame, debut, (finalgame-debut) AS days
FROM master
WHERE finalgame IS NOT NULL and debut IS NOT NULL
ORDER BY 4 DESC
LIMIT 5''',cnx)
Out[30]:
6. What is the distribution of debut months? Hint: Look at the DATE and EXTRACT functions.
In [34]:
pd.read_sql_query('''SELECT EXTRACT(MONTH FROM debut) AS debut_month, COUNT(*)
FROM master
GROUP BY 1
ORDER BY 1 ASC''',cnx)
Out[34]:
7. What is the effect of table join order on mean salary for the players listed in the main (master) table? Hint: Perform two different queries, one that joins on playerID in the salary table and other that joins on the same column in the master table. You will have to use left joins for each since right joins are not currently supported with SQLalchemy.
In [43]:
pd.read_sql_query('''SELECT S.playerid, AVG(salary)
FROM Salaries S LEFT JOIN Master M
ON S.playerid = M.playerid
GROUP BY 1
LIMIT 5''',cnx)
Out[43]:
In [48]:
pd.read_sql_query('''SELECT M.playerid, AVG(salary)
FROM Master M LEFT JOIN Salaries S
ON M.playerid = S.playerid
GROUP BY 1
LIMIT 5''',cnx)
Out[48]:
By joining to the Master table, we can see that there's many players with no salary data.