Topic:    Challenge Set 9 Part II
Subject:  SQL
Date:     02/20/2017
Name:     Prashant Tatineni

In [2]:
from sqlalchemy import create_engine
import pandas as pd
cnx = create_engine('postgresql://prashant:ptpro3@52.14.144.23:5432/prashant')
#port ~ 5432

In [2]:
pd.read_sql_query('''SELECT * FROM allstarfull LIMIT 5''',cnx)


Out[2]:
playerid yearid gamenum gameid teamid lgid gp startingpos
0 gomezle01 1933 0 ALS193307060 NYA AL 1 1
1 ferreri01 1933 0 ALS193307060 BOS AL 1 2
2 gehrilo01 1933 0 ALS193307060 NYA AL 1 3
3 gehrich01 1933 0 ALS193307060 DET AL 1 4
4 dykesji01 1933 0 ALS193307060 CHA AL 1 5

In [3]:
pd.read_sql_query('''SELECT * FROM schools LIMIT 5''',cnx)


Out[3]:
schoolid schoolname schoolcity schoolstate schoolnick
0 abilchrist Abilene Christian University Abilene TX Wildcats
1 adelphi Adelphi University Garden City NY Panthers
2 adrianmi Adrian College Adrian MI Bulldogs
3 airforce United States Air Force Academy Colorado Springs CO Falcons
4 akron University of Akron Akron OH Zips

In [35]:
pd.read_sql_query('''SELECT * FROM salaries LIMIT 5''',cnx)


Out[35]:
yearid teamid lgid playerid salary
0 1985 BAL AL murraed02 1472819.0
1 1985 BAL AL lynnfr01 1090000.0
2 1985 BAL AL ripkeca01 800000.0
3 1985 BAL AL lacyle01 725000.0
4 1985 BAL AL flanami01 641667.0

In [4]:
pd.read_sql_query('''SELECT schoolstate,Count(schoolid) as ct FROM schools Group By schoolstate ORDER BY ct DESC LIMIT 5''',cnx)


Out[4]:
schoolstate ct
0 PA 57
1 CA 48
2 NY 45
3 TX 41
4 OH 33

In [5]:
pd.read_sql_query('''SELECT playerid,salary 
                     FROM Salaries 
                     WHERE yearid = '1985' and salary > '500000' LIMIT 5;''',cnx)


Out[5]:
playerid salary
0 barkele01 870000.0
1 bedrost01 550000.0
2 benedbr01 545000.0
3 campri01 633333.0
4 ceronri01 625000.0

In [ ]:


In [6]:



Out[6]:
schoolid schoolname schoolcity schoolstate schoolnick
0 abilchrist Abilene Christian University Abilene TX Wildcats
1 adelphi Adelphi University Garden City NY Panthers
2 adrianmi Adrian College Adrian MI Bulldogs
3 airforce United States Air Force Academy Colorado Springs CO Falcons
4 akron University of Akron Akron OH Zips

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:

1. What was the total spent on salaries by each team, each year?


In [14]:
pd.read_sql_query('''SELECT yearid, teamid, SUM(salary) FROM salaries
                        GROUP BY 1,2
                        ORDER BY 1 DESC
                        LIMIT 10
                        ''',cnx)


Out[14]:
yearid teamid sum
0 2013 ARI 90132000.0
1 2013 PHI 169863189.0
2 2013 LAA 124174750.0
3 2013 SLN 92260110.0
4 2013 PIT 77062000.0
5 2013 SFN 140180334.0
6 2013 KCA 80091725.0
7 2013 MIN 75337500.0
8 2013 COL 74409071.0
9 2013 MIL 76947033.0

2. What is the first and last year played for each player? Hint: Create a new table from 'Fielding.csv'.


In [19]:
pd.read_sql_query('''SELECT playerid, min(yearid), max(yearid)
                        FROM fielding
                        GROUP BY 1
                        LIMIT 10
                        ''',cnx)


Out[19]:
playerid min max
0 gagnied01 1914 1915
1 ramospe01 1955 1970
2 putkolu01 2012 2013
3 boyerke01 1955 1969
4 dillibo01 1946 1951
5 jeterde01 1995 2013
6 sperrst01 1936 1938
7 vidrojo01 1997 2008
8 izturce01 2001 2013
9 clarkda06 2007 2007

3. Who has played the most all star games?


In [22]:
pd.read_sql_query('''SELECT playerid, COUNT(*)
                        FROM allstarfull
                        GROUP BY 1
                        ORDER BY 2 DESC
                        LIMIT 1
                        ''',cnx)


Out[22]:
playerid count
0 aaronha01 25

4. Which school has generated the most distinct players? Hint: Create new table from 'CollegePlaying.csv'.


In [8]:
pd.read_sql_query('''SELECT schoolid, count(distinct playerid)
                        FROM schoolsplayers
                        GROUP BY 1
                        ORDER BY 2 DESC
                        LIMIT 1
                        ''',cnx)


Out[8]:
schoolid count
0 usc 102

5. Which players have the longest career? Assume that the debut and finalGame columns comprise the start and end, respectively, of a player's career. Hint: Create a new table from 'Master.csv'. Also note that strings can be converted to dates using the DATE function and can then be subtracted from each other yielding their difference in days.


In [30]:
pd.read_sql_query('''SELECT playerid, finalgame, debut, (finalgame-debut) AS days
                        FROM master
                        WHERE finalgame IS NOT NULL and debut IS NOT NULL
                        ORDER BY 4 DESC
                        LIMIT 5''',cnx)


Out[30]:
playerid finalgame debut days
0 altroni01 1933-10-01 1898-07-14 12862
1 orourji01 1904-09-22 1872-04-26 11836
2 minosmi01 1980-10-05 1949-04-19 11492
3 olearch01 1934-09-30 1904-04-14 11126
4 lathaar01 1909-09-30 1880-07-05 10678

6. What is the distribution of debut months? Hint: Look at the DATE and EXTRACT functions.


In [34]:
pd.read_sql_query('''SELECT EXTRACT(MONTH FROM debut) AS debut_month, COUNT(*)
                        FROM master
                        GROUP BY 1
                        ORDER BY 1 ASC''',cnx)


Out[34]:
debut_month count
0 3.0 41
1 4.0 4711
2 5.0 2230
3 6.0 1893
4 7.0 1978
5 8.0 1943
6 9.0 5061
7 10.0 308
8 NaN 189

7. What is the effect of table join order on mean salary for the players listed in the main (master) table? Hint: Perform two different queries, one that joins on playerID in the salary table and other that joins on the same column in the master table. You will have to use left joins for each since right joins are not currently supported with SQLalchemy.


In [43]:
pd.read_sql_query('''SELECT S.playerid, AVG(salary)
                        FROM Salaries S LEFT JOIN Master M
                        ON S.playerid = M.playerid
                        GROUP BY 1
                        LIMIT 5''',cnx)


Out[43]:
playerid avg
0 bookech01 350000.0
1 wisede01 443750.0
2 contrjo01 6750000.0
3 myersro02 175000.0
4 jeterde01 14034338.5

In [48]:
pd.read_sql_query('''SELECT M.playerid, AVG(salary)
                        FROM Master M LEFT JOIN Salaries S
                        ON M.playerid = S.playerid
                        GROUP BY 1
                        LIMIT 5''',cnx)


Out[48]:
playerid avg
0 gagnied01 None
1 ramospe01 None
2 putkolu01 None
3 boyerke01 None
4 dillibo01 None

By joining to the Master table, we can see that there's many players with no salary data.