Pandas Exercise

Use the Adventure Works dataset to create the following reports. The data files can be directlty accessed by all users on ARC (our Hadoop Cluster).

No need to download or copy into your home directory!

Reference to the original data set https://msdn.microsoft.com/en-us/library/hh403424.aspx

Task

  1. write the Python Pandas expression to produce a table as described in the problem statements.
  2. The SQL expression may give you a hint. It also allows you to see both systems side-by-side.
  3. If you don't know SQL just ignore the SQL code.

In [1]:
import pandas as pd
import numpy as np

In [2]:
%%time
Employees = pd.read_excel('/home/data/AdventureWorks/Employees.xls')
print("shape:", Employees.shape)


shape: (291, 26)
CPU times: user 74.6 ms, sys: 2.6 ms, total: 77.2 ms
Wall time: 75.7 ms

In [6]:
%%time
Territory = pd.read_excel('/home/data/AdventureWorks/SalesTerritory.xls')
print("shape:", Territory.shape)


(12, 6)
CPU times: user 10.9 ms, sys: 340 µs, total: 11.3 ms
Wall time: 10.2 ms

In [11]:
%%time
Customers = pd.read_excel('/home/data/AdventureWorks/Customers.xls')
print("shape:", Customers.shape)


shape: (17, 6)
CPU times: user 8.18 ms, sys: 1.54 ms, total: 9.72 ms
Wall time: 8.64 ms

In [10]:
%%time
Orders = pd.read_excel('/home/data/AdventureWorks/ItemsOrdered.xls')
print("shape:", Orders.shape)


shape: (32, 5)
CPU times: user 11.2 ms, sys: 499 µs, total: 11.7 ms
Wall time: 10.9 ms

Filtering (with)

1. Provide a list of employees that are married.

SELECT e.EmployeeID, e.FirstName, e.LastName FROM dbo.Employees AS e WHERE e.MaritalStatus = 'M';

In [23]:
Employees.MaritalStatus == 'M'


Out[23]:
0       True
1       True
2       True
3       True
4       True
5      False
6      False
7      False
8      False
9      False
10      True
11     False
12      True
13     False
14      True
15      True
16     False
17     False
18      True
19      True
20     False
21     False
22     False
23     False
24      True
25      True
26      True
27      True
28      True
29      True
       ...  
261     True
262    False
263    False
264     True
265    False
266     True
267     True
268    False
269     True
270     True
271    False
272     True
273    False
274    False
275     True
276    False
277    False
278     True
279     True
280     True
281     True
282    False
283     True
284     True
285    False
286    False
287    False
288     True
289    False
290     True
Name: MaritalStatus, dtype: bool

In [29]:
Employees.loc[Employees.MaritalStatus == 'M', ['EmployeeID', 'FirstName', 'LastName', 'MaritalStatus']]


Out[29]:
EmployeeID FirstName LastName MaritalStatus
0 259 Ben Miller M
1 278 Garrett Vargas M
2 204 Gabe Mares M
3 78 Reuben D'sa M
4 255 Gordon Hee M
10 10 Michael Raheem M
12 175 Reed Koch M
14 144 Paul Singh M
15 192 Brenda Diaz M
18 203 Ken Myer M
19 245 Barbara Moreland M
24 163 Alex Nayberg M
25 118 Don Hall M
26 176 David Lawrence M
27 147 Sandra Reátegui Alayo M
28 138 Samantha Smith M
29 95 Jim Scardelis M
30 231 Jo Berry M
31 86 Ryan Cornelsen M
33 214 Andreas Berglund M
34 72 Steven Selikoff M
36 107 Christopher Hill M
37 269 Dan Bacon M
38 187 Yvonne McKay M
40 12 Thierry D'Hers M
42 30 Britta Simon M
43 198 Lorraine Nay M
45 31 Margie Shoop M
48 3 Roberto Tamburello M
49 199 Paula Nartker M
... ... ... ... ...
227 136 Sylvester Valdez M
231 35 Brandon Heidepriem M
232 228 Christian Kleinerman M
233 223 Sairaj Uddin M
234 100 Lolan Song M
239 200 Frank Lee M
240 279 Tsvi Reiter M
243 73 Carole Poland M
249 21 Terry Eminhizer M
252 254 Fukiko Ogisu M
255 53 Diane Glimp M
256 282 José Saraiva M
258 209 Kathie Flood M
259 50 Sidney Higa M
261 70 David Ortiz M
264 140 Prasanna Samarawickrama M
266 5 Gail Erickson M
267 252 Arvind Rao M
269 233 Magnus Hedlund M
270 83 Patrick Cook M
272 241 David Liu M
275 239 Mindy Martin M
278 38 Kim Abercrombie M
279 154 Raymond Sam M
280 114 Mindaugas Krapauskas M
281 123 Vamsi Kuppa M
283 101 Houman Pournasseh M
284 221 Chris Norred M
288 289 Jae Pak M
290 999 Chadwick Smith M

147 rows × 4 columns

2a. Show me a list of employees that have a lastname that begins with "R".

SELECT e.EmployeeID, e.FirstName, e.LastName FROM dbo.Employees AS e WHERE e.LastName LIKE 'R%';

In [4]:
filter = Employees.LastName.map(lambda x: x[0] == 'R')

In [5]:
Employees[filter]


Out[5]:
EmployeeID ManagerID TerritoryID Title FirstName MiddleName LastName Suffix JobTitle NationalIDNumber ... SickLeaveHours PhoneNumber PhoneNumberType EmailAddress AddressLine1 AddressLine2 City StateProvinceName PostalCode CountryName
9 124 250.0 NaN NaN Kim T Ralls NaN Stocker 420776180 ... 69 309-555-0129 Work kim0@adventure-works.com 1226 Shoe St. NaN Bothell Washington 98011 United States
10 10 25.0 NaN NaN Michael NaN Raheem NaN Research and Development Manager 879342154 ... 64 330-555-2568 Work michael6@adventure-works.com 1234 Seaside Way NaN San Francisco California 94109 United States
16 166 26.0 NaN NaN Jack S Richins NaN Production Supervisor - WC30 60114406 ... 54 552-555-0111 Work jack0@adventure-works.com 1356 Grove Way NaN Monroe Washington 98272 United States
27 147 26.0 NaN NaN Sandra NaN Reátegui Alayo NaN Production Technician - WC30 604664374 ... 38 896-555-0168 Work sandra0@adventure-works.com 1619 Stillman Court NaN Seattle Washington 98104 United States
44 133 26.0 NaN NaN Michael L Rothkugel NaN Production Technician - WC40 830150469 ... 55 454-555-0119 Work michael1@adventure-works.com 207 Berry Court NaN Edmonds Washington 98020 United States
95 44 26.0 NaN NaN Simon D Rapier NaN Production Technician - WC60 276751903 ... 39 963-555-0134 Work simon0@adventure-works.com 3421 Bouncing Road NaN Duvall Washington 98019 United States
99 65 26.0 NaN NaN Randy T Reeves NaN Production Technician - WC60 8066363 ... 34 961-555-0122 Cell randy0@adventure-works.com 3632 Bank Way NaN Edmonds Washington 98020 United States
128 145 26.0 NaN NaN Cynthia S Randall NaN Production Supervisor - WC30 386315192 ... 54 352-555-0138 Cell cynthia0@adventure-works.com 463 H Stagecoach Rd. NaN Kenmore Washington 98028 United States
131 149 26.0 NaN NaN Andy M Ruth NaN Production Technician - WC30 764853868 ... 39 118-555-0110 Cell andy0@adventure-works.com 4777 Rockne Drive NaN Bellevue Washington 98004 United States
166 74 26.0 NaN NaN Bjorn M Rettig NaN Production Technician - WC30 420023788 ... 41 199-555-0117 Work bjorn0@adventure-works.com 5802 Ampersand Drive NaN Seattle Washington 98104 United States
180 71 26.0 NaN NaN Michael Sean Ray NaN Production Supervisor - WC30 578935259 ... 55 156-555-0199 Cell michael3@adventure-works.com 6498 Mining Rd. NaN Seattle Washington 98104 United States
188 190 26.0 NaN NaN Robert J Rounthwaite NaN Production Technician - WC45 456839592 ... 61 589-555-0147 Cell robert0@adventure-works.com 6843 San Simeon Dr. NaN Seattle Washington 98104 United States
219 157 26.0 NaN NaN Linda A Randall NaN Production Technician - WC20 812797414 ... 26 696-555-0157 Work linda1@adventure-works.com 77 Birchwood NaN Seattle Washington 98104 United States
240 279 274.0 5.0 NaN Tsvi Michael Reiter NaN Sales Representative 716374314 ... 34 664-555-0112 Work tsvi0@adventure-works.com 8291 Crossbow Way NaN Memphis Tennessee 38103 United States
267 252 250.0 NaN NaN Arvind B Rao NaN Buyer 792847334 ... 50 848-555-0163 Cell arvind0@adventure-works.com 9495 Limewood Place NaN Renton Washington 98055 United States

15 rows × 26 columns


In [12]:
df = Employees[Employees.LastName.map(lambda x: x[0] == 'R')]
df[['EmployeeID', 'FirstName', 'LastName']]


Out[12]:
EmployeeID FirstName LastName
9 124 Kim Ralls
10 10 Michael Raheem
16 166 Jack Richins
27 147 Sandra Reátegui Alayo
44 133 Michael Rothkugel
95 44 Simon Rapier
99 65 Randy Reeves
128 145 Cynthia Randall
131 149 Andy Ruth
166 74 Bjorn Rettig
180 71 Michael Ray
188 190 Robert Rounthwaite
219 157 Linda Randall
240 279 Tsvi Reiter
267 252 Arvind Rao

2b. Show me a list of employees that have a lastname that ends with "r"

SELECT e.EmployeeID, e.FirstName, e.LastName FROM dbo.Employees AS e WHERE e.LastName LIKE '%r';

In [35]:
Employees.loc[Employees.LastName.map(lambda x: x[-1] == 'r'), ['EmployeeID', 'FirstName', 'LastName']]


Out[35]:
EmployeeID FirstName LastName
0 259 Ben Miller
8 161 Kirk Koenigsbauer
18 203 Ken Myer
49 199 Paula Nartker
53 41 Bryan Baker
56 104 Mary Baker
64 225 Alan Brewer
75 156 Lane Sacksteder
95 44 Simon Rapier
97 96 Elizabeth Keyser
98 229 Lori Penor
127 39 Ed Dudenhoefer
129 42 James Kramer
165 117 Chad Niswonger
168 57 Frank Miller
172 105 Kevin Homer
173 253 Linda Meisner
196 7 Dylan Miller
200 183 Barbara Decker
203 142 Olinda Turner
216 94 Russell Hunter
222 273 Brian Welcker
229 216 Sean Alexander
240 279 Tsvi Reiter
249 21 Terry Eminhizer
254 262 David Barber
282 29 Mark McArthur

2c. Provide a list of employees that have a hyphenated lastname.

SELECT e.EmployeeID, e.FirstName, e.LastName FROM dbo.Employees AS e WHERE e.LastName LIKE '%-%';

In [16]:
help(pd.Series.between)


Help on function between in module pandas.core.series:

between(self, left, right, inclusive=True)
    Return boolean Series equivalent to left <= series <= right. NA values
    will be treated as False
    
    Parameters
    ----------
    left : scalar
        Left boundary
    right : scalar
        Right boundary
    
    Returns
    -------
    is_between : Series


In [14]:
type(Employees.LastName)
dir(type(Employees.LastName))


Out[14]:
['T',
 '_AXIS_ALIASES',
 '_AXIS_IALIASES',
 '_AXIS_LEN',
 '_AXIS_NAMES',
 '_AXIS_NUMBERS',
 '_AXIS_ORDERS',
 '_AXIS_REVERSED',
 '_AXIS_SLICEMAP',
 '__abs__',
 '__add__',
 '__and__',
 '__array__',
 '__array_prepare__',
 '__array_priority__',
 '__array_wrap__',
 '__bool__',
 '__bytes__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__div__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__imul__',
 '__init__',
 '__int__',
 '__invert__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__le__',
 '__len__',
 '__long__',
 '__lt__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdiv__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rmod__',
 '__rmul__',
 '__ror__',
 '__round__',
 '__rpow__',
 '__rsub__',
 '__rtruediv__',
 '__rxor__',
 '__setattr__',
 '__setitem__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__unicode__',
 '__weakref__',
 '__xor__',
 '_accessors',
 '_add_numeric_operations',
 '_add_series_only_operations',
 '_add_series_or_dataframe_operations',
 '_agg_by_level',
 '_align_frame',
 '_align_series',
 '_allow_index_ops',
 '_at',
 '_binop',
 '_box_item_values',
 '_can_hold_na',
 '_check_inplace_setting',
 '_check_is_chained_assignment_possible',
 '_check_percentile',
 '_check_setitem_copy',
 '_clear_item_cache',
 '_consolidate_inplace',
 '_construct_axes_dict',
 '_construct_axes_dict_for_slice',
 '_construct_axes_dict_from',
 '_construct_axes_from_arguments',
 '_constructor',
 '_constructor_expanddim',
 '_constructor_sliced',
 '_convert',
 '_create_indexer',
 '_dir_additions',
 '_dir_deletions',
 '_expand_axes',
 '_from_axes',
 '_get_axis',
 '_get_axis_name',
 '_get_axis_number',
 '_get_axis_resolvers',
 '_get_block_manager_axis',
 '_get_bool_data',
 '_get_cacher',
 '_get_index_resolvers',
 '_get_item_cache',
 '_get_numeric_data',
 '_get_repr',
 '_get_values',
 '_get_values_tuple',
 '_get_with',
 '_iat',
 '_iget_item_cache',
 '_iloc',
 '_index',
 '_indexed_same',
 '_info_axis',
 '_info_axis_name',
 '_info_axis_number',
 '_init_mgr',
 '_internal_names',
 '_internal_names_set',
 '_is_cached',
 '_is_datelike_mixed_type',
 '_is_mixed_type',
 '_is_numeric_mixed_type',
 '_is_view',
 '_ix',
 '_ixs',
 '_loc',
 '_make_cat_accessor',
 '_make_dt_accessor',
 '_make_str_accessor',
 '_maybe_cache_changed',
 '_maybe_update_cacher',
 '_metadata',
 '_needs_reindex_multi',
 '_protect_consolidate',
 '_reduce',
 '_reindex_axes',
 '_reindex_axis',
 '_reindex_indexer',
 '_reindex_multi',
 '_reindex_with_indexers',
 '_reset_cache',
 '_reset_cacher',
 '_set_as_cached',
 '_set_axis',
 '_set_axis_name',
 '_set_is_copy',
 '_set_item',
 '_set_labels',
 '_set_name',
 '_set_subtyp',
 '_set_values',
 '_set_with',
 '_set_with_engine',
 '_setup_axes',
 '_slice',
 '_stat_axis',
 '_stat_axis_name',
 '_stat_axis_number',
 '_typ',
 '_unpickle_series_compat',
 '_update_inplace',
 '_validate_dtype',
 '_values',
 '_xs',
 'abs',
 'add',
 'add_prefix',
 'add_suffix',
 'align',
 'all',
 'any',
 'append',
 'apply',
 'argmax',
 'argmin',
 'argsort',
 'as_blocks',
 'as_matrix',
 'asfreq',
 'asobject',
 'asof',
 'astype',
 'at',
 'at_time',
 'autocorr',
 'axes',
 'base',
 'between',
 'between_time',
 'bfill',
 'blocks',
 'bool',
 'cat',
 'clip',
 'clip_lower',
 'clip_upper',
 'combine',
 'combine_first',
 'compound',
 'compress',
 'consolidate',
 'convert_objects',
 'copy',
 'corr',
 'count',
 'cov',
 'cummax',
 'cummin',
 'cumprod',
 'cumsum',
 'data',
 'describe',
 'diff',
 'div',
 'divide',
 'dot',
 'drop',
 'drop_duplicates',
 'dropna',
 'dt',
 'dtype',
 'dtypes',
 'duplicated',
 'empty',
 'eq',
 'equals',
 'ewm',
 'expanding',
 'factorize',
 'ffill',
 'fillna',
 'filter',
 'first',
 'first_valid_index',
 'flags',
 'floordiv',
 'from_array',
 'from_csv',
 'ftype',
 'ftypes',
 'ge',
 'get',
 'get_dtype_counts',
 'get_ftype_counts',
 'get_value',
 'get_values',
 'groupby',
 'gt',
 'hasnans',
 'head',
 'hist',
 'iat',
 'idxmax',
 'idxmin',
 'iget',
 'iget_value',
 'iloc',
 'imag',
 'index',
 'interpolate',
 'irow',
 'is_copy',
 'is_time_series',
 'is_unique',
 'isin',
 'isnull',
 'item',
 'items',
 'itemsize',
 'iteritems',
 'iterkv',
 'ix',
 'keys',
 'kurt',
 'kurtosis',
 'last',
 'last_valid_index',
 'le',
 'loc',
 'lt',
 'mad',
 'map',
 'mask',
 'max',
 'mean',
 'median',
 'memory_usage',
 'min',
 'mod',
 'mode',
 'mul',
 'multiply',
 'name',
 'nbytes',
 'ndim',
 'ne',
 'nlargest',
 'nonzero',
 'notnull',
 'nsmallest',
 'nunique',
 'order',
 'pct_change',
 'pipe',
 'plot',
 'pop',
 'pow',
 'prod',
 'product',
 'ptp',
 'put',
 'quantile',
 'radd',
 'rank',
 'ravel',
 'rdiv',
 'real',
 'reindex',
 'reindex_axis',
 'reindex_like',
 'rename',
 'rename_axis',
 'reorder_levels',
 'repeat',
 'replace',
 'resample',
 'reset_index',
 'reshape',
 'rfloordiv',
 'rmod',
 'rmul',
 'rolling',
 'round',
 'rpow',
 'rsub',
 'rtruediv',
 'sample',
 'searchsorted',
 'select',
 'sem',
 'set_axis',
 'set_value',
 'shape',
 'shift',
 'size',
 'skew',
 'slice_shift',
 'sort',
 'sort_index',
 'sort_values',
 'sortlevel',
 'squeeze',
 'std',
 'str',
 'strides',
 'sub',
 'subtract',
 'sum',
 'swapaxes',
 'swaplevel',
 'tail',
 'take',
 'to_clipboard',
 'to_csv',
 'to_dense',
 'to_dict',
 'to_frame',
 'to_hdf',
 'to_json',
 'to_msgpack',
 'to_period',
 'to_pickle',
 'to_sparse',
 'to_sql',
 'to_string',
 'to_timestamp',
 'to_xarray',
 'tolist',
 'transpose',
 'truediv',
 'truncate',
 'tshift',
 'tz_convert',
 'tz_localize',
 'unique',
 'unstack',
 'update',
 'valid',
 'value_counts',
 'values',
 'var',
 'view',
 'where',
 'xs']

In [36]:
Employees.loc[Employees.LastName.map(lambda x: '-' in x), ['EmployeeID', 'FirstName', 'LastName']]


Out[36]:
EmployeeID FirstName LastName
114 284 Tete Mensa-Annan
134 180 Katie McAskill-White
176 280 Pamela Ansman-Wolfe

3a. Provide a list of employees that are on salary and have more than 35 vacation hours left.

SELECT e.EmployeeID ,e.FirstName ,e.LastName ,e.VacationHours ,e.SalariedFlag FROM dbo.Employees AS e WHERE (e.SalariedFlag = 1) AND (e.VacationHours > 35);

In [38]:
Employees.loc[(Employees.SalariedFlag==1) & (Employees.VacationHours>35), ['EmployeeID', 'FirstName', 'LastName', 'SalariedFlag', 'VacationHours']]


Out[38]:
EmployeeID FirstName LastName SalariedFlag VacationHours
6 270 François Ajenstat 1 67
11 248 Mike Seamans 1 59
19 245 Barbara Moreland 1 58
37 269 Dan Bacon 1 72
57 263 Jean Trenary 1 65
65 212 Peng Wu 1 81
66 227 Gary Altman 1 86
93 286 Lynn Tsoflias 1 36
101 26 Peter Krebs 1 43
106 16 David Bradley 1 40
108 268 Ramesh Meyyappan 1 73
114 284 Tete Mensa-Annan 1 39
121 235 Paula Barreto de Mattos 1 54
123 1 Ken Sánchez 1 99
125 249 Wendy Kahn 1 55
130 8 Diane Margheim 1 62
140 211 Hazem Abolrous 1 80
145 267 Karen Berg 1 74
147 250 Sheela Word 1 49
151 121 Pilar Ackerman 1 93
167 271 Dan Wilson 1 66
169 272 Janaina Bueno 1 71
196 7 Dylan Miller 1 61
210 264 Stephanie Conroy 1 68
223 9 Gigi Matthew 1 63
232 228 Christian Kleinerman 1 92
235 275 Michael Blythe 1 38
272 241 David Liu 1 57
273 25 James Hamilton 1 64
288 289 Jae Pak 1 37

3b. Show the same as above but limit it to American employees.

SELECT DISTINCT CountryName FROM dbo.Employees; SELECT e.EmployeeID ,e.FirstName ,e.LastName ,e.VacationHours ,e.SalariedFlag ,e.CountryName FROM dbo.Employees AS e WHERE (e.SalariedFlag = 1) AND (e.VacationHours > 35) AND e.CountryName = 'United States';

In [ ]:
Employees.loc[(Employees.SalariedFlag==1) & (Employees.VacationHours>35), ['EmployeeID', 'FirstName', 'LastName', 'SalariedFlag', 'VacationHours']]

3c. Show the same as above but limit it to non-American employees.

SELECT e.EmployeeID ,e.FirstName ,e.LastName ,e.VacationHours ,e.SalariedFlag ,e.CountryName FROM dbo.Employees AS e WHERE (e.SalariedFlag = 1) AND (e.VacationHours > 5) AND e.CountryName != 'United States';

In [39]:
Employees.loc[
    (Employees.SalariedFlag==1) & (Employees.VacationHours>35) & (Employees.CountryName != 'United States')
    , ['EmployeeID', 'FirstName', 'LastName', 'SalariedFlag', 'VacationHours']
]


Out[39]:
EmployeeID FirstName LastName SalariedFlag VacationHours
93 286 Lynn Tsoflias 1 36
288 289 Jae Pak 1 37

4a. List the married employees with more than 35 vacation hours, only ones living in Washington state.

SELECT e.EmployeeID ,e.FirstName ,e.LastName ,e.VacationHours ,e.SalariedFlag ,e.StateProvinceName ,e.CountryName FROM dbo.Employees AS e WHERE e.MaritalStatus = 'M' AND e.VacationHours > 35 AND e.StateProvinceName = 'Washington' ;

In [40]:
Employees.loc[
    (Employees.SalariedFlag==1) & (Employees.VacationHours>35) & (Employees.StateProvinceName == 'Washington')
    , ['EmployeeID', 'FirstName', 'LastName', 'SalariedFlag', 'VacationHours', 'StateProvinceName']
]


Out[40]:
EmployeeID FirstName LastName SalariedFlag VacationHours StateProvinceName
6 270 François Ajenstat 1 67 Washington
11 248 Mike Seamans 1 59 Washington
19 245 Barbara Moreland 1 58 Washington
37 269 Dan Bacon 1 72 Washington
57 263 Jean Trenary 1 65 Washington
65 212 Peng Wu 1 81 Washington
66 227 Gary Altman 1 86 Washington
101 26 Peter Krebs 1 43 Washington
106 16 David Bradley 1 40 Washington
108 268 Ramesh Meyyappan 1 73 Washington
121 235 Paula Barreto de Mattos 1 54 Washington
123 1 Ken Sánchez 1 99 Washington
125 249 Wendy Kahn 1 55 Washington
130 8 Diane Margheim 1 62 Washington
140 211 Hazem Abolrous 1 80 Washington
145 267 Karen Berg 1 74 Washington
147 250 Sheela Word 1 49 Washington
151 121 Pilar Ackerman 1 93 Washington
167 271 Dan Wilson 1 66 Washington
169 272 Janaina Bueno 1 71 Washington
196 7 Dylan Miller 1 61 Washington
210 264 Stephanie Conroy 1 68 Washington
223 9 Gigi Matthew 1 63 Washington
232 228 Christian Kleinerman 1 92 Washington
272 241 David Liu 1 57 Washington
273 25 James Hamilton 1 64 Washington

4b. Change the logic to include anyone who meets any of the 3 conditions (i.e., people who are either married, live in Washington state, or have more than 35 vacation hours left)

SELECT e.EmployeeID ,e.FirstName ,e.LastName ,e.MaritalStatus ,e.VacationHours ,e.SalariedFlag ,e.StateProvinceName ,e.CountryName FROM dbo.Employees AS e WHERE e.MaritalStatus = 'M' OR e.VacationHours > 35 OR e.StateProvinceName = 'Washington' ;

In [41]:
Employees.loc[
    (Employees.SalariedFlag==1) | (Employees.VacationHours>35) | (Employees.StateProvinceName == 'Washington')
    , ['EmployeeID', 'FirstName', 'LastName', 'SalariedFlag', 'VacationHours', 'StateProvinceName']
]


Out[41]:
EmployeeID FirstName LastName SalariedFlag VacationHours StateProvinceName
0 259 Ben Miller 0 55 Washington
1 278 Garrett Vargas 1 33 Alberta
2 204 Gabe Mares 0 57 Washington
3 78 Reuben D'sa 0 72 Washington
4 255 Gordon Hee 0 52 Washington
5 66 Karan Khanna 0 28 Washington
6 270 François Ajenstat 1 67 Washington
7 22 Sariya Harnpadoungsataya 0 45 Washington
8 161 Kirk Koenigsbauer 0 74 Washington
9 124 Kim Ralls 0 98 Washington
10 10 Michael Raheem 1 16 California
11 248 Mike Seamans 1 59 Washington
12 175 Reed Koch 0 23 Washington
13 155 Fadi Fakhouri 0 16 Washington
14 144 Paul Singh 0 6 Washington
15 192 Brenda Diaz 0 71 Washington
16 166 Jack Richins 0 68 Washington
17 112 John Evans 0 91 Washington
18 203 Ken Myer 0 51 Alberta
19 245 Barbara Moreland 1 58 Washington
20 257 Eric Kurjan 0 54 Washington
21 244 Bryan Walton 0 62 Alberta
22 69 Steve Masters 0 31 Washington
23 210 Belinda Newman 0 83 Washington
24 163 Alex Nayberg 0 77 Washington
25 118 Don Hall 0 88 Washington
26 176 David Lawrence 0 29 Washington
27 147 Sandra Reátegui Alayo 0 37 Washington
28 138 Samantha Smith 0 7 Washington
29 95 Jim Scardelis 0 1 Washington
... ... ... ... ... ... ...
261 70 David Ortiz 0 33 Washington
262 219 Sean Chai 0 78 Washington
263 148 Jason Watters 0 38 Washington
264 140 Prasanna Samarawickrama 0 5 Washington
265 290 Ranjit Varkey Chudukatil 1 34 Gironde
266 5 Gail Erickson 1 5 Washington
267 252 Arvind Rao 0 60 Washington
268 266 Peter Connelly 0 69 Washington
269 233 Magnus Hedlund 0 87 Washington
270 83 Patrick Cook 0 61 Washington
271 251 Mikael Sandberg 0 59 Washington
272 241 David Liu 1 57 Washington
273 25 James Hamilton 1 64 Washington
274 108 Jinghao Liu 0 77 Washington
275 239 Mindy Martin 0 51 Washington
276 27 Jo Brown 0 80 Washington
277 186 Shane Kim 0 75 Washington
278 38 Kim Abercrombie 0 24 Washington
279 154 Raymond Sam 0 10 Washington
280 114 Mindaugas Krapauskas 0 97 Washington
281 123 Vamsi Kuppa 0 95 Washington
282 29 Mark McArthur 0 19 Washington
283 101 Houman Pournasseh 0 3 Washington
284 221 Chris Norred 0 75 Washington
285 18 John Wood 0 48 Washington
286 46 Eugene Kogan 0 36 Washington
287 202 Tawana Nusbaum 0 50 Washington
288 289 Jae Pak 1 37 England
289 288 Rachel Valdez 1 35 Hamburg
290 999 Chadwick Smith 0 55 Georgia

289 rows × 6 columns

4c. Show the same as above, but only for Production Technicians

SELECT DISTINCT JobTitle FROM dbo.Employees; --- look at job titles SELECT e.EmployeeID ,e.FirstName ,e.LastName ,e.MaritalStatus ,e.JobTitle ,e.VacationHours ,e.SalariedFlag ,e.StateProvinceName ,e.CountryName FROM dbo.Employees AS e WHERE e.JobTitle LIKE 'Production Technician%' AND (e.MaritalStatus = 'M' OR e.VacationHours > 35 OR e.StateProvinceName = 'Washington') ;

In [47]:
'abc'.startswith('ab')


Out[47]:
True

In [49]:
Employees.loc[ Employees.JobTitle.map(lambda x: x.startswith('Production Technician'))
    & (
        (Employees.SalariedFlag==1) | (Employees.VacationHours>35) | (Employees.StateProvinceName == 'Washington')
    ), ['EmployeeID', 'FirstName', 'LastName', 'SalariedFlag', 'VacationHours', 'StateProvinceName', 'JobTitle']
]


Out[49]:
EmployeeID FirstName LastName SalariedFlag VacationHours StateProvinceName JobTitle
2 204 Gabe Mares 0 57 Washington Production Technician - WC40
5 66 Karan Khanna 0 28 Washington Production Technician - WC60
8 161 Kirk Koenigsbauer 0 74 Washington Production Technician - WC45
12 175 Reed Koch 0 23 Washington Production Technician - WC30
13 155 Fadi Fakhouri 0 16 Washington Production Technician - WC20
14 144 Paul Singh 0 6 Washington Production Technician - WC20
17 112 John Evans 0 91 Washington Production Technician - WC50
18 203 Ken Myer 0 51 Alberta Production Technician - WC40
22 69 Steve Masters 0 31 Washington Production Technician - WC60
23 210 Belinda Newman 0 83 Washington Production Technician - WC45
24 163 Alex Nayberg 0 77 Washington Production Technician - WC45
25 118 Don Hall 0 88 Washington Production Technician - WC50
26 176 David Lawrence 0 29 Washington Production Technician - WC30
27 147 Sandra Reátegui Alayo 0 37 Washington Production Technician - WC30
28 138 Samantha Smith 0 7 Washington Production Technician - WC20
29 95 Jim Scardelis 0 1 Washington Production Technician - WC50
31 86 Ryan Cornelsen 0 59 Washington Production Technician - WC40
34 72 Steven Selikoff 0 41 Washington Production Technician - WC30
35 162 Laura Steele 0 75 Washington Production Technician - WC45
36 107 Christopher Hill 0 91 Washington Production Technician - WC10
38 187 Yvonne McKay 0 79 Washington Production Technician - WC45
39 75 Michiko Osada 0 44 Washington Production Technician - WC30
41 116 Michael Patten 0 98 Washington Production Technician - WC50
42 30 Britta Simon 0 14 Washington Production Technician - WC60
43 198 Lorraine Nay 0 49 Washington Production Technician - WC40
44 133 Michael Rothkugel 0 70 Washington Production Technician - WC40
45 31 Margie Shoop 0 18 Washington Production Technician - WC60
46 168 Garrett Young 0 34 Washington Production Technician - WC30
49 199 Paula Nartker 0 54 Washington Production Technician - WC40
50 48 Ruth Ellerbrock 0 83 Washington Production Technician - WC10
... ... ... ... ... ... ... ...
234 100 Lolan Song 0 7 Washington Production Technician - WC50
236 85 Brian Goldstein 0 63 Washington Production Technician - WC40
237 61 Diane Tibbott 0 8 Washington Production Technician - WC50
238 106 John Kane 0 92 Washington Production Technician - WC10
239 200 Frank Lee 0 56 Washington Production Technician - WC40
241 146 Jian Shuo Wang 0 36 Washington Production Technician - WC30
242 103 Ebru Ersan 0 93 Washington Production Technician - WC10
243 73 Carole Poland 0 42 Washington Production Technician - WC30
244 185 Stefen Hesse 0 19 Washington Production Technician - WC20
245 37 Chris Okelberry 0 16 Washington Production Technician - WC60
247 179 Jan Miksovsky 0 30 Washington Production Technician - WC30
248 91 Kimberly Zimmerman 0 95 Washington Production Technician - WC10
250 207 Greg Alderson 0 85 Washington Production Technician - WC45
251 82 Jack Creasey 0 62 Washington Production Technician - WC40
253 67 Jay Adams 0 32 Washington Production Technician - WC60
255 53 Diane Glimp 0 87 Washington Production Technician - WC10
257 169 Susan Metters 0 31 Washington Production Technician - WC30
258 209 Kathie Flood 0 87 Washington Production Technician - WC45
259 50 Sidney Higa 0 84 Washington Production Technician - WC10
261 70 David Ortiz 0 33 Washington Production Technician - WC60
263 148 Jason Watters 0 38 Washington Production Technician - WC30
264 140 Prasanna Samarawickrama 0 5 Washington Production Technician - WC20
270 83 Patrick Cook 0 61 Washington Production Technician - WC40
278 38 Kim Abercrombie 0 24 Washington Production Technician - WC60
279 154 Raymond Sam 0 10 Washington Production Technician - WC20
280 114 Mindaugas Krapauskas 0 97 Washington Production Technician - WC50
282 29 Mark McArthur 0 19 Washington Production Technician - WC60
283 101 Houman Pournasseh 0 3 Washington Production Technician - WC50
286 46 Eugene Kogan 0 36 Washington Production Technician - WC60
287 202 Tawana Nusbaum 0 50 Washington Production Technician - WC40

155 rows × 7 columns

5a. List all employees living in Redmond, Seattle, and Bellevue, showing EmployeeID, FirstName, LastName, and City. Sort the list alphabetically by city.

SELECT e.EmployeeID, e.FirstName, e.LastName, e.City FROM dbo.Employees AS e WHERE e.City in ('Redmond', 'Seattle', 'Bellevue') ORDER BY e.City ;

In [53]:
Employees.loc[Employees.City.isin(['Redmond', 'Seattle', 'Bellevue']),
              ['EmployeeID', 'FirstName', 'LastName', 'City']].sort_values('City')


Out[53]:
EmployeeID FirstName LastName City
203 142 Olinda Turner Bellevue
78 89 Patrick Wedge Bellevue
83 32 Rebecca Laszlo Bellevue
249 21 Terry Eminhizer Bellevue
88 113 Linda Moschell Bellevue
89 196 Shammi Mohamed Bellevue
160 34 Suchitra Mohan Bellevue
96 99 Nuan Yu Bellevue
250 207 Greg Alderson Bellevue
101 26 Peter Krebs Bellevue
104 151 Rostislav Shabalin Bellevue
205 128 Paul Komosinski Bellevue
111 177 Russell King Bellevue
223 9 Gigi Matthew Bellevue
121 235 Paula Barreto de Mattos Bellevue
227 136 Sylvester Valdez Bellevue
131 149 Andy Ruth Bellevue
102 93 Kok-Ho Loh Bellevue
73 258 Erin Hagens Bellevue
167 271 Dan Wilson Bellevue
263 148 Jason Watters Bellevue
14 144 Paul Singh Bellevue
208 141 Min Su Bellevue
266 5 Gail Erickson Bellevue
187 224 William Vong Bellevue
172 105 Kevin Homer Bellevue
179 206 Stuart Munson Bellevue
195 159 Terrence Earls Bellevue
28 138 Samantha Smith Bellevue
175 19 Mary Dempsey Bellevue
... ... ... ... ...
3 78 Reuben D'sa Seattle
5 66 Karan Khanna Seattle
8 161 Kirk Koenigsbauer Seattle
15 192 Brenda Diaz Seattle
17 112 John Evans Seattle
19 245 Barbara Moreland Seattle
22 69 Steve Masters Seattle
27 147 Sandra Reátegui Alayo Seattle
49 199 Paula Nartker Seattle
56 104 Mary Baker Seattle
69 62 John Campbell Seattle
75 156 Lane Sacksteder Seattle
85 132 Nicole Holliday Seattle
202 87 Cristian Petculescu Seattle
119 131 Baris Cetinok Seattle
134 180 Katie McAskill-White Seattle
136 152 Yuhong Li Seattle
137 171 David Yalovsky Seattle
142 36 Jose Lugo Seattle
148 60 Pete Male Seattle
149 120 Kitti Lertpiriyasuwat Seattle
151 121 Pilar Ackerman Seattle
156 77 Merav Netz Seattle
180 71 Michael Ray Seattle
182 102 Zheng Mu Seattle
185 165 Chris Preston Seattle
188 190 Robert Rounthwaite Seattle
198 40 JoLynn Dobney Seattle
127 39 Ed Dudenhoefer Seattle
287 202 Tawana Nusbaum Seattle

101 rows × 4 columns

5b. For the list above, make sure these are only in the state of Washington, just to be careful.

SELECT e.EmployeeID, e.FirstName, e.LastName, e.City, e.StateProvinceName FROM dbo.Employees AS e WHERE e.City in ('Redmond', 'Seattle', 'Bellevue') AND e.StateProvinceName = 'Washington' ORDER BY e.City ;

In [54]:
Employees.loc[Employees.City.isin(['Redmond', 'Seattle', 'Bellevue'])
              & (Employees.StateProvinceName=='Washington'),
              ['EmployeeID', 'FirstName', 'LastName', 'City']].sort_values('City')


Out[54]:
EmployeeID FirstName LastName City
203 142 Olinda Turner Bellevue
78 89 Patrick Wedge Bellevue
83 32 Rebecca Laszlo Bellevue
249 21 Terry Eminhizer Bellevue
88 113 Linda Moschell Bellevue
89 196 Shammi Mohamed Bellevue
160 34 Suchitra Mohan Bellevue
96 99 Nuan Yu Bellevue
250 207 Greg Alderson Bellevue
101 26 Peter Krebs Bellevue
104 151 Rostislav Shabalin Bellevue
205 128 Paul Komosinski Bellevue
111 177 Russell King Bellevue
223 9 Gigi Matthew Bellevue
121 235 Paula Barreto de Mattos Bellevue
227 136 Sylvester Valdez Bellevue
131 149 Andy Ruth Bellevue
102 93 Kok-Ho Loh Bellevue
73 258 Erin Hagens Bellevue
167 271 Dan Wilson Bellevue
263 148 Jason Watters Bellevue
14 144 Paul Singh Bellevue
208 141 Min Su Bellevue
266 5 Gail Erickson Bellevue
187 224 William Vong Bellevue
172 105 Kevin Homer Bellevue
179 206 Stuart Munson Bellevue
195 159 Terrence Earls Bellevue
28 138 Samantha Smith Bellevue
175 19 Mary Dempsey Bellevue
... ... ... ... ...
3 78 Reuben D'sa Seattle
5 66 Karan Khanna Seattle
8 161 Kirk Koenigsbauer Seattle
15 192 Brenda Diaz Seattle
17 112 John Evans Seattle
19 245 Barbara Moreland Seattle
22 69 Steve Masters Seattle
27 147 Sandra Reátegui Alayo Seattle
49 199 Paula Nartker Seattle
56 104 Mary Baker Seattle
69 62 John Campbell Seattle
75 156 Lane Sacksteder Seattle
85 132 Nicole Holliday Seattle
202 87 Cristian Petculescu Seattle
119 131 Baris Cetinok Seattle
134 180 Katie McAskill-White Seattle
136 152 Yuhong Li Seattle
137 171 David Yalovsky Seattle
142 36 Jose Lugo Seattle
148 60 Pete Male Seattle
149 120 Kitti Lertpiriyasuwat Seattle
151 121 Pilar Ackerman Seattle
156 77 Merav Netz Seattle
180 71 Michael Ray Seattle
182 102 Zheng Mu Seattle
185 165 Chris Preston Seattle
188 190 Robert Rounthwaite Seattle
198 40 JoLynn Dobney Seattle
127 39 Ed Dudenhoefer Seattle
287 202 Tawana Nusbaum Seattle

101 rows × 4 columns

6. Provide a list of employees who have no title, whether it's a NULL or empty string.

SELECT e.EmployeeID, e.Title, e.FirstName, e.LastName FROM dbo.Employees AS e WHERE e.Title IS NULL OR e.Title = '' ;

In [93]:
### Note: when exporting the SQL table to Excel the one row with a ' ' in the Title was already removed...Sorry

Employees.loc[pd.isnull(Employees.Title) | Employees.Title.map(lambda x: len(str(x).strip())==0)
              ,['EmployeeID', 'Title', 'FirstName', 'LastName', 'City']]


Out[93]:
EmployeeID Title FirstName LastName City
0 259 NaN Ben Miller Redmond
1 278 NaN Garrett Vargas Calgary
2 204 NaN Gabe Mares Edmonds
3 78 NaN Reuben D'sa Seattle
4 255 NaN Gordon Hee Bellevue
5 66 NaN Karan Khanna Seattle
6 270 NaN François Ajenstat Issaquah
7 22 NaN Sariya Harnpadoungsataya Everett
8 161 NaN Kirk Koenigsbauer Seattle
9 124 NaN Kim Ralls Bothell
10 10 NaN Michael Raheem San Francisco
11 248 NaN Mike Seamans Index
12 175 NaN Reed Koch Redmond
13 155 NaN Fadi Fakhouri Snohomish
14 144 NaN Paul Singh Bellevue
15 192 NaN Brenda Diaz Seattle
16 166 NaN Jack Richins Monroe
17 112 NaN John Evans Seattle
18 203 NaN Ken Myer Calgary
19 245 NaN Barbara Moreland Seattle
20 257 NaN Eric Kurjan Renton
21 244 NaN Bryan Walton Calgary
22 69 NaN Steve Masters Seattle
23 210 NaN Belinda Newman Bothell
24 163 NaN Alex Nayberg Newport Hills
25 118 NaN Don Hall Carnation
26 176 NaN David Lawrence Monroe
27 147 NaN Sandra Reátegui Alayo Seattle
28 138 NaN Samantha Smith Bellevue
29 95 NaN Jim Scardelis Edmonds
... ... ... ... ... ...
260 222 NaN A. Scott Wright Newport Hills
261 70 NaN David Ortiz Seattle
262 219 NaN Sean Chai Issaquah
263 148 NaN Jason Watters Bellevue
264 140 NaN Prasanna Samarawickrama Redmond
265 290 NaN Ranjit Varkey Chudukatil Bordeaux
267 252 NaN Arvind Rao Renton
268 266 NaN Peter Connelly Issaquah
269 233 NaN Magnus Hedlund Renton
270 83 NaN Patrick Cook Seattle
271 251 NaN Mikael Sandberg Bothell
272 241 NaN David Liu Gold Bar
273 25 NaN James Hamilton Monroe
274 108 NaN Jinghao Liu Carnation
275 239 NaN Mindy Martin Newport Hills
276 27 NaN Jo Brown Duvall
277 186 NaN Shane Kim Bellevue
278 38 NaN Kim Abercrombie Carnation
279 154 NaN Raymond Sam Edmonds
280 114 NaN Mindaugas Krapauskas Edmonds
281 123 NaN Vamsi Kuppa Bothell
282 29 NaN Mark McArthur Redmond
283 101 NaN Houman Pournasseh Redmond
284 221 NaN Chris Norred Issaquah
285 18 NaN John Wood Redmond
286 46 NaN Eugene Kogan Duvall
287 202 NaN Tawana Nusbaum Seattle
288 289 NaN Jae Pak Cambridge
289 288 NaN Rachel Valdez Berlin
290 999 NaN Chadwick Smith Atlanta

283 rows × 5 columns

7a. Provide a list of employees who have at least 60 vacation hours left.

SELECT e.EmployeeID, e.Title, e.FirstName, e.LastName, e.VacationHours, e.SickLeaveHours FROM dbo.Employees AS e WHERE e.VacationHours >= 60 ;

In [111]:
Employees.loc[Employees.VacationHours >= 60
              , ['EmployeeID', 'FirstName', 'LastName', 'VacationHours', 'SickLeaveHours']]


Out[111]:
EmployeeID FirstName LastName VacationHours SickLeaveHours
3 78 Reuben D'sa 72 56
6 270 François Ajenstat 67 53
8 161 Kirk Koenigsbauer 74 57
9 124 Kim Ralls 98 69
15 192 Brenda Diaz 71 55
16 166 Jack Richins 68 54
17 112 John Evans 91 65
21 244 Bryan Walton 62 51
23 210 Belinda Newman 83 61
24 163 Alex Nayberg 77 58
25 118 Don Hall 88 64
30 231 Jo Berry 91 65
33 214 Andreas Berglund 84 62
35 162 Laura Steele 75 57
36 107 Christopher Hill 91 65
37 269 Dan Bacon 72 56
38 187 Yvonne McKay 79 59
41 116 Michael Patten 98 69
44 133 Michael Rothkugel 70 55
50 48 Ruth Ellerbrock 83 61
51 126 Jimmy Bischoff 96 68
52 230 Stuart Macrae 88 64
56 104 Mary Baker 94 67
57 263 Jean Trenary 65 52
58 232 Pat Coleman 89 64
60 81 Mihail Frintu 64 52
62 119 Michael Entin 94 67
65 212 Peng Wu 81 60
66 227 Gary Altman 86 63
69 62 John Campbell 81 60
... ... ... ... ... ...
205 128 Paul Komosinski 68 54
209 208 Scott Gode 86 63
210 264 Stephanie Conroy 68 54
217 242 Deborah Poe 60 50
221 191 Lionel Penuchot 78 59
223 9 Gigi Matthew 63 51
229 216 Sean Alexander 82 61
232 228 Christian Kleinerman 92 66
236 85 Brian Goldstein 63 51
238 106 John Kane 92 66
242 103 Ebru Ersan 93 66
246 215 Mark Harrington 83 61
248 91 Kimberly Zimmerman 95 67
250 207 Greg Alderson 85 62
251 82 Jack Creasey 62 51
255 53 Diane Glimp 87 63
258 209 Kathie Flood 87 63
259 50 Sidney Higa 84 62
262 219 Sean Chai 78 59
267 252 Arvind Rao 60 50
268 266 Peter Connelly 69 54
269 233 Magnus Hedlund 87 63
270 83 Patrick Cook 61 50
273 25 James Hamilton 64 52
274 108 Jinghao Liu 77 58
276 27 Jo Brown 80 60
277 186 Shane Kim 75 57
280 114 Mindaugas Krapauskas 97 68
281 123 Vamsi Kuppa 95 67
284 221 Chris Norred 75 57

120 rows × 5 columns

7b. Provide a list of employees who have less than 60 vacation hours left.

SELECT e.EmployeeID, e.Title, e.FirstName, e.LastName, e.VacationHours, e.SickLeaveHours FROM dbo.Employees AS e WHERE e.VacationHours < 60 ;

In [106]:
Employees.loc[Employees.VacationHours < 60
              , ['EmployeeID', 'FirstName', 'LastName', 'VacationHours', 'SickLeaveHours']]


Out[106]:
EmployeeID Title FirstName LastName VacationHours SickLeaveHours
0 259 NaN Ben Miller 55 47
1 278 NaN Garrett Vargas 33 36
2 204 NaN Gabe Mares 57 48
4 255 NaN Gordon Hee 52 46
5 66 NaN Karan Khanna 28 34
7 22 NaN Sariya Harnpadoungsataya 45 42
10 10 NaN Michael Raheem 16 64
11 248 NaN Mike Seamans 59 49
12 175 NaN Reed Koch 23 31
13 155 NaN Fadi Fakhouri 16 28
14 144 NaN Paul Singh 6 23
18 203 NaN Ken Myer 51 45
19 245 NaN Barbara Moreland 58 49
20 257 NaN Eric Kurjan 54 47
22 69 NaN Steve Masters 31 35
26 176 NaN David Lawrence 29 34
27 147 NaN Sandra Reátegui Alayo 37 38
28 138 NaN Samantha Smith 7 23
29 95 NaN Jim Scardelis 1 20
31 86 NaN Ryan Cornelsen 59 49
32 181 NaN Michael Hines 20 30
34 72 NaN Steven Selikoff 41 40
39 75 NaN Michiko Osada 44 42
40 12 NaN Thierry D'Hers 9 24
42 30 NaN Britta Simon 14 27
43 198 NaN Lorraine Nay 49 44
45 31 NaN Margie Shoop 18 29
46 168 NaN Garrett Young 34 37
47 236 NaN Grant Culbertson 53 46
48 3 NaN Roberto Tamburello 2 21
... ... ... ... ... ... ...
241 146 NaN Jian Shuo Wang 36 38
243 73 NaN Carole Poland 42 41
244 185 NaN Stefen Hesse 19 29
245 37 NaN Chris Okelberry 16 28
247 179 NaN Jan Miksovsky 30 35
249 21 NaN Terry Eminhizer 44 42
252 254 NaN Fukiko Ogisu 57 48
253 67 NaN Jay Adams 32 36
254 262 NaN David Barber 56 48
256 282 NaN José Saraiva 31 35
257 169 NaN Susan Metters 31 35
260 222 NaN A. Scott Wright 44 42
261 70 NaN David Ortiz 33 36
263 148 NaN Jason Watters 38 39
264 140 NaN Prasanna Samarawickrama 5 22
265 290 NaN Ranjit Varkey Chudukatil 34 37
266 5 Ms. Gail Erickson 5 22
271 251 NaN Mikael Sandberg 59 49
272 241 NaN David Liu 57 48
275 239 NaN Mindy Martin 51 45
278 38 NaN Kim Abercrombie 24 32
279 154 NaN Raymond Sam 10 25
282 29 NaN Mark McArthur 19 29
283 101 NaN Houman Pournasseh 3 21
285 18 NaN John Wood 48 44
286 46 NaN Eugene Kogan 36 38
287 202 NaN Tawana Nusbaum 50 45
288 289 NaN Jae Pak 37 38
289 288 NaN Rachel Valdez 35 37
290 999 NaN Chadwick Smith 55 47

171 rows × 6 columns

7c. Show me employees who have more than 20 and less than 60 vacation hours left.

SELECT e.EmployeeID, e.Title, e.FirstName, e.LastName, e.VacationHours, e.SickLeaveHours FROM dbo.Employees AS e WHERE e.VacationHours BETWEEN 21 AND 59 ;

In [112]:
Employees.loc[(Employees.VacationHours > 20) & (Employees.VacationHours < 60)
              , ['EmployeeID', 'FirstName', 'LastName', 'VacationHours', 'SickLeaveHours']]


Out[112]:
EmployeeID FirstName LastName VacationHours SickLeaveHours
0 259 Ben Miller 55 47
1 278 Garrett Vargas 33 36
2 204 Gabe Mares 57 48
4 255 Gordon Hee 52 46
5 66 Karan Khanna 28 34
7 22 Sariya Harnpadoungsataya 45 42
11 248 Mike Seamans 59 49
12 175 Reed Koch 23 31
18 203 Ken Myer 51 45
19 245 Barbara Moreland 58 49
20 257 Eric Kurjan 54 47
22 69 Steve Masters 31 35
26 176 David Lawrence 29 34
27 147 Sandra Reátegui Alayo 37 38
31 86 Ryan Cornelsen 59 49
34 72 Steven Selikoff 41 40
39 75 Michiko Osada 44 42
43 198 Lorraine Nay 49 44
46 168 Garrett Young 34 37
47 236 Grant Culbertson 53 46
49 199 Paula Nartker 54 47
53 41 Bryan Baker 35 37
54 283 David Campbell 23 31
55 226 Brian LaMee 48 44
61 172 Marc Ingle 22 31
63 276 Linda Mitchell 27 33
64 225 Alan Brewer 47 43
68 194 Fred Northup 47 43
72 150 Michael Vanderhyde 35 37
73 258 Erin Hagens 53 46
... ... ... ... ... ...
228 173 Eugene Zabokritski 33 36
230 277 Jillian Carson 24 32
231 35 Brandon Heidepriem 22 31
233 223 Sairaj Uddin 46 43
235 275 Michael Blythe 38 39
239 200 Frank Lee 56 48
240 279 Tsvi Reiter 29 34
241 146 Jian Shuo Wang 36 38
243 73 Carole Poland 42 41
247 179 Jan Miksovsky 30 35
249 21 Terry Eminhizer 44 42
252 254 Fukiko Ogisu 57 48
253 67 Jay Adams 32 36
254 262 David Barber 56 48
256 282 José Saraiva 31 35
257 169 Susan Metters 31 35
260 222 A. Scott Wright 44 42
261 70 David Ortiz 33 36
263 148 Jason Watters 38 39
265 290 Ranjit Varkey Chudukatil 34 37
271 251 Mikael Sandberg 59 49
272 241 David Liu 57 48
275 239 Mindy Martin 51 45
278 38 Kim Abercrombie 24 32
285 18 John Wood 48 44
286 46 Eugene Kogan 36 38
287 202 Tawana Nusbaum 50 45
288 289 Jae Pak 37 38
289 288 Rachel Valdez 35 37
290 999 Chadwick Smith 55 47

115 rows × 5 columns

7d. If you did not use BETWEEN for 7c, do the same but use BETWEEN. If you did use BETWEEN for 7c, do it another way.

SELECT e.EmployeeID, e.Title, e.FirstName, e.LastName, e.VacationHours, e.SickLeaveHours FROM dbo.Employees AS e WHERE e.VacationHours > 20 AND e.VacationHours < 60 ;

In [114]:
Employees.loc[Employees.VacationHours.between(20, 60, inclusive=False)
              , ['EmployeeID', 'FirstName', 'LastName', 'VacationHours', 'SickLeaveHours']]


Out[114]:
EmployeeID FirstName LastName VacationHours SickLeaveHours
0 259 Ben Miller 55 47
1 278 Garrett Vargas 33 36
2 204 Gabe Mares 57 48
4 255 Gordon Hee 52 46
5 66 Karan Khanna 28 34
7 22 Sariya Harnpadoungsataya 45 42
11 248 Mike Seamans 59 49
12 175 Reed Koch 23 31
18 203 Ken Myer 51 45
19 245 Barbara Moreland 58 49
20 257 Eric Kurjan 54 47
22 69 Steve Masters 31 35
26 176 David Lawrence 29 34
27 147 Sandra Reátegui Alayo 37 38
31 86 Ryan Cornelsen 59 49
34 72 Steven Selikoff 41 40
39 75 Michiko Osada 44 42
43 198 Lorraine Nay 49 44
46 168 Garrett Young 34 37
47 236 Grant Culbertson 53 46
49 199 Paula Nartker 54 47
53 41 Bryan Baker 35 37
54 283 David Campbell 23 31
55 226 Brian LaMee 48 44
61 172 Marc Ingle 22 31
63 276 Linda Mitchell 27 33
64 225 Alan Brewer 47 43
68 194 Fred Northup 47 43
72 150 Michael Vanderhyde 35 37
73 258 Erin Hagens 53 46
... ... ... ... ... ...
228 173 Eugene Zabokritski 33 36
230 277 Jillian Carson 24 32
231 35 Brandon Heidepriem 22 31
233 223 Sairaj Uddin 46 43
235 275 Michael Blythe 38 39
239 200 Frank Lee 56 48
240 279 Tsvi Reiter 29 34
241 146 Jian Shuo Wang 36 38
243 73 Carole Poland 42 41
247 179 Jan Miksovsky 30 35
249 21 Terry Eminhizer 44 42
252 254 Fukiko Ogisu 57 48
253 67 Jay Adams 32 36
254 262 David Barber 56 48
256 282 José Saraiva 31 35
257 169 Susan Metters 31 35
260 222 A. Scott Wright 44 42
261 70 David Ortiz 33 36
263 148 Jason Watters 38 39
265 290 Ranjit Varkey Chudukatil 34 37
271 251 Mikael Sandberg 59 49
272 241 David Liu 57 48
275 239 Mindy Martin 51 45
278 38 Kim Abercrombie 24 32
285 18 John Wood 48 44
286 46 Eugene Kogan 36 38
287 202 Tawana Nusbaum 50 45
288 289 Jae Pak 37 38
289 288 Rachel Valdez 35 37
290 999 Chadwick Smith 55 47

115 rows × 5 columns

Grouping

1a. What is the earliest birthdate for all employees?

SELECT MIN(e.BirthDate) FROM dbo.Employees AS e;

In [115]:
Employees.BirthDate.min()


Out[115]:
'1945-11-17'

1b. Add to the above, the most recent birthdate for all employees

SELECT MIN(e.BirthDate) AS 'Earliest Birthday', MAX(e.BirthDate) AS 'Most Reecent Birthday' FROM dbo.Employees AS e;

In [117]:
(Employees.BirthDate.min(), Employees.BirthDate.max())


Out[117]:
('1945-11-17', '1985-07-01')

1c. Show the above results broken down by gender

SELECT e.Gender, MIN(e.BirthDate) AS 'Earliest Birthday', MAX(e.BirthDate) AS 'Most Reecent Birthday' FROM dbo.Employees AS e GROUP BY e.Gender ;

In [131]:
Employees.groupby(by=['Gender']).BirthDate.aggregate([min, max])


Out[131]:
min max
Gender
F 1946-10-29 1985-07-01
M 1945-11-17 1985-05-07

1d. Show the above results broken down by gender, and salaried/hourly

SELECT e.Gender, e.SalariedFlag, MIN(e.BirthDate) AS 'Earliest Birthday', MAX(e.BirthDate) AS 'Most Reecent Birthday' FROM dbo.Employees AS e GROUP BY e.Gender, e.SalariedFlag ;

In [134]:
Employees.groupby(by=['Gender', 'SalariedFlag']).BirthDate.aggregate([min, max])


Out[134]:
min max
Gender SalariedFlag
F 0 1948-05-25 1985-07-01
1 1946-10-29 1980-07-06
M 0 1946-04-03 1985-05-07
1 1945-11-17 1982-04-14

2a. What are the average vacation hours for all employees?

SELECT AVG(e.VacationHours) FROM dbo.Employees AS e ;

In [135]:
Employees.VacationHours.mean()


Out[135]:
50.628865979381445

2b. Add to the above, the minimum vacation hours for all employees

SELECT AVG(e.VacationHours) AS 'Average Vacation', MIN(e.VacationHours) AS 'Minumum Vacation' FROM dbo.Employees AS e ;

In [136]:
(Employees.VacationHours.mean(), Employees.VacationHours.min())


Out[136]:
(50.628865979381445, 0)

2c. Show the above results broken down and ordered by job title

SELECT e.JobTitle, AVG(e.VacationHours) AS 'Average Vacation', MIN(e.VacationHours) AS 'Minimum Vacation' FROM dbo.Employees AS e GROUP BY e.JobTitle ;

In [157]:
Employees.groupby(by=['JobTitle']).VacationHours.aggregate([np.mean, np.min])


Out[157]:
mean amin
JobTitle
Accountant 58.500000 58
Accounts Manager 57.000000 57
Accounts Payable Specialist 63.500000 63
Accounts Receivable Specialist 61.000000 60
Application Specialist 72.500000 71
Assistant to the Chief Financial Officer 56.000000 56
BI Professor 55.000000 55
Benefits Specialist 51.000000 51
Buyer 56.000000 52
Chief Executive Officer 99.000000 99
Chief Financial Officer 0.000000 0
Control Specialist 75.500000 75
Database Administrator 66.500000 66
Design Engineer 5.000000 4
Document Control Assistant 78.500000 78
Document Control Manager 77.000000 77
Engineering Manager 2.000000 2
European Sales Manager 21.000000 21
Facilities Administrative Assistant 87.000000 87
Facilities Manager 86.000000 86
Finance Manager 55.000000 55
Human Resources Administrative Assistant 52.500000 52
Human Resources Manager 54.000000 54
Information Services Manager 65.000000 65
Janitor 89.500000 88
Maintenance Supervisor 92.000000 92
Marketing Assistant 42.000000 41
Marketing Manager 40.000000 40
Marketing Specialist 46.000000 44
Master Scheduler 44.000000 44
... ... ...
Production Supervisor - WC40 72.000000 71
Production Supervisor - WC45 75.000000 74
Production Supervisor - WC50 78.000000 77
Production Supervisor - WC60 81.000000 80
Production Technician - WC10 91.000000 83
Production Technician - WC20 10.500000 0
Production Technician - WC30 34.000000 22
Production Technician - WC40 59.500000 47
Production Technician - WC45 80.000000 73
Production Technician - WC50 46.653846 0
Production Technician - WC60 26.500000 14
Purchasing Assistant 50.500000 50
Purchasing Manager 49.000000 49
Quality Assurance Manager 80.000000 80
Quality Assurance Supervisor 81.000000 81
Quality Assurance Technician 83.500000 82
Recruiter 49.500000 49
Research and Development Engineer 62.500000 62
Research and Development Manager 38.500000 16
Sales Representative 31.000000 22
Scheduling Assistant 46.500000 45
Senior Design Engineer 3.000000 3
Senior Tool Designer 27.500000 7
Shipping and Receiving Clerk 94.500000 94
Shipping and Receiving Supervisor 93.000000 93
Stocker 97.000000 96
Tool Designer 8.500000 8
Vice President of Engineering 1.000000 1
Vice President of Production 64.000000 64
Vice President of Sales 10.000000 10

68 rows × 2 columns

2d. Show the above results broken down by job title, and married/single employees

SELECT e.JobTitle, e.MaritalStatus, AVG(e.VacationHours) AS 'Average Vacation', MIN(e.VacationHours) AS 'Minimum Vacation' FROM dbo.Employees AS e GROUP BY e.JobTitle, e.MaritalStatus ;

In [158]:
Employees.groupby(by=['JobTitle', 'MaritalStatus']).VacationHours.aggregate([np.mean, np.min])


Out[158]:
mean amin
JobTitle MaritalStatus
Accountant M 58.000000 58
S 59.000000 59
Accounts Manager M 57.000000 57
Accounts Payable Specialist M 63.500000 63
Accounts Receivable Specialist M 60.000000 60
S 61.500000 61
Application Specialist M 71.500000 71
S 73.500000 73
Assistant to the Chief Financial Officer S 56.000000 56
BI Professor M 55.000000 55
Benefits Specialist M 51.000000 51
Buyer M 56.333333 52
S 55.333333 53
Chief Executive Officer S 99.000000 99
Chief Financial Officer M 0.000000 0
Control Specialist M 75.000000 75
S 76.000000 76
Database Administrator M 66.000000 66
S 67.000000 67
Design Engineer M 5.000000 4
Document Control Assistant M 79.000000 79
S 78.000000 78
Document Control Manager M 77.000000 77
Engineering Manager M 2.000000 2
European Sales Manager M 21.000000 21
Facilities Administrative Assistant M 87.000000 87
Facilities Manager M 86.000000 86
Finance Manager S 55.000000 55
Human Resources Administrative Assistant S 52.500000 52
Human Resources Manager M 54.000000 54
... ... ... ...
Production Technician - WC40 S 61.857143 47
Production Technician - WC45 M 80.833333 76
S 79.444444 73
Production Technician - WC50 M 40.076923 1
S 53.230769 0
Production Technician - WC60 M 24.714286 14
S 28.583333 16
Purchasing Assistant M 50.500000 50
Purchasing Manager S 49.000000 49
Quality Assurance Manager S 80.000000 80
Quality Assurance Supervisor M 81.000000 81
Quality Assurance Technician M 84.500000 84
S 82.500000 82
Recruiter S 49.500000 49
Research and Development Engineer M 63.000000 63
S 62.000000 62
Research and Development Manager M 38.500000 16
Sales Representative M 31.714286 26
S 30.285714 22
Scheduling Assistant M 46.500000 45
Senior Design Engineer S 3.000000 3
Senior Tool Designer S 27.500000 7
Shipping and Receiving Clerk M 94.500000 94
Shipping and Receiving Supervisor S 93.000000 93
Stocker M 96.000000 96
S 97.500000 97
Tool Designer M 8.500000 8
Vice President of Engineering S 1.000000 1
Vice President of Production S 64.000000 64
Vice President of Sales S 10.000000 10

95 rows × 2 columns

2e. Add to the above, the maximum vacation hours per group

SELECT e.JobTitle, e.MaritalStatus , AVG(e.VacationHours) AS 'Average Vacation' , MIN(e.VacationHours) AS 'Minimum Vacation' , MAX(e.VacationHours) AS 'Maximum Vacation' FROM dbo.Employees AS e GROUP BY e.JobTitle, e.MaritalStatus ;

In [159]:
Employees.groupby(by=['JobTitle', 'MaritalStatus']).VacationHours.aggregate([np.mean, np.min, np.max])


Out[159]:
mean amin amax
JobTitle MaritalStatus
Accountant M 58.000000 58 58
S 59.000000 59 59
Accounts Manager M 57.000000 57 57
Accounts Payable Specialist M 63.500000 63 64
Accounts Receivable Specialist M 60.000000 60 60
S 61.500000 61 62
Application Specialist M 71.500000 71 72
S 73.500000 73 74
Assistant to the Chief Financial Officer S 56.000000 56 56
BI Professor M 55.000000 55 55
Benefits Specialist M 51.000000 51 51
Buyer M 56.333333 52 60
S 55.333333 53 59
Chief Executive Officer S 99.000000 99 99
Chief Financial Officer M 0.000000 0 0
Control Specialist M 75.000000 75 75
S 76.000000 76 76
Database Administrator M 66.000000 66 66
S 67.000000 67 67
Design Engineer M 5.000000 4 6
Document Control Assistant M 79.000000 79 79
S 78.000000 78 78
Document Control Manager M 77.000000 77 77
Engineering Manager M 2.000000 2 2
European Sales Manager M 21.000000 21 21
Facilities Administrative Assistant M 87.000000 87 87
Facilities Manager M 86.000000 86 86
Finance Manager S 55.000000 55 55
Human Resources Administrative Assistant S 52.500000 52 53
Human Resources Manager M 54.000000 54 54
... ... ... ... ...
Production Technician - WC40 S 61.857143 47 72
Production Technician - WC45 M 80.833333 76 87
S 79.444444 73 85
Production Technician - WC50 M 40.076923 1 99
S 53.230769 0 98
Production Technician - WC60 M 24.714286 14 39
S 28.583333 16 38
Purchasing Assistant M 50.500000 50 51
Purchasing Manager S 49.000000 49 49
Quality Assurance Manager S 80.000000 80 80
Quality Assurance Supervisor M 81.000000 81 81
Quality Assurance Technician M 84.500000 84 85
S 82.500000 82 83
Recruiter S 49.500000 49 50
Research and Development Engineer M 63.000000 63 63
S 62.000000 62 62
Research and Development Manager M 38.500000 16 61
Sales Representative M 31.714286 26 39
S 30.285714 22 38
Scheduling Assistant M 46.500000 45 48
Senior Design Engineer S 3.000000 3 3
Senior Tool Designer S 27.500000 7 48
Shipping and Receiving Clerk M 94.500000 94 95
Shipping and Receiving Supervisor S 93.000000 93 93
Stocker M 96.000000 96 96
S 97.500000 97 98
Tool Designer M 8.500000 8 9
Vice President of Engineering S 1.000000 1 1
Vice President of Production S 64.000000 64 64
Vice President of Sales S 10.000000 10 10

95 rows × 3 columns

2f. Show the above results broken down by job title, married/single employees, and State

SELECT e.JobTitle, e.MaritalStatus, e.StateProvinceName , AVG(e.VacationHours) AS 'Average Vacation' , MIN(e.VacationHours) AS 'Minimum Vacation' , MAX(e.VacationHours) AS 'Maximum Vacation' FROM dbo.Employees AS e GROUP BY e.JobTitle, e.MaritalStatus, e.StateProvinceName ;

In [160]:
Employees.groupby(by=['JobTitle', 'MaritalStatus', 'StateProvinceName']).VacationHours.aggregate([np.mean, np.min, np.max])


Out[160]:
mean amin amax
JobTitle MaritalStatus StateProvinceName
Accountant M Washington 58.000000 58 58
S Washington 59.000000 59 59
Accounts Manager M Washington 57.000000 57 57
Accounts Payable Specialist M Washington 63.500000 63 64
Accounts Receivable Specialist M Washington 60.000000 60 60
S Alberta 62.000000 62 62
Washington 61.000000 61 61
Application Specialist M Washington 71.500000 71 72
S Washington 73.500000 73 74
Assistant to the Chief Financial Officer S Washington 56.000000 56 56
BI Professor M Georgia 55.000000 55 55
Benefits Specialist M Washington 51.000000 51 51
Buyer M Washington 56.333333 52 60
S Washington 55.333333 53 59
Chief Executive Officer S Washington 99.000000 99 99
Chief Financial Officer M Washington 0.000000 0 0
Control Specialist M Washington 75.000000 75 75
S Washington 76.000000 76 76
Database Administrator M Washington 66.000000 66 66
S Washington 67.000000 67 67
Design Engineer M Washington 5.000000 4 6
Document Control Assistant M Washington 79.000000 79 79
S Washington 78.000000 78 78
Document Control Manager M Washington 77.000000 77 77
Engineering Manager M Washington 2.000000 2 2
European Sales Manager M Washington 21.000000 21 21
Facilities Administrative Assistant M Washington 87.000000 87 87
Facilities Manager M Washington 86.000000 86 86
Finance Manager S Washington 55.000000 55 55
Human Resources Administrative Assistant S Washington 52.500000 52 53
... ... ... ... ... ...
Research and Development Engineer M Washington 63.000000 63 63
S Washington 62.000000 62 62
Research and Development Manager M California 16.000000 16 16
Washington 61.000000 61 61
Sales Representative M Alberta 33.000000 33 33
California 26.000000 26 26
England 37.000000 37 37
Massachusetts 39.000000 39 39
Ontario 31.000000 31 31
Tennessee 29.000000 29 29
Utah 27.000000 27 27
S Gironde 34.000000 34 34
Hamburg 35.000000 35 35
Michigan 38.000000 38 38
Minnesota 24.000000 24 24
Oregon 22.000000 22 22
Victoria 36.000000 36 36
Washington 23.000000 23 23
Scheduling Assistant M Washington 46.500000 45 48
Senior Design Engineer S Washington 3.000000 3 3
Senior Tool Designer S Minnesota 48.000000 48 48
Washington 7.000000 7 7
Shipping and Receiving Clerk M Washington 94.500000 94 95
Shipping and Receiving Supervisor S Washington 93.000000 93 93
Stocker M Washington 96.000000 96 96
S Washington 97.500000 97 98
Tool Designer M Washington 8.500000 8 9
Vice President of Engineering S Washington 1.000000 1 1
Vice President of Production S Washington 64.000000 64 64
Vice President of Sales S Washington 10.000000 10 10

113 rows × 3 columns

2g. Show the above results but only for American employees

SELECT e.JobTitle, e.MaritalStatus, e.StateProvinceName , AVG(e.VacationHours) AS 'Average Vacation' , MIN(e.VacationHours) AS 'Minimum Vacation' , MAX(e.VacationHours) AS 'Maximum Vacation' FROM dbo.Employees AS e WHERE e.CountryName = 'United States' GROUP BY e.JobTitle, e.MaritalStatus, e.StateProvinceName ;

In [164]:
Employees[Employees.CountryName=='United States'] \
.groupby(by=['JobTitle', 'MaritalStatus', 'StateProvinceName']).VacationHours.aggregate([np.mean, np.min, np.max])


Out[164]:
mean amin amax
JobTitle MaritalStatus StateProvinceName
Accountant M Washington 58.000000 58 58
S Washington 59.000000 59 59
Accounts Manager M Washington 57.000000 57 57
Accounts Payable Specialist M Washington 63.500000 63 64
Accounts Receivable Specialist M Washington 60.000000 60 60
S Washington 61.000000 61 61
Application Specialist M Washington 71.500000 71 72
S Washington 73.500000 73 74
Assistant to the Chief Financial Officer S Washington 56.000000 56 56
BI Professor M Georgia 55.000000 55 55
Benefits Specialist M Washington 51.000000 51 51
Buyer M Washington 56.333333 52 60
S Washington 55.333333 53 59
Chief Executive Officer S Washington 99.000000 99 99
Chief Financial Officer M Washington 0.000000 0 0
Control Specialist M Washington 75.000000 75 75
S Washington 76.000000 76 76
Database Administrator M Washington 66.000000 66 66
S Washington 67.000000 67 67
Design Engineer M Washington 5.000000 4 6
Document Control Assistant M Washington 79.000000 79 79
S Washington 78.000000 78 78
Document Control Manager M Washington 77.000000 77 77
Engineering Manager M Washington 2.000000 2 2
European Sales Manager M Washington 21.000000 21 21
Facilities Administrative Assistant M Washington 87.000000 87 87
Facilities Manager M Washington 86.000000 86 86
Finance Manager S Washington 55.000000 55 55
Human Resources Administrative Assistant S Washington 52.500000 52 53
Human Resources Manager M Washington 54.000000 54 54
... ... ... ... ... ...
Purchasing Manager S Washington 49.000000 49 49
Quality Assurance Manager S Washington 80.000000 80 80
Quality Assurance Supervisor M Washington 81.000000 81 81
Quality Assurance Technician M Washington 84.500000 84 85
S Washington 82.500000 82 83
Recruiter S Washington 49.500000 49 50
Research and Development Engineer M Washington 63.000000 63 63
S Washington 62.000000 62 62
Research and Development Manager M California 16.000000 16 16
Washington 61.000000 61 61
Sales Representative M California 26.000000 26 26
Massachusetts 39.000000 39 39
Tennessee 29.000000 29 29
Utah 27.000000 27 27
S Michigan 38.000000 38 38
Minnesota 24.000000 24 24
Oregon 22.000000 22 22
Washington 23.000000 23 23
Scheduling Assistant M Washington 46.500000 45 48
Senior Design Engineer S Washington 3.000000 3 3
Senior Tool Designer S Minnesota 48.000000 48 48
Washington 7.000000 7 7
Shipping and Receiving Clerk M Washington 94.500000 94 95
Shipping and Receiving Supervisor S Washington 93.000000 93 93
Stocker M Washington 96.000000 96 96
S Washington 97.500000 97 98
Tool Designer M Washington 8.500000 8 9
Vice President of Engineering S Washington 1.000000 1 1
Vice President of Production S Washington 64.000000 64 64
Vice President of Sales S Washington 10.000000 10 10

103 rows × 3 columns

2h. Change the grouping above so it's broken down by married/single and State, no more job title

SELECT e.MaritalStatus, e.StateProvinceName , AVG(e.VacationHours) AS 'Average Vacation' , MIN(e.VacationHours) AS 'Minimum Vacation' , MAX(e.VacationHours) AS 'Maximum Vacation' FROM dbo.Employees AS e WHERE e.CountryName = 'United States' GROUP BY e.MaritalStatus, e.StateProvinceName ;

In [165]:
Employees[Employees.CountryName=='United States'] \
.groupby(by=['MaritalStatus', 'StateProvinceName']).VacationHours.aggregate([np.mean, np.min, np.max])


Out[165]:
mean amin amax
MaritalStatus StateProvinceName
M California 21.000000 16 26
Georgia 55.000000 55 55
Massachusetts 39.000000 39 39
Tennessee 29.000000 29 29
Utah 27.000000 27 27
Washington 49.426471 0 99
S Michigan 38.000000 38 38
Minnesota 36.000000 24 48
Oregon 22.000000 22 22
Washington 54.362963 0 99

2i. Limit the results above to States where the average vacation hours is greater than 30

SELECT e.MaritalStatus, e.StateProvinceName , AVG(e.VacationHours) AS 'Average Vacation' , MIN(e.VacationHours) AS 'Minimum Vacation' , MAX(e.VacationHours) AS 'Maximum Vacation' FROM dbo.Employees AS e WHERE e.CountryName = 'United States' GROUP BY e.MaritalStatus, e.StateProvinceName HAVING AVG(e.VacationHours) > 30 ;
SELECT e.StateProvinceName , AVG(e.VacationHours) AS 'Average Vacation' , MIN(e.VacationHours) AS 'Minimum Vacation' , MAX(e.VacationHours) AS 'Maximum Vacation' FROM dbo.Employees AS e WHERE e.CountryName = 'United States' GROUP BY e.StateProvinceName HAVING AVG(e.VacationHours) > 30 ;

In [180]:
Employees[Employees.CountryName=='United States'] \
.groupby(by=['StateProvinceName']).filter(lambda x: x.VacationHours.mean()>30) \
.groupby(by=['StateProvinceName']).VacationHours.aggregate([np.mean, np.min, np.max])


Out[180]:
mean amin amax
StateProvinceName
Georgia 55.000000 55 55
Massachusetts 39.000000 39 39
Michigan 38.000000 38 38
Minnesota 36.000000 24 48
Washington 51.885609 0 99

2j. Limit the results above to States where the average vacation hours is greater than 30 and the maximum vacation hours is less than 50

SELECT e.StateProvinceName , AVG(e.VacationHours) AS 'Average Vacation' , MIN(e.VacationHours) AS 'Minimum Vacation' , MAX(e.VacationHours) AS 'Maximum Vacation' FROM dbo.Employees AS e WHERE e.CountryName = 'United States' GROUP BY e.StateProvinceName HAVING AVG(e.VacationHours) >30 AND 49 MAX((e.VacationHours) < 50) ;

In [182]:
Employees[Employees.CountryName=='United States'] \
.groupby(by=['StateProvinceName']).filter(lambda x: (x.VacationHours.mean() >30) and (x.VacationHours.max()<50)) \
.groupby(by=['StateProvinceName']).VacationHours.aggregate([np.mean, np.min, np.max])


Out[182]:
mean amin amax
StateProvinceName
Massachusetts 39 39 39
Michigan 38 38 38
Minnesota 36 24 48

2k. Show the same results but only for non-American employees

SELECT e.StateProvinceName , AVG(e.VacationHours) AS 'Average Vacation' , MIN(e.VacationHours) AS 'Minimum Vacation' , MAX(e.VacationHours) AS 'Maximum Vacation' FROM dbo.Employees AS e WHERE e.CountryName != 'United States' GROUP BY e.StateProvinceName HAVING AVG(e.VacationHours) BETWEEN 31 AND 49 ;

In [183]:
Employees[Employees.CountryName!='United States'] \
.groupby(by=['StateProvinceName']).filter(lambda x: (x.VacationHours.mean() >30) and (x.VacationHours.max()<50)) \
.groupby(by=['StateProvinceName']).VacationHours.aggregate([np.mean, np.min, np.max])


Out[183]:
mean amin amax
StateProvinceName
England 37 37 37
Gironde 34 34 34
Hamburg 35 35 35
Ontario 31 31 31
Victoria 36 36 36

3a. Report how many employees are in the company

SELECT COUNT(1) FROM dbo.Employees;

In [185]:
Employees.shape[0]


Out[185]:
291

3b. For the above report, show the number of employees per manager (hint: use ManagerID)

SELECT e.ManagerID, COUNT(1) AS 'NumEmployees' FROM dbo.Employees AS e GROUP BY e.ManagerID ;

In [186]:
Employees.groupby(by=['ManagerID']).EmployeeID.aggregate(lambda x: x.shape[0])


Out[186]:
ManagerID
1.0        8
16.0       8
25.0      30
26.0     178
227.0      6
234.0      3
235.0      5
249.0      9
250.0     17
263.0      9
273.0      3
274.0     10
285.0      1
287.0      3
Name: EmployeeID, dtype: int64

3c. Remove any manager ID's that are NULL from the results above

SELECT e.ManagerID, COUNT(1) AS 'NumEmployees' FROM dbo.Employees AS e WHERE e.ManagerID IS NOT NULL GROUP BY e.ManagerID ;

In [193]:
Employees.loc[pd.isnull(Employees.EmployeeID)==False] \
.groupby(by=['ManagerID']).EmployeeID.aggregate(lambda x: x.shape[0])


Out[193]:
ManagerID
1.0        8
16.0       8
25.0      30
26.0     178
227.0      6
234.0      3
235.0      5
249.0      9
250.0     17
263.0      9
273.0      3
274.0     10
285.0      1
287.0      3
Name: EmployeeID, dtype: int64

3d. Show the same results as above, but only for managers who have at least 5 employees

SELECT e.ManagerID, COUNT(1) AS 'NumEmployees' FROM dbo.Employees AS e WHERE e.ManagerID IS NOT NULL GROUP BY e.ManagerID HAVING COUNT(1) >= 5 ;

In [195]:
Employees.loc[pd.isnull(Employees.EmployeeID)==False] \
.groupby(by=['ManagerID']).filter(lambda x: x.shape[0]>=5) \
.groupby(by=['ManagerID']).EmployeeID.aggregate(lambda x: x.shape[0])


Out[195]:
ManagerID
1.0        8
16.0       8
25.0      30
26.0     178
227.0      6
235.0      5
249.0      9
250.0     17
263.0      9
274.0     10
Name: EmployeeID, dtype: int64

4a. List the average vacation hours of all employees

SELECT AVG(e.VacationHours) FROM dbo.Employees AS e ;

In [196]:
Employees.VacationHours.mean()


Out[196]:
50.628865979381445

4b. Break down the results by State

SELECT e.StateProvinceName, AVG(e.VacationHours) AS 'Average Vacation Hours' FROM dbo.Employees AS e GROUP BY e.StateProvinceName ;

In [197]:
Employees.groupby(by=['StateProvinceName']).VacationHours.mean()


Out[197]:
StateProvinceName
Alberta          35.000000
California       21.000000
England          37.000000
Georgia          55.000000
Gironde          34.000000
Hamburg          35.000000
Massachusetts    39.000000
Michigan         38.000000
Minnesota        36.000000
Ontario          31.000000
Oregon           22.000000
Tennessee        29.000000
Utah             27.000000
Victoria         36.000000
Washington       51.885609
Name: VacationHours, dtype: float64

4c. Break down the results by city and State

SELECT e.StateProvinceName, e.City, AVG(e.VacationHours) AS 'Average Vacation Hours' FROM dbo.Employees AS e GROUP BY e.StateProvinceName, e.City ;

In [198]:
Employees.groupby(by=['StateProvinceName', 'City']).VacationHours.mean()


Out[198]:
StateProvinceName  City         
Alberta            Calgary          35.000000
California         San Francisco    21.000000
England            Cambridge        37.000000
Georgia            Atlanta          55.000000
Gironde            Bordeaux         34.000000
Hamburg            Berlin           35.000000
Massachusetts      Cambridge        39.000000
Michigan           Detroit          38.000000
Minnesota          Duluth           24.000000
                   Minneapolis      48.000000
Ontario            Ottawa           31.000000
Oregon             Portland         22.000000
Tennessee          Memphis          29.000000
Utah               Nevada           27.000000
Victoria           Melbourne        36.000000
Washington         Bellevue         44.305556
                   Bothell          65.692308
                   Carnation        59.000000
                   Duvall           42.200000
                   Edmonds          47.640000
                   Everett          57.823529
                   Gold Bar         70.800000
                   Index            76.000000
                   Issaquah         67.666667
                   Kenmore          53.916667
                   Kent              5.000000
                   Monroe           47.916667
                   Newport Hills    62.833333
                   Redmond          37.428571
                   Renton           54.176471
                   Sammamish        42.823529
                   Seattle          54.659091
                   Snohomish        52.800000
Name: VacationHours, dtype: float64

4d. Add something that shows the number of employees per city

SELECT e.StateProvinceName, e.City , AVG(e.VacationHours) AS 'Average Vacation Hours' , COUNT(1) AS 'NumEmployees' FROM dbo.Employees AS e GROUP BY e.StateProvinceName, e.City ;

In [17]:
aggregations = {
    'VacationHours' : {
        'AverageVacationHours' : 'mean'
    },
    'EmployeeID' : {
        'NumEmployees' : 'count'
    }
}
Employees.groupby(by=['StateProvinceName', 'City'])\
.aggregate(aggregations)


Out[17]:
VacationHours EmployeeID
AverageVacationHours NumEmployees
StateProvinceName City
Alberta Calgary 35.000000 5
California San Francisco 21.000000 2
England Cambridge 37.000000 1
Georgia Atlanta 55.000000 1
Gironde Bordeaux 34.000000 1
Hamburg Berlin 35.000000 1
Massachusetts Cambridge 39.000000 1
Michigan Detroit 38.000000 1
Minnesota Duluth 24.000000 1
Minneapolis 48.000000 1
Ontario Ottawa 31.000000 1
Oregon Portland 22.000000 1
Tennessee Memphis 29.000000 1
Utah Nevada 27.000000 1
Victoria Melbourne 36.000000 1
Washington Bellevue 44.305556 36
Bothell 65.692308 13
Carnation 59.000000 5
Duvall 42.200000 10
Edmonds 47.640000 25
Everett 57.823529 17
Gold Bar 70.800000 5
Index 76.000000 5
Issaquah 67.666667 15
Kenmore 53.916667 12
Kent 5.000000 1
Monroe 47.916667 12
Newport Hills 62.833333 6
Redmond 37.428571 21
Renton 54.176471 17
Sammamish 42.823529 17
Seattle 54.659091 44
Snohomish 52.800000 10

4e. Sort the results by the city and state

SELECT e.StateProvinceName, e.City , AVG(e.VacationHours) AS 'Average Vacation Hours' , COUNT(1) AS 'NumEmployees' FROM dbo.Employees AS e GROUP BY e.StateProvinceName, e.City ORDER BY e.StateProvinceName, e.City ;

In [217]:
### Since Pandas already sorts the results, we show here how show resutls in DESCENDING order
aggregations = {
    'VacationHours' : {
        'AverageVacationHours' : 'mean'
    },
    'EmployeeID' : {
        'NumEmployees' : 'count'
    }
}
Employees.groupby(by=['StateProvinceName', 'City']).aggregate(aggregations) \
.sort_index(ascending=False, axis=0)


Out[217]:
EmployeeID VacationHours
NumEmployees AverageVacationHours
StateProvinceName City
Washington Snohomish 10 52.800000
Seattle 44 54.659091
Sammamish 17 42.823529
Renton 17 54.176471
Redmond 21 37.428571
Newport Hills 6 62.833333
Monroe 12 47.916667
Kent 1 5.000000
Kenmore 12 53.916667
Issaquah 15 67.666667
Index 5 76.000000
Gold Bar 5 70.800000
Everett 17 57.823529
Edmonds 25 47.640000
Duvall 10 42.200000
Carnation 5 59.000000
Bothell 13 65.692308
Bellevue 36 44.305556
Victoria Melbourne 1 36.000000
Utah Nevada 1 27.000000
Tennessee Memphis 1 29.000000
Oregon Portland 1 22.000000
Ontario Ottawa 1 31.000000
Minnesota Minneapolis 1 48.000000
Duluth 1 24.000000
Michigan Detroit 1 38.000000
Massachusetts Cambridge 1 39.000000
Hamburg Berlin 1 35.000000
Gironde Bordeaux 1 34.000000
Georgia Atlanta 1 55.000000
England Cambridge 1 37.000000
California San Francisco 2 21.000000
Alberta Calgary 5 35.000000

In [241]:
## This adds the row indices back as columns
df = Employees.groupby(by=['StateProvinceName', 'City']).aggregate(aggregations) \
.sort_index(ascending=False, axis=0)

df['SPN'] = df.index.get_level_values(0)
df['C'] = df.index.get_level_values(1)


Out[241]:
employeeid vacationhours spn c
numemployees averagevacationhours
StateProvinceName City
Washington Snohomish 10 52.800000 Washington Snohomish
Seattle 44 54.659091 Washington Seattle
Sammamish 17 42.823529 Washington Sammamish
Renton 17 54.176471 Washington Renton
Redmond 21 37.428571 Washington Redmond
Newport Hills 6 62.833333 Washington Newport Hills
Monroe 12 47.916667 Washington Monroe
Kent 1 5.000000 Washington Kent
Kenmore 12 53.916667 Washington Kenmore
Issaquah 15 67.666667 Washington Issaquah
Index 5 76.000000 Washington Index
Gold Bar 5 70.800000 Washington Gold Bar
Everett 17 57.823529 Washington Everett
Edmonds 25 47.640000 Washington Edmonds
Duvall 10 42.200000 Washington Duvall
Carnation 5 59.000000 Washington Carnation
Bothell 13 65.692308 Washington Bothell
Bellevue 36 44.305556 Washington Bellevue
Victoria Melbourne 1 36.000000 Victoria Melbourne
Utah Nevada 1 27.000000 Utah Nevada
Tennessee Memphis 1 29.000000 Tennessee Memphis
Oregon Portland 1 22.000000 Oregon Portland
Ontario Ottawa 1 31.000000 Ontario Ottawa
Minnesota Minneapolis 1 48.000000 Minnesota Minneapolis
Duluth 1 24.000000 Minnesota Duluth
Michigan Detroit 1 38.000000 Michigan Detroit
Massachusetts Cambridge 1 39.000000 Massachusetts Cambridge
Hamburg Berlin 1 35.000000 Hamburg Berlin
Gironde Bordeaux 1 34.000000 Gironde Bordeaux
Georgia Atlanta 1 55.000000 Georgia Atlanta
England Cambridge 1 37.000000 England Cambridge
California San Francisco 2 21.000000 California San Francisco
Alberta Calgary 5 35.000000 Alberta Calgary

4f. Make city and State a single column in the format of "City, State"

SELECT CONCAT(e.StateProvinceName, ', ', e.City) AS 'Location' , AVG(e.VacationHours) AS 'Average Vacation Hours' , COUNT(1) AS 'NumEmployees' FROM dbo.Employees AS e GROUP BY e.StateProvinceName, e.City ORDER BY e.StateProvinceName, e.City ;

In [362]:
# Version 1: first create new column 'CityState' ... rest is easy
aggregations = {
    'VacationHours' : {
        'AverageVacationHours' : 'mean'
    },
    'EmployeeID' : {
        'NumEmployees' : 'count'
    },
}

df = Employees.copy()
df['CityState'] = Employees.City.str.cat(Employees.StateProvinceName, sep=', ')
df.groupby(by = 'CityState').aggregate(aggregations).sort_index(ascending=False, axis=0)


Out[362]:
EmployeeID VacationHours
NumEmployees AverageVacationHours
CityState
Snohomish, Washington 10 52.800000
Seattle, Washington 44 54.659091
San Francisco, California 2 21.000000
Sammamish, Washington 17 42.823529
Renton, Washington 17 54.176471
Redmond, Washington 21 37.428571
Portland, Oregon 1 22.000000
Ottawa, Ontario 1 31.000000
Newport Hills, Washington 6 62.833333
Nevada, Utah 1 27.000000
Monroe, Washington 12 47.916667
Minneapolis, Minnesota 1 48.000000
Memphis, Tennessee 1 29.000000
Melbourne, Victoria 1 36.000000
Kent, Washington 1 5.000000
Kenmore, Washington 12 53.916667
Issaquah, Washington 15 67.666667
Index, Washington 5 76.000000
Gold Bar, Washington 5 70.800000
Everett, Washington 17 57.823529
Edmonds, Washington 25 47.640000
Duvall, Washington 10 42.200000
Duluth, Minnesota 1 24.000000
Detroit, Michigan 1 38.000000
Carnation, Washington 5 59.000000
Cambridge, Massachusetts 1 39.000000
Cambridge, England 1 37.000000
Calgary, Alberta 5 35.000000
Bothell, Washington 13 65.692308
Bordeaux, Gironde 1 34.000000
Berlin, Hamburg 1 35.000000
Bellevue, Washington 36 44.305556
Atlanta, Georgia 1 55.000000

In [301]:
# Version 2: using the multi-indexed result and adding index values as new column 'CityState'
aggregations = {
    'VacationHours' : {
        'AverageVacationHours' : 'mean'
    },
    'EmployeeID' : {
        'NumEmployees' : 'count'
    },
}

df = Employees.groupby(by=['StateProvinceName', 'City']).aggregate(aggregations) \
.sort_index(ascending=False, axis=0)
df['CityState'] = [
    r.City+', '+r.State \
        for i, r in pd.DataFrame({
            'City': df.index.get_level_values(1),
            'State': df.index.get_level_values(0)
            }).iterrows()
    ]
df


Out[301]:
EmployeeID VacationHours CityState
NumEmployees AverageVacationHours
StateProvinceName City
Washington Snohomish 10 52.800000 Snohomish, Washington
Seattle 44 54.659091 Seattle, Washington
Sammamish 17 42.823529 Sammamish, Washington
Renton 17 54.176471 Renton, Washington
Redmond 21 37.428571 Redmond, Washington
Newport Hills 6 62.833333 Newport Hills, Washington
Monroe 12 47.916667 Monroe, Washington
Kent 1 5.000000 Kent, Washington
Kenmore 12 53.916667 Kenmore, Washington
Issaquah 15 67.666667 Issaquah, Washington
Index 5 76.000000 Index, Washington
Gold Bar 5 70.800000 Gold Bar, Washington
Everett 17 57.823529 Everett, Washington
Edmonds 25 47.640000 Edmonds, Washington
Duvall 10 42.200000 Duvall, Washington
Carnation 5 59.000000 Carnation, Washington
Bothell 13 65.692308 Bothell, Washington
Bellevue 36 44.305556 Bellevue, Washington
Victoria Melbourne 1 36.000000 Melbourne, Victoria
Utah Nevada 1 27.000000 Nevada, Utah
Tennessee Memphis 1 29.000000 Memphis, Tennessee
Oregon Portland 1 22.000000 Portland, Oregon
Ontario Ottawa 1 31.000000 Ottawa, Ontario
Minnesota Minneapolis 1 48.000000 Minneapolis, Minnesota
Duluth 1 24.000000 Duluth, Minnesota
Michigan Detroit 1 38.000000 Detroit, Michigan
Massachusetts Cambridge 1 39.000000 Cambridge, Massachusetts
Hamburg Berlin 1 35.000000 Berlin, Hamburg
Gironde Bordeaux 1 34.000000 Bordeaux, Gironde
Georgia Atlanta 1 55.000000 Atlanta, Georgia
England Cambridge 1 37.000000 Cambridge, England
California San Francisco 2 21.000000 San Francisco, California
Alberta Calgary 5 35.000000 Calgary, Alberta

4g. Add a column that shows the difference between the maximum vacation hours and minimum vacation hours for each city

SELECT CONCAT(e.StateProvinceName, ', ', e.City) AS 'Location' , AVG(e.VacationHours) AS 'Average Vacation Hours' , COUNT(1) AS 'NumEmployees' , MAX(e.VacationHours)-MIN(e.VacationHours) AS 'Difference Vacation Hours' FROM dbo.Employees AS e GROUP BY e.StateProvinceName, e.City ORDER BY e.StateProvinceName, e.City ;

In [309]:
aggregations = {
    'VacationHours' : {
        'AverageVacationHours' : 'mean',
        'MinVacationHours' : min, 
        'MaxVacationHours' : max,
        'DeltaVacationHoues' : lambda x: x.max()-x.min()
    },
    'EmployeeID' : {
        'NumEmployees' : 'count'
    },
}

Employees.groupby(by=['StateProvinceName', 'City']).aggregate(aggregations) \
.sort_index(ascending=False, axis=0)


Out[309]:
EmployeeID VacationHours
NumEmployees AverageVacationHours DeltaVacationHoues MinVacationHours MaxVacationHours
StateProvinceName City
Washington Snohomish 10 52.800000 85 10 95
Seattle 44 54.659091 89 6 95
Sammamish 17 42.823529 93 0 93
Renton 17 54.176471 97 0 97
Redmond 21 37.428571 92 2 94
Newport Hills 6 62.833333 55 44 99
Monroe 12 47.916667 71 14 85
Kent 1 5.000000 0 5 5
Kenmore 12 53.916667 90 7 97
Issaquah 15 67.666667 73 10 83
Index 5 76.000000 32 59 91
Gold Bar 5 70.800000 39 53 92
Everett 17 57.823529 71 27 98
Edmonds 25 47.640000 97 1 98
Duvall 10 42.200000 92 0 92
Carnation 5 59.000000 64 24 88
Bothell 13 65.692308 90 9 99
Bellevue 36 44.305556 95 1 96
Victoria Melbourne 1 36.000000 0 36 36
Utah Nevada 1 27.000000 0 27 27
Tennessee Memphis 1 29.000000 0 29 29
Oregon Portland 1 22.000000 0 22 22
Ontario Ottawa 1 31.000000 0 31 31
Minnesota Minneapolis 1 48.000000 0 48 48
Duluth 1 24.000000 0 24 24
Michigan Detroit 1 38.000000 0 38 38
Massachusetts Cambridge 1 39.000000 0 39 39
Hamburg Berlin 1 35.000000 0 35 35
Gironde Bordeaux 1 34.000000 0 34 34
Georgia Atlanta 1 55.000000 0 55 55
England Cambridge 1 37.000000 0 37 37
California San Francisco 2 21.000000 10 16 26
Alberta Calgary 5 35.000000 53 9 62

4h. Now sort the results by the new column created above

SELECT CONCAT(e.StateProvinceName, ', ', e.City) AS 'Location' , AVG(e.VacationHours) AS 'Average Vacation Hours' , COUNT(1) AS 'NumEmployees' , MAX(e.VacationHours)-MIN(e.VacationHours) AS 'Difference Vacation Hours' FROM dbo.Employees AS e GROUP BY e.StateProvinceName, e.City ORDER BY 'Difference Vacation Hours' DESC ;

In [326]:
aggregations = {
    'VacationHours' : {
        'AverageVacationHours' : 'mean',
        'MinVacationHours' : min, 
        'MaxVacationHours' : max,
        'DeltaVacationHours' : lambda x: x.max()-x.min()
    },
    'EmployeeID' : {
        'NumEmployees' : 'count'
    },
}

df = Employees.groupby(by=['StateProvinceName', 'City']).aggregate(aggregations)
df.iloc[np.argsort(df['VacationHours']['DeltaVacationHours'])]


Out[326]:
EmployeeID VacationHours
NumEmployees AverageVacationHours DeltaVacationHours MinVacationHours MaxVacationHours
StateProvinceName City
Oregon Portland 1 22.000000 0 22 22
England Cambridge 1 37.000000 0 37 37
Georgia Atlanta 1 55.000000 0 55 55
Gironde Bordeaux 1 34.000000 0 34 34
Hamburg Berlin 1 35.000000 0 35 35
Massachusetts Cambridge 1 39.000000 0 39 39
Michigan Detroit 1 38.000000 0 38 38
Minnesota Duluth 1 24.000000 0 24 24
Minneapolis 1 48.000000 0 48 48
Ontario Ottawa 1 31.000000 0 31 31
Washington Kent 1 5.000000 0 5 5
Tennessee Memphis 1 29.000000 0 29 29
Utah Nevada 1 27.000000 0 27 27
Victoria Melbourne 1 36.000000 0 36 36
California San Francisco 2 21.000000 10 16 26
Washington Index 5 76.000000 32 59 91
Gold Bar 5 70.800000 39 53 92
Alberta Calgary 5 35.000000 53 9 62
Washington Newport Hills 6 62.833333 55 44 99
Carnation 5 59.000000 64 24 88
Monroe 12 47.916667 71 14 85
Everett 17 57.823529 71 27 98
Issaquah 15 67.666667 73 10 83
Snohomish 10 52.800000 85 10 95
Seattle 44 54.659091 89 6 95
Kenmore 12 53.916667 90 7 97
Bothell 13 65.692308 90 9 99
Duvall 10 42.200000 92 0 92
Redmond 21 37.428571 92 2 94
Sammamish 17 42.823529 93 0 93
Bellevue 36 44.305556 95 1 96
Renton 17 54.176471 97 0 97
Edmonds 25 47.640000 97 1 98

4i. Limit the results to cities that have more than 1 employee

SELECT CONCAT(e.StateProvinceName, ', ', e.City) AS 'Location' , AVG(e.VacationHours) AS 'Average Vacation Hours' , COUNT(1) AS 'NumEmployees' , MAX(e.VacationHours)-MIN(e.VacationHours) AS 'Difference Vacation Hours' FROM dbo.Employees AS e GROUP BY e.StateProvinceName, e.City HAVING COUNT(1) > 1 ORDER BY 'Difference Vacation Hours' DESC ;

In [50]:
def my_own_delta(x):
    return  x.max()-x.min()

aggregations = {
    'VacationHours' : {
        'AverageVacationHours' : 'mean',
        'MinVacationHours' : np.min, 
        'MaxVacationHours' : np.max,
        'DeltaVacationHours' : my_own_delta, ##lambda x: x.max()-x.min(),
        'cnt' : pd.Series.count
    },
    'EmployeeID' : {
        'NumEmployees' : 'count'
    },
}

df = Employees.groupby(by=['StateProvinceName', 'City']).aggregate(aggregations)
df['VacationHours']['DeltaVacationHours']>1
df2 = df[df['VacationHours']['DeltaVacationHours']>1]

df2.columns = df2.columns.droplevel(0)
df2.reset_index()
##df2.iloc[np.argsort(df2['VacationHours']['DeltaVacationHours'])]
##np.argsort(df2['VacationHours']['DeltaVacationHours'])


Out[50]:
StateProvinceName City cnt MinVacationHours AverageVacationHours MaxVacationHours DeltaVacationHours NumEmployees
0 Alberta Calgary 5 9 35.000000 62 53 5
1 California San Francisco 2 16 21.000000 26 10 2
2 Washington Bellevue 36 1 44.305556 96 95 36
3 Washington Bothell 13 9 65.692308 99 90 13
4 Washington Carnation 5 24 59.000000 88 64 5
5 Washington Duvall 10 0 42.200000 92 92 10
6 Washington Edmonds 25 1 47.640000 98 97 25
7 Washington Everett 17 27 57.823529 98 71 17
8 Washington Gold Bar 5 53 70.800000 92 39 5
9 Washington Index 5 59 76.000000 91 32 5
10 Washington Issaquah 15 10 67.666667 83 73 15
11 Washington Kenmore 12 7 53.916667 97 90 12
12 Washington Monroe 12 14 47.916667 85 71 12
13 Washington Newport Hills 6 44 62.833333 99 55 6
14 Washington Redmond 21 2 37.428571 94 92 21
15 Washington Renton 17 0 54.176471 97 97 17
16 Washington Sammamish 17 0 42.823529 93 93 17
17 Washington Seattle 44 6 54.659091 95 89 44
18 Washington Snohomish 10 10 52.800000 95 85 10

4j. Limit the results to non-U.S. cities

SELECT CONCAT(e.StateProvinceName, ', ', e.City) AS 'Location' , AVG(e.VacationHours) AS 'Average Vacation Hours' , COUNT(1) AS 'NumEmployees' , MAX(e.VacationHours)-MIN(e.VacationHours) AS 'Difference Vacation Hours' FROM dbo.Employees AS e WHERE e.CountryName != 'United States' GROUP BY e.StateProvinceName, e.City HAVING COUNT(1) > 1 ORDER BY 'Difference Vacation Hours' DESC ;

In [336]:
aggregations = {
    'VacationHours' : {
        'AverageVacationHours' : 'mean',
        'MinVacationHours' : min, 
        'MaxVacationHours' : max,
        'DeltaVacationHours' : lambda x: x.max()-x.min()
    },
    'EmployeeID' : {
        'NumEmployees' : 'count'
    },
}

df = Employees[Employees.CountryName != 'United States'].groupby(by=['StateProvinceName', 'City']).aggregate(aggregations)
df2 = df[df['VacationHours']['DeltaVacationHours']>1]
df2.iloc[np.argsort(df2['VacationHours']['DeltaVacationHours'])]


Out[336]:
EmployeeID VacationHours
NumEmployees AverageVacationHours DeltaVacationHours MinVacationHours MaxVacationHours
StateProvinceName City
Alberta Calgary 5 35 53 9 62

Extra

Here are two alternatives to create a 'City, State' column.


In [354]:
[cs for cs in map(lambda x,y: str(y)+', '+str(x), Employees.StateProvinceName.values, Employees.City.values)]


Out[354]:
['Redmond, Washington',
 'Calgary, Alberta',
 'Edmonds, Washington',
 'Seattle, Washington',
 'Bellevue, Washington',
 'Seattle, Washington',
 'Issaquah, Washington',
 'Everett, Washington',
 'Seattle, Washington',
 'Bothell, Washington',
 'San Francisco, California',
 'Index, Washington',
 'Redmond, Washington',
 'Snohomish, Washington',
 'Bellevue, Washington',
 'Seattle, Washington',
 'Monroe, Washington',
 'Seattle, Washington',
 'Calgary, Alberta',
 'Seattle, Washington',
 'Renton, Washington',
 'Calgary, Alberta',
 'Seattle, Washington',
 'Bothell, Washington',
 'Newport Hills, Washington',
 'Carnation, Washington',
 'Monroe, Washington',
 'Seattle, Washington',
 'Bellevue, Washington',
 'Edmonds, Washington',
 'Index, Washington',
 'Sammamish, Washington',
 'Calgary, Alberta',
 'Renton, Washington',
 'Everett, Washington',
 'Duvall, Washington',
 'Bothell, Washington',
 'Issaquah, Washington',
 'Everett, Washington',
 'Sammamish, Washington',
 'Bothell, Washington',
 'Edmonds, Washington',
 'Edmonds, Washington',
 'Edmonds, Washington',
 'Edmonds, Washington',
 'Edmonds, Washington',
 'Snohomish, Washington',
 'Gold Bar, Washington',
 'Redmond, Washington',
 'Seattle, Washington',
 'Everett, Washington',
 'Renton, Washington',
 'Renton, Washington',
 'Monroe, Washington',
 'Bellevue, Washington',
 'Everett, Washington',
 'Seattle, Washington',
 'Redmond, Washington',
 'Gold Bar, Washington',
 'Redmond, Washington',
 'Edmonds, Washington',
 'Sammamish, Washington',
 'Redmond, Washington',
 'Nevada, Utah',
 'Kenmore, Washington',
 'Bothell, Washington',
 'Renton, Washington',
 'Duvall, Washington',
 'Snohomish, Washington',
 'Seattle, Washington',
 'Edmonds, Washington',
 'Renton, Washington',
 'Edmonds, Washington',
 'Bellevue, Washington',
 'Edmonds, Washington',
 'Seattle, Washington',
 'Newport Hills, Washington',
 'Redmond, Washington',
 'Bellevue, Washington',
 'Carnation, Washington',
 'Index, Washington',
 'Snohomish, Washington',
 'Index, Washington',
 'Bellevue, Washington',
 'Everett, Washington',
 'Seattle, Washington',
 'Snohomish, Washington',
 'Edmonds, Washington',
 'Bellevue, Washington',
 'Bellevue, Washington',
 'Issaquah, Washington',
 'Everett, Washington',
 'Redmond, Washington',
 'Melbourne, Victoria',
 'Issaquah, Washington',
 'Duvall, Washington',
 'Bellevue, Washington',
 'Kent, Washington',
 'Index, Washington',
 'Edmonds, Washington',
 'Everett, Washington',
 'Bellevue, Washington',
 'Bellevue, Washington',
 'Edmonds, Washington',
 'Bellevue, Washington',
 'Monroe, Washington',
 'Redmond, Washington',
 'Calgary, Alberta',
 'Issaquah, Washington',
 'Gold Bar, Washington',
 'Carnation, Washington',
 'Bellevue, Washington',
 'Everett, Washington',
 'Duvall, Washington',
 'Cambridge, Massachusetts',
 'Kenmore, Washington',
 'Sammamish, Washington',
 'Kenmore, Washington',
 'Snohomish, Washington',
 'Seattle, Washington',
 'Sammamish, Washington',
 'Bellevue, Washington',
 'Renton, Washington',
 'Newport Hills, Washington',
 'Sammamish, Washington',
 'Sammamish, Washington',
 'Monroe, Washington',
 'Seattle, Washington',
 'Kenmore, Washington',
 'Monroe, Washington',
 'Everett, Washington',
 'Bellevue, Washington',
 'Snohomish, Washington',
 'Redmond, Washington',
 'Seattle, Washington',
 'Renton, Washington',
 'Seattle, Washington',
 'Seattle, Washington',
 'Monroe, Washington',
 'Edmonds, Washington',
 'Kenmore, Washington',
 'Duvall, Washington',
 'Seattle, Washington',
 'Kenmore, Washington',
 'Everett, Washington',
 'Issaquah, Washington',
 'Sammamish, Washington',
 'Sammamish, Washington',
 'Seattle, Washington',
 'Seattle, Washington',
 'Duvall, Washington',
 'Seattle, Washington',
 'Edmonds, Washington',
 'Everett, Washington',
 'Kenmore, Washington',
 'Kenmore, Washington',
 'Seattle, Washington',
 'Kenmore, Washington',
 'Renton, Washington',
 'Bothell, Washington',
 'Bellevue, Washington',
 'Minneapolis, Minnesota',
 'Snohomish, Washington',
 'San Francisco, California',
 'Monroe, Washington',
 'Bothell, Washington',
 'Seattle, Washington',
 'Bellevue, Washington',
 'Snohomish, Washington',
 'Issaquah, Washington',
 'Renton, Washington',
 'Edmonds, Washington',
 'Bellevue, Washington',
 'Bellevue, Washington',
 'Bellevue, Washington',
 'Bellevue, Washington',
 'Portland, Oregon',
 'Redmond, Washington',
 'Bothell, Washington',
 'Bellevue, Washington',
 'Seattle, Washington',
 'Renton, Washington',
 'Seattle, Washington',
 'Issaquah, Washington',
 'Everett, Washington',
 'Seattle, Washington',
 'Kenmore, Washington',
 'Bellevue, Washington',
 'Seattle, Washington',
 'Edmonds, Washington',
 'Sammamish, Washington',
 'Bothell, Washington',
 'Redmond, Washington',
 'Renton, Washington',
 'Sammamish, Washington',
 'Bellevue, Washington',
 'Kenmore, Washington',
 'Sammamish, Washington',
 'Seattle, Washington',
 'Seattle, Washington',
 'Sammamish, Washington',
 'Renton, Washington',
 'Seattle, Washington',
 'Bellevue, Washington',
 'Snohomish, Washington',
 'Bellevue, Washington',
 'Redmond, Washington',
 'Edmonds, Washington',
 'Bellevue, Washington',
 'Seattle, Washington',
 'Issaquah, Washington',
 'Seattle, Washington',
 'Bothell, Washington',
 'Edmonds, Washington',
 'Renton, Washington',
 'Seattle, Washington',
 'Seattle, Washington',
 'Everett, Washington',
 'Issaquah, Washington',
 'Seattle, Washington',
 'Monroe, Washington',
 'Seattle, Washington',
 'Issaquah, Washington',
 'Bellevue, Washington',
 'Sammamish, Washington',
 'Seattle, Washington',
 'Everett, Washington',
 'Bellevue, Washington',
 'Sammamish, Washington',
 'Renton, Washington',
 'Duluth, Minnesota',
 'Redmond, Washington',
 'Gold Bar, Washington',
 'Redmond, Washington',
 'Edmonds, Washington',
 'Detroit, Michigan',
 'Bothell, Washington',
 'Kenmore, Washington',
 'Duvall, Washington',
 'Seattle, Washington',
 'Memphis, Tennessee',
 'Monroe, Washington',
 'Sammamish, Washington',
 'Edmonds, Washington',
 'Duvall, Washington',
 'Redmond, Washington',
 'Issaquah, Washington',
 'Everett, Washington',
 'Seattle, Washington',
 'Bellevue, Washington',
 'Bellevue, Washington',
 'Seattle, Washington',
 'Bellevue, Washington',
 'Monroe, Washington',
 'Newport Hills, Washington',
 'Redmond, Washington',
 'Ottawa, Ontario',
 'Sammamish, Washington',
 'Everett, Washington',
 'Edmonds, Washington',
 'Newport Hills, Washington',
 'Seattle, Washington',
 'Issaquah, Washington',
 'Bellevue, Washington',
 'Redmond, Washington',
 'Bordeaux, Gironde',
 'Bellevue, Washington',
 'Renton, Washington',
 'Issaquah, Washington',
 'Renton, Washington',
 'Seattle, Washington',
 'Bothell, Washington',
 'Gold Bar, Washington',
 'Monroe, Washington',
 'Carnation, Washington',
 'Newport Hills, Washington',
 'Duvall, Washington',
 'Bellevue, Washington',
 'Carnation, Washington',
 'Edmonds, Washington',
 'Edmonds, Washington',
 'Bothell, Washington',
 'Redmond, Washington',
 'Redmond, Washington',
 'Issaquah, Washington',
 'Redmond, Washington',
 'Duvall, Washington',
 'Seattle, Washington',
 'Cambridge, England',
 'Berlin, Hamburg',
 'Atlanta, Georgia']

In [356]:
Employees.City.str.cat(Employees.StateProvinceName, sep=', ')


Out[356]:
0            Redmond, Washington
1               Calgary, Alberta
2            Edmonds, Washington
3            Seattle, Washington
4           Bellevue, Washington
5            Seattle, Washington
6           Issaquah, Washington
7            Everett, Washington
8            Seattle, Washington
9            Bothell, Washington
10     San Francisco, California
11             Index, Washington
12           Redmond, Washington
13         Snohomish, Washington
14          Bellevue, Washington
15           Seattle, Washington
16            Monroe, Washington
17           Seattle, Washington
18              Calgary, Alberta
19           Seattle, Washington
20            Renton, Washington
21              Calgary, Alberta
22           Seattle, Washington
23           Bothell, Washington
24     Newport Hills, Washington
25         Carnation, Washington
26            Monroe, Washington
27           Seattle, Washington
28          Bellevue, Washington
29           Edmonds, Washington
                 ...            
261          Seattle, Washington
262         Issaquah, Washington
263         Bellevue, Washington
264          Redmond, Washington
265            Bordeaux, Gironde
266         Bellevue, Washington
267           Renton, Washington
268         Issaquah, Washington
269           Renton, Washington
270          Seattle, Washington
271          Bothell, Washington
272         Gold Bar, Washington
273           Monroe, Washington
274        Carnation, Washington
275    Newport Hills, Washington
276           Duvall, Washington
277         Bellevue, Washington
278        Carnation, Washington
279          Edmonds, Washington
280          Edmonds, Washington
281          Bothell, Washington
282          Redmond, Washington
283          Redmond, Washington
284         Issaquah, Washington
285          Redmond, Washington
286           Duvall, Washington
287          Seattle, Washington
288           Cambridge, England
289              Berlin, Hamburg
290             Atlanta, Georgia
Name: City, dtype: object

Summary

In the simplest form one can specify either rows or columns


In [ ]:
# First ten rows
Employees[0:10]

In [5]:
# Rows satisfying a particular condition
Employees[Employees.MaritalStatus=='M']


Out[5]:
EmployeeID FirstName
0 259 Ben
1 278 Garrett
2 204 Gabe
3 78 Reuben
4 255 Gordon
10 10 Michael
12 175 Reed
14 144 Paul
15 192 Brenda
18 203 Ken
19 245 Barbara
24 163 Alex
25 118 Don
26 176 David
27 147 Sandra
28 138 Samantha
29 95 Jim
30 231 Jo
31 86 Ryan
33 214 Andreas
34 72 Steven
36 107 Christopher
37 269 Dan
38 187 Yvonne
40 12 Thierry
42 30 Britta
43 198 Lorraine
45 31 Margie
48 3 Roberto
49 199 Paula
... ... ...
227 136 Sylvester
231 35 Brandon
232 228 Christian
233 223 Sairaj
234 100 Lolan
239 200 Frank
240 279 Tsvi
243 73 Carole
249 21 Terry
252 254 Fukiko
255 53 Diane
256 282 José
258 209 Kathie
259 50 Sidney
261 70 David
264 140 Prasanna
266 5 Gail
267 252 Arvind
269 233 Magnus
270 83 Patrick
272 241 David
275 239 Mindy
278 38 Kim
279 154 Raymond
280 114 Mindaugas
281 123 Vamsi
283 101 Houman
284 221 Chris
288 289 Jae
290 999 Chadwick

147 rows × 2 columns


In [ ]:
# Select sertain columns
Employees[['EmployeeID', 'FirstName']]

Since expressions return DataFrames we can chain method calls and [...] operators

The order matters!


In [9]:
Employees[0:10][Employees.MaritalStatus=='M'][['EmployeeID', 'FirstName', 'MaritalStatus']]


/usr/lib/python3.4/site-packages/ipykernel/__main__.py:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  if __name__ == '__main__':
Out[9]:
EmployeeID FirstName MaritalStatus
0 259 Ben M
1 278 Garrett M
2 204 Gabe M
3 78 Reuben M
4 255 Gordon M

In [10]:
Employees[Employees.MaritalStatus=='M'][['EmployeeID', 'FirstName', 'MaritalStatus']][0:10]


Out[10]:
EmployeeID FirstName MaritalStatus
0 259 Ben M
1 278 Garrett M
2 204 Gabe M
3 78 Reuben M
4 255 Gordon M
10 10 Michael M
12 175 Reed M
14 144 Paul M
15 192 Brenda M
18 203 Ken M

Recommended to use .loc .iloc .at methods

Grouping should be performed by on factors (i.e. discrete values), aggregation usually on numeric values.


In [14]:
Employees.groupby(by=['MaritalStatus']).VacationHours.sum()


Out[14]:
MaritalStatus
M    7075
S    7658
Name: VacationHours, dtype: int64

Use 'aggregate' for more complex groupings and aggregations


In [ ]:
aggregations = {
    'VacationHours' : {
        'AverageVacationHours' : 'mean',
        'MinVacationHours' : min, 
        'MaxVacationHours' : max,
        'DeltaVacationHoues' : lambda x: x.max()-x.min()
    },
    'EmployeeID' : {
        'NumEmployees' : 'count'
    },
}

Employees.groupby(by=['StateProvinceName', 'City']).aggregate(aggregations) \
.sort_index(ascending=False, axis=0)

Assign intermediate results to new variables, and perform next operation on them

In particular to saticfy filtering ('HAVING') and sorting on aggregated values

Columns are not lists. Methods on columns are defined in Pandas.Series and NumPy


In [18]:
help(np.min)
help(pd.Series.between)


Help on function amin in module numpy.core.fromnumeric:

amin(a, axis=None, out=None, keepdims=False)
    Return the minimum of an array or minimum along an axis.
    
    Parameters
    ----------
    a : array_like
        Input data.
    axis : None or int or tuple of ints, optional
        Axis or axes along which to operate.  By default, flattened input is
        used.
    
        .. versionadded: 1.7.0
    
        If this is a tuple of ints, the minimum is selected over multiple axes,
        instead of a single axis or all the axes as before.
    out : ndarray, optional
        Alternative output array in which to place the result.  Must
        be of the same shape and buffer length as the expected output.
        See `doc.ufuncs` (Section "Output arguments") for more details.
    keepdims : bool, optional
        If this is set to True, the axes which are reduced are left
        in the result as dimensions with size one. With this option,
        the result will broadcast correctly against the original `arr`.
    
    Returns
    -------
    amin : ndarray or scalar
        Minimum of `a`. If `axis` is None, the result is a scalar value.
        If `axis` is given, the result is an array of dimension
        ``a.ndim - 1``.
    
    See Also
    --------
    amax :
        The maximum value of an array along a given axis, propagating any NaNs.
    nanmin :
        The minimum value of an array along a given axis, ignoring any NaNs.
    minimum :
        Element-wise minimum of two arrays, propagating any NaNs.
    fmin :
        Element-wise minimum of two arrays, ignoring any NaNs.
    argmin :
        Return the indices of the minimum values.
    
    nanmax, maximum, fmax
    
    Notes
    -----
    NaN values are propagated, that is if at least one item is NaN, the
    corresponding min value will be NaN as well. To ignore NaN values
    (MATLAB behavior), please use nanmin.
    
    Don't use `amin` for element-wise comparison of 2 arrays; when
    ``a.shape[0]`` is 2, ``minimum(a[0], a[1])`` is faster than
    ``amin(a, axis=0)``.
    
    Examples
    --------
    >>> a = np.arange(4).reshape((2,2))
    >>> a
    array([[0, 1],
           [2, 3]])
    >>> np.amin(a)           # Minimum of the flattened array
    0
    >>> np.amin(a, axis=0)   # Minima along the first axis
    array([0, 1])
    >>> np.amin(a, axis=1)   # Minima along the second axis
    array([0, 2])
    
    >>> b = np.arange(5, dtype=np.float)
    >>> b[2] = np.NaN
    >>> np.amin(b)
    nan
    >>> np.nanmin(b)
    0.0

Help on function between in module pandas.core.series:

between(self, left, right, inclusive=True)
    Return boolean Series equivalent to left <= series <= right. NA values
    will be treated as False
    
    Parameters
    ----------
    left : scalar
        Left boundary
    right : scalar
        Right boundary
    
    Returns
    -------
    is_between : Series


In [ ]:


In [ ]: