Homework 8: Dataset 2: Baggage claims

  • Open your dataset up using pandas in a Jupyter notebook
  • Do a .head() to get a feel for your data
  • Write down 12 questions to ask your data, or 12 things to hunt for in the data
  • Attempt to answer those ten questions using the magic of pandas
  • Make three charts with your dataset
  • Keep track of anything that was problematic - it can be non-standard country names, extra spaces in columns, trouble getting multiple colors in scatterplots, whatever you'd like.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
% matplotlib inline

In [3]:
df=pd.read_csv('baggageclaims_data.csv')

In [130]:
df.head()


Out[130]:
Claim Number Date Received Incident D Airport Code Airport Name Airline Name Claim Type Claim Site Item Category Close Amount Disposition Amount_float date
0 2015060222904 28-May-15 21-May-15 ABE Lehigh Valley International Airport, Allentown Allegiant Air Property Damage Checkpoint Personal Electronics $0.00 Deny 0.0 2015-12-31
1 2015082725115 21-Aug-15 1-Aug-15 ABE Lehigh Valley International Airport, Allentown - Property Damage Checked Baggage Sporting Equipment & Supplies $180.00 Settle 180.0 2015-12-31
2 2015102326219 7-Oct-15 2-Oct-15 ABE Lehigh Valley International Airport, Allentown Allegiant Air Passenger Property Loss Checked Baggage Clothing $60.00 Approve in Full 60.0 2015-12-31
3 2015122027695 27-Oct-15 19-Oct-15 ABE Lehigh Valley International Airport, Allentown Allegiant Air Property Damage Checkpoint Personal Accessories - - 0.0 2015-12-31
4 2015012220065 22-Jan-15 23-Dec-14 ABI Abilene Regional American Airlines Property Damage Checked Baggage Home Decor $0.00 Deny 0.0 2015-12-31

1) Which claim type is the most common?


In [5]:
df.columns


Out[5]:
Index(['Claim Number', 'Date Received', 'Incident D', 'Airport Code',
       'Airport Name', 'Airline Name', 'Claim Type', 'Claim Site',
       'Item Category', 'Close Amount', 'Disposition'],
      dtype='object')

In [6]:
df['Claim Type'].value_counts()
#Passenger Property Loss is most common claim type


Out[6]:
Passenger Property Loss    4551
Property Damage            3888
Personal Injury             122
Motor Vehicle                35
-                            34
Complaint                    28
Employee Loss (MPCECA)        9
Name: Claim Type, dtype: int64

2. How do airlines compare on this most common claim type?


In [30]:
loss = df[df['Claim Type'] == 'Passenger Property Loss']

In [19]:
loss_by_airline = pd.DataFrame(loss.groupby('Airline Name')['Claim Type'].value_counts())
loss_by_airline.sort_values(by='Claim Type', ascending = False).head(20)


Out[19]:
Claim Type
Airline Name Claim Type
Southwest Airlines Passenger Property Loss 851
Delta Air Lines Passenger Property Loss 680
American Airlines Passenger Property Loss 628
UAL Passenger Property Loss 578
USAir Passenger Property Loss 326
- Passenger Property Loss 225
Jet Blue Passenger Property Loss 223
Alaska Airlines Passenger Property Loss 203
Frontier Airlines Passenger Property Loss 94
British Airways Passenger Property Loss 87
Spirit Airlines Passenger Property Loss 80
Hawaiian Airlines Passenger Property Loss 70
Allegiant Air Passenger Property Loss 51
Virgin America Passenger Property Loss 43
Air Canada Passenger Property Loss 29
Lufthansa Passenger Property Loss 27
Westjet Airlines Ltd Passenger Property Loss 26
Emirates Passenger Property Loss 21
Etihad Airways Passenger Property Loss 17
Turkish Airlines Passenger Property Loss 12

In [34]:
plt.style.use('ggplot')
loss_by_airline.sort_values(by='Claim Type', ascending = True).tail(20).plot(kind='barh', xlim=(0,900), ylim=(0,900), legend=False, figsize=(10,10))


Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x88c68b0>

3. How does most common claimtype disperse over item categories?


In [28]:
loss_by_item = loss.groupby('Item Category')['Claim Type'].value_counts()
loss_by_item.sort_values().tail(20)


Out[28]:
Item Category                         Claim Type             
Jewelry & Watches; Jewelry & Watches  Passenger Property Loss     39
Automobile Parts & Accessories        Passenger Property Loss     40
Tools & Home Improvement Supplies     Passenger Property Loss     47
Audio/Video                           Passenger Property Loss     54
Sporting Equipment & Supplies         Passenger Property Loss     71
Hunting & Fishing Items               Passenger Property Loss     77
Cameras                               Passenger Property Loss     83
Medical/Science                       Passenger Property Loss    100
Food & Drink                          Passenger Property Loss    111
Baggage/Cases/Purses                  Passenger Property Loss    122
Currency                              Passenger Property Loss    133
Personal Accessories                  Passenger Property Loss    149
Computer & Accessories                Passenger Property Loss    227
Cosmetics & Grooming                  Passenger Property Loss    232
Personal Electronics                  Passenger Property Loss    246
-                                     Passenger Property Loss    309
Other                                 Passenger Property Loss    327
Travel Accessories                    Passenger Property Loss    333
Jewelry & Watches                     Passenger Property Loss    448
Clothing                              Passenger Property Loss    480
Name: Claim Type, dtype: int64

In [43]:
loss_by_item_df = pd.DataFrame(loss.groupby('Item Category')['Claim Type'].value_counts())
loss_by_item_df.sort_values('Claim Type', ascending = False).head(20)


Out[43]:
Claim Type
Item Category Claim Type
Clothing Passenger Property Loss 480
Jewelry & Watches Passenger Property Loss 448
Travel Accessories Passenger Property Loss 333
Other Passenger Property Loss 327
- Passenger Property Loss 309
Personal Electronics Passenger Property Loss 246
Cosmetics & Grooming Passenger Property Loss 232
Computer & Accessories Passenger Property Loss 227
Personal Accessories Passenger Property Loss 149
Currency Passenger Property Loss 133
Baggage/Cases/Purses Passenger Property Loss 122
Food & Drink Passenger Property Loss 111
Medical/Science Passenger Property Loss 100
Cameras Passenger Property Loss 83
Hunting & Fishing Items Passenger Property Loss 77
Sporting Equipment & Supplies Passenger Property Loss 71
Audio/Video Passenger Property Loss 54
Tools & Home Improvement Supplies Passenger Property Loss 47
Automobile Parts & Accessories Passenger Property Loss 40
Jewelry & Watches; Jewelry & Watches Passenger Property Loss 39

4. Is there a correlation between the ten most lost items and when these get lost?


In [81]:
# convert dates into actual dates for python
# date format of 'Date Received': 28-May-15
from datetime import datetime
date_object = datetime.strptime('28-May-15', '%d-%b-%y')
date_object
#strftime.net


Out[81]:
datetime.datetime(2015, 5, 28, 0, 0)

In [251]:
right_formatted_dates = []
for date in df['Date Received']:
    #print(date)
    date_formatted = datetime.strptime(date, '%d-%b-%y')
    #print(date_formatted)
    right_formatted_dates.append(date_formatted)

df['right_dates'] = right_formatted_dates

In [252]:
new_table = df[df['Claim Type'] == 'Passenger Property Loss'] 
latest = new_table[['right_dates', 'Claim Type', 'Item Category', 'Claim Number']]
latest


Out[252]:
right_dates Claim Type Item Category Claim Number
2 2015-10-07 Passenger Property Loss Clothing 2015102326219
5 2015-08-05 Passenger Property Loss Baggage/Cases/Purses 2015081024734
6 2015-09-07 Passenger Property Loss Personal Electronics 2015092825697
7 2015-01-09 Passenger Property Loss Clothing 2015010919746
8 2015-02-27 Passenger Property Loss - 2015022720896
9 2015-04-06 Passenger Property Loss Clothing; Home Decor 2015040721792
11 2015-05-29 Passenger Property Loss Cameras; Computer & Accessories; Personal Elec... 2015060422932
12 2015-06-25 Passenger Property Loss Clothing 2015062523439
14 2015-07-21 Passenger Property Loss - 2015072124032
17 2015-08-18 Passenger Property Loss Personal Accessories 2015083125181
19 2015-09-04 Passenger Property Loss Jewelry & Watches 2015092325581
22 2015-10-26 Passenger Property Loss Jewelry & Watches 2015111626893
23 2015-10-29 Passenger Property Loss Jewelry & Watches 2015102926385
24 2015-11-03 Passenger Property Loss Other 2015112327063
25 2015-12-07 Passenger Property Loss Other 2015122127713
26 2015-07-20 Passenger Property Loss Clothing 2015080324436
27 2015-10-19 Passenger Property Loss Currency; Jewelry & Watches 2015110926712
30 2015-03-20 Passenger Property Loss Clothing; Clothing; Clothing; Clothing 2015032021422
31 2015-10-27 Passenger Property Loss Clothing 2015111826956
32 2015-07-15 Passenger Property Loss Currency; Currency; Other 2015072224105
34 2015-04-29 Passenger Property Loss Other 2015042922169
36 2015-06-30 Passenger Property Loss Hunting & Fishing Items 2015070823700
39 2015-09-16 Passenger Property Loss Outdoor Items 2015100525869
42 2015-07-01 Passenger Property Loss Clothing 2015071423780
45 2015-01-12 Passenger Property Loss Travel Accessories 2015011219787
46 2015-02-04 Passenger Property Loss Computer & Accessories 2015020420399
49 2015-02-18 Passenger Property Loss Other 2015021820652
50 2015-03-11 Passenger Property Loss Sporting Equipment & Supplies 2015031621239
52 2015-04-14 Passenger Property Loss Cosmetics & Grooming 2015041421906
54 2015-04-28 Passenger Property Loss Baggage/Cases/Purses 2015050522313
... ... ... ... ...
8568 2015-08-03 Passenger Property Loss Hunting & Fishing Items 2015080324457
8569 2015-10-01 Passenger Property Loss Currency 2015102126149
8610 2015-01-29 Passenger Property Loss - 2015012820265
8614 2015-03-16 Passenger Property Loss - 2015031621250
8617 2015-03-24 Passenger Property Loss - 2015032421459
8618 2015-04-09 Passenger Property Loss - 2015041321888
8619 2015-04-10 Passenger Property Loss Clothing; Clothing; Other 2015041021852
8620 2015-04-14 Passenger Property Loss Baggage/Cases/Purses; Clothing; Clothing 2015041721983
8622 2015-04-21 Passenger Property Loss Jewelry & Watches 2015050122261
8624 2015-05-08 Passenger Property Loss Cosmetics & Grooming 2015052222713
8625 2015-05-13 Passenger Property Loss Clothing; Medical/Science 2015052122646
8626 2015-05-20 Passenger Property Loss - 2015052622764
8628 2015-06-01 Passenger Property Loss Medical/Science 2015061523144
8631 2015-06-12 Passenger Property Loss Jewelry & Watches; Medical/Science 2015062223292
8633 2015-06-27 Passenger Property Loss Cosmetics & Grooming 2015070623670
8634 2015-07-09 Passenger Property Loss - 2015072124038
8639 2015-08-17 Passenger Property Loss - 2015081724898
8643 2015-09-09 Passenger Property Loss Personal Electronics 2015092125461
8644 2015-09-10 Passenger Property Loss - 2015092425598
8645 2015-09-28 Passenger Property Loss Jewelry & Watches 2015102626275
8651 2015-10-21 Passenger Property Loss - 2015102126161
8653 2015-10-26 Passenger Property Loss Personal Electronics 2015110926731
8654 2015-11-02 Passenger Property Loss Cameras; Clothing; Clothing; Clothing; Clothin... 2015110226446
8655 2015-11-02 Passenger Property Loss Books, Magazines & Other 2015110226479
8659 2015-11-16 Passenger Property Loss Cosmetics & Grooming 2015120727336
8660 2015-11-17 Passenger Property Loss Jewelry & Watches 2015120827394
8661 2015-11-17 Passenger Property Loss Clothing; Household Items 2015121427528
8664 2015-12-22 Passenger Property Loss Clothing 2016010428072
8665 2015-12-30 Passenger Property Loss Tools & Home Improvement Supplies 2016011328300
8666 2015-12-31 Passenger Property Loss Personal Accessories 2015123128015

4551 rows × 4 columns


In [253]:
Clothing = latest[latest['Item Category'] == 'Clothing']
clothing_count = Clothing['Item Category'].count()
clothing_count


Out[253]:
480

In [264]:
from datetime import date
for element in Clothing['right_dates']:
    if element.date() > date(2015, 1, 1) and element.date() < date(2015, 2, 1):
        clothing_count = Clothing['Item Category'].count()
        print(clothing_count)
# The idea is to loop though the dataframe latest and the clothing category (and subsequently for others --> function) loop through each month (here, so far only January) and count how many entries there are for this month, by doing .count() on any variable of that list (works, see cell above)


480
480
480
480
480
480
480
480
480
480
480
480
480
480
480
480
480
480
480
480
480
480
480
480
480
480
480
480
480
480
480
480
480
480
480
480
480
480
480

In [227]:
from datetime import date
d = date(2013, 12 ,22)
type(d)


Out[227]:
datetime.date

In [ ]:


In [124]:
Clothing = new_table[new_table['Item Category'] == 'Clothing']
Jewelry = new_table[new_table['Item Category'] == 'Jewelry & Watches']
Travel_Accessories = new_table[new_table['Item Category'] == 'Travel Accessories']
Personal_Electronics = new_table[new_table['Item Category'] == 'Personal Electronics']
Cosmetics = new_table[new_table['Item Category'] == 'Cosmetics & Grooming']
Computer = new_table[new_table['Item Category'] == 'Computer & Accessories']

plt.scatter(y= Clothing["date"], x= Clothing["date"], c='c', alpha=0.75, marker='x')

plt.scatter(y= Jewelry["date"], x= Jewelry["date"], c='y', alpha=0.75, marker='o')

plt.scatter(y= Travel_Accessories["date"], x= Travel_Accessories["date"], c='m', alpha=0.75, marker='v')

plt.scatter(y= Personal_Electronics["date"], x= Personal_Electronics["date"], c='m', alpha=0.75, marker='s')

plt.scatter(y= Cosmetics["date"], x= Cosmetics["date"], c='m', alpha=0.75, marker='.')

plt.scatter(y= Computer["date"], x= Computer["date"], c='m', alpha=0.75, marker='*')

#markers: http://matplotlib.org/api/markers_api.html
# make x a timeline: http://stackoverflow.com/questions/1574088/plotting-time-in-python-with-matplotlib
# https://blog.mafr.de/2012/03/11/time-series-data-with-matplotlib/


Out[124]:
<matplotlib.collections.PathCollection at 0x9b201d0>
Error in callback <function install_repl_displayhook.<locals>.post_execute at 0x06099A08> (for post_execute):
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\pyplot.py in post_execute()
    145             def post_execute():
    146                 if matplotlib.is_interactive():
--> 147                     draw_all()
    148 
    149             # IPython >= 2

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\_pylab_helpers.py in draw_all(cls, force)
    148         for f_mgr in cls.get_all_fig_managers():
    149             if force or f_mgr.canvas.figure.stale:
--> 150                 f_mgr.canvas.draw_idle()
    151 
    152 atexit.register(Gcf.destroy_all)

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\backend_bases.py in draw_idle(self, *args, **kwargs)
   2024         if not self._is_idle_drawing:
   2025             with self._idle_draw_cntx():
-> 2026                 self.draw(*args, **kwargs)
   2027 
   2028     def draw_cursor(self, event):

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\backends\backend_agg.py in draw(self)
    472 
    473         try:
--> 474             self.figure.draw(self.renderer)
    475         finally:
    476             RendererAgg.lock.release()

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\artist.py in draw_wrapper(artist, renderer, *args, **kwargs)
     60     def draw_wrapper(artist, renderer, *args, **kwargs):
     61         before(artist, renderer)
---> 62         draw(artist, renderer, *args, **kwargs)
     63         after(artist, renderer)
     64 

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\figure.py in draw(self, renderer)
   1157         dsu.sort(key=itemgetter(0))
   1158         for zorder, a, func, args in dsu:
-> 1159             func(*args)
   1160 
   1161         renderer.close_group('figure')

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\artist.py in draw_wrapper(artist, renderer, *args, **kwargs)
     60     def draw_wrapper(artist, renderer, *args, **kwargs):
     61         before(artist, renderer)
---> 62         draw(artist, renderer, *args, **kwargs)
     63         after(artist, renderer)
     64 

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\axes\_base.py in draw(self, renderer, inframe)
   2317 
   2318         for zorder, a in dsu:
-> 2319             a.draw(renderer)
   2320 
   2321         renderer.close_group('axes')

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\artist.py in draw_wrapper(artist, renderer, *args, **kwargs)
     60     def draw_wrapper(artist, renderer, *args, **kwargs):
     61         before(artist, renderer)
---> 62         draw(artist, renderer, *args, **kwargs)
     63         after(artist, renderer)
     64 

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\axis.py in draw(self, renderer, *args, **kwargs)
   1106         renderer.open_group(__name__)
   1107 
-> 1108         ticks_to_draw = self._update_ticks(renderer)
   1109         ticklabelBoxes, ticklabelBoxes2 = self._get_tick_bboxes(ticks_to_draw,
   1110                                                                 renderer)

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\axis.py in _update_ticks(self, renderer)
    949 
    950         interval = self.get_view_interval()
--> 951         tick_tups = [t for t in self.iter_ticks()]
    952         if self._smart_bounds:
    953             # handle inverted limits

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\axis.py in <listcomp>(.0)
    949 
    950         interval = self.get_view_interval()
--> 951         tick_tups = [t for t in self.iter_ticks()]
    952         if self._smart_bounds:
    953             # handle inverted limits

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\axis.py in iter_ticks(self)
    892         Iterate through all of the major and minor ticks.
    893         """
--> 894         majorLocs = self.major.locator()
    895         majorTicks = self.get_major_ticks(len(majorLocs))
    896         self.major.formatter.set_locs(majorLocs)

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\dates.py in __call__(self)
   1005     def __call__(self):
   1006         'Return the locations of the ticks'
-> 1007         self.refresh()
   1008         return self._locator()
   1009 

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\dates.py in refresh(self)
   1025     def refresh(self):
   1026         'Refresh internal information based on current limits.'
-> 1027         dmin, dmax = self.viewlim_to_dt()
   1028         self._locator = self.get_locator(dmin, dmax)
   1029 

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\dates.py in viewlim_to_dt(self)
    769             vmin, vmax = vmax, vmin
    770 
--> 771         return num2date(vmin, self.tz), num2date(vmax, self.tz)
    772 
    773     def _get_unit(self):

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\dates.py in num2date(x, tz)
    417         tz = _get_rc_timezone()
    418     if not cbook.iterable(x):
--> 419         return _from_ordinalf(x, tz)
    420     else:
    421         x = np.asarray(x)

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\dates.py in _from_ordinalf(x, tz)
    269 
    270     ix = int(x)
--> 271     dt = datetime.datetime.fromordinal(ix).replace(tzinfo=UTC)
    272 
    273     remainder = float(x) - ix

OverflowError: Python int too large to convert to C long
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\IPython\core\formatters.py in __call__(self, obj)
    337                 pass
    338             else:
--> 339                 return printer(obj)
    340             # Finally look for special method names
    341             method = _safe_get_formatter_method(obj, self.print_method)

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\IPython\core\pylabtools.py in <lambda>(fig)
    226 
    227     if 'png' in formats:
--> 228         png_formatter.for_type(Figure, lambda fig: print_figure(fig, 'png', **kwargs))
    229     if 'retina' in formats or 'png2x' in formats:
    230         png_formatter.for_type(Figure, lambda fig: retina_figure(fig, **kwargs))

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\IPython\core\pylabtools.py in print_figure(fig, fmt, bbox_inches, **kwargs)
    117 
    118     bytes_io = BytesIO()
--> 119     fig.canvas.print_figure(bytes_io, **kw)
    120     data = bytes_io.getvalue()
    121     if fmt == 'svg':

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\backend_bases.py in print_figure(self, filename, dpi, facecolor, edgecolor, orientation, format, **kwargs)
   2178                     orientation=orientation,
   2179                     dryrun=True,
-> 2180                     **kwargs)
   2181                 renderer = self.figure._cachedRenderer
   2182                 bbox_inches = self.figure.get_tightbbox(renderer)

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\backends\backend_agg.py in print_png(self, filename_or_obj, *args, **kwargs)
    525 
    526     def print_png(self, filename_or_obj, *args, **kwargs):
--> 527         FigureCanvasAgg.draw(self)
    528         renderer = self.get_renderer()
    529         original_dpi = renderer.dpi

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\backends\backend_agg.py in draw(self)
    472 
    473         try:
--> 474             self.figure.draw(self.renderer)
    475         finally:
    476             RendererAgg.lock.release()

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\artist.py in draw_wrapper(artist, renderer, *args, **kwargs)
     60     def draw_wrapper(artist, renderer, *args, **kwargs):
     61         before(artist, renderer)
---> 62         draw(artist, renderer, *args, **kwargs)
     63         after(artist, renderer)
     64 

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\figure.py in draw(self, renderer)
   1157         dsu.sort(key=itemgetter(0))
   1158         for zorder, a, func, args in dsu:
-> 1159             func(*args)
   1160 
   1161         renderer.close_group('figure')

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\artist.py in draw_wrapper(artist, renderer, *args, **kwargs)
     60     def draw_wrapper(artist, renderer, *args, **kwargs):
     61         before(artist, renderer)
---> 62         draw(artist, renderer, *args, **kwargs)
     63         after(artist, renderer)
     64 

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\axes\_base.py in draw(self, renderer, inframe)
   2317 
   2318         for zorder, a in dsu:
-> 2319             a.draw(renderer)
   2320 
   2321         renderer.close_group('axes')

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\artist.py in draw_wrapper(artist, renderer, *args, **kwargs)
     60     def draw_wrapper(artist, renderer, *args, **kwargs):
     61         before(artist, renderer)
---> 62         draw(artist, renderer, *args, **kwargs)
     63         after(artist, renderer)
     64 

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\axis.py in draw(self, renderer, *args, **kwargs)
   1106         renderer.open_group(__name__)
   1107 
-> 1108         ticks_to_draw = self._update_ticks(renderer)
   1109         ticklabelBoxes, ticklabelBoxes2 = self._get_tick_bboxes(ticks_to_draw,
   1110                                                                 renderer)

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\axis.py in _update_ticks(self, renderer)
    949 
    950         interval = self.get_view_interval()
--> 951         tick_tups = [t for t in self.iter_ticks()]
    952         if self._smart_bounds:
    953             # handle inverted limits

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\axis.py in <listcomp>(.0)
    949 
    950         interval = self.get_view_interval()
--> 951         tick_tups = [t for t in self.iter_ticks()]
    952         if self._smart_bounds:
    953             # handle inverted limits

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\axis.py in iter_ticks(self)
    892         Iterate through all of the major and minor ticks.
    893         """
--> 894         majorLocs = self.major.locator()
    895         majorTicks = self.get_major_ticks(len(majorLocs))
    896         self.major.formatter.set_locs(majorLocs)

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\dates.py in __call__(self)
   1005     def __call__(self):
   1006         'Return the locations of the ticks'
-> 1007         self.refresh()
   1008         return self._locator()
   1009 

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\dates.py in refresh(self)
   1025     def refresh(self):
   1026         'Refresh internal information based on current limits.'
-> 1027         dmin, dmax = self.viewlim_to_dt()
   1028         self._locator = self.get_locator(dmin, dmax)
   1029 

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\dates.py in viewlim_to_dt(self)
    769             vmin, vmax = vmax, vmin
    770 
--> 771         return num2date(vmin, self.tz), num2date(vmax, self.tz)
    772 
    773     def _get_unit(self):

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\dates.py in num2date(x, tz)
    417         tz = _get_rc_timezone()
    418     if not cbook.iterable(x):
--> 419         return _from_ordinalf(x, tz)
    420     else:
    421         x = np.asarray(x)

c:\users\gianna-carina\appdata\local\programs\python\python35-32\lib\site-packages\matplotlib\dates.py in _from_ordinalf(x, tz)
    269 
    270     ix = int(x)
--> 271     dt = datetime.datetime.fromordinal(ix).replace(tzinfo=UTC)
    272 
    273     remainder = float(x) - ix

OverflowError: Python int too large to convert to C long
<matplotlib.figure.Figure at 0x9a98950>

5. What is the airport with most property damage?


In [46]:
damage = df[df['Claim Type'] == 'Property Damage']
damage['Claim Type'].value_counts()


Out[46]:
Property Damage    3888
Name: Claim Type, dtype: int64

In [47]:
damage_by_airport = damage.groupby('Airport Name')['Claim Type'].value_counts()
damage_by_airport.sort_values().tail(30)


Out[47]:
Airport Name                                      Claim Type     
John Wayne                                        Property Damage     33
Dallas Love Field                                 Property Damage     33
Metropolitan Oakland International                Property Damage     35
Honolulu International Airport                    Property Damage     36
Salt Lake City International Airport              Property Damage     37
New Orleans International                         Property Damage     37
Ronald Reagan Washington National Airport         Property Damage     39
Baltimore/Washington Intl Thurgood Marshall       Property Damage     47
Minneapolis/ St.Paul International Airport        Property Damage     47
Tampa International                               Property Damage     54
Charlotte/Douglas International Airport           Property Damage     59
San Diego International                           Property Damage     66
LaGuardia                                         Property Damage     66
Houston - George Bush Intercontinental Airport    Property Damage     72
Philadelphia International Airport                Property Damage     79
Washington Dulles International                   Property Damage     80
Denver International Airport                      Property Damage     80
Ft. Lauderdale-Hollywood International            Property Damage     81
Phoenix Sky Harbor International                  Property Damage     84
Dallas-Fort Worth International Airport           Property Damage     91
Boston (Logan) International Airport              Property Damage     98
McCarran International                            Property Damage    105
Seattle-Tacoma International                      Property Damage    105
Newark International Airport                      Property Damage    119
Miami International Airport                       Property Damage    130
Chicago O'Hare International Airport              Property Damage    135
Orlando International Airport                     Property Damage    154
Hartsfield-Jackson Atlanta International Airport  Property Damage    178
John F. Kennedy International                     Property Damage    218
Los Angeles International Airport                 Property Damage    225
Name: Claim Type, dtype: int64

6. How many of the claims were granted, how many denied?


In [48]:
end = df['Disposition'].value_counts()
end


Out[48]:
Deny               3574
-                  2066
Approve in Full    1958
Settle             1069
Name: Disposition, dtype: int64

In [49]:
plt.style.use('ggplot')
end.plot(kind='pie')


Out[49]:
<matplotlib.axes._subplots.AxesSubplot at 0x8683930>

7. How do airlines compare on denial and full approval of claims?


In [50]:
approval = df[df['Disposition'] == 'Approve in Full']
approval['Disposition'].value_counts()


Out[50]:
Approve in Full    1958
Name: Disposition, dtype: int64

In [51]:
approval_by_airline = approval.groupby('Airline Name')['Disposition'].value_counts()
approval_by_airline.sort_values().tail(20)


Out[51]:
Airline Name                Disposition    
Sun Country Airlines Inc    Approve in Full      4
Turkish Airlines            Approve in Full      4
Virgin Atlantic             Approve in Full      5
Air Canada                  Approve in Full      6
Virgin America              Approve in Full     13
Westjet Airlines Ltd        Approve in Full     14
Lufthansa                   Approve in Full     14
British Airways             Approve in Full     17
Hawaiian Airlines           Approve in Full     26
Allegiant Air               Approve in Full     27
Spirit Airlines             Approve in Full     35
Frontier Airlines           Approve in Full     37
Alaska Airlines             Approve in Full     77
Jet Blue                    Approve in Full     79
-                           Approve in Full    138
USAir                       Approve in Full    160
American Airlines           Approve in Full    250
UAL                         Approve in Full    252
Delta Air Lines             Approve in Full    337
Southwest Airlines          Approve in Full    387
Name: Disposition, dtype: int64

In [52]:
denial = df[df['Disposition'] == 'Deny']
denial['Disposition'].value_counts()


Out[52]:
Deny    3574
Name: Disposition, dtype: int64

In [53]:
denial_by_airline = denial.groupby('Airline Name')['Disposition'].value_counts()
denial_by_airline.sort_values().tail(20)


Out[53]:
Airline Name            Disposition
Emirates                Deny             8
Etihad Airways          Deny             9
Lufthansa               Deny            13
Air France              Deny            13
Westjet Airlines Ltd    Deny            19
Air Canada              Deny            26
Virgin America          Deny            40
Allegiant Air           Deny            41
Hawaiian Airlines       Deny            60
Frontier Airlines       Deny            61
Spirit Airlines         Deny            72
British Airways         Deny            86
Alaska Airlines         Deny           145
-                       Deny           150
Jet Blue                Deny           161
USAir                   Deny           273
UAL                     Deny           459
American Airlines       Deny           496
Delta Air Lines         Deny           590
Southwest Airlines      Deny           658
Name: Disposition, dtype: int64

8. What is the average close amount? What is mean/max?


In [65]:
float_amount = df['Close Amount'].str.replace('$','').str.replace('-','0').str.replace(',','').astype(float)
df['Amount_float'] = float_amount
df.head(3)
df['Amount_float'].describe()


Out[65]:
count    8667.000000
mean       70.513101
std       216.240594
min         0.000000
25%         0.000000
50%         0.000000
75%        50.000000
max      5403.460000
Name: Amount_float, dtype: float64

9. Per airline, what is the average close amount?


In [66]:
df.groupby('Airline Name')['Amount_float'].mean().sort_values().tail(50)


Out[66]:
Airline Name
Allegiant Air                             63.900400
Delta Air Lines                           65.665711
Carribean Airlines                        66.104545
American Airlines                         66.141277
Royal Jordanian Airline                   66.750000
Emirates                                  67.813333
UAL                                       68.288594
Southwest Airlines                        68.696680
Air Canada                                69.911765
EVA airlines                              70.000000
Scandinavian Airlines Systems             70.535000
Island Air                                70.780000
Frontier Airlines                         70.792516
Spirit Airlines                           70.916341
Aero Flot                                 71.428571
USAir                                     72.987410
Pakistan International Airlines Cor       75.000000
OMNI Air Express- (OY)                    85.795000
Polish Airlines                           88.420000
Jet Blue                                  89.367430
Skywest Airlines Inc USA                  90.000000
Virgin Atlantic                           94.201923
Philippine Airlines                       99.000000
Great Lakes Airline                      100.000000
Austrian Airlines                        100.000000
Qantas Airlines                          101.980000
-                                        102.058140
Lufthansa                                103.968667
China Eastern Airlines                   105.400000
Aer Lingus                               107.877000
Copa Airlines                            111.935000
Southeast Airlines                       122.500000
LOT Polish Airlines                      131.633333
Lan Airlines                             136.905000
Aerolineas Argentinas                    143.716667
Etihad Airways                           148.070476
Air Berlin                               152.816250
Turkish Airlines                         168.130667
CanJet                                   170.000000
Cathay Pacific - CX                      181.271818
Silverjet                                186.275000
Asiana Airlines                          223.003750
Horizon Air                              235.640000
Norwegian Airlines                       262.878571
Air India                                286.185000
KLM Royal Dutch Airlines                 300.000000
SunWing Airlines                         377.400000
Czech Airlines                           388.890000
Iceland Air                              504.000000
Air New Zealand                          517.700000
Name: Amount_float, dtype: float64

In [76]:
close_df = pd.DataFrame(df.groupby('Airline Name')['Amount_float'].mean())
close_df.sort_values(by='Amount_float', ascending=True).tail(20).plot(kind="barh", legend=False, figsize=(10,10))


Out[76]:
<matplotlib.axes._subplots.AxesSubplot at 0x95222d0>

10. Per item category, what is the average close amount?


In [225]:
#loss['Amount_float'] = float_amount
#loss.groupby('Item Category')['Amount_float'].mean()

category_amount = pd.DataFrame(df.groupby('Item Category')['Amount_float'].mean())
cleaned_category_amount = category_amount['Amount_float'] != 0
category_amount[cleaned_category_amount].sort_values('Amount_float', ascending=False).head(10)

#how to take only those with one entry? ie only "Audio/Video" instead of "Audio/Video; Audio/Video"?
#--> exclude all the cells that have a ; in it as that is how multiple entries are separated?


Out[225]:
Amount_float
Item Category
Cameras; Computer & Accessories; Jewelry & Watches 1100.00
Baggage/Cases/Purses; Computer & Accessories; Travel Accessories 1083.12
Computer & Accessories; Cosmetics & Grooming; Household Items; Personal Electronics 724.76
Cosmetics & Grooming; Cosmetics & Grooming; Cosmetics & Grooming; Medical/Science 690.88
Travel Accessories; Travel Accessories 500.00
Clothing; Cosmetics & Grooming; Cosmetics & Grooming; Jewelry & Watches 500.00
Currency; Jewelry & Watches; Jewelry & Watches; Jewelry & Watches; Jewelry & Watches; Jewelry & Watches 500.00
Clothing; Hunting & Fishing Items; Other 448.66
Jewelry & Watches; Jewelry & Watches; Other 433.50
Home Decor; Other; Other 425.00

In [ ]: