Cleaning: Flight Delays

This data set comes from the featured Kaggle datasets, containing three tables that relate to commercial airline flights. The flight delay and cancellation data was collected and published by the DOT's Bureau of Transportation Statistics.

There are three tables:

  • airlines : contains airline IATA codes with corresponding airline names
  • airports : contains IATA_CODE: Location Identifier String

    AIRPORT: Airport's Name String

    CITY: City Name of the Airport String

    STATE: State Name of the Airport String

    COUNTRY: Country Name of the Airport String

    LATITUDE: Latitude of the Airport Numeric

    LONGITUDE: Longitude of the Airport Numeric

  • flights : contains many features with descriptions that can be found here, by selecting flights.csv in the data and clicking on Edit descriptions above the data preview.

1. Convert column names to lower case (it's a pain to keep typing upper case column names)


In [3]:
import pandas as pd
import numpy as np

flights = pd.read_csv('flights/flights_sm_raw.csv')
airlines = pd.read_csv('flights/airlines.csv')
airports = pd.read_csv('flights/airports.csv')

In [4]:
f_names = [name.lower() for name in list(flights.columns)]
l_names = [name.lower() for name in list(airlines.columns)]
p_names = [name.lower() for name in list(airports.columns)]

flights.columns = f_names
airlines.columns = l_names
airports.columns = p_names

In [5]:
flights.head()


Out[5]:
year month day day_of_week airline flight_number tail_number origin_airport destination_airport scheduled_departure ... arrival_time arrival_delay diverted cancelled cancellation_reason air_system_delay security_delay airline_delay late_aircraft_delay weather_delay
0 2015 1 1 4 US 2013 N584UW LAX CLT 30 ... 753.0 -10.0 0 0 NaN NaN NaN NaN NaN NaN
1 2015 1 1 4 AA 1674 N853AA LAS MIA 35 ... 753.0 -10.0 0 0 NaN NaN NaN NaN NaN NaN
2 2015 1 1 4 DL 1560 N3743H ANC SEA 45 ... 451.0 -24.0 0 0 NaN NaN NaN NaN NaN NaN
3 2015 1 1 4 AA 371 N3GXAA SEA MIA 100 ... 939.0 1.0 0 0 NaN NaN NaN NaN NaN NaN
4 2015 1 1 4 AA 115 N3CTAA LAX MIA 105 ... 839.0 -12.0 0 0 NaN NaN NaN NaN NaN NaN

5 rows × 31 columns


In [6]:
airlines.head()


Out[6]:
iata_code airline
0 UA United Air Lines Inc.
1 AA American Airlines Inc.
2 US US Airways Inc.
3 F9 Frontier Airlines Inc.
4 B6 JetBlue Airways

In [7]:
airports.head()


Out[7]:
iata_code airport city state country latitude longitude
0 ABE Lehigh Valley International Airport Allentown PA USA 40.65236 -75.44040
1 ABI Abilene Regional Airport Abilene TX USA 32.41132 -99.68190
2 ABQ Albuquerque International Sunport Albuquerque NM USA 35.04022 -106.60919
3 ABR Aberdeen Regional Airport Aberdeen SD USA 45.44906 -98.42183
4 ABY Southwest Georgia Regional Airport Albany GA USA 31.53552 -84.19447

2. Remove all rows that have null values for critical features

  • critical features are those that are needed to compute missing values
  • you need to determine the sets of features that are go together for computing values

In [8]:
flights.iloc[0:5, 9:23]


Out[8]:
scheduled_departure departure_time departure_delay taxi_out wheels_off scheduled_time elapsed_time air_time distance wheels_on taxi_in scheduled_arrival arrival_time arrival_delay
0 30 44.0 14.0 13.0 57.0 273.0 249.0 228.0 2125 745.0 8.0 803 753.0 -10.0
1 35 27.0 -8.0 21.0 48.0 268.0 266.0 238.0 2174 746.0 7.0 803 753.0 -10.0
2 45 31.0 -14.0 25.0 56.0 210.0 200.0 171.0 1448 447.0 4.0 515 451.0 -24.0
3 100 52.0 -8.0 30.0 122.0 338.0 347.0 311.0 2724 933.0 6.0 938 939.0 1.0
4 105 103.0 -2.0 14.0 117.0 286.0 276.0 255.0 2342 832.0 7.0 851 839.0 -12.0

We have this features:

  1. YEAR: critical
  2. MONTH: critical
  3. DAY: critical
  4. DAY_OF_WEEK: may be derived from year, month and day
  5. AIRLINE: critical
  6. FLIGHT_NUMBER: critical
  7. TAIL_NUMBER: critical
  8. ORIGIN_AIRPORT: critical
  9. DESTINATION_AIRPORT: critical
  10. SCHEDULED_DEPARTURE: critical, timepoint
  11. DEPARTURE_TIME = WHEELS_OFF - TAXI_OUT: critical, timepoint; we'll derive taxi_out from here
  12. DEPARTURE_DELAY = DEPARTURE_TIME-SCHEDULED_DEPARTURE
  13. TAXI_OUT: derived from departure_time and wheels_off
  14. WHEELS_OFF: critical, timepoint
  15. SCHEDULED_TIME = SCHEDULED_ARRIVAL-SCHEDULED_DEPARTURE: critical cannot calculate because of timezones
  16. ELAPSED_TIME = AIR_TIME+TAXI_IN+TAXI_OFF
  17. AIR_TIME = WHEELS_ON - WHEELS_OFF: I can theoretically calculate it but the calculation won't be precise because of timezones
  18. DISTANCE: critical
  19. WHEELS_ON: critical, timepoint
  20. TAXI_IN: derived from arrival_time and wheels_on
  21. SCHEDULED_ARRIVAL: critical, timepoint
  22. ARRIVAL_TIME = WHEELS_ON+TAXI_IN: critical, timepoint
  23. ARRIVAL_DELAY = ARRIVAL_TIME-SCHEDULED_ARRIVAL
  24. DIVERTED: critical
  25. CANCELLED: critical
  26. CANCELLATION_REASON: critical
  27. AIR_SYSTEM_DELAY: ?
  28. SECURITY_DELAY: ?
  29. AIRLINE_DELAY: ?
  30. LATE_AIRCRAFT_DELAY: ?
  31. WEATHER_DELAY: ?

The count column gives an idea of missing values for the following features:


In [9]:
flights.iloc[:, 9:23].describe().transpose()


Out[9]:
count mean std min 25% 50% 75% max
scheduled_departure 662804.0 1334.697796 485.897622 1.0 919.0 1330.0 1735.0 2359.0
departure_time 576651.0 1335.741712 496.812987 1.0 921.0 1330.0 1740.0 2400.0
departure_delay 576651.0 9.665097 37.730697 -45.0 -5.0 -2.0 7.0 1521.0
taxi_out 573757.0 16.055921 8.847659 1.0 11.0 14.0 19.0 185.0
wheels_off 573757.0 1356.680891 498.284208 1.0 935.0 1342.0 1754.0 2400.0
scheduled_time 662798.0 139.527372 74.158283 18.0 85.0 120.0 170.0 718.0
elapsed_time 571395.0 136.861871 74.173957 15.0 82.0 118.0 168.0 718.0
air_time 571395.0 113.388771 72.182318 7.0 60.0 94.0 144.0 669.0
distance 662804.0 802.052239 598.923492 21.0 363.0 632.0 1037.0 4983.0
wheels_on 572641.0 1470.117538 522.510025 1.0 1054.0 1507.0 1911.0 2400.0
taxi_in 572641.0 7.432660 5.628520 1.0 4.0 6.0 9.0 175.0
scheduled_arrival 662804.0 1496.783230 509.641339 1.0 1109.0 1523.0 1922.0 2400.0
arrival_time 572641.0 1475.421535 526.463841 1.0 1058.0 1511.0 1916.0 2400.0
arrival_delay 571395.0 4.404344 39.078085 -82.0 -13.0 -5.0 8.0 1499.0

I'm selecting the critical features among the ones above and deleting all rows which have null values for them:


In [11]:
critical = ['scheduled_departure', 'departure_time', 'wheels_off', 'scheduled_time',
            'wheels_on', 'scheduled_arrival', 'arrival_time']

for i in range(7):
    flights.drop(flights[flights[critical[i]].isnull()].index, inplace=True)

len(flights)


Out[11]:
572641

The solution goes like this:


In [ ]:
cancelled = flights[flights.cancelled == 1].copy()
not_cancelled = flights[flights.cancelled == 0].copy()

In [ ]:
subset1 = ['departure_time', 'taxi_out', 'wheels_off']
subset1 = ['scheduled_departure', 'departure_delay']
subset1 = ['wheels_off', 'air_time', 'wheels_on']
subset1 = ['taxi_in', 'wheels_on', 'arrival_time']
subset1 = ['scheduled_arrival', 'arrival_delay']

from itertools import combinations

for subset in [subset1, subset3, subset4]:
    for c in combinations(subset, 2):
        not_cancelled.dropna(how='all', subset=list(c), inplace=True)

for subset in [subset2, subset5]:
    not_cancelled.dropna(how='all', subset=subset, inplace=True)

# sample is for randomizing order
data = pd.concat([cancelled, not_cancelled]).sample(frac=1).reset_index(drop=True)

3. Convert times from float values to full 24-hour time format strings


In [12]:
times = ['scheduled_departure', 'departure_time', 'wheels_off', 'wheels_on', 'scheduled_arrival', 'arrival_time']
for i in range(6):
    flights[times[i]] = flights[times[i]].apply(lambda x: ('0000' + str(int(x)))[-4:-2] + ':' + ('0000' + str(int(x)))[-2:] + ':00')

flights[times].head()


Out[12]:
scheduled_departure departure_time wheels_off wheels_on scheduled_arrival arrival_time
0 00:30:00 00:44:00 00:57:00 07:45:00 08:03:00 07:53:00
1 00:35:00 00:27:00 00:48:00 07:46:00 08:03:00 07:53:00
2 00:45:00 00:31:00 00:56:00 04:47:00 05:15:00 04:51:00
3 01:00:00 00:52:00 01:22:00 09:33:00 09:38:00 09:39:00
4 01:05:00 01:03:00 01:17:00 08:32:00 08:51:00 08:39:00

4. Convert new hour strings to datetime objects

  • Test the function and determine the cause of (and fix) any errors

The cause is that some columns have 24:00:00 as a value; I'm guessing they mean midnight of the subsequent day so I'm creating a datetime value including the day and I'm adding one day to this cases.


In [19]:
# hour must be in 0..23
for col in times:
    print(col)
    print(flights[flights[col] > '23:59:59'][col].unique())


scheduled_departure
[]
departure_time
['24:00:00']
wheels_off
['24:00:00']
wheels_on
['24:00:00']
scheduled_arrival
[]
arrival_time
['24:00:00']

In [20]:
def convert_dt(year, month, day, time):
        if time == '24:00:00':
            timestamp = pd.to_datetime(str(year) + '-' + str(month) + '-' + str(day) + ' ' + '00:00:00') + pd.Timedelta(1, unit='D')
        else:
            timestamp = pd.to_datetime(str(year) + '-' + str(month) + '-' + str(day) + ' ' + time)
        return timestamp

for i in range(6):
    flights[times[i]] = flights.apply(lambda x: convert_dt(x['year'], x['month'], x['day'], x[times[i]]), axis=1)

In [ ]:
# solution for ttesting the function:
c = flights.columns.get_loc['departure_time'] + 1
_ = [hour_to_datetime(r, c) for r in data.itertuples()]

In [21]:
flights[times].head()


Out[21]:
scheduled_departure departure_time wheels_off wheels_on scheduled_arrival arrival_time
0 2015-01-01 00:30:00 2015-01-01 00:44:00 2015-01-01 00:57:00 2015-01-01 07:45:00 2015-01-01 08:03:00 2015-01-01 07:53:00
1 2015-01-01 00:35:00 2015-01-01 00:27:00 2015-01-01 00:48:00 2015-01-01 07:46:00 2015-01-01 08:03:00 2015-01-01 07:53:00
2 2015-01-01 00:45:00 2015-01-01 00:31:00 2015-01-01 00:56:00 2015-01-01 04:47:00 2015-01-01 05:15:00 2015-01-01 04:51:00
3 2015-01-01 01:00:00 2015-01-01 00:52:00 2015-01-01 01:22:00 2015-01-01 09:33:00 2015-01-01 09:38:00 2015-01-01 09:39:00
4 2015-01-01 01:05:00 2015-01-01 01:03:00 2015-01-01 01:17:00 2015-01-01 08:32:00 2015-01-01 08:51:00 2015-01-01 08:39:00

5. Fill in missing values that can be calculated from subsets (our critical features)

These are the features I can calculate and the count column indicates which ones need to be filled:


In [22]:
calculated = ['departure_delay', 'taxi_out', 'elapsed_time', 'taxi_in', 'arrival_delay', 'air_time']
flights[calculated].describe().transpose()

# departure_delay = departure_time - scheduled_departure
# taxi_out: wheels_off - departure_time
# elapsed_time = air_time + taxi_in + taxi_out
# taxi_in: arrival_time - wheels_on
# arrival_delay = arrival_time - scheduled_arrival
# air_time = wheels_on - wheels_off


Out[22]:
count mean std min 25% 50% 75% max
departure_delay 572641.0 9.326114 36.758575 -45.0 -5.0 -2.0 7.0 1521.0
taxi_out 572641.0 16.050209 8.834925 1.0 11.0 14.0 19.0 185.0
elapsed_time 571395.0 136.861871 74.173957 15.0 82.0 118.0 168.0 718.0
taxi_in 572641.0 7.432660 5.628520 1.0 4.0 6.0 9.0 175.0
arrival_delay 571395.0 4.404344 39.078085 -82.0 -13.0 -5.0 8.0 1499.0
air_time 571395.0 113.388771 72.182318 7.0 60.0 94.0 144.0 669.0

In [54]:
to_fill = flights[flights['air_time'] != flights['air_time']]
diff = (to_fill['wheels_on'] - to_fill['wheels_off']).astype('timedelta64[m]')
diff = diff.apply(lambda x: x + 24*60 if x < 0 else x)

flights.loc[to_fill.index, 'air_time'] = diff

In [55]:
to_fill = flights[flights['arrival_delay'] != flights['arrival_delay']]
diff = (to_fill['arrival_time'] - to_fill['scheduled_arrival']).astype('timedelta64[m]')
diff = diff.apply(lambda x: x + 24*60 if x < 0 else x)

flights.loc[to_fill.index, 'arrival_delay'] = diff

In [56]:
to_fill = flights[flights['elapsed_time'] != flights['elapsed_time']]
diff = flights['air_time'] + flights['taxi_in'] + flights['taxi_out']

flights.loc[to_fill.index, 'elapsed_time'] = diff

In [57]:
flights[calculated].describe().transpose()


Out[57]:
count mean std min 25% 50% 75% max
departure_delay 572641.0 9.326114 36.758575 -45.0 -5.0 -2.0 7.0 1521.0
taxi_out 572641.0 16.050209 8.834925 1.0 11.0 14.0 19.0 185.0
elapsed_time 572641.0 137.337194 75.215931 15.0 82.0 118.0 169.0 1384.0
taxi_in 572641.0 7.432660 5.628520 1.0 4.0 6.0 9.0 175.0
arrival_delay 572641.0 4.845046 40.820702 -82.0 -13.0 -5.0 8.0 1499.0
air_time 572641.0 113.854324 73.219570 7.0 60.0 94.0 144.0 1366.0

6. Filter out rows that contain numeric airport codes

  • there is a mix of character IATA codes and numeric strings

I'm going to match flights with airports and delete the rows that doesn't match, I reindex the dataset to make the rows in the merged dataset match the rows in the original one (some rows were dropped so this is necessary):


In [59]:
unmatched = pd.merge(flights.reset_index()[['index', 'origin_airport']], airports, left_on='origin_airport', right_on='iata_code', how='left').set_index('index')

flights.drop(unmatched[unmatched['iata_code'].isnull()].index, inplace=True)

I have already deleted all the rows, so this passage isn't necessary:


In [60]:
unmatched = pd.merge(flights.reset_index()[['index', 'destination_airport']], airports, left_on='destination_airport', right_on='iata_code', how='left').set_index('index')
unmatched[unmatched['iata_code'].isnull()]


Out[60]:
destination_airport iata_code airport city state country latitude longitude
index

In [61]:
len(flights)


Out[61]:
524141

7. BONUS: Write a function and script to correct dates for arrivals that are overnight flights

  • don't actually run this code (time consuming)

This should be quite similar to the function I wrote above, only checking if the time difference between what I'm looking at and the first time in the row is negative and adding one day in this case.


In [ ]:
times2 = ['departure_time', 'wheels_off', 'wheels_on', 'scheduled_arrival', 'arrival_time']

def correct_overnight(time, starttime):
        if time - starttime < 0:
            timestamp = time + pd.Timedelta(1, unit='D')
        else:
            timestamp = time
        return timestamp

for col in times2:
    flights[col] = flights.apply(lambda x: correct_overnight(x[col], x['scheduled_departure']), axis=1)

In [ ]:
# from solution: x.replace(day=day)

Cleaning: Legos

colors : This file contains information on LEGO colors, including a unique ID for each color, its name, and approximate RGB value, and whether it's transparent

inventories : This table contains information on inventories, including a unique ID, it's version and the set number.

inventory_parts : This table contains information part inventories, including a unique ID number, the part number, the color of the part, how many are included and whether it's a spare.

inventory_sets : This file contains information on what inventory is included in which sets, including the inventory ID, the set number and the quantity of that inventory that are included.

part_categories : This dataset includes information on the part category (what type of part it is) and a unique ID for that part category.

parts : This dataset includes information on lego parts, including a unique ID number, the name of the part, and what part category it's from.

sets : This file contains information on LEGO sets, including a unique ID number, the name of the set, the year it was released, its theme and how many parts it includes.

themes : This file includes information on lego themes. Each theme is given a unique ID number, a name, and (if it's part of a bigger theme) which theme it's part of.

Schema

1. Efficiently import all data files into a dictionary for easy access


In [62]:
tables = ['colors', 'inventories', 'inventory_parts', 'inventory_sets', 'part_categories', 'parts', 'sets', 'themes']
legos = {}

for table in tables:
    legos[table] = pd.read_csv('legos/' + table + '.csv')

In [63]:
legos['colors'].head()


Out[63]:
id name rgb is_trans
0 -1 Unknown 0033B2 f
1 0 Black 05131D f
2 1 Blue 0055BF f
3 2 Green 237841 f
4 3 Dark Turquoise 008F9B f

2. Create a table that provides data that can be used to analyze colors by category

  • We do not want detailed part names, part numbers, or category ID
  • We also do not want any other ID values
  • Make sure the color name and category name columns are labeled clearly

In [64]:
# Merge colors and inventory_parts and dropping columns
color_cat = pd.merge(legos['colors'], legos['inventory_parts'], left_on='id', right_on='color_id', how='left')
color_cat.drop(['id', 'inventory_id', 'color_id', 'is_spare'], axis=1, inplace=True)
# Merge with parts and dropping columns
color_cat = pd.merge(color_cat, legos['parts'], left_on='part_num', right_on='part_num', how='outer')
color_cat.drop(['part_num', 'name_y'], axis=1, inplace=True)
# Merge with part_categories and dropping columns
color_cat = pd.merge(color_cat, legos['part_categories'], left_on='part_cat_id', right_on='id', how='outer')
color_cat.drop(['part_cat_id', 'id'], axis=1, inplace=True)
# Grouping and counting, setting column names and sorting bu number
color_cat = color_cat.groupby(['name_x', 'rgb', 'is_trans', 'name']).sum().reset_index()
names = ['color_name', 'rgb', 'is_trans', 'category_name', 'number']
color_cat.columns = names
color_cat.sort_values(by='number', ascending=False).head(10)


Out[64]:
color_name rgb is_trans category_name number
2479 White FFFFFF f Bricks 55812.0
39 Black 05131D f Plates 53179.0
56 Black 05131D f Technic Pins 51642.0
1773 Red C91A09 f Bricks 48251.0
2500 White FFFFFF f Plates 39571.0
2534 Yellow F2CD37 f Bricks 38166.0
74 Blue 0055BF f Bricks 35878.0
17 Black 05131D f Bricks 34969.0
1052 Light Bluish Gray A0A5A9 f Plates 31356.0
456 Dark Bluish Gray 6C6E68 f Plates 29883.0

3. Create a table that allows us to analyze set names and their themes

  • Merge everything in one command
  • Do not include data that will not help with this analysis (remove all of this in a second command)
    • do consider that we might later want to look at specific colors or other part details
  • Rename columns where necessary

In [71]:
# Merging sets and themes and dropping columns
set_themes = pd.merge(legos['sets'], legos['themes'], left_on='theme_id', right_on='id', how='left')\
    .merge(legos['inventories'], on='set_num')\
    .merge(legos['inventory_parts'], left_on='id_y', right_on='inventory_id')
set_themes.drop(['set_num', 'theme_id', 'num_parts', 'id_x', 'id_y', 'is_spare'], axis=1, inplace=True)
# Reset names
set_themes.rename(columns={'name_x': 'set_name',
                           'name_y': 'theme_name'}, inplace=True)
set_themes.head()


Out[71]:
set_name year theme_name parent_id version inventory_id part_num color_id quantity
0 Weetabix Castle 1970 Castle 411.0 1 5574 29c01 4 8
1 Weetabix Castle 1970 Castle 411.0 1 5574 29c01 15 6
2 Weetabix Castle 1970 Castle 411.0 1 5574 3001a 15 25
3 Weetabix Castle 1970 Castle 411.0 1 5574 3001a 4 9
4 Weetabix Castle 1970 Castle 411.0 1 5574 3001a 1 4

4. Create a copy of the parts table and modify it to show what sets the parts belong to


In [69]:
# Creating the copy
parts_copy = legos['parts'].copy()
# Merging with inventory_parts
parts_copy = pd.merge(parts_copy, legos['inventory_parts'], left_on='part_num', right_on='part_num', how='left')
# Merging with inventories
parts_copy = pd.merge(parts_copy, legos['inventories'], left_on='inventory_id', right_on='id', how='left')
# Merging with sets
parts_copy = pd.merge(parts_copy, legos['sets'], left_on='set_num', right_on='set_num', how='left')
# Dropping and renaming columns
parts_copy.drop(['inventory_id', 'id'], axis=1, inplace=True)
names = ['part_num', 'part_name', 'part_cat_id', 'color_id', 'quantity', 'is_spare', 'version', 'set_num', 'set_name', 'year', 'theme_id', 'num_parts']
parts_copy.columns = names
parts_copy.head()


Out[69]:
part_num part_name part_cat_id color_id quantity is_spare version set_num set_name year theme_id num_parts
0 0687b1 Set 0687 Activity Booklet 1 17 9999.0 1.0 f 1.0 970687-1 Team Challenge Upgrade Kit 2002.0 524.0 95.0
1 0901 Baseplate 16 x 30 with Set 080 Yellow House Print 1 2.0 1.0 f 1.0 080-1 Basic Building Set with Train 1967.0 366.0 710.0
2 0902 Baseplate 16 x 24 with Set 080 Small White Hou... 1 2.0 1.0 f 1.0 080-1 Basic Building Set with Train 1967.0 366.0 710.0
3 0903 Baseplate 16 x 24 with Set 080 Red House Print 1 2.0 1.0 f 1.0 080-1 Basic Building Set with Train 1967.0 366.0 710.0
4 0904 Baseplate 16 x 24 with Set 080 Large White Hou... 1 2.0 1.0 f 1.0 080-1 Basic Building Set with Train 1967.0 366.0 710.0

5. Create a copy of the set_themes table created earlier. Modify it to create a new table indicating how many transparent parts there are for each set name


In [73]:
set_themes_copy.head()


Out[73]:
set_name year theme_name parent_id version inventory_id part_num color_id quantity id name rgb is_trans
0 Weetabix Castle 1970 Castle 411.0 1 5574 29c01 4 8 4 Red C91A09 0
1 Weetabix Castle 1970 Castle 411.0 1 5574 3001a 4 9 4 Red C91A09 0
2 Weetabix Castle 1970 Castle 411.0 1 5574 3002a 4 6 4 Red C91A09 0
3 Weetabix Castle 1970 Castle 411.0 1 5574 3004 4 22 4 Red C91A09 0
4 Weetabix Castle 1970 Castle 411.0 1 5574 3005 4 2 4 Red C91A09 0

In [75]:
# Creating the copy
set_themes_copy = set_themes.copy()
# Merging with colors
set_themes_copy = pd.merge(set_themes_copy, legos['colors'], left_on='color_id', right_on='id', how='inner')
# Transforming is_trans column to int
set_themes_copy['is_trans'] = set_themes_copy['is_trans'].apply(lambda x: 1 if x == 't' else 0)
# Grouping and counting
set_themes_copy = set_themes_copy.groupby(['set_name', 'theme_name'])['quantity', 'is_trans'].sum()
# Resetting index and renaming columns
set_themes_copy.reset_index(inplace=True)
names = ['set_name', 'theme_name', 'quantity', 'transparent_quantity']
set_themes_copy.columns = names
set_themes_copy.head()


Out[75]:
set_name theme_name quantity transparent_quantity
0 Scenery and Dagger Trap polybag Ninjago 25 0
1 Spectre Series 14 Minifigures 7 1
2 'Where Are My Pants?' Guy The LEGO Movie Series 6 0
3 1 x 1 Bricks Supplemental 400 1
4 1 x 1 Bricks with Letters (System) Supplemental 44 0