In [0]:
import pandas as pd

Creating data¶ There are two core objects in pandas: the DataFrame and the Series.

DataFrame A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.

For example, consider the following simple DataFrame:


In [2]:
pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})


Out[2]:
Yes No
0 50 131
1 21 2

In this example, the "0, No" entry has the value of 131. The "0, Yes" entry has a value of 50, and so on.

DataFrame entries are not limited to integers. For instance, here's a DataFrame whose values are strings:


In [3]:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})


Out[3]:
Bob Sue
0 I liked it. Pretty good.
1 It was awful. Bland.

The dictionary-list constructor assigns values to the column labels, but just uses an ascending count from 0 (0, 1, 2, 3, ...) for the row labels. Sometimes this is OK, but oftentimes we will want to assign these labels ourselves.

The list of row labels used in a DataFrame is known as an Index. We can assign values to it by using an index parameter in our constructor:


In [4]:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 
              'Sue': ['Pretty good.', 'Bland.']},
             index=['Product A', 'Product B'])


Out[4]:
Bob Sue
Product A I liked it. Pretty good.
Product B It was awful. Bland.

Series

A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list. And in fact you can create one with nothing more than a list


In [5]:
pd.Series([1, 2, 3, 4, 5])


Out[5]:
0    1
1    2
2    3
3    4
4    5
dtype: int64

A Series is, in essence, a single column of a DataFrame. So you can assign column values to the Series the same way as before, using an index parameter. However, a Series does not have a column name, it only has one overall name:


In [6]:
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')


Out[6]:
2015 Sales    30
2016 Sales    35
2017 Sales    40
Name: Product A, dtype: int64

The Series and the DataFrame are intimately related. It's helpful to think of a DataFrame as actually being just a bunch of Series "glued together". We'll see more of this in the next section of this tutorial.

Reading data files

Being able to create a DataFrame or Series by hand is handy. But, most of the time, we won't actually be creating our own data by hand. Instead, we'll be working with data that already exists.

Data can be stored in any of a number of different forms and formats. By far the most basic of these is the humble CSV file. When you open a CSV file you get something that looks like this:

Product A,Product B,Product C, 30,21,9, 35,34,1, 41,11,11

So a CSV file is a table of values separated by commas. Hence the name: "Comma-Separated Values", or CSV.

Let's now set aside our toy datasets and see what a real dataset looks like when we read it into a DataFrame. We'll use the pd.read_csv() function to read the data into a DataFrame. This goes thusly:


In [0]:
# wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv")

Para bajar los datos:

https://www.kaggle.com/luisrivera/exercise-creating-reading-and-writing/edit

Para subir los datos a Google Collaborative: D:\kaggle\Cursos\Panda


In [0]:
wine_reviews = pd.read_csv("winemag-data-130k-v2.csv")

We can use the shape attribute to check how large the resulting DataFrame is:


In [9]:
wine_reviews.shape


Out[9]:
(43904, 14)

So our new DataFrame has 130,000 records split across 14 different columns. That's almost 2 million entries!

We can examine the contents of the resultant DataFrame using the head() command, which grabs the first five rows:


In [10]:
wine_reviews.head()


Out[10]:
Unnamed: 0 country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks

The pd.read_csv() function is well-endowed, with over 30 optional parameters you can specify. For example, you can see in this dataset that the CSV file has a built-in index, which pandas did not pick up on automatically. To make pandas use that column for the index (instead of creating a new one from scratch), we can specify an index_col.


In [0]:
# wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
# wine_reviews.head()

In [0]:
wine_reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)
wine_reviews.head()


Out[0]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87.0 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87.0 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87.0 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87.0 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87.0 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks

In [0]:
import pandas as pd
pd.set_option('max_rows', 5)
# from learntools.core import binder; binder.bind(globals())
# from learntools.pandas.creating_reading_and_writing import *
# print("Setup complete.")

1.

In the cell below, create a DataFrame fruits that looks like this:


In [12]:
# Your code goes here. Create a dataframe matching the above diagram and assign it to the variable fruits.
fruits = pd.DataFrame({'Apples': ['30'], 
              'Bananas': ['21']})

#q1.check()
fruits


Out[12]:
Apples Bananas
0 30 21

In [14]:
fruits = pd.DataFrame([[30, 21]], columns=['Apples', 'Bananas'])
fruits


Out[14]:
Apples Bananas
0 30 21

2.

Create a dataframe fruit_sales that matches the diagram below:


In [15]:
fruit_sales = pd.DataFrame({'Apples': ['35', '41'], 
                            'Bananas': ['21', '34' ]},
                             index=['2017 Sales', '2018 Sales'])
fruit_sales


Out[15]:
Apples Bananas
2017 Sales 35 21
2018 Sales 41 34

3.

Create a variable ingredients with a Series that looks like:

Flour     4 cups
Milk       1 cup
Eggs     2 large
Spam       1 can
Name: Dinner, dtype: object

In [16]:
quantities = ['4 cups', '1 cup', '2 large', '1 can']
items = ['Flour', 'Milk', 'Eggs', 'Spam']
recipe = pd.Series(quantities, index=items, name='Dinner')
recipe


Out[16]:
Flour     4 cups
Milk       1 cup
Eggs     2 large
Spam       1 can
Name: Dinner, dtype: object

4.

Read the following csv dataset of wine reviews into a DataFrame called reviews:

The filepath to the csv file is ../input/wine-reviews/winemag-data_first150k.csv. The first few lines look like:

,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,"This tremendous 100% varietal wine[...]",Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and[...]",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez

In [18]:
#reviews = pd.read_csv("../input/wine-reviews/winemag-data_first150k.csv", index_col=0)
reviews = pd.read_csv("winemag-data_first150k.csv", index_col=0)

reviews


Out[18]:
country description designation points price province region_1 region_2 variety winery
0 US This tremendous 100% varietal wine hails from ... Martha's Vineyard 96 235.0 California Napa Valley Napa Cabernet Sauvignon Heitz
1 Spain Ripe aromas of fig, blackberry and cassis are ... Carodorum Selección Especial Reserva 96 110.0 Northern Spain Toro NaN Tinta de Toro Bodega Carmen Rodríguez
... ... ... ... ... ... ... ... ... ... ...
150928 France A perfect salmon shade, with scents of peaches... Grand Brut Rosé 90 52.0 Champagne Champagne NaN Champagne Blend Gosset
150929 Italy More Pinot Grigios should taste like this. A r... NaN 90 15.0 Northeastern Italy Alto Adige NaN Pinot Grigio Alois Lageder

150930 rows × 10 columns

5.

Run the cell below to create and display a DataFrame called animals:


In [19]:
animals = pd.DataFrame({'Cows': [12, 20], 'Goats': [22, 19]}, 
                       index=['Year 1', 'Year 2'])
animals


Out[19]:
Cows Goats
Year 1 12 22
Year 2 20 19

In the cell below, write code to save this DataFrame to disk as a csv file with the name cows_and_goats.csv.


In [0]:
animals.to_csv("cows_and_goats.csv")

https://www.kaggle.com/residentmario/indexing-selecting-assigning

Naive accessors

Native Python objects provide good ways of indexing data. Pandas carries all of these over, which helps make it easy to start with.

Consider this DataFrame:


In [21]:
reviews


Out[21]:
country description designation points price province region_1 region_2 variety winery
0 US This tremendous 100% varietal wine hails from ... Martha's Vineyard 96 235.0 California Napa Valley Napa Cabernet Sauvignon Heitz
1 Spain Ripe aromas of fig, blackberry and cassis are ... Carodorum Selección Especial Reserva 96 110.0 Northern Spain Toro NaN Tinta de Toro Bodega Carmen Rodríguez
... ... ... ... ... ... ... ... ... ... ...
150928 France A perfect salmon shade, with scents of peaches... Grand Brut Rosé 90 52.0 Champagne Champagne NaN Champagne Blend Gosset
150929 Italy More Pinot Grigios should taste like this. A r... NaN 90 15.0 Northeastern Italy Alto Adige NaN Pinot Grigio Alois Lageder

150930 rows × 10 columns

In Python, we can access the property of an object by accessing it as an attribute. A book object, for example, might have a title property, which we can access by calling book.title. Columns in a pandas DataFrame work in much the same way.

Hence to access the country property of reviews we can use:


In [22]:
reviews.country


Out[22]:
0             US
1          Spain
           ...  
150928    France
150929     Italy
Name: country, Length: 150930, dtype: object

If we have a Python dictionary, we can access its values using the indexing ([]) operator. We can do the same with columns in a DataFrame:


In [23]:
reviews['country']


Out[23]:
0             US
1          Spain
           ...  
150928    France
150929     Italy
Name: country, Length: 150930, dtype: object

These are the two ways of selecting a specific Series out of a DataFrame. Neither of them is more or less syntactically valid than the other, but the indexing operator [] does have the advantage that it can handle column names with reserved characters in them (e.g. if we had a country providence column, reviews.country providence wouldn't work).

Doesn't a pandas Series look kind of like a fancy dictionary? It pretty much is, so it's no surprise that, to drill down to a single specific value, we need only use the indexing operator [] once more:


In [24]:
reviews['country'][0]


Out[24]:
'US'

Indexing in pandas

The indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas has its own accessor operators, loc and iloc. For more advanced operations, these are the ones you're supposed to be using.

Index-based selection Pandas indexing works in one of two paradigms. The first is index-based selection: selecting data based on its numerical position in the data. iloc follows this paradigm.

To select the first row of data in a DataFrame, we may use the following:


In [0]:
reviews.iloc[0]


Out[0]:
country                                                       US
description    This tremendous 100% varietal wine hails from ...
                                     ...                        
variety                                       Cabernet Sauvignon
winery                                                     Heitz
Name: 0, Length: 10, dtype: object

Both loc and iloc are row-first, column-second. This is the opposite of what we do in native Python, which is column-first, row-second.

This means that it's marginally easier to retrieve rows, and marginally harder to get retrieve columns. To get a column with iloc, we can do the following:


In [0]:
reviews.iloc[:, 0]


Out[0]:
0           US
1        Spain
         ...  
15739       US
15740       US
Name: country, Length: 15741, dtype: object

In [0]:
On its own, the : operator, which also comes from native Python, means "everything". When combined with other selectors, however, it can be used to indicate a range of values. For example, to select the country column from just the first, second, and third row, we would do:

In [0]:
reviews.iloc[:3, 0]


Out[0]:
0       US
1    Spain
2       US
Name: country, dtype: object

Or, to select just the second and third entries, we would do:


In [0]:
reviews.iloc[1:3, 0]


Out[0]:
1    Spain
2       US
Name: country, dtype: object

It's also possible to pass a list:


In [0]:
reviews.iloc[[0, 1, 2], 0]


Out[0]:
0       US
1    Spain
2       US
Name: country, dtype: object

Finally, it's worth knowing that negative numbers can be used in selection. This will start counting forwards from the end of the values. So for example here are the last five elements of the dataset.


In [0]:
reviews.iloc[-5:]


Out[0]:
country description designation points price province region_1 region_2 variety winery
15736 US Clean and zesty, with high acidity framing jam... NaN 85 17.0 California Napa Valley Napa Sauvignon Blanc Madrigal
15737 Australia Hazyblur has coaxed some expressive fruit out ... NaN 85 19.0 South Australia Kangaroo Island NaN Pinot Gris Hazyblur
15738 US Heavy and full-bodied, a dense, soft Pinot Noi... Bacigalupi Vineyards 85 42.0 California Russian River Valley Sonoma Pinot Noir John Tyler
15739 US Estate-grown, tart and tasting of citrus and g... NaN 85 12.0 Washington Columbia Valley (WA) Columbia Valley Riesling Jones of Washington
15740 US A good, rugged Petite Sirah, with the bigtime ... NaN 85 15.0 California North NaN NaN NaN

Label-based selection The second paradigm for attribute selection is the one followed by the loc operator: label-based selection. In this paradigm, it's the data index value, not its position, which matters.

For example, to get the first entry in reviews, we would now do the following:


In [0]:
reviews.loc[0, 'country']


Out[0]:
'US'

iloc is conceptually simpler than loc because it ignores the dataset's indices. When we use iloc we treat the dataset like a big matrix (a list of lists), one that we have to index into by position. loc, by contrast, uses the information in the indices to do its work. Since your dataset usually has meaningful indices, it's usually easier to do things using loc instead. For example, here's one operation that's much easier using loc:


In [0]:
reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']]


/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py:1418: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)
Out[0]:
taster_name taster_twitter_handle points
0 NaN NaN 96
1 NaN NaN 96
... ... ... ...
15739 NaN NaN 85
15740 NaN NaN 85

15741 rows × 3 columns

Choosing between loc and iloc When choosing or transitioning between loc and iloc, there is one "gotcha" worth keeping in mind, which is that the two methods use slightly different indexing schemes.

iloc uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So 0:10 will select entries 0,...,9. loc, meanwhile, indexes inclusively. So 0:10 will select entries 0,...,10.

Why the change? Remember that loc can index any stdlib type: strings, for example. If we have a DataFrame with index values Apples, ..., Potatoes, ..., and we want to select "all the alphabetical fruit choices between Apples and Potatoes", then it's a lot more convenient to index df.loc['Apples':'Potatoes'] than it is to index something like df.loc['Apples', 'Potatoet] (t coming after s in the alphabet).

This is particularly confusing when the DataFrame index is a simple numerical list, e.g. 0,...,1000. In this case df.iloc[0:1000] will return 1000 entries, while df.loc[0:1000] return 1001 of them! To get 1000 elements using loc, you will need to go one lower and ask for df.iloc[0:999].

Otherwise, the semantics of using loc are the same as those for iloc.

Manipulating the index

Label-based selection derives its power from the labels in the index. Critically, the index we use is not immutable. We can manipulate the index in any way we see fit.

The set_index() method can be used to do the job. Here is what happens when we set_index to the title field:


In [0]:
# reviews.set_index("title")  # da error, no existe esa columna

In [0]:
reviews.set_index("variety")


Out[0]:
country description designation points price province region_1 region_2 winery
variety
Cabernet Sauvignon US This tremendous 100% varietal wine hails from ... Martha's Vineyard 96 235.0 California Napa Valley Napa Heitz
Tinta de Toro Spain Ripe aromas of fig, blackberry and cassis are ... Carodorum Selección Especial Reserva 96 110.0 Northern Spain Toro NaN Bodega Carmen Rodríguez
... ... ... ... ... ... ... ... ... ...
Riesling US Estate-grown, tart and tasting of citrus and g... NaN 85 12.0 Washington Columbia Valley (WA) Columbia Valley Jones of Washington
NaN US A good, rugged Petite Sirah, with the bigtime ... NaN 85 15.0 California North NaN NaN

15741 rows × 9 columns

This is useful if you can come up with an index for the dataset which is better than the current one.

Conditional selection

So far we've been indexing various strides of data, using structural properties of the DataFrame itself. To do interesting things with the data, however, we often need to ask questions based on conditions.

For example, suppose that we're interested specifically in better-than-average wines produced in Italy.

We can start by checking if each wine is Italian or not:


In [0]:
reviews.country == 'Italy'


Out[0]:
0        False
1        False
         ...  
15739    False
15740    False
Name: country, Length: 15741, dtype: bool

This operation produced a Series of True/False booleans based on the country of each record. This result can then be used inside of loc to select the relevant data:


In [0]:
reviews.loc[reviews.country == 'Italy']


Out[0]:
country description designation points price province region_1 region_2 variety winery
10 Italy Elegance, complexity and structure come togeth... Ronco della Chiesa 95 80.0 Northeastern Italy Collio NaN Friulano Borgo del Tiglio
32 Italy Underbrush, scorched earth, menthol and plum s... Vigna Piaggia 90 NaN Tuscany Brunello di Montalcino NaN Sangiovese Abbadia Ardenga
... ... ... ... ... ... ... ... ... ... ...
15600 Italy This Negroamaro-Aglianico blend opens slowly, ... Terra Riserva 90 30.0 Southern Italy Leverano NaN Red Blend Conti Zecca
15609 Italy Here is a dense and intense Primitivo with exo... Punta Aquila 90 NaN Southern Italy Puglia NaN Primitivo Tenute Rubino

2303 rows × 10 columns

This DataFrame has ~20,000 rows. The original had ~130,000. That means that around 15% of wines originate from Italy.

We also wanted to know which ones are better than average. Wines are reviewed on a 80-to-100 point scale, so this could mean wines that accrued at least 90 points.

We can use the ampersand (&) to bring the two questions together:


In [0]:
reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]


Out[0]:
country description designation points price province region_1 region_2 variety winery
10 Italy Elegance, complexity and structure come togeth... Ronco della Chiesa 95 80.0 Northeastern Italy Collio NaN Friulano Borgo del Tiglio
32 Italy Underbrush, scorched earth, menthol and plum s... Vigna Piaggia 90 NaN Tuscany Brunello di Montalcino NaN Sangiovese Abbadia Ardenga
... ... ... ... ... ... ... ... ... ... ...
15600 Italy This Negroamaro-Aglianico blend opens slowly, ... Terra Riserva 90 30.0 Southern Italy Leverano NaN Red Blend Conti Zecca
15609 Italy Here is a dense and intense Primitivo with exo... Punta Aquila 90 NaN Southern Italy Puglia NaN Primitivo Tenute Rubino

776 rows × 10 columns

Suppose we'll buy any wine that's made in Italy or which is rated above average. For this we use a pipe (|):


In [0]:
reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]


Out[0]:
country description designation points price province region_1 region_2 variety winery
0 US This tremendous 100% varietal wine hails from ... Martha's Vineyard 96 235.0 California Napa Valley Napa Cabernet Sauvignon Heitz
1 Spain Ripe aromas of fig, blackberry and cassis are ... Carodorum Selección Especial Reserva 96 110.0 Northern Spain Toro NaN Tinta de Toro Bodega Carmen Rodríguez
... ... ... ... ... ... ... ... ... ... ...
15623 Australia Produced by Frankland Estate, this is a full-b... NaN 90 14.0 Western Australia Frankland River NaN Riesling Rocky Gully
15624 Germany A bit fuller and drier than you might expect f... Butterfly 90 16.0 Mosel-Saar-Ruwer NaN NaN Riesling Zilliken

7853 rows × 10 columns

Pandas comes with a few built-in conditional selectors, two of which we will highlight here.

The first is isin. isin is lets you select data whose value "is in" a list of values. For example, here's how we can use it to select wines only from Italy or France:


In [0]:
reviews.loc[reviews.country.isin(['Italy', 'France'])]


Out[0]:
country description designation points price province region_1 region_2 variety winery
4 France This is the top wine from La Bégude, named aft... La Brûlade 95 66.0 Provence Bandol NaN Provence red blend Domaine de la Bégude
10 Italy Elegance, complexity and structure come togeth... Ronco della Chiesa 95 80.0 Northeastern Italy Collio NaN Friulano Borgo del Tiglio
... ... ... ... ... ... ... ... ... ... ...
15704 France From fruit growing around the village of Wetto... Cuvée Albert 86 NaN Alsace Alsace NaN Riesling Domaine Albert Mann
15727 France Strong currant aromas, soft in texture with hi... La Montarlet 85 20.0 Loire Valley Sancerre NaN Sauvignon Blanc Joseph Mellot

5230 rows × 10 columns

The second is isnull (and its companion notnull). These methods let you highlight values which are (or are not) empty (NaN). For example, to filter out wines lacking a price tag in the dataset, here's what we would do:


In [0]:
reviews.loc[reviews.price.notnull()]


Out[0]:
country description designation points price province region_1 region_2 variety winery
0 US This tremendous 100% varietal wine hails from ... Martha's Vineyard 96 235.0 California Napa Valley Napa Cabernet Sauvignon Heitz
1 Spain Ripe aromas of fig, blackberry and cassis are ... Carodorum Selección Especial Reserva 96 110.0 Northern Spain Toro NaN Tinta de Toro Bodega Carmen Rodríguez
... ... ... ... ... ... ... ... ... ... ...
15739 US Estate-grown, tart and tasting of citrus and g... NaN 85 12.0 Washington Columbia Valley (WA) Columbia Valley Riesling Jones of Washington
15740 US A good, rugged Petite Sirah, with the bigtime ... NaN 85 15.0 California North NaN NaN NaN

14776 rows × 10 columns

Assigning data

Going the other way, assigning data to a DataFrame is easy. You can assign either a constant value:


In [0]:
reviews['critic'] = 'everyone'
reviews['critic']


Out[0]:
0        everyone
1        everyone
           ...   
15739    everyone
15740    everyone
Name: critic, Length: 15741, dtype: object

Or with an iterable of values:


In [0]:
reviews['index_backwards'] = range(len(reviews), 0, -1)
reviews['index_backwards']


Out[0]:
0        15741
1        15740
         ...  
15739        2
15740        1
Name: index_backwards, Length: 15741, dtype: int64

Run the following cell to load your data and some utility functions (including code to check your answers).


In [0]:
import pandas as pd

# reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)
pd.set_option("display.max_rows", 5)

# from learntools.core import binder; binder.bind(globals())
# from learntools.pandas.indexing_selecting_and_assigning import *
# print("Setup complete.")

Look at an overview of your data by running the following line.


In [0]:
reviews.head()


Out[0]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks

# Exercises

1.

Select the description column from reviews and assign the result to the variable desc.


In [0]:
desc = reviews['description']

In [0]:
desc = reviews.description
#or

#desc = reviews["description"]
# desc is a pandas Series object, with an index matching the reviews DataFrame. In general, when we select a single column from a DataFrame, we'll get a Series.

desc.head(10)


Out[0]:
0    Aromas include tropical fruit, broom, brimston...
1    This is ripe and fruity, a wine that is smooth...
                           ...                        
8    Savory dried thyme notes accent sunnier flavor...
9    This has great depth of flavor with its fresh ...
Name: description, Length: 10, dtype: object

2.

Select the first value from the description column of reviews, assigning it to variable first_description.


In [0]:
first_description = reviews["description"][0]

# q2.check()
first_description


Out[0]:
"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity."

In [0]:
# Solution:

first_description = reviews.description.iloc[0]
# Note that while this is the preferred way to obtain the entry in the DataFrame, many other options will return a valid result,
# such as reviews.description.loc[0], reviews.description[0], and more!
first_description


Out[0]:
"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity."

3.

Select the first row of data (the first record) from reviews, assigning it to the variable first_row.


In [0]:
first_row = reviews.iloc[0]

# q3.check()
first_row


Out[0]:
country                                                    Italy
description    Aromas include tropical fruit, broom, brimston...
                                     ...                        
variety                                              White Blend
winery                                                   Nicosia
Name: 0, Length: 13, dtype: object

In [0]:
# Solution:

first_row = reviews.iloc[0]

4.

Select the first 10 values from the description column in reviews, assigning the result to variable first_descriptions.

Hint: format your output as a pandas Series.


In [0]:
first_descriptions = reviews.iloc[:10, 1]
# first_descriptions = reviews.description.iloc[0:9]
# first_descriptions = reviews.description.loc[0:9,'description']

# q4.check()
first_descriptions


Out[0]:
0    Aromas include tropical fruit, broom, brimston...
1    This is ripe and fruity, a wine that is smooth...
                           ...                        
8    Savory dried thyme notes accent sunnier flavor...
9    This has great depth of flavor with its fresh ...
Name: description, Length: 10, dtype: object

In [0]:
# Solution:

first_descriptions = reviews.description.iloc[:10]
# Note that many other options will return a valid result, such as desc.head(10) and reviews.loc[:9, "description"].
first_descriptions


Out[0]:
0    Aromas include tropical fruit, broom, brimston...
1    This is ripe and fruity, a wine that is smooth...
                           ...                        
8    Savory dried thyme notes accent sunnier flavor...
9    This has great depth of flavor with its fresh ...
Name: description, Length: 10, dtype: object

5.

Select the records with index labels 1, 2, 3, 5, and 8, assigning the result to the variable sample_reviews.

In other words, generate the following DataFrame:


In [0]:
sample_reviews = reviews.iloc[[1,2,3,5,8],]

# q5.check()
sample_reviews


Out[0]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
5 Spain Blackberry and raspberry aromas show a typical... Ars In Vitro 87 15.0 Northern Spain Navarra NaN Michael Schachner @wineschach Tandem 2011 Ars In Vitro Tempranillo-Merlot (N... Tempranillo-Merlot Tandem
8 Germany Savory dried thyme notes accent sunnier flavor... Shine 87 12.0 Rheinhessen NaN NaN Anna Lee C. Iijima NaN Heinz Eifel 2013 Shine Gewürztraminer (Rheinhe... Gewürztraminer Heinz Eifel

In [0]:
# Solution:

indices = [1, 2, 3, 5, 8]
sample_reviews = reviews.loc[indices]
sample_reviews


Out[0]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
5 Spain Blackberry and raspberry aromas show a typical... Ars In Vitro 87 15.0 Northern Spain Navarra NaN Michael Schachner @wineschach Tandem 2011 Ars In Vitro Tempranillo-Merlot (N... Tempranillo-Merlot Tandem
8 Germany Savory dried thyme notes accent sunnier flavor... Shine 87 12.0 Rheinhessen NaN NaN Anna Lee C. Iijima NaN Heinz Eifel 2013 Shine Gewürztraminer (Rheinhe... Gewürztraminer Heinz Eifel

6.

Create a variable df containing the country, province, region_1, and region_2 columns of the records with the index labels 0, 1, 10, and 100. In other words, generate the following DataFrame:


In [0]:
df = reviews.loc[[0,1,10,100],['country', 'province', 'region_1', 'region_2']]

# q6.check()
df


Out[0]:
country province region_1 region_2
0 Italy Sicily & Sardinia Etna NaN
1 Portugal Douro NaN NaN
10 US California Napa Valley Napa
100 US New York Finger Lakes Finger Lakes

In [0]:
# Solution:

cols = ['country', 'province', 'region_1', 'region_2']
indices = [0, 1, 10, 100]
df = reviews.loc[indices, cols]
df


Out[0]:
country province region_1 region_2
0 Italy Sicily & Sardinia Etna NaN
1 Portugal Douro NaN NaN
10 US California Napa Valley Napa
100 US New York Finger Lakes Finger Lakes

7.

Create a variable df containing the country and variety columns of the first 100 records.

Hint: you may use loc or iloc. When working on the answer this question and the several of the ones that follow, keep the following "gotcha" described in the tutorial:

iloc uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. loc, meanwhile, indexes inclusively.

This is particularly confusing when the DataFrame index is a simple numerical list, e.g. 0,...,1000. In this case df.iloc[0:1000] will return 1000 entries, while df.loc[0:1000] return 1001 of them! To get 1000 elements using loc, you will need to go one lower and ask for df.iloc[0:999].


In [0]:
df = reviews.loc[0:99,['country', 'variety']]

# q7.check()
df


Out[0]:
country variety
0 Italy White Blend
1 Portugal Portuguese Red
... ... ...
98 Italy Sangiovese
99 US Bordeaux-style Red Blend

100 rows × 2 columns


In [0]:
# # Correct:

cols = ['country', 'variety']
df = reviews.loc[:99, cols]
# or

# cols_idx = [0, 11]
# df = reviews.iloc[:100, cols_idx]
df


Out[0]:
country variety
0 Italy White Blend
1 Portugal Portuguese Red
... ... ...
98 Italy Sangiovese
99 US Bordeaux-style Red Blend

100 rows × 2 columns

8.

Create a DataFrame italian_wines containing reviews of wines made in Italy. Hint: reviews.country equals what?


In [0]:
italian_wines = reviews.loc[reviews.country == 'Italy']

# q8.check()
italian_wines


Out[0]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
6 Italy Here's a bright, informal red that opens with ... Belsito 87 16.0 Sicily & Sardinia Vittoria NaN Kerin O’Keefe @kerinokeefe Terre di Giurfo 2013 Belsito Frappato (Vittoria) Frappato Terre di Giurfo
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129961 Italy Intense aromas of wild cherry, baking spice, t... NaN 90 30.0 Sicily & Sardinia Sicilia NaN Kerin O’Keefe @kerinokeefe COS 2013 Frappato (Sicilia) Frappato COS
129962 Italy Blackberry, cassis, grilled herb and toasted a... Sàgana Tenuta San Giacomo 90 40.0 Sicily & Sardinia Sicilia NaN Kerin O’Keefe @kerinokeefe Cusumano 2012 Sàgana Tenuta San Giacomo Nero d... Nero d'Avola Cusumano

19540 rows × 13 columns


In [0]:
# Solution:

italian_wines = reviews[reviews.country == 'Italy']
italian_wines


Out[0]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
6 Italy Here's a bright, informal red that opens with ... Belsito 87 16.0 Sicily & Sardinia Vittoria NaN Kerin O’Keefe @kerinokeefe Terre di Giurfo 2013 Belsito Frappato (Vittoria) Frappato Terre di Giurfo
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129961 Italy Intense aromas of wild cherry, baking spice, t... NaN 90 30.0 Sicily & Sardinia Sicilia NaN Kerin O’Keefe @kerinokeefe COS 2013 Frappato (Sicilia) Frappato COS
129962 Italy Blackberry, cassis, grilled herb and toasted a... Sàgana Tenuta San Giacomo 90 40.0 Sicily & Sardinia Sicilia NaN Kerin O’Keefe @kerinokeefe Cusumano 2012 Sàgana Tenuta San Giacomo Nero d... Nero d'Avola Cusumano

19540 rows × 13 columns

9.

Create a DataFrame top_oceania_wines containing all reviews with at least 95 points (out of 100) for wines from Australia or New Zealand.


In [0]:
top_oceania_wines = reviews.loc[reviews.country.isin(['Australia', 'New Zealand'])
                                & (reviews.points >= 95)]

# q9.check()
top_oceania_wines


Out[0]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
345 Australia This wine contains some material over 100 year... Rare 100 350.0 Victoria Rutherglen NaN Joe Czerwinski @JoeCz Chambers Rosewood Vineyards NV Rare Muscat (Ru... Muscat Chambers Rosewood Vineyards
346 Australia This deep brown wine smells like a damp, mossy... Rare 98 350.0 Victoria Rutherglen NaN Joe Czerwinski @JoeCz Chambers Rosewood Vineyards NV Rare Muscadelle... Muscadelle Chambers Rosewood Vineyards
... ... ... ... ... ... ... ... ... ... ... ... ... ...
122507 New Zealand This blend of Cabernet Sauvignon (62.5%), Merl... SQM Gimblett Gravels Cabernets/Merlot 95 79.0 Hawke's Bay NaN NaN Joe Czerwinski @JoeCz Squawking Magpie 2014 SQM Gimblett Gravels Cab... Bordeaux-style Red Blend Squawking Magpie
122939 Australia Full-bodied and plush yet vibrant and imbued w... The Factor 98 125.0 South Australia Barossa Valley NaN Joe Czerwinski @JoeCz Torbreck 2013 The Factor Shiraz (Barossa Valley) Shiraz Torbreck

49 rows × 13 columns


In [0]:
# Solution:

top_oceania_wines = reviews.loc[
    (reviews.country.isin(['Australia', 'New Zealand']))
    & (reviews.points >= 95)
]

In [0]:
reviews


Out[0]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

129971 rows × 13 columns

Summary functions

Pandas provides many simple "summary functions" (not an official name) which restructure the data in some useful way. For example, consider the describe() method:


In [0]:
reviews.points.describe()


Out[0]:
count    129971.000000
mean         88.447138
             ...      
75%          91.000000
max         100.000000
Name: points, Length: 8, dtype: float64

This method generates a high-level summary of the attributes of the given column. It is type-aware, meaning that its output changes based on the data type of the input. The output above only makes sense for numerical data; for string data here's what we get:


In [0]:
reviews.taster_name.describe()


Out[0]:
count         103727
unique            19
top       Roger Voss
freq           25514
Name: taster_name, dtype: object

If you want to get some particular simple summary statistic about a column in a DataFrame or a Series, there is usually a helpful pandas function that makes it happen.

For example, to see the mean of the points allotted (e.g. how well an averagely rated wine does), we can use the mean() function:


In [0]:
reviews.points.mean()


Out[0]:
88.44713820775404

To see a list of unique values we can use the unique() function:


In [0]:
# reviews.taster_name.unique()    # se demora mucho??

To see a list of unique values and how often they occur in the dataset, we can use the value_counts() method:


In [0]:
reviews.taster_name.value_counts()


Out[0]:
Roger Voss           25514
Michael Schachner    15134
                     ...  
Fiona Adams             27
Christina Pickard        6
Name: taster_name, Length: 19, dtype: int64

Maps

A map is a term, borrowed from mathematics, for a function that takes one set of values and "maps" them to another set of values. In data science we often have a need for creating new representations from existing data, or for transforming data from the format it is in now to the format that we want it to be in later. Maps are what handle this work, making them extremely important for getting your work done!

There are two mapping methods that you will use often.

map() is the first, and slightly simpler one. For example, suppose that we wanted to remean the scores the wines received to 0. We can do this as follows:


In [0]:
review_points_mean = reviews.points.mean()
reviews.points.map(lambda p: p - review_points_mean)


Out[0]:
0        -1.447138
1        -1.447138
            ...   
129969    1.552862
129970    1.552862
Name: points, Length: 129971, dtype: float64

The function you pass to map() should expect a single value from the Series (a point value, in the above example), and return a transformed version of that value. map() returns a new Series where all the values have been transformed by your function.

apply() is the equivalent method if we want to transform a whole DataFrame by calling a custom method on each row.

La función que pase a map () debería esperar un único valor de la Serie (un valor de punto, en el ejemplo anterior) y devolver una versión transformada de ese valor. map () devuelve una nueva serie donde todos los valores han sido transformados por su función.

apply () es el método equivalente si queremos transformar un DataFrame completo llamando a un método personalizado en cada fila.


In [0]:
def remean_points(row):
    row.points = row.points - review_points_mean
    return row

reviews.apply(remean_points, axis='columns')


Out[0]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco -1.447138 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos -1.447138 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 1.552862 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 1.552862 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

129971 rows × 13 columns

If we had called reviews.apply() with axis='index', then instead of passing a function to transform each row, we would need to give a function to transform each column.

Note that map() and apply() return new, transformed Series and DataFrames, respectively. They don't modify the original data they're called on. If we look at the first row of reviews, we can see that it still has its original points value.

Si hubiéramos llamado reviews.apply () con axis = 'index', entonces, en lugar de pasar una función para transformar cada fila, tendríamos que dar una función para transformar cada columna.

Tenga en cuenta que map () y apply () devuelven Series y DataFrames nuevos y transformados, respectivamente. No modifican los datos originales a los que se les solicita. Si miramos la primera fila de revisiones, podemos ver que todavía tiene su valor de puntos original.


In [0]:
reviews.head(1)


Out[0]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia

Pandas provides many common mapping operations as built-ins. For example, here's a faster way of remeaning our points column:


In [0]:
review_points_mean = reviews.points.mean()
reviews.points - review_points_mean


Out[0]:
0        -1.447138
1        -1.447138
            ...   
129969    1.552862
129970    1.552862
Name: points, Length: 129971, dtype: float64

In this code we are performing an operation between a lot of values on the left-hand side (everything in the Series) and a single value on the right-hand side (the mean value). Pandas looks at this expression and figures out that we must mean to subtract that mean value from every value in the dataset.

Pandas will also understand what to do if we perform these operations between Series of equal length. For example, an easy way of combining country and region information in the dataset would be to do the following:


In [0]:
reviews.country + " - " + reviews.region_1


Out[0]:
0            Italy - Etna
1                     NaN
               ...       
129969    France - Alsace
129970    France - Alsace
Length: 129971, dtype: object

These operators are faster than map() or apply() because they uses speed ups built into pandas. All of the standard Python operators (>, <, ==, and so on) work in this manner.

However, they are not as flexible as map() or apply(), which can do more advanced things, like applying conditional logic, which cannot be done with addition and subtraction alone.

Estos operadores son más rápidos que map () o apply () porque usan aceleraciones integradas en pandas. Todos los operadores estándar de Python (>, <, ==, etc.) funcionan de esta manera.

Sin embargo, no son tan flexibles como map () o apply (), que pueden hacer cosas más avanzadas, como aplicar lógica condicional, que no se puede hacer solo con la suma y la resta.

https://www.kaggle.com/residentmario/grouping-and-sorting

Groupwise analysis

One function we've been using heavily thus far is the value_counts() function. We can replicate what value_counts() does by doing the following:


In [0]:
reviews.groupby('points').points.count()


Out[0]:
points
80     397
81     692
      ... 
99      33
100     19
Name: points, Length: 21, dtype: int64

In [0]:
reviews.groupby('points').price.min()


Out[0]:
points
80      5.0
81      5.0
       ... 
99     44.0
100    80.0
Name: price, Length: 21, dtype: float64

In [0]:
reviews.groupby('winery').apply(lambda df: df.title.iloc[0])


Out[0]:
winery
1+1=3                          1+1=3 NV Rosé Sparkling (Cava)
10 Knots                 10 Knots 2010 Viognier (Paso Robles)
                                  ...                        
àMaurice    àMaurice 2013 Fred Estate Syrah (Walla Walla V...
Štoka                         Štoka 2009 Izbrani Teran (Kras)
Length: 16757, dtype: object

In [0]:
reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()])


Out[0]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
country province
Argentina Mendoza Province Argentina If the color doesn't tell the full story, the ... Nicasia Vineyard 97 120.0 Mendoza Province Mendoza NaN Michael Schachner @wineschach Bodega Catena Zapata 2006 Nicasia Vineyard Mal... Malbec Bodega Catena Zapata
Other Argentina Take note, this could be the best wine Colomé ... Reserva 95 90.0 Other Salta NaN Michael Schachner @wineschach Colomé 2010 Reserva Malbec (Salta) Malbec Colomé
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Uruguay San Jose Uruguay Baked, sweet, heavy aromas turn earthy with ti... El Preciado Gran Reserva 87 50.0 San Jose NaN NaN Michael Schachner @wineschach Castillo Viejo 2005 El Preciado Gran Reserva R... Red Blend Castillo Viejo
Uruguay Uruguay Cherry and berry aromas are ripe, healthy and ... Blend 002 Limited Edition 91 22.0 Uruguay NaN NaN Michael Schachner @wineschach Narbona NV Blend 002 Limited Edition Tannat-Ca... Tannat-Cabernet Franc Narbona

425 rows × 13 columns


In [0]:
reviews.groupby(['country']).price.agg([len, min, max])


Out[0]:
len min max
country
Argentina 3800.0 4.0 230.0
Armenia 2.0 14.0 15.0
... ... ... ...
Ukraine 14.0 6.0 13.0
Uruguay 109.0 10.0 130.0

43 rows × 3 columns

Multi-indexes

In all of the examples we've seen thus far we've been working with DataFrame or Series objects with a single-label index. groupby() is slightly different in the fact that, depending on the operation we run, it will sometimes result in what is called a multi-index.

A multi-index differs from a regular index in that it has multiple levels. For example:


In [0]:
countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed


Out[0]:
len
country province
Argentina Mendoza Province 3264
Other 536
... ... ...
Uruguay San Jose 3
Uruguay 24

425 rows × 1 columns


In [0]:
mi = countries_reviewed.index
type(mi)


Out[0]:
pandas.core.indexes.multi.MultiIndex

In [0]:
countries_reviewed.reset_index()


Out[0]:
country province len
0 Argentina Mendoza Province 3264
1 Argentina Other 536
... ... ... ...
423 Uruguay San Jose 3
424 Uruguay Uruguay 24

425 rows × 3 columns

Sorting

Looking again at countries_reviewed we can see that grouping returns data in index order, not in value order. That is to say, when outputting the result of a groupby, the order of the rows is dependent on the values in the index, not in the data.

To get data in the order want it in we can sort it ourselves. The sort_values() method is handy for this.


In [0]:
countries_reviewed = countries_reviewed.reset_index()
countries_reviewed.sort_values(by='len')


Out[0]:
country province len
179 Greece Muscat of Kefallonian 1
192 Greece Sterea Ellada 1
... ... ... ...
415 US Washington 8639
392 US California 36247

425 rows × 3 columns


In [0]:
countries_reviewed.sort_values(by='len', ascending=False)


Out[0]:
country province len
392 US California 36247
415 US Washington 8639
... ... ... ...
63 Chile Coelemu 1
149 Greece Beotia 1

425 rows × 3 columns


In [0]:
countries_reviewed.sort_index()


Out[0]:
country province len
0 Argentina Mendoza Province 3264
1 Argentina Other 536
... ... ... ...
423 Uruguay San Jose 3
424 Uruguay Uruguay 24

425 rows × 3 columns


In [0]:
countries_reviewed.sort_values(by=['country', 'len'])


Out[0]:
country province len
1 Argentina Other 536
0 Argentina Mendoza Province 3264
... ... ... ...
424 Uruguay Uruguay 24
419 Uruguay Canelones 43

425 rows × 3 columns


In [0]:
import pandas as pd

# reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
#pd.set_option("display.max_rows", 5)
reviews = pd.read_csv("./winemag-data-130k-v2.csv", index_col=0)

# from learntools.core import binder; binder.bind(globals())
# from learntools.pandas.grouping_and_sorting import *
# print("Setup complete.")

1.

Who are the most common wine reviewers in the dataset? Create a Series whose index is the taster_twitter_handle category from the dataset, and whose values count how many reviews each person wrote.


In [0]:
reviews_written = reviews.groupby('taster_twitter_handle').size()

#or
reviews_written


Out[0]:
taster_twitter_handle
@AnneInVino        3685
@JoeCz             5147
                   ... 
@winewchristina       6
@worldwineguys     1005
Length: 15, dtype: int64

In [0]:
reviews_written = reviews.groupby('taster_twitter_handle').taster_twitter_handle.count()
reviews_written


Out[0]:
taster_twitter_handle
@AnneInVino        3685
@JoeCz             5147
                   ... 
@winewchristina       6
@worldwineguys     1005
Name: taster_twitter_handle, Length: 15, dtype: int64

2.

What is the best wine I can buy for a given amount of money? Create a Series whose index is wine prices and whose values is the maximum number of points a wine costing that much was given in a review. Sort the values by price, ascending (so that 4.0 dollars is at the top and 3300.0 dollars is at the bottom).


In [0]:
best_rating_per_price = reviews.groupby('price')['points'].max().sort_index()
best_rating_per_price


Out[0]:
price
4.0       86
5.0       87
          ..
2500.0    96
3300.0    88
Name: points, Length: 390, dtype: int64

3.

What are the minimum and maximum prices for each variety of wine? Create a DataFrame whose index is the variety category from the dataset and whose values are the min and max values thereof.


In [0]:
price_extremes = reviews.groupby('variety').price.agg([min, max])
price_extremes


Out[0]:
min max
variety
Abouriou 15.0 75.0
Agiorgitiko 10.0 66.0
... ... ...
Çalkarası 19.0 19.0
Žilavka 15.0 15.0

707 rows × 2 columns

4.

What are the most expensive wine varieties? Create a variable sorted_varieties containing a copy of the dataframe from the previous question where varieties are sorted in descending order based on minimum price, then on maximum price (to break ties).


In [0]:
sorted_varieties = price_extremes.sort_values(by=['min', 'max'], ascending=False)   
sorted_varieties


Out[0]:
min max
variety
Ramisco 495.0 495.0
Terrantez 236.0 236.0
... ... ...
Vital NaN NaN
Zelen NaN NaN

707 rows × 2 columns

5.

Create a Series whose index is reviewers and whose values is the average review score given out by that reviewer. Hint: you will need the taster_name and points columns.


In [0]:
reviewer_mean_ratings = reviews.groupby('taster_name').points.mean()
reviewer_mean_ratings


Out[0]:
taster_name
Alexander Peartree    85.855422
Anna Lee C. Iijima    88.415629
                        ...    
Susan Kostrzewa       86.609217
Virginie Boone        89.213379
Name: points, Length: 19, dtype: float64

In [0]:
reviewer_mean_ratings.describe()


Out[0]:
count    19.000000
mean     88.233026
           ...    
75%      88.975256
max      90.562551
Name: points, Length: 8, dtype: float64

6.

What combination of countries and varieties are most common? Create a Series whose index is a MultiIndexof {country, variety} pairs. For example, a pinot noir produced in the US should map to {"US", "Pinot Noir"}. Sort the values in the Series in descending order based on wine count.


In [0]:
country_variety_counts = reviews.groupby(['country', 'variety']).size().sort_values(ascending=False)
country_variety_counts


Out[0]:
country    variety           
US         Pinot Noir            9885
           Cabernet Sauvignon    7315
                                 ... 
Italy      Moscato di Noto          1
Argentina  Barbera                  1
Length: 1612, dtype: int64

Dtypes

The data type for a column in a DataFrame or a Series is known as the dtype.

You can use the dtype property to grab the type of a specific column. For instance, we can get the dtype of the price column in the reviews DataFrame:


In [0]:
reviews.price.dtype


Out[0]:
dtype('float64')

Alternatively, the dtypes property returns the dtype of every column in the DataFrame:


In [0]:
reviews.dtypes


Out[0]:
country        object
description    object
                ...  
variety        object
winery         object
Length: 13, dtype: object

Data types tell us something about how pandas is storing the data internally. float64 means that it's using a 64-bit floating point number; int64 means a similarly sized integer instead, and so on.

One peculiarity to keep in mind (and on display very clearly here) is that columns consisting entirely of strings do not get their own type; they are instead given the object type.

It's possible to convert a column of one type into another wherever such a conversion makes sense by using the astype() function. For example, we may transform the points column from its existing int64 data type into a float64 data type:


In [0]:
reviews.points.astype('float64')


Out[0]:
0         87.0
1         87.0
          ... 
129969    90.0
129970    90.0
Name: points, Length: 129971, dtype: float64

A DataFrame or Series index has its own dtype, too:


In [0]:
reviews.index.dtype


Out[0]:
dtype('int64')

Pandas also supports more exotic data types, such as categorical data and timeseries data. Because these data types are more rarely used, we will omit them until a much later section of this tutorial.

Missing data

Entries missing values are given the value NaN, short for "Not a Number". For technical reasons these NaN values are always of the float64 dtype.

Pandas provides some methods specific to missing data. To select NaN entries you can use pd.isnull() (or its companion pd.notnull()). This is meant to be used thusly:


In [0]:
reviews[pd.isnull(reviews.country)]


Out[0]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
913 NaN Amber in color, this wine has aromas of peach ... Asureti Valley 87 30.0 NaN NaN NaN Mike DeSimone @worldwineguys Gotsa Family Wines 2014 Asureti Valley Chinuri Chinuri Gotsa Family Wines
3131 NaN Soft, fruity and juicy, this is a pleasant, si... Partager 83 NaN NaN NaN NaN Roger Voss @vossroger Barton & Guestier NV Partager Red Red Blend Barton & Guestier
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129590 NaN A blend of 60% Syrah, 30% Cabernet Sauvignon a... Shah 90 30.0 NaN NaN NaN Mike DeSimone @worldwineguys Büyülübağ 2012 Shah Red Red Blend Büyülübağ
129900 NaN This wine offers a delightful bouquet of black... NaN 91 32.0 NaN NaN NaN Mike DeSimone @worldwineguys Psagot 2014 Merlot Merlot Psagot

63 rows × 13 columns

Replacing missing values is a common operation. Pandas provides a really handy method for this problem: fillna(). fillna() provides a few different strategies for mitigating such data. For example, we can simply replace each NaN with an "Unknown":


In [0]:
reviews.region_2.fillna("Unknown")


Out[0]:
0         Unknown
1         Unknown
           ...   
129969    Unknown
129970    Unknown
Name: region_2, Length: 129971, dtype: object

Or we could fill each missing value with the first non-null value that appears sometime after the given record in the database. This is known as the backfill strategy.

Alternatively, we may have a non-null value that we would like to replace. For example, suppose that since this dataset was published, reviewer Kerin O'Keefe has changed her Twitter handle from @kerinokeefe to @kerino. One way to reflect this in the dataset is using the replace() method:


In [0]:
reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")


Out[0]:
0            @kerino
1         @vossroger
             ...    
129969    @vossroger
129970    @vossroger
Name: taster_twitter_handle, Length: 129971, dtype: object

The replace() method is worth mentioning here because it's handy for replacing missing data which is given some kind of sentinel value in the dataset: things like "Unknown", "Undisclosed", "Invalid", and so on.


In [0]:
# import pandas as pd

# reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)

# from learntools.core import binder; binder.bind(globals())
# from learntools.pandas.data_types_and_missing_data import *
# print("Setup complete.")

Exercises

1. What is the data type of the points column in the dataset?


In [0]:
# Your code here
dtype = reviews.points.dtype

2.

Create a Series from entries in the points column, but convert the entries to strings. Hint: strings are str in native Python.


In [0]:
point_strings = reviews.points.astype(str)

point_strings


Out[0]:
0         87
1         87
          ..
129969    90
129970    90
Name: points, Length: 129971, dtype: object

3.

Sometimes the price column is null. How many reviews in the dataset are missing a price?


In [0]:
missing_price_reviews = reviews[reviews.price.isnull()]
n_missing_prices = len(missing_price_reviews)
n_missing_prices


Out[0]:
8996

In [0]:
# Cute alternative solution: if we sum a boolean series, True is treated as 1 and False as 0
n_missing_prices = reviews.price.isnull().sum()
n_missing_prices


Out[0]:
8996

In [0]:
# or equivalently:
n_missing_prices = pd.isnull(reviews.price).sum()
n_missing_prices


Out[0]:
8996

In [0]:
## 4.
What are the most common wine-producing regions? Create a Series counting the number of times each value occurs in the `region_1` field. This field is often missing data, so replace missing values with `Unknown`. Sort in descending order.  Your output should look something like this:

```
Unknown                    21247
Napa Valley                 4480
                           ...  
Bardolino Superiore            1
Primitivo del Tarantino        1
Name: region_1, Length: 1230, dtype: int64

In [0]:
reviews_per_region = reviews.region_1.fillna('Unknown').value_counts().sort_values(ascending=False)
reviews_per_region


Out[0]:
Unknown             21247
Napa Valley          4480
                    ...  
Canada                  1
Lake-Sonoma-Napa        1
Name: region_1, Length: 1230, dtype: int64

Introduction

Oftentimes data will come to us with column names, index names, or other naming conventions that we are not satisfied with. In that case, you'll learn how to use pandas functions to change the names of the offending entries to something better.

You'll also explore how to combine data from multiple DataFrames and/or Series.

Renaming

The first function we'll introduce here is rename(), which lets you change index names and/or column names. For example, to change the points column in our dataset to score, we would do:


In [0]:
import pandas as pd

# reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)
# from learntools.core import binder; binder.bind(globals())
# from learntools.pandas.renaming_and_combining import *
# print("Setup complete.")

In [0]:
reviews.head()


Out[0]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks

In [0]:
reviews.rename(columns={'points': 'score'})


Out[0]:
country description designation score price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129966 Germany Notes of honeysuckle and cantaloupe sweeten th... Brauneberger Juffer-Sonnenuhr Spätlese 90 28.0 Mosel NaN NaN Anna Lee C. Iijima NaN Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ... Riesling Dr. H. Thanisch (Erben Müller-Burggraef)
129967 US Citation is given as much as a decade of bottl... NaN 90 75.0 Oregon Oregon Oregon Other Paul Gregutt @paulgwine Citation 2004 Pinot Noir (Oregon) Pinot Noir Citation
129968 France Well-drained gravel soil gives this wine its c... Kritt 90 30.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Gresser 2013 Kritt Gewurztraminer (Als... Gewürztraminer Domaine Gresser
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

129971 rows × 13 columns

rename() lets you rename index or column values by specifying a index or column keyword parameter, respectively. It supports a variety of input formats, but usually a Python dictionary is the most convenient. Here is an example using it to rename some elements of the index.


In [0]:
reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'})


Out[0]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
firstEntry Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
secondEntry Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129966 Germany Notes of honeysuckle and cantaloupe sweeten th... Brauneberger Juffer-Sonnenuhr Spätlese 90 28.0 Mosel NaN NaN Anna Lee C. Iijima NaN Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ... Riesling Dr. H. Thanisch (Erben Müller-Burggraef)
129967 US Citation is given as much as a decade of bottl... NaN 90 75.0 Oregon Oregon Oregon Other Paul Gregutt @paulgwine Citation 2004 Pinot Noir (Oregon) Pinot Noir Citation
129968 France Well-drained gravel soil gives this wine its c... Kritt 90 30.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Gresser 2013 Kritt Gewurztraminer (Als... Gewürztraminer Domaine Gresser
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

129971 rows × 13 columns

You'll probably rename columns very often, but rename index values very rarely. For that, set_index() is usually more convenient.

Both the row index and the column index can have their own name attribute. The complimentary rename_axis() method may be used to change these names. For example:


In [0]:
reviews.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns')


Out[0]:
fields country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
wines
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129966 Germany Notes of honeysuckle and cantaloupe sweeten th... Brauneberger Juffer-Sonnenuhr Spätlese 90 28.0 Mosel NaN NaN Anna Lee C. Iijima NaN Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ... Riesling Dr. H. Thanisch (Erben Müller-Burggraef)
129967 US Citation is given as much as a decade of bottl... NaN 90 75.0 Oregon Oregon Oregon Other Paul Gregutt @paulgwine Citation 2004 Pinot Noir (Oregon) Pinot Noir Citation
129968 France Well-drained gravel soil gives this wine its c... Kritt 90 30.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Gresser 2013 Kritt Gewurztraminer (Als... Gewürztraminer Domaine Gresser
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

129971 rows × 13 columns

Combining

When performing operations on a dataset, we will sometimes need to combine different DataFrames and/or Series in non-trivial ways. Pandas has three core methods for doing this. In order of increasing complexity, these are concat(), join(), and merge(). Most of what merge() can do can also be done more simply with join(), so we will omit it and focus on the first two functions here.

The simplest combining method is concat(). Given a list of elements, this function will smush those elements together along an axis.

This is useful when we have data in different DataFrame or Series objects but having the same fields (columns). One example: the YouTube Videos dataset, which splits the data up based on country of origin (e.g. Canada and the UK, in this example). If we want to study multiple countries simultaneously, we can use concat() to smush them together:

https://www.kaggle.com/datasnaek/youtube-new ; datos


In [0]:
# canadian_youtube = pd.read_csv("../input/youtube-new/CAvideos.csv")
# british_youtube = pd.read_csv("../input/youtube-new/GBvideos.csv")

canadian_youtube = pd.read_csv("CAvideos.csv")
british_youtube = pd.read_csv("GBvideos.csv")

pd.concat([canadian_youtube, british_youtube])


Out[0]:
video_id trending_date title channel_title category_id publish_time tags views likes dislikes comment_count thumbnail_link comments_disabled ratings_disabled video_error_or_removed description
0 n1WpP7iowLc 17.14.11 Eminem - Walk On Water (Audio) ft. Beyoncé EminemVEVO 10 2017-11-10T17:00:03.000Z Eminem|"Walk"|"On"|"Water"|"Aftermath/Shady/In... 17158579 787425 43420 125882 https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg False False False Eminem's new track Walk on Water ft. Beyoncé i...
1 0dBIkQ4Mz1M 17.14.11 PLUSH - Bad Unboxing Fan Mail iDubbbzTV 23 2017-11-13T17:00:00.000Z plush|"bad unboxing"|"unboxing"|"fan mail"|"id... 1014651 127794 1688 13030 https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg False False False STill got a lot of packages. Probably will las...
2 5qpjK5DgCt4 17.14.11 Racist Superman | Rudy Mancuso, King Bach & Le... Rudy Mancuso 23 2017-11-12T19:05:24.000Z racist superman|"rudy"|"mancuso"|"king"|"bach"... 3191434 146035 5339 8181 https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg False False False WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3 d380meD0W0M 17.14.11 I Dare You: GOING BALD!? nigahiga 24 2017-11-12T18:01:41.000Z ryan|"higa"|"higatv"|"nigahiga"|"i dare you"|"... 2095828 132239 1989 17518 https://i.ytimg.com/vi/d380meD0W0M/default.jpg False False False I know it's been a while since we did this sho...
4 2Vv-BfVoq4g 17.14.11 Ed Sheeran - Perfect (Official Music Video) Ed Sheeran 10 2017-11-09T11:04:14.000Z edsheeran|"ed sheeran"|"acoustic"|"live"|"cove... 33523622 1634130 21082 85067 https://i.ytimg.com/vi/2Vv-BfVoq4g/default.jpg False False False 🎧: https://ad.gt/yt-perfect\n💰: https://atlant...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
38911 l884wKofd54 18.14.06 Enrique Iglesias - MOVE TO MIAMI (Official Vid... EnriqueIglesiasVEVO 10 2018-05-09T07:00:01.000Z Enrique Iglesias feat. Pitbull|"MOVE TO MIAMI"... 25066952 268088 12783 9933 https://i.ytimg.com/vi/l884wKofd54/default.jpg False False False NEW SONG - MOVE TO MIAMI feat. Pitbull (Click ...
38912 IP8k2xkhOdI 18.14.06 Jacob Sartorius - Up With It (Official Music V... Jacob Sartorius 10 2018-05-11T17:09:16.000Z jacob sartorius|"jacob"|"up with it"|"jacob sa... 1492219 61998 13781 24330 https://i.ytimg.com/vi/IP8k2xkhOdI/default.jpg False False False THE OFFICIAL UP WITH IT MUSIC VIDEO!Get my new...
38913 Il-an3K9pjg 18.14.06 Anne-Marie - 2002 [Official Video] Anne-Marie 10 2018-05-08T11:05:08.000Z anne|"marie"|"anne-marie"|"2002"|"two thousand... 29641412 394830 8892 19988 https://i.ytimg.com/vi/Il-an3K9pjg/default.jpg False False False Get 2002 by Anne-Marie HERE ▶ http://ad.gt/200...
38914 -DRsfNObKIQ 18.14.06 Eleni Foureira - Fuego - Cyprus - LIVE - First... Eurovision Song Contest 24 2018-05-08T20:32:32.000Z Eurovision Song Contest|"2018"|"Lisbon"|"Cypru... 14317515 151870 45875 26766 https://i.ytimg.com/vi/-DRsfNObKIQ/default.jpg False False False Eleni Foureira represented Cyprus at the first...
38915 4YFo4bdMO8Q 18.14.06 KYLE - Ikuyo feat. 2 Chainz & Sophia Black [A... SuperDuperKyle 10 2018-05-11T04:06:35.000Z Kyle|"SuperDuperKyle"|"Ikuyo"|"2 Chainz"|"Soph... 607552 18271 274 1423 https://i.ytimg.com/vi/4YFo4bdMO8Q/default.jpg False False False Debut album 'Light of Mine' out now: http://ky...

79797 rows × 16 columns

The middlemost combiner in terms of complexity is join(). join() lets you combine different DataFrame objects which have an index in common. For example, to pull down videos that happened to be trending on the same day in both Canada and the UK, we could do the following:


In [0]:
left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])

left.join(right, lsuffix='_CAN', rsuffix='_UK')


Out[0]:
video_id_CAN channel_title_CAN category_id_CAN publish_time_CAN tags_CAN views_CAN likes_CAN dislikes_CAN comment_count_CAN thumbnail_link_CAN comments_disabled_CAN ratings_disabled_CAN video_error_or_removed_CAN description_CAN video_id_UK channel_title_UK category_id_UK publish_time_UK tags_UK views_UK likes_UK dislikes_UK comment_count_UK thumbnail_link_UK comments_disabled_UK ratings_disabled_UK video_error_or_removed_UK description_UK
title trending_date
!! THIS VIDEO IS NOTHING BUT PAIN !! | Getting Over It - Part 7 18.04.01 PNn8sECd7io Markiplier 20 2018-01-03T19:33:53.000Z getting over it|"markiplier"|"funny moments"|"... 835930 47058 1023 8250 https://i.ytimg.com/vi/PNn8sECd7io/default.jpg False False False Getting Over It continues with RAGE BEYOND ALL... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
#1 Fortnite World Rank - 2,323 Solo Wins! 18.09.03 DvPW66IFhMI AlexRamiGaming 20 2018-03-09T07:15:52.000Z PS4 Battle Royale|"PS4 Pro Battle Royale"|"Bat... 212838 5199 542 11 https://i.ytimg.com/vi/DvPW66IFhMI/default.jpg False False False Discord For EVERYONE - https://discord.gg/nhud... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
#1 Fortnite World Rank - 2,330 Solo Wins! 18.10.03 EXEaMjFeiEk AlexRamiGaming 20 2018-03-10T06:26:17.000Z PS4 Battle Royale|"PS4 Pro Battle Royale"|"Bat... 200764 5620 537 45 https://i.ytimg.com/vi/EXEaMjFeiEk/default.jpg False False False Discord For EVERYONE - https://discord.gg/nhud... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
#1 MOST ANTICIPATED VIDEO (Timber Frame House Raising) 17.20.12 bYvQmusLaxw Pure Living for Life 24 2017-12-20T02:49:11.000Z timber frame|"timber framing"|"timber frame ra... 79152 7761 159 1965 https://i.ytimg.com/vi/bYvQmusLaxw/default.jpg False False False Shelter Institute: http://bit.ly/2iwXj8B\nFull... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
17.21.12 bYvQmusLaxw Pure Living for Life 24 2017-12-20T02:49:11.000Z timber frame|"timber framing"|"timber frame ra... 232762 15515 329 3601 https://i.ytimg.com/vi/bYvQmusLaxw/default.jpg False False False Shelter Institute: http://bit.ly/2iwXj8B\nFull... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
😲She Is So Nervous But BLOWS The ROOF After Taking on OPERA Song! | Britain´s Got Talent 2018 18.02.05 WttN1Z0XF4k How Talented 24 2018-04-28T19:40:58.000Z bgt|"bgt 2018"|"britain got talent"|"britain´s... 713400 4684 260 266 https://i.ytimg.com/vi/WttN1Z0XF4k/default.jpg False False False Amy Marie Borg - Britain´s Got Talent 2018\n\n... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18.29.04 WttN1Z0XF4k How Talented 24 2018-04-28T19:40:58.000Z bgt|"bgt 2018"|"britain got talent"|"britain´s... 231906 1924 78 146 https://i.ytimg.com/vi/WttN1Z0XF4k/default.jpg False False False Amy Marie Borg - Britain´s Got Talent 2018\n\n... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18.30.04 WttN1Z0XF4k How Talented 24 2018-04-28T19:40:58.000Z bgt|"bgt 2018"|"britain got talent"|"britain´s... 476253 3417 176 240 https://i.ytimg.com/vi/WttN1Z0XF4k/default.jpg False False False Amy Marie Borg - Britain´s Got Talent 2018\n\n... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
🚨 BREAKING NEWS 🔴 Raja Live all Slot Channels Welcome 🎰 18.07.05 Wt9Gkpmbt44 TheBigJackpot 24 2018-05-07T06:58:59.000Z Slot Machine|"win"|"Gambling"|"Big Win"|"raja"... 28973 2167 175 10 https://i.ytimg.com/vi/Wt9Gkpmbt44/default.jpg False False False The Raja takes matters in to his own hands ton... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
🚨Active Shooter at YouTube Headquarters - LIVE BREAKING NEWS COVERAGE 18.04.04 Az72jrKbANA Right Side Broadcasting Network 25 2018-04-03T23:12:37.000Z YouTube shooter|"YouTube active shooter"|"acti... 103513 1722 181 76 https://i.ytimg.com/vi/Az72jrKbANA/default.jpg False False False An active shooter has been reported at the You... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

40900 rows × 28 columns

The lsuffix and rsuffix parameters are necessary here because the data has the same column names in both British and Canadian datasets. If this wasn't true (because, say, we'd renamed them beforehand) we wouldn't need them.


In [0]:
# import pandas as pd

# reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)

# from learntools.core import binder; binder.bind(globals())
# from learntools.pandas.renaming_and_combining import *
# print("Setup complete.")

Exercises

View the first several lines of your data by running the cell below:


In [0]:
reviews.head()


Out[0]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks

1.

region_1 and region_2 are pretty uninformative names for locale columns in the dataset. Create a copy of reviews with these columns renamed to region and locale, respectively.


In [0]:
renamed = reviews.rename(columns=dict(region_1='region', region_2='locale'))
# q1.check()
renamed


Out[0]:
country description designation points price province region locale taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129966 Germany Notes of honeysuckle and cantaloupe sweeten th... Brauneberger Juffer-Sonnenuhr Spätlese 90 28.0 Mosel NaN NaN Anna Lee C. Iijima NaN Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ... Riesling Dr. H. Thanisch (Erben Müller-Burggraef)
129967 US Citation is given as much as a decade of bottl... NaN 90 75.0 Oregon Oregon Oregon Other Paul Gregutt @paulgwine Citation 2004 Pinot Noir (Oregon) Pinot Noir Citation
129968 France Well-drained gravel soil gives this wine its c... Kritt 90 30.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Gresser 2013 Kritt Gewurztraminer (Als... Gewürztraminer Domaine Gresser
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

129971 rows × 13 columns

2.

Set the index name in the dataset to wines.


In [0]:
reindexed = reviews.rename_axis('wines', axis='rows')
reindexed


Out[0]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
wines
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129966 Germany Notes of honeysuckle and cantaloupe sweeten th... Brauneberger Juffer-Sonnenuhr Spätlese 90 28.0 Mosel NaN NaN Anna Lee C. Iijima NaN Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ... Riesling Dr. H. Thanisch (Erben Müller-Burggraef)
129967 US Citation is given as much as a decade of bottl... NaN 90 75.0 Oregon Oregon Oregon Other Paul Gregutt @paulgwine Citation 2004 Pinot Noir (Oregon) Pinot Noir Citation
129968 France Well-drained gravel soil gives this wine its c... Kritt 90 30.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Gresser 2013 Kritt Gewurztraminer (Als... Gewürztraminer Domaine Gresser
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

129971 rows × 13 columns

3.

The Things on Reddit dataset includes product links from a selection of top-ranked forums ("subreddits") on reddit.com. Run the cell below to load a dataframe of products mentioned on the /r/gaming subreddit and another dataframe for products mentioned on the r//movies subreddit.


In [0]:
# gaming_products = pd.read_csv("../input/things-on-reddit/top-things/top-things/reddits/g/gaming.csv")
gaming_products = pd.read_csv("gaming.csv")
gaming_products['subreddit'] = "r/gaming"


# movie_products = pd.read_csv("../input/things-on-reddit/top-things/top-things/reddits/m/movies.csv")
movie_products = pd.read_csv("movies.csv")
movie_products['subreddit'] = "r/movies"

Create a DataFrame of products mentioned on either subreddit.


In [0]:
combined_products = pd.concat([gaming_products, movie_products])

# q3.check()
combined_products.head()


Out[0]:
name category amazon_link total_mentions subreddit_mentions subreddit
0 BOOMco Halo Covenant Needler Blaster Toys & Games https://www.amazon.com/BOOMco-Halo-Covenant-Ne... 4.0 4 r/gaming
1 Raspberry PI 3 Model B 1.2GHz 64-bit quad-core... Electronics https://www.amazon.com/Raspberry-Model-A1-2GHz... 19.0 3 r/gaming
2 CanaKit 5V 2.5A Raspberry Pi 3 Power Supply / ... Electronics https://www.amazon.com/CanaKit-Raspberry-Suppl... 7.0 3 r/gaming
3 Panasonic K-KJ17MCA4BA Advanced Individual Cel... Electronics https://www.amazon.com/Panasonic-Advanced-Indi... 29.0 2 r/gaming
4 Mayflash GameCube Controller Adapter for Wii U... Electronics https://www.amazon.com/GameCube-Controller-Ada... 24.0 2 r/gaming

4.

The Powerlifting Database dataset on Kaggle includes one CSV table for powerlifting meets and a separate one for powerlifting competitors. Run the cell below to load these datasets into dataframes:


In [0]:
# powerlifting_meets = pd.read_csv("../input/powerlifting-database/meets.csv")
# powerlifting_competitors = pd.read_csv("../input/powerlifting-database/openpowerlifting.csv")
powerlifting_meets = pd.read_csv("meets.csv")
powerlifting_meets.head()


Out[0]:
MeetID MeetPath Federation Date MeetCountry MeetState MeetTown MeetName
0 0 365strong/1601 365Strong 2016-10-29 USA NC Charlotte 2016 Junior & Senior National Powerlifting Cha...
1 1 365strong/1602 365Strong 2016-11-19 USA MO Ozark Thanksgiving Powerlifting Classic
2 2 365strong/1603 365Strong 2016-07-09 USA NC Charlotte Charlotte Europa Games
3 3 365strong/1604 365Strong 2016-06-11 USA SC Rock Hill Carolina Cup Push Pull Challenge
4 4 365strong/1605 365Strong 2016-04-10 USA SC Rock Hill Eastern USA Challenge

In [0]:
powerlifting_competitors = pd.read_csv("openpowerlifting.csv")
powerlifting_competitors.head()


Out[0]:
MeetID Name Sex Equipment Age Division BodyweightKg WeightClassKg Squat4Kg BestSquatKg Bench4Kg BestBenchKg Deadlift4Kg BestDeadliftKg TotalKg Place Wilks
0 0 Angie Belk Terry F Wraps 47.0 Mst 45-49 59.60 60 NaN 47.63 NaN 20.41 NaN 70.31 138.35 1 155.05
1 0 Dawn Bogart F Single-ply 42.0 Mst 40-44 58.51 60 NaN 142.88 NaN 95.25 NaN 163.29 401.42 1 456.38
2 0 Dawn Bogart F Single-ply 42.0 Open Senior 58.51 60 NaN 142.88 NaN 95.25 NaN 163.29 401.42 1 456.38
3 0 Dawn Bogart F Raw 42.0 Open Senior 58.51 60 NaN NaN NaN 95.25 NaN NaN 95.25 1 108.29
4 0 Destiny Dula F Raw 18.0 Teen 18-19 63.68 67.5 NaN NaN NaN 31.75 NaN 90.72 122.47 1 130.47

Both tables include references to a MeetID, a unique key for each meet (competition) included in the database. Using this, generate a dataset combining the two tables into one.


In [0]:
powerlifting_combined = powerlifting_meets.set_index("MeetID").join(powerlifting_competitors.set_index("MeetID"))

powerlifting_combined.head()


Out[0]:
MeetPath Federation Date MeetCountry MeetState MeetTown MeetName Name Sex Equipment Age Division BodyweightKg WeightClassKg Squat4Kg BestSquatKg Bench4Kg BestBenchKg Deadlift4Kg BestDeadliftKg TotalKg Place Wilks
MeetID
0 365strong/1601 365Strong 2016-10-29 USA NC Charlotte 2016 Junior & Senior National Powerlifting Cha... Angie Belk Terry F Wraps 47.0 Mst 45-49 59.60 60 NaN 47.63 NaN 20.41 NaN 70.31 138.35 1 155.05
0 365strong/1601 365Strong 2016-10-29 USA NC Charlotte 2016 Junior & Senior National Powerlifting Cha... Dawn Bogart F Single-ply 42.0 Mst 40-44 58.51 60 NaN 142.88 NaN 95.25 NaN 163.29 401.42 1 456.38
0 365strong/1601 365Strong 2016-10-29 USA NC Charlotte 2016 Junior & Senior National Powerlifting Cha... Dawn Bogart F Single-ply 42.0 Open Senior 58.51 60 NaN 142.88 NaN 95.25 NaN 163.29 401.42 1 456.38
0 365strong/1601 365Strong 2016-10-29 USA NC Charlotte 2016 Junior & Senior National Powerlifting Cha... Dawn Bogart F Raw 42.0 Open Senior 58.51 60 NaN NaN NaN 95.25 NaN NaN 95.25 1 108.29
0 365strong/1601 365Strong 2016-10-29 USA NC Charlotte 2016 Junior & Senior National Powerlifting Cha... Destiny Dula F Raw 18.0 Teen 18-19 63.68 67.5 NaN NaN NaN 31.75 NaN 90.72 122.47 1 130.47