https://www.kaggle.com/residentmario/creating-reading-and-writing
In [0]:
import pandas as pd
Creating data¶ There are two core objects in pandas: the DataFrame and the Series.
DataFrame A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.
For example, consider the following simple DataFrame:
In [2]:
pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})
Out[2]:
In this example, the "0, No" entry has the value of 131. The "0, Yes" entry has a value of 50, and so on.
DataFrame entries are not limited to integers. For instance, here's a DataFrame whose values are strings:
In [3]:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})
Out[3]:
The dictionary-list constructor assigns values to the column labels, but just uses an ascending count from 0 (0, 1, 2, 3, ...) for the row labels. Sometimes this is OK, but oftentimes we will want to assign these labels ourselves.
The list of row labels used in a DataFrame is known as an Index. We can assign values to it by using an index parameter in our constructor:
In [4]:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'],
'Sue': ['Pretty good.', 'Bland.']},
index=['Product A', 'Product B'])
Out[4]:
Series
A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list. And in fact you can create one with nothing more than a list
In [5]:
pd.Series([1, 2, 3, 4, 5])
Out[5]:
A Series is, in essence, a single column of a DataFrame. So you can assign column values to the Series the same way as before, using an index parameter. However, a Series does not have a column name, it only has one overall name:
In [6]:
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')
Out[6]:
The Series and the DataFrame are intimately related. It's helpful to think of a DataFrame as actually being just a bunch of Series "glued together". We'll see more of this in the next section of this tutorial.
Being able to create a DataFrame or Series by hand is handy. But, most of the time, we won't actually be creating our own data by hand. Instead, we'll be working with data that already exists.
Data can be stored in any of a number of different forms and formats. By far the most basic of these is the humble CSV file. When you open a CSV file you get something that looks like this:
Product A,Product B,Product C, 30,21,9, 35,34,1, 41,11,11
So a CSV file is a table of values separated by commas. Hence the name: "Comma-Separated Values", or CSV.
Let's now set aside our toy datasets and see what a real dataset looks like when we read it into a DataFrame. We'll use the pd.read_csv() function to read the data into a DataFrame. This goes thusly:
In [0]:
# wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv")
Para bajar los datos:
https://www.kaggle.com/luisrivera/exercise-creating-reading-and-writing/edit
Para subir los datos a Google Collaborative: D:\kaggle\Cursos\Panda
In [0]:
wine_reviews = pd.read_csv("winemag-data-130k-v2.csv")
We can use the shape attribute to check how large the resulting DataFrame is:
In [9]:
wine_reviews.shape
Out[9]:
So our new DataFrame has 130,000 records split across 14 different columns. That's almost 2 million entries!
We can examine the contents of the resultant DataFrame using the head() command, which grabs the first five rows:
In [10]:
wine_reviews.head()
Out[10]:
The pd.read_csv() function is well-endowed, with over 30 optional parameters you can specify. For example, you can see in this dataset that the CSV file has a built-in index, which pandas did not pick up on automatically. To make pandas use that column for the index (instead of creating a new one from scratch), we can specify an index_col.
In [0]:
# wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
# wine_reviews.head()
In [0]:
wine_reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)
wine_reviews.head()
Out[0]:
Para practicar directo en Kaggle:
https://www.kaggle.com/luisrivera/exercise-creating-reading-and-writing/edit
In [0]:
import pandas as pd
pd.set_option('max_rows', 5)
# from learntools.core import binder; binder.bind(globals())
# from learntools.pandas.creating_reading_and_writing import *
# print("Setup complete.")
In [12]:
# Your code goes here. Create a dataframe matching the above diagram and assign it to the variable fruits.
fruits = pd.DataFrame({'Apples': ['30'],
'Bananas': ['21']})
#q1.check()
fruits
Out[12]:
In [14]:
fruits = pd.DataFrame([[30, 21]], columns=['Apples', 'Bananas'])
fruits
Out[14]:
In [15]:
fruit_sales = pd.DataFrame({'Apples': ['35', '41'],
'Bananas': ['21', '34' ]},
index=['2017 Sales', '2018 Sales'])
fruit_sales
Out[15]:
In [16]:
quantities = ['4 cups', '1 cup', '2 large', '1 can']
items = ['Flour', 'Milk', 'Eggs', 'Spam']
recipe = pd.Series(quantities, index=items, name='Dinner')
recipe
Out[16]:
Read the following csv dataset of wine reviews into a DataFrame called reviews
:
The filepath to the csv file is ../input/wine-reviews/winemag-data_first150k.csv
. The first few lines look like:
,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,"This tremendous 100% varietal wine[...]",Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and[...]",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
In [18]:
#reviews = pd.read_csv("../input/wine-reviews/winemag-data_first150k.csv", index_col=0)
reviews = pd.read_csv("winemag-data_first150k.csv", index_col=0)
reviews
Out[18]:
In [19]:
animals = pd.DataFrame({'Cows': [12, 20], 'Goats': [22, 19]},
index=['Year 1', 'Year 2'])
animals
Out[19]:
In the cell below, write code to save this DataFrame to disk as a csv file with the name cows_and_goats.csv
.
In [0]:
animals.to_csv("cows_and_goats.csv")
https://www.kaggle.com/residentmario/indexing-selecting-assigning
Native Python objects provide good ways of indexing data. Pandas carries all of these over, which helps make it easy to start with.
Consider this DataFrame:
In [21]:
reviews
Out[21]:
In Python, we can access the property of an object by accessing it as an attribute. A book object, for example, might have a title property, which we can access by calling book.title. Columns in a pandas DataFrame work in much the same way.
Hence to access the country property of reviews we can use:
In [22]:
reviews.country
Out[22]:
If we have a Python dictionary, we can access its values using the indexing ([]) operator. We can do the same with columns in a DataFrame:
In [23]:
reviews['country']
Out[23]:
These are the two ways of selecting a specific Series out of a DataFrame. Neither of them is more or less syntactically valid than the other, but the indexing operator [] does have the advantage that it can handle column names with reserved characters in them (e.g. if we had a country providence column, reviews.country providence wouldn't work).
Doesn't a pandas Series look kind of like a fancy dictionary? It pretty much is, so it's no surprise that, to drill down to a single specific value, we need only use the indexing operator [] once more:
In [24]:
reviews['country'][0]
Out[24]:
The indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas has its own accessor operators, loc and iloc. For more advanced operations, these are the ones you're supposed to be using.
Index-based selection Pandas indexing works in one of two paradigms. The first is index-based selection: selecting data based on its numerical position in the data. iloc follows this paradigm.
To select the first row of data in a DataFrame, we may use the following:
In [0]:
reviews.iloc[0]
Out[0]:
Both loc and iloc are row-first, column-second. This is the opposite of what we do in native Python, which is column-first, row-second.
This means that it's marginally easier to retrieve rows, and marginally harder to get retrieve columns. To get a column with iloc, we can do the following:
In [0]:
reviews.iloc[:, 0]
Out[0]:
In [0]:
On its own, the : operator, which also comes from native Python, means "everything". When combined with other selectors, however, it can be used to indicate a range of values. For example, to select the country column from just the first, second, and third row, we would do:
In [0]:
reviews.iloc[:3, 0]
Out[0]:
Or, to select just the second and third entries, we would do:
In [0]:
reviews.iloc[1:3, 0]
Out[0]:
It's also possible to pass a list:
In [0]:
reviews.iloc[[0, 1, 2], 0]
Out[0]:
Finally, it's worth knowing that negative numbers can be used in selection. This will start counting forwards from the end of the values. So for example here are the last five elements of the dataset.
In [0]:
reviews.iloc[-5:]
Out[0]:
Label-based selection The second paradigm for attribute selection is the one followed by the loc operator: label-based selection. In this paradigm, it's the data index value, not its position, which matters.
For example, to get the first entry in reviews, we would now do the following:
In [0]:
reviews.loc[0, 'country']
Out[0]:
iloc is conceptually simpler than loc because it ignores the dataset's indices. When we use iloc we treat the dataset like a big matrix (a list of lists), one that we have to index into by position. loc, by contrast, uses the information in the indices to do its work. Since your dataset usually has meaningful indices, it's usually easier to do things using loc instead. For example, here's one operation that's much easier using loc:
In [0]:
reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']]
Out[0]:
Choosing between loc and iloc When choosing or transitioning between loc and iloc, there is one "gotcha" worth keeping in mind, which is that the two methods use slightly different indexing schemes.
iloc uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So 0:10 will select entries 0,...,9. loc, meanwhile, indexes inclusively. So 0:10 will select entries 0,...,10.
Why the change? Remember that loc can index any stdlib type: strings, for example. If we have a DataFrame with index values Apples, ..., Potatoes, ..., and we want to select "all the alphabetical fruit choices between Apples and Potatoes", then it's a lot more convenient to index df.loc['Apples':'Potatoes'] than it is to index something like df.loc['Apples', 'Potatoet] (t coming after s in the alphabet).
This is particularly confusing when the DataFrame index is a simple numerical list, e.g. 0,...,1000. In this case df.iloc[0:1000] will return 1000 entries, while df.loc[0:1000] return 1001 of them! To get 1000 elements using loc, you will need to go one lower and ask for df.iloc[0:999].
Otherwise, the semantics of using loc are the same as those for iloc.
In [0]:
# reviews.set_index("title") # da error, no existe esa columna
In [0]:
reviews.set_index("variety")
Out[0]:
This is useful if you can come up with an index for the dataset which is better than the current one.
So far we've been indexing various strides of data, using structural properties of the DataFrame itself. To do interesting things with the data, however, we often need to ask questions based on conditions.
For example, suppose that we're interested specifically in better-than-average wines produced in Italy.
We can start by checking if each wine is Italian or not:
In [0]:
reviews.country == 'Italy'
Out[0]:
This operation produced a Series of True/False booleans based on the country of each record. This result can then be used inside of loc to select the relevant data:
In [0]:
reviews.loc[reviews.country == 'Italy']
Out[0]:
This DataFrame has ~20,000 rows. The original had ~130,000. That means that around 15% of wines originate from Italy.
We also wanted to know which ones are better than average. Wines are reviewed on a 80-to-100 point scale, so this could mean wines that accrued at least 90 points.
We can use the ampersand (&) to bring the two questions together:
In [0]:
reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]
Out[0]:
Suppose we'll buy any wine that's made in Italy or which is rated above average. For this we use a pipe (|):
In [0]:
reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]
Out[0]:
Pandas comes with a few built-in conditional selectors, two of which we will highlight here.
The first is isin. isin is lets you select data whose value "is in" a list of values. For example, here's how we can use it to select wines only from Italy or France:
In [0]:
reviews.loc[reviews.country.isin(['Italy', 'France'])]
Out[0]:
The second is isnull (and its companion notnull). These methods let you highlight values which are (or are not) empty (NaN). For example, to filter out wines lacking a price tag in the dataset, here's what we would do:
In [0]:
reviews.loc[reviews.price.notnull()]
Out[0]:
In [0]:
reviews['critic'] = 'everyone'
reviews['critic']
Out[0]:
Or with an iterable of values:
In [0]:
reviews['index_backwards'] = range(len(reviews), 0, -1)
reviews['index_backwards']
Out[0]:
Run the following cell to load your data and some utility functions (including code to check your answers).
In [0]:
import pandas as pd
# reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)
pd.set_option("display.max_rows", 5)
# from learntools.core import binder; binder.bind(globals())
# from learntools.pandas.indexing_selecting_and_assigning import *
# print("Setup complete.")
Look at an overview of your data by running the following line.
In [0]:
reviews.head()
Out[0]:
In [0]:
desc = reviews['description']
In [0]:
desc = reviews.description
#or
#desc = reviews["description"]
# desc is a pandas Series object, with an index matching the reviews DataFrame. In general, when we select a single column from a DataFrame, we'll get a Series.
desc.head(10)
Out[0]:
In [0]:
first_description = reviews["description"][0]
# q2.check()
first_description
Out[0]:
In [0]:
# Solution:
first_description = reviews.description.iloc[0]
# Note that while this is the preferred way to obtain the entry in the DataFrame, many other options will return a valid result,
# such as reviews.description.loc[0], reviews.description[0], and more!
first_description
Out[0]:
In [0]:
first_row = reviews.iloc[0]
# q3.check()
first_row
Out[0]:
In [0]:
# Solution:
first_row = reviews.iloc[0]
In [0]:
first_descriptions = reviews.iloc[:10, 1]
# first_descriptions = reviews.description.iloc[0:9]
# first_descriptions = reviews.description.loc[0:9,'description']
# q4.check()
first_descriptions
Out[0]:
In [0]:
# Solution:
first_descriptions = reviews.description.iloc[:10]
# Note that many other options will return a valid result, such as desc.head(10) and reviews.loc[:9, "description"].
first_descriptions
Out[0]:
In [0]:
sample_reviews = reviews.iloc[[1,2,3,5,8],]
# q5.check()
sample_reviews
Out[0]:
In [0]:
# Solution:
indices = [1, 2, 3, 5, 8]
sample_reviews = reviews.loc[indices]
sample_reviews
Out[0]:
In [0]:
df = reviews.loc[[0,1,10,100],['country', 'province', 'region_1', 'region_2']]
# q6.check()
df
Out[0]:
In [0]:
# Solution:
cols = ['country', 'province', 'region_1', 'region_2']
indices = [0, 1, 10, 100]
df = reviews.loc[indices, cols]
df
Out[0]:
Create a variable df
containing the country
and variety
columns of the first 100 records.
Hint: you may use loc
or iloc
. When working on the answer this question and the several of the ones that follow, keep the following "gotcha" described in the tutorial:
iloc
uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded.loc
, meanwhile, indexes inclusively.This is particularly confusing when the DataFrame index is a simple numerical list, e.g.
0,...,1000
. In this casedf.iloc[0:1000]
will return 1000 entries, whiledf.loc[0:1000]
return 1001 of them! To get 1000 elements usingloc
, you will need to go one lower and ask fordf.iloc[0:999]
.
In [0]:
df = reviews.loc[0:99,['country', 'variety']]
# q7.check()
df
Out[0]:
In [0]:
# # Correct:
cols = ['country', 'variety']
df = reviews.loc[:99, cols]
# or
# cols_idx = [0, 11]
# df = reviews.iloc[:100, cols_idx]
df
Out[0]:
In [0]:
italian_wines = reviews.loc[reviews.country == 'Italy']
# q8.check()
italian_wines
Out[0]:
In [0]:
# Solution:
italian_wines = reviews[reviews.country == 'Italy']
italian_wines
Out[0]:
In [0]:
top_oceania_wines = reviews.loc[reviews.country.isin(['Australia', 'New Zealand'])
& (reviews.points >= 95)]
# q9.check()
top_oceania_wines
Out[0]:
In [0]:
# Solution:
top_oceania_wines = reviews.loc[
(reviews.country.isin(['Australia', 'New Zealand']))
& (reviews.points >= 95)
]
https://www.kaggle.com/residentmario/summary-functions-and-maps
In [0]:
reviews
Out[0]:
Summary functions
Pandas provides many simple "summary functions" (not an official name) which restructure the data in some useful way. For example, consider the describe() method:
In [0]:
reviews.points.describe()
Out[0]:
This method generates a high-level summary of the attributes of the given column. It is type-aware, meaning that its output changes based on the data type of the input. The output above only makes sense for numerical data; for string data here's what we get:
In [0]:
reviews.taster_name.describe()
Out[0]:
If you want to get some particular simple summary statistic about a column in a DataFrame or a Series, there is usually a helpful pandas function that makes it happen.
For example, to see the mean of the points allotted (e.g. how well an averagely rated wine does), we can use the mean() function:
In [0]:
reviews.points.mean()
Out[0]:
To see a list of unique values we can use the unique() function:
In [0]:
# reviews.taster_name.unique() # se demora mucho??
To see a list of unique values and how often they occur in the dataset, we can use the value_counts() method:
In [0]:
reviews.taster_name.value_counts()
Out[0]:
A map is a term, borrowed from mathematics, for a function that takes one set of values and "maps" them to another set of values. In data science we often have a need for creating new representations from existing data, or for transforming data from the format it is in now to the format that we want it to be in later. Maps are what handle this work, making them extremely important for getting your work done!
There are two mapping methods that you will use often.
map() is the first, and slightly simpler one. For example, suppose that we wanted to remean the scores the wines received to 0. We can do this as follows:
In [0]:
review_points_mean = reviews.points.mean()
reviews.points.map(lambda p: p - review_points_mean)
Out[0]:
The function you pass to map() should expect a single value from the Series (a point value, in the above example), and return a transformed version of that value. map() returns a new Series where all the values have been transformed by your function.
apply() is the equivalent method if we want to transform a whole DataFrame by calling a custom method on each row.
La función que pase a map () debería esperar un único valor de la Serie (un valor de punto, en el ejemplo anterior) y devolver una versión transformada de ese valor. map () devuelve una nueva serie donde todos los valores han sido transformados por su función.
apply () es el método equivalente si queremos transformar un DataFrame completo llamando a un método personalizado en cada fila.
In [0]:
def remean_points(row):
row.points = row.points - review_points_mean
return row
reviews.apply(remean_points, axis='columns')
Out[0]:
If we had called reviews.apply() with axis='index', then instead of passing a function to transform each row, we would need to give a function to transform each column.
Note that map() and apply() return new, transformed Series and DataFrames, respectively. They don't modify the original data they're called on. If we look at the first row of reviews, we can see that it still has its original points value.
Si hubiéramos llamado reviews.apply () con axis = 'index', entonces, en lugar de pasar una función para transformar cada fila, tendríamos que dar una función para transformar cada columna.
Tenga en cuenta que map () y apply () devuelven Series y DataFrames nuevos y transformados, respectivamente. No modifican los datos originales a los que se les solicita. Si miramos la primera fila de revisiones, podemos ver que todavía tiene su valor de puntos original.
In [0]:
reviews.head(1)
Out[0]:
Pandas provides many common mapping operations as built-ins. For example, here's a faster way of remeaning our points column:
In [0]:
review_points_mean = reviews.points.mean()
reviews.points - review_points_mean
Out[0]:
In this code we are performing an operation between a lot of values on the left-hand side (everything in the Series) and a single value on the right-hand side (the mean value). Pandas looks at this expression and figures out that we must mean to subtract that mean value from every value in the dataset.
Pandas will also understand what to do if we perform these operations between Series of equal length. For example, an easy way of combining country and region information in the dataset would be to do the following:
In [0]:
reviews.country + " - " + reviews.region_1
Out[0]:
These operators are faster than map() or apply() because they uses speed ups built into pandas. All of the standard Python operators (>, <, ==, and so on) work in this manner.
However, they are not as flexible as map() or apply(), which can do more advanced things, like applying conditional logic, which cannot be done with addition and subtraction alone.
Estos operadores son más rápidos que map () o apply () porque usan aceleraciones integradas en pandas. Todos los operadores estándar de Python (>, <, ==, etc.) funcionan de esta manera.
Sin embargo, no son tan flexibles como map () o apply (), que pueden hacer cosas más avanzadas, como aplicar lógica condicional, que no se puede hacer solo con la suma y la resta.
https://www.kaggle.com/residentmario/grouping-and-sorting
One function we've been using heavily thus far is the value_counts() function. We can replicate what value_counts() does by doing the following:
In [0]:
reviews.groupby('points').points.count()
Out[0]:
In [0]:
reviews.groupby('points').price.min()
Out[0]:
In [0]:
reviews.groupby('winery').apply(lambda df: df.title.iloc[0])
Out[0]:
In [0]:
reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()])
Out[0]:
In [0]:
reviews.groupby(['country']).price.agg([len, min, max])
Out[0]:
In all of the examples we've seen thus far we've been working with DataFrame or Series objects with a single-label index. groupby() is slightly different in the fact that, depending on the operation we run, it will sometimes result in what is called a multi-index.
A multi-index differs from a regular index in that it has multiple levels. For example:
In [0]:
countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed
Out[0]:
In [0]:
mi = countries_reviewed.index
type(mi)
Out[0]:
In [0]:
countries_reviewed.reset_index()
Out[0]:
Looking again at countries_reviewed we can see that grouping returns data in index order, not in value order. That is to say, when outputting the result of a groupby, the order of the rows is dependent on the values in the index, not in the data.
To get data in the order want it in we can sort it ourselves. The sort_values() method is handy for this.
In [0]:
countries_reviewed = countries_reviewed.reset_index()
countries_reviewed.sort_values(by='len')
Out[0]:
In [0]:
countries_reviewed.sort_values(by='len', ascending=False)
Out[0]:
In [0]:
countries_reviewed.sort_index()
Out[0]:
In [0]:
countries_reviewed.sort_values(by=['country', 'len'])
Out[0]:
https://www.kaggle.com/luisrivera/exercise-grouping-and-sorting/edit
In [0]:
import pandas as pd
# reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
#pd.set_option("display.max_rows", 5)
reviews = pd.read_csv("./winemag-data-130k-v2.csv", index_col=0)
# from learntools.core import binder; binder.bind(globals())
# from learntools.pandas.grouping_and_sorting import *
# print("Setup complete.")
In [0]:
reviews_written = reviews.groupby('taster_twitter_handle').size()
#or
reviews_written
Out[0]:
In [0]:
reviews_written = reviews.groupby('taster_twitter_handle').taster_twitter_handle.count()
reviews_written
Out[0]:
What is the best wine I can buy for a given amount of money? Create a Series
whose index is wine prices and whose values is the maximum number of points a wine costing that much was given in a review. Sort the values by price, ascending (so that 4.0
dollars is at the top and 3300.0
dollars is at the bottom).
In [0]:
best_rating_per_price = reviews.groupby('price')['points'].max().sort_index()
best_rating_per_price
Out[0]:
In [0]:
price_extremes = reviews.groupby('variety').price.agg([min, max])
price_extremes
Out[0]:
In [0]:
sorted_varieties = price_extremes.sort_values(by=['min', 'max'], ascending=False)
sorted_varieties
Out[0]:
In [0]:
reviewer_mean_ratings = reviews.groupby('taster_name').points.mean()
reviewer_mean_ratings
Out[0]:
In [0]:
reviewer_mean_ratings.describe()
Out[0]:
In [0]:
country_variety_counts = reviews.groupby(['country', 'variety']).size().sort_values(ascending=False)
country_variety_counts
Out[0]:
https://www.kaggle.com/residentmario/data-types-and-missing-values
In [0]:
reviews.price.dtype
Out[0]:
Alternatively, the dtypes property returns the dtype of every column in the DataFrame:
In [0]:
reviews.dtypes
Out[0]:
Data types tell us something about how pandas is storing the data internally. float64 means that it's using a 64-bit floating point number; int64 means a similarly sized integer instead, and so on.
One peculiarity to keep in mind (and on display very clearly here) is that columns consisting entirely of strings do not get their own type; they are instead given the object type.
It's possible to convert a column of one type into another wherever such a conversion makes sense by using the astype() function. For example, we may transform the points column from its existing int64 data type into a float64 data type:
In [0]:
reviews.points.astype('float64')
Out[0]:
A DataFrame or Series index has its own dtype, too:
In [0]:
reviews.index.dtype
Out[0]:
Pandas also supports more exotic data types, such as categorical data and timeseries data. Because these data types are more rarely used, we will omit them until a much later section of this tutorial.
Entries missing values are given the value NaN, short for "Not a Number". For technical reasons these NaN values are always of the float64 dtype.
Pandas provides some methods specific to missing data. To select NaN entries you can use pd.isnull() (or its companion pd.notnull()). This is meant to be used thusly:
In [0]:
reviews[pd.isnull(reviews.country)]
Out[0]:
Replacing missing values is a common operation. Pandas provides a really handy method for this problem: fillna(). fillna() provides a few different strategies for mitigating such data. For example, we can simply replace each NaN with an "Unknown":
In [0]:
reviews.region_2.fillna("Unknown")
Out[0]:
Or we could fill each missing value with the first non-null value that appears sometime after the given record in the database. This is known as the backfill strategy.
Alternatively, we may have a non-null value that we would like to replace. For example, suppose that since this dataset was published, reviewer Kerin O'Keefe has changed her Twitter handle from @kerinokeefe to @kerino. One way to reflect this in the dataset is using the replace() method:
In [0]:
reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")
Out[0]:
The replace() method is worth mentioning here because it's handy for replacing missing data which is given some kind of sentinel value in the dataset: things like "Unknown", "Undisclosed", "Invalid", and so on.
https://www.kaggle.com/luisrivera/exercise-data-types-and-missing-values/edit
In [0]:
# import pandas as pd
# reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
# from learntools.core import binder; binder.bind(globals())
# from learntools.pandas.data_types_and_missing_data import *
# print("Setup complete.")
In [0]:
# Your code here
dtype = reviews.points.dtype
In [0]:
point_strings = reviews.points.astype(str)
point_strings
Out[0]:
In [0]:
missing_price_reviews = reviews[reviews.price.isnull()]
n_missing_prices = len(missing_price_reviews)
n_missing_prices
Out[0]:
In [0]:
# Cute alternative solution: if we sum a boolean series, True is treated as 1 and False as 0
n_missing_prices = reviews.price.isnull().sum()
n_missing_prices
Out[0]:
In [0]:
# or equivalently:
n_missing_prices = pd.isnull(reviews.price).sum()
n_missing_prices
Out[0]:
In [0]:
## 4.
What are the most common wine-producing regions? Create a Series counting the number of times each value occurs in the `region_1` field. This field is often missing data, so replace missing values with `Unknown`. Sort in descending order. Your output should look something like this:
```
Unknown 21247
Napa Valley 4480
...
Bardolino Superiore 1
Primitivo del Tarantino 1
Name: region_1, Length: 1230, dtype: int64
In [0]:
reviews_per_region = reviews.region_1.fillna('Unknown').value_counts().sort_values(ascending=False)
reviews_per_region
Out[0]:
Oftentimes data will come to us with column names, index names, or other naming conventions that we are not satisfied with. In that case, you'll learn how to use pandas functions to change the names of the offending entries to something better.
You'll also explore how to combine data from multiple DataFrames and/or Series.
In [0]:
import pandas as pd
# reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)
# from learntools.core import binder; binder.bind(globals())
# from learntools.pandas.renaming_and_combining import *
# print("Setup complete.")
In [0]:
reviews.head()
Out[0]:
In [0]:
reviews.rename(columns={'points': 'score'})
Out[0]:
rename() lets you rename index or column values by specifying a index or column keyword parameter, respectively. It supports a variety of input formats, but usually a Python dictionary is the most convenient. Here is an example using it to rename some elements of the index.
In [0]:
reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'})
Out[0]:
You'll probably rename columns very often, but rename index values very rarely. For that, set_index() is usually more convenient.
Both the row index and the column index can have their own name attribute. The complimentary rename_axis() method may be used to change these names. For example:
In [0]:
reviews.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns')
Out[0]:
When performing operations on a dataset, we will sometimes need to combine different DataFrames and/or Series in non-trivial ways. Pandas has three core methods for doing this. In order of increasing complexity, these are concat(), join(), and merge(). Most of what merge() can do can also be done more simply with join(), so we will omit it and focus on the first two functions here.
The simplest combining method is concat(). Given a list of elements, this function will smush those elements together along an axis.
This is useful when we have data in different DataFrame or Series objects but having the same fields (columns). One example: the YouTube Videos dataset, which splits the data up based on country of origin (e.g. Canada and the UK, in this example). If we want to study multiple countries simultaneously, we can use concat() to smush them together:
In [0]:
# canadian_youtube = pd.read_csv("../input/youtube-new/CAvideos.csv")
# british_youtube = pd.read_csv("../input/youtube-new/GBvideos.csv")
canadian_youtube = pd.read_csv("CAvideos.csv")
british_youtube = pd.read_csv("GBvideos.csv")
pd.concat([canadian_youtube, british_youtube])
Out[0]:
The middlemost combiner in terms of complexity is join(). join() lets you combine different DataFrame objects which have an index in common. For example, to pull down videos that happened to be trending on the same day in both Canada and the UK, we could do the following:
In [0]:
left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])
left.join(right, lsuffix='_CAN', rsuffix='_UK')
Out[0]:
The lsuffix and rsuffix parameters are necessary here because the data has the same column names in both British and Canadian datasets. If this wasn't true (because, say, we'd renamed them beforehand) we wouldn't need them.
https://www.kaggle.com/luisrivera/exercise-renaming-and-combining/edit
In [0]:
# import pandas as pd
# reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
# from learntools.core import binder; binder.bind(globals())
# from learntools.pandas.renaming_and_combining import *
# print("Setup complete.")
In [0]:
reviews.head()
Out[0]:
In [0]:
renamed = reviews.rename(columns=dict(region_1='region', region_2='locale'))
# q1.check()
renamed
Out[0]:
In [0]:
reindexed = reviews.rename_axis('wines', axis='rows')
reindexed
Out[0]:
The Things on Reddit dataset includes product links from a selection of top-ranked forums ("subreddits") on reddit.com. Run the cell below to load a dataframe of products mentioned on the /r/gaming subreddit and another dataframe for products mentioned on the r//movies subreddit.
In [0]:
# gaming_products = pd.read_csv("../input/things-on-reddit/top-things/top-things/reddits/g/gaming.csv")
gaming_products = pd.read_csv("gaming.csv")
gaming_products['subreddit'] = "r/gaming"
# movie_products = pd.read_csv("../input/things-on-reddit/top-things/top-things/reddits/m/movies.csv")
movie_products = pd.read_csv("movies.csv")
movie_products['subreddit'] = "r/movies"
Create a DataFrame
of products mentioned on either subreddit.
In [0]:
combined_products = pd.concat([gaming_products, movie_products])
# q3.check()
combined_products.head()
Out[0]:
The Powerlifting Database dataset on Kaggle includes one CSV table for powerlifting meets and a separate one for powerlifting competitors. Run the cell below to load these datasets into dataframes:
In [0]:
# powerlifting_meets = pd.read_csv("../input/powerlifting-database/meets.csv")
# powerlifting_competitors = pd.read_csv("../input/powerlifting-database/openpowerlifting.csv")
powerlifting_meets = pd.read_csv("meets.csv")
powerlifting_meets.head()
Out[0]:
In [0]:
powerlifting_competitors = pd.read_csv("openpowerlifting.csv")
powerlifting_competitors.head()
Out[0]:
Both tables include references to a MeetID
, a unique key for each meet (competition) included in the database. Using this, generate a dataset combining the two tables into one.
In [0]:
powerlifting_combined = powerlifting_meets.set_index("MeetID").join(powerlifting_competitors.set_index("MeetID"))
powerlifting_combined.head()
Out[0]: