In [ ]:
import io
import numpy as np
import pandas as pd
In [ ]:
# MASTER ONLY
import re
# imports %%solution, %%submission, %%template, %%inlinetest, %%studenttest, %autotest
%load_ext prog_edu_assistant_tools.magics
from prog_edu_assistant_tools.magics import report, autotest
lang:enIn this exercise, we will consider what is a data frame and how to represent
the data in a "tidy" way. We will use the pandas
data frame library.
lang:jaこの講義では、「データフレーム」を使って、データをキレイに(tidy) 表現する方法を説明します。
本講義では データフレームのためのライブラリとしてpandas
を使用します。
lang:en
There are many ways to represent the tabular data, spreadsheets being the most popular one among general computer users. However, for the programmatic access, a simpler format may be even more useful.
It is easy to generate, even by typing manually, and relatively easy to parse. CSV stands for comma-separated values, so it uses a comma ,
to separate the values in a single row.
Here are the rules of the CSV data format:
Typically the CSV format is used in files with .csv
suffix, but Python language makes it easy enough to parse CSV defined directly in the source code in string literals. This is one of the easiest way to define small data frames in Jupyter notebooks. Here is an example.
lang:ja 表のようなデータを表現できる方法は複数がありますが、プログラムでデータを扱うのために特に使いやすいのはCSV形式です。 CSV形式は、プログラムによって生成または手動の生成、両方とも簡単で、読み込みも簡単にできます。 CSVはComma-separated valuesの略で、カンマ区切りという意味です。
CSV形式のルールは以下です。
"a,b"
"a""b"
普段はCSV形式.csv
のファイルとして保存しますが、Pythonでは直接のプログラムへの組み込みも可能です。
以下の例をご覧ください。
In [ ]:
df2 = pd.read_csv(io.StringIO("""
x,y
1,2
3,4
"""))
df2
lang:enIn case you are curious, pd.read_csv
accepts file-like objects to read the data from, and io.StringIO is way to create a file-like object from a string literal. Triple quotes """
are a Python syntax that allows to define multi-line string literal.
lang:ja詳しく見ると、pd.read_csv
はファイルのようなものを受け取ります、そしてio.StringIO
は文字からファイルのようなオブジェクトを作っています。
lang:enHere is the example of CSV data that we will use throughout this notebook.
lang:ja以下では、次のCSV形式のファイルを例に説明していきます。
In [ ]:
with open("data/tokyo-weather.csv") as f:
[print(next(f), end='') for i in range(5)]
lang:enA data frame is a table containing the data where every column has a name, and the data within each column has a uniform type (e.g. only numbers or only strings). For example, a standard spreadsheet with a data
can often be thought of as a data frame. Here is the definition
of the DataFrame
class in pandas
library: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame
Let's look at an example.
lang:jaデータフレームとは2次元の表形式のデータ(tabular data)を扱うためのデータ構造です。各列は型や名前がついています。列はそれぞれ型が異なってもよいです。 たとえば、スプレッドシートのデータはデータフレームとしてみることができます。
pandas
のライブラリでのDataFrame
クラスの定義はこちらを参考にしてください: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame
例をみてみましょう。
In [ ]:
df = pd.read_csv('data/tokyo-weather.csv')
df.head(5)
lang:enHere, the read_csv
call reads the data from a CSV file into a data frame.
# Read the CSV file into a new data frame.
df = pd.read_csv('data/tokyo-weather.csv')
And the df.head(5)
call displays the first few lines of the data frame.
# Display the first 5 rows of the data frame.
df.head(5)
lang:enThe data frame has columns, rows and the cells holding the values. The values in the cells can be numeric (including NaN to represent missing numbers), or they can be string values to represent text data or categorical data, but each column must have a single type. In a data frame, it is possible to address individual columns or rows of data.
The good way for representing the data using the data frame comes from statistics. Each column in the data frame corresponds to a variable, that is something that either can be measured, or can be controlled by us. Each row corresponds to one observation, with values in different columns logically being related. For example, in the table above, one row coresonds to the weather data for 1 hour.
In Python Pandas library, the column types can be inspected using dtypes property. Note that numeric types
are further subdivided into integer (int64
) and floating point (float64
) types. The string data is represented with dtype object
.
lang:jaデータフレームは値を含む「セル」が2次元格子状にならんだデータ構造です。 各セルには数値または文字列のデータを保存できます。 上の例ではいくつかのセルにNaN という値が入っていますが、これはNot A Numberの意味で、値が不正または欠けていることを表します
一つのデータフルームのよい考え方は統計分析に由来しますが、統計分析以外にもその考え方が役に立ちます。 各行は観測値を表し、各列は変数を表します。変数は直接に設定できる、または観測して図るものとします。 一つの観測値は同時に図るものなので、一つの行に入っている値は一つのものを記述します。 上記の例の表では、一つの行は一時間の観測を表しています。
Pythonのpandas
のライブラリでは、列の型を知るためにdtypes
というプロパティを使用できます。
数値型は更に整数(int64
)や浮動小数点(float64
)の型に分けられます。文字の場合はオブジェクトの型(object
)になります。
In [ ]:
# データフレームの列の型をご覧できます。
# 因子はCSVの中で文字列として
df.dtypes
In [ ]:
# 一目で分かるデータの平均値や標準偏差
df.describe()
lang:en There are many possible ways how one can put the same data into the tabular format.
Date | Rainfall | Wind |
---|---|---|
2019-08-08 | 50 | NE |
2019-08-07 | 0 | E |
Rainfall.8/8 | Rainfall.8/7 | Wind.8/8 | Wind.8/7 |
---|---|---|---|
50 | 0 | NE | E |
Date | Variable | Value |
---|---|---|
2019-08-08 | Rainfall | 50 |
2019-08-08 | Wind | NE |
2019-08-07 | Rainfall | 0 |
2019-08-07 | Wind | E |
One particularly useful way to think of the data has been inspired by statistics and looks like an experiment report. It is called tidy data and satisfies the following conditions:
All other possible formats of data that are not tidy are called messy by contrast. Of the examples above, only the first table is a tidy data frame. The second and third are messy.
There is some connection of tidy data frames to 3rd normal form in the database theory, but data frames tend to be more flexible and malleable. It is also worth noting, that depending on the purpose of data analysis and required computations, the definition of "one observation" may be different. For example, let's assume that we have the data about flight arrival and departure times. If we want to study flight durations, then it is convenient to have departure and arrival as independent variables in separate columns, which makes it really easy to compute flight duration. If on the other hand we want to study how the air stripe at an airport is used, then depatures and arrivals are just timestamps of events related to the airstripe use, and arrival/departure is better to be thought as an additional categorical variable.
There are two benefits to tidy data frames
Bringing all data into tidy frame format makes your life easier as you do not need to remember and handle various data format pecularities. Data handing becomes uniform.
There is an existing set of tools that work best when the data is in tidy format. The most important of those tools is a plotting library used for data visualiation. We will see some examples later in this unit.
See the paper https://vita.had.co.nz/papers/tidy-data.pdf for more details about tidy data frames.
lang:ja データフレームにデータを入れる方法はたくさんありますが、それはどちらでもよいという訳はありません。以下の例を見ましょう。
日付 | 降水量 | 風向 |
---|---|---|
2019-08-08 | 50 | NE |
2019-08-07 | 0 | E |
降水量.8/8 | 降水量.8/7 | 風向.8/8 | 風向.8/7 |
---|---|---|---|
50 | 0 | NE | E |
日付 | 変数 | 値 |
---|---|---|
2019-08-08 | 降水量 | 50 |
2019-08-08 | 風向 | NE |
2019-08-07 | 降水量 | 0 |
2019-08-07 | 風向 | E |
以上のデータの表現方法の中から一つは特に役に立ちます。それは「キレイな(tidy)データフレーム」といい、以下の条件に当てはまるデータフレームです。
キレイな(tidy)データフレームの条件に当てはまらないデータフレームは汚い(messy)といいます。 上の例では、1つ目の表はtidyで、2つ目と3つ目はmessyです。
データ解析の目的によって観測値の定義は異なる場合もあります。たとえば、飛行機の出発時間や到着時間は 別々の変数でしょうか。 飛行時間の解析であれば、別々の変数の扱いは便利です。なぜかというと、観測値ごとに 簡単に飛行時間を計算できるからです。 もし空港の飛行場の使い方の解析の場合は、離陸も着陸も飛行場を使う 機会なので、同じデータであっても、一つの変数にした方が解析しやすいのです。
詳しくキレイなデータフレームについてこちらの論文ご参考ください: https://vita.had.co.nz/papers/tidy-data.pdf (英語)
lang:enIn this exercise, you task is to create a tidy data frame based on the textual description provided below. A person (Aliсe) wants to do a data analysis on her coffee drinking habits.
Here is the Alices description of her week:
Your task: create a data frame named coffee
that would describe how much coffee Alice drinks on each day of the week, with the following columns describing the day:
day
: integer, describes the day (1: Monday, ... 7 = Sunday)work
: boolean (True/False) describes whether the day is workday (true) or weekends (false).gym
: boolean (True/False) describes whether Alice goes to the gym on that day (true - goes to gym, false -
does not go to gym).coffee_ml
: integer, describes how much coffee Alice drinks in the day.lang:jaアリスはコーヒーを大好きで、よく飲みます。コーヒーの消費量に気になってデータ解析を行いたいので、以下の記述を読んで、データフレームをCSV形式で作ってください。 アリスの一週間の説明こちらです:
課題として、データフレームを作ってcoffee
という名前をつけてください。データフレームには以下の列を入れましょう。
day
: 整数、一週間の中の一日を記述します (1:月曜日, 2:火曜日, ..., 6:土曜日, 7:日曜日)work
: 真理値、その日に会社に行くかどうか(1:会社に行く、0:行かない)gym
: 真理値、その日にジムに行くかどうか(1:ジムに行く、0:行かない)coffee_ml
: 整数、その日にコーヒーの消費量、mlの単位
In [ ]:
%%solution
""" # BEGIN PROMPT
coffee = pd.read_csv(io.StringIO('''day,work,gym,coffee_ml
...
'''))
""" # END PROMPT
# BEGIN SOLUTION
coffee = pd.read_csv(io.StringIO("""day,work,gym,coffee_ml
1,true,false,200
2,true,true,500
3,true,false,0
4,true,true,500
5,true,false,200
6,false,false,500
7,false,false,500
"""))
# END SOLUTION
In [ ]:
# Inspect the resulting data frame
coffee
In [ ]:
%%studenttest StudentTest
# Test the data frame. **lang:en**
# MASTER ONLY
assert len(coffee) == 7, "Your dataframe should have 7 rows for each day of the week"
assert 'day' in coffee, "Your dataframe should have a 'day' column"
assert 'coffee_ml' in coffee, "Your dataframe should have a 'coffee_ml' column"
assert 'work' in coffee, "Your dataframe should have a 'work' column"
assert 'gym' in coffee, "Your dataframe should have a 'gym' column"
In [ ]:
%%studenttest StudentTest
# Test the data frame. **lang:ja**
assert len(coffee) == 7, "データフレームには7つの行が入らなければなりません"
assert 'day' in coffee, "データフレームには'day'の列が入らなければなりません"
assert 'coffee_ml' in coffee, "データフレームには'coffee_ml'の列が入らなければなりません"
assert 'work' in coffee, "データフレームには'work'の列が入らなければなりません"
assert 'gym' in coffee, "データフレームには'gym'の列が入らなければなりません"
In [ ]:
%%inlinetest AutograderTest
# This test is not shown to student, but used by the autograder.
assert 'coffee' in globals(), "Did you define the data frame named 'coffee' in the solution cell?"
assert coffee.__class__ == pd.core.frame.DataFrame, "Did you define a data frame named 'coffee'? There was a %s instead" % coffee.__class__
assert len(coffee) == 7, "The data frame should have 7 rows, you have %d" % len(coffee)
assert len(np.unique(coffee['day']) == 7), "The data frame should have 7 unique values of the 'day', you have %d" % len(np.unique(coffee['day']))
assert str(np.sort(np.unique(coffee['coffee_ml'])).astype(list)) == '[0 200 500]', "The daily coffee_ml amount should have values of 0, 200, and 500, but you have got: %s" % (str(np.sort(np.unique(coffee['coffee_ml'])).astype(list)))
assert np.sum(coffee['coffee_ml']) == 2400, "The coffee amount is not correct, total should be 2400 ml per week, but you data frame has %d" % np.sum(coffee['coffee_ml'])
assert np.sum(coffee['work'].astype(int)) == 5, "There should be 5 work days in a week"
assert np.sum(coffee['gym'].astype(int)) == 2, "There should be 2 gym days in a week"
assert np.all(coffee.loc[coffee['gym'].astype(bool)]['coffee_ml'] == 500), "coffee_ml should be 500 ml on gym days"
assert np.all(coffee.loc[np.logical_not(coffee['work'].astype(bool))]['coffee_ml'] == 500), "coffee_ml should be 500 on weekends"
assert np.sum(coffee.loc[np.logical_and(coffee['work'].astype(bool), np.logical_not(coffee['gym'].astype(bool)))]['coffee_ml']) == 400, "coffee_ml should be 200 on Monday and Friday, and 0 on Wednesday"
In [ ]:
%%submission
1,1,0,200
2,1,1,500
3,1,0,0
4,1,1,500
5,1,0,200
6,0,0,500
7,0,0,500
In [ ]:
result, log = %autotest AutograderTest
report(AutograderTest, results=result.results)
In [ ]:
%%submission
coffee = pd.read_csv(io.StringIO("""day,coffee_ml,work,gym
Monday,201,true,false
Tuesday,500,true,true
Wednesday,0,true,false
Thursday,500,true,true
Friday,200,true,false
Saturday,500,false,false
Sunday,500,false,false
"""))
In [ ]:
result, logs = %autotest AutograderTest
assert re.search(r'should have values of 0, 200, and 500', str(result.results['error']))
report(AutograderTest, results=result.results)
In [ ]:
%%submission
coffee = True
In [ ]:
result, logs = %autotest AutograderTest
assert re.search(r'Did you define a data frame named .coffee.', str(result.results['error']))
report(AutograderTest, results=result.results, source=submission_source.source)
In [ ]:
result, logs = %autotest StudentTest
report(StudentTest, results=result.results)
In [ ]:
%%submission
coffee = pd.read_csv(io.StringIO("""day,coffee_ml,work,gym
Monday,200,1,0
Tuesday,500,1,0
Wednesday,0,1,0
Thursday,500,1,1
Friday,200,1,0
Saturday,500,0,0
Sunday,500,0,0
"""))
In [ ]:
result, logs = %autotest StudentTest
assert result.results['passed']
report(StudentTest, results=result.results)