In [ ]:
import io
import numpy as np
import pandas as pd
import plotly_express as px
In [ ]:
# MASTER ONLY
import ast
# imports %%solution, %%submission, %%template, %%inlinetest, %%studenttest, %autotest
%load_ext prog_edu_assistant_tools.magics
from prog_edu_assistant_tools.magics import report, autotest
lang:en
In this unit, we will get acquainted with how to visualize tidy data frames
using the library plotly_express.
Let's start with reading the data from a CSV file and peeking inside.
lang:ja この講義では、データフレームの可視化を紹介します。plotly_expressというライブラリを使用します。
まずはデータをCSVファイルから読み込みましょう。
In [ ]:
# CSVファイルからデータを読み込みましょう。 Read the data from CSV file.
df = pd.read_csv('data/16-July-2019-Tokyo-hourly.csv')
print("行数は %d です" % len(df))
print(df.dtypes)
df.head()
lang:en
Today we will use the visualization library called Plotly Express, which is designed to make visualization very easy and concise, provided the data frame is tidy. To get the access to Plotly Express, you need to import it as follows:
import plotly_express as px
If you get import errors at this point, you may need to install the library. The details of installation may differ depending on the platform you are using. Here is the example command line with Pip:
pip install plotly_express
With anaconda, enter the following installation command in the conda shell:
conda install -c plotly plotly_express
All plotting commands with plotly_express follow the same pattern:
px.plot_method(data_frame, variable1='column1', variable2='column2', ...)
Here, plot_method can be on of the plotting methods supported by the library (see https://www.plotly.express/plotly_express/ for comprehensive documentation). The data_frame is the name of the data frame that provides the data to visualize, variable1, variable2 are the names of the plot variables that the plotting method is capable of representing. Here are some examples: x, y, color, size, symbol, facet_col, facet_row. The column1, column2 are the names of the columns in the given data frame that map to the visualization variables.
Depending on the plot chosen, some variables may be omitted, and the visualization library will automatically fill in something reasonable. Let's have a look at the simplest example:
px.line(df, y='Temperature_degC')
Here we are calling the plot method px.line which is a line plot. The line plot uses two variables: x and y to put the points on to the coordinate grid and connects them in order. It is possible to give only one of the x and y, then the other will be automatically filled with an integer from 0 to the number of rows in the data frame. In our example, we specify y to map to the Temperature_degC column of our data frame, and do not specify x mapping.
plotly_expressを用いて可視化 (Visualization with plotly_express)lang:ja
plotly_expressというライブラリはデータの簡単な可視化のために開発されました。可視化を簡単にするには、データフレームはキレイな状態に保たなければなりません。このライブラリを使うには、以下のインポートが必要です。
import plotly_express as px
このコマンドがエラーを返していたら、ライブラリをインストール必要があります。インストール方法はコンピューターシステムによって異なりますので、 README.mdに参照するか、TAにお尋ねください。ご参考までこちらはPipを用いたインストール方法です。
pip install plotly_express
Anacondaを使う場合, 以下のコマンドをCondaシェルに入力してください。
conda install -c plotly plotly_express
可視化の命令は全て同じ形です。
px.可視化方法(データフレーム, 変数1='列1', 変数2='列2', ...)
可視化方法はライブラリの説明書をご参照ください (https://www.plotly.express/plotly_express/)。
データフレームはデータフレームを指定します。変数1, 変数2は可視化の変数をしていします。可視化方法によって異なりますが、以下の
可視化変数はよく使われます: x, y, color, size, symbol, facet_col, facet_row.
列1, 列2はデータフレームに入っている列の参照で、可視化変数とデータを関連づけます。
可視化方法によって変数は指定しなくてもよい場合もあります。その場合は可視化変数は自動で生成されます。簡単な例を見ましょう。
px.line(df, y='Temperature_degC')
px.lineは線の可視化を指定します。線の可視化には2つの可視化変数を使います: xとyを省略すれば、0からN-1の整数の区間に自動的になります(Nはデータフレームの行数です)。
以下の例では、yは気温(Temperature_degC)に指定して、xは自動で生成されます。
In [ ]:
px.line(df, y='Temperature_degC')
lang:en You can see that specifying x explicitly to map to the time column Time_Hour does not change the plot much. The only difference is that the automatic range is varying from 0 to 23, while Time_Hour varies from 1 to 24.
lang:ja xの可視化変数を明確に指定することも可能です。 以下の例では、xに時間(Time_Hour)を指定すれば、グラフはほとんど変わりません。
違いはたった一つです。Time_Hour(時間)の区間は[1,24]ですが、先の自動的に生成された区間は[0,23]でした。
In [ ]:
px.line(df, x='Time_Hour', y='Temperature_degC')
lang:en Similarly, one can plot any other variable that is present in the data frame. To remind yourself what columns
you have in the data frame, use df.dtypes.
lang:ja データフレームに入っている変数は全て可視化できます。データフレームに入っている変数を一覧を見るためにdf.dtypesをご参考ください。
In [ ]:
df.dtypes
In [ ]:
px.line(df, y='Pressure_hPa')
lang:en You can enable the plot to show additional dimensions by providing additional mappings. The mapping color makes the points to use color range to represent an additional variable. Here is a different kind of plot, with x and y showing different variables, and using the color to represent time of the day.
This example uses plotting method px.scatter which is a scatter plot, where the coordinates of points to plot are given with plot variable mappings x and y.
lang:jaxとyの他に可視化変数があります。colorは色を使ってもう一つの変数を可視化することができます。以下の例は散布図です。xとyはそれぞれ違うデータ変数を可視化し、色は時間を表します。散布図はpx.scatterによって指定します。
In [ ]:
px.scatter(df, x='Temperature_degC', y='Pressure_hPa', color='Time_Hour')
lang:en Another useful additional dimension is size. Let's use it to visualize how the amount of rain is related to the temperature.
lang:jaもう一つの便利な可視化変数があります:size(大きさ)を使って、雨の量も可視化して、温度や気圧との関係をみてみましょう。
In [ ]:
px.scatter(df, x='Time_Hour', y='Temperature_degC', size='Precipitation_mm')
lang:en It is possible to use the same column of the data frame for multiple plot variable mapping to make the plot show information in a redundant way, but please take care not to overuse it!
lang:ja同じデータ変数を複数の可視化変数に用いてもいいです。そうすれば、グラフが読みやすくなることがあります。 色などを使いすぎないように気をつけましょう。
In [ ]:
px.scatter(df, x='Time_Hour', y='Temperature_degC', size='Precipitation_mm', color='Precipitation_mm')
In [ ]:
# MASTER ONLY
df.melt(id_vars=['Time_Hour'], value_vars=['Temperature_degC', 'Pressure_hPa']).iloc[[0,1,-2,-1]]
In [ ]:
# MASTER ONLY: This example is too confusing.
# This example is demonstrating how Grammar of Graphics makes it hard to create confusing plots
# when starting from a tidy dataset. E.g it is hard to plot different variables together.
# In this example it takes some effort even to try plotting temperature and pressure together,
# and even then it does not work well because of different value range. Grammar of Graphics
# assumes that all plots shown together have one variable, and as a consequence of that chooses
# a single axis scale, which does not work here for both variables, because their ranges
# are very different and cannot be plotted with the same axis scale.
px.line(df.melt(id_vars=['Time_Hour'], value_vars=['Temperature_degC', 'Pressure_hPa']), x='Time_Hour', y='value', facet_col='variable')
lang:en Finally, we introduce one special plotting method that does not take almost any inputs, and automatically plots scatter plots of all pairs of variables present in the data frame. This makes it easy to visually see if there are any pairs of the variables that are highly correlated, which looks like regular line-like pattern in the scatter plot. Note, that all plots on the diagonal looks like lines. This is not surprising, because every variable is perfectly correlated to itself!
lang:ja最後になりますが、ほぼ自動的な可視化方法を紹介しましょう。そればデータに入っている変数を二つずつ利用して、散布図を描きます。 このグラフを見ながら変数同士はどのように関係するのか一目で確認できます。対角線のグラフは全て線に見えますが、それはなぜかというと、どの変数であっても、自分自身に対しては完全に相関するからです。
In [ ]:
px.scatter_matrix(df)
lang:en
Load the weather data set from data/15-July-2019-Tokyo-hourly.csv and
visualize the amount of sunshine per hour and find out from the plot which hour had most sunshine.
lang:ja
天気についてのデータをdata/15-July-2019-Tokyo-hourly.csvから読み込んで、
日向の量(SunshineDuration_h)を可視化して、どの時間帯で日向が一番多かったかを見つけてください。
In [ ]:
%%solution
""" # BEGIN PROMPT
df15 = pd.read_csv(...)
px.___(...)
""" # END PROMPT
# BEGIN SOLUTION
df15 = pd.read_csv('data/15-July-2019-Tokyo-hourly.csv')
px.bar(df15, x='Time_Hour', y='SunshineDuration_h')
# END SOLUTION
lang:en The expected answer is 13 and 14, which both had 0.1h of sunshine. Please check if your plot clearly shows that.
lang:ja 答えは13時と14時です。両方とも日向の時間は0.1でした。可視化によってそれははっきり見えるかご確認ください。
In [ ]:
%%inlinetest FigureTest
try:
df15
assert len(df15) == 24, "Did you load the right data set? Expected to see 24 rows, but got %d" % len(df15)
except NameError:
assert False, "Your code does not define df15"
# Check the submission syntactically.
import ast
# This import will be uncommented when executed on autograder server.
#import submission_source
try:
a = ast.parse(submission_source.source)
assert len(a.body) > 0, "Is your code submission empty?"
e = None
for x in a.body:
if x.__class__ == ast.Expr:
e = x
break
assert e is not None, "Your code does not have any function call in it?"
assert e.value.__class__ == ast.Call, "Do you have a function call in your cell? The code may be okay, but I am just not sure"
assert e.value.func.__class__ == ast.Attribute, "I do not recognize your function call. The code may be okay, but I am just not sure"
assert e.value.func.attr in set(['line', 'bar', 'scatter']), "Expected to see a px.line() or px.bar() or px.scatter plot, but got %s. This may be okay, but I am just not sure" % e.value.func.attr
except AssertionError as e:
raise e
except SyntaxError as e:
assert False, "Your code does not compile: %s" % e
except IndentationError as e:
assert False, "Something is wrong with the test: %s" % e
except Exception as e:
assert False, "Something is wrong with the test: %s" % e
# TODO(salikh): Implement a library for easy checking of syntactic conditiions.
#assert ae.TopExprIsCall(a), "Your "
In [ ]:
# MASTER ONLY
# This visualization should work as well.
px.line(df15, x='Time_Hour', y='SunshineDuration_h')
In [ ]:
%%submission
px.line(df15, y='SunshineDuration_h')