本培训教程,通过 jupyter-notebook 来指导怎么通过 Python Pandas 来进行数据分析和图表呈现。 包括下面这些内容:

  • 数据的各类获得方式
  • 数据接口的调用
  • 数据的整理和清洗
  • 数据的存储
  • 计算各类指标
  • 基本制图
  • 用图表来进行数据分析

我们用通过分析新冠疫情数据为例,来看看怎么从接口获取数据、数据整理和制作图表等。

其中会涉及到各类知识点,我们会尽量详细解释清楚。

我们先举一个完整的例子,从调用API接口开始,用 Pandas来处理基本数据,整个过程通过 Python 的 jupyter notebook 环境中实现。

  1. 设置 API的地址,调用token
  2. 设置 headers、payload等需要调用的参数
  3. 通过 requests 的 get 方法来访问数据
  4. 通过 Pandas 来简单处理数据
  5. 显示数据

In [3]:
# demo for infection/region
# input region, start_date, get data
# 接口:感染/国家地区

import requests
import pandas as pd

# API url
url = 'https://covid-19.adapay.tech/api/v1/'
# token, can call register function get the API token
token = '497115d0c2ff9586bf0fe03088cfdbe2'

# region or country
region='Italy'

# headers, need the API token
headers = {
    'token': token
}

# the params
payload = {
    'region': region,
    'start_date':'2020-03-24'
}

# call requets to load 
r = requests.get(url+'infection/region', params=payload, headers=headers)

data = r.json()

# use pandas to get the data
df = pd.DataFrame.from_dict(data['data']['region'][region])
print(df)
print('---')


               2020-03-24
confirmed           69176
confirmed_add        5249
deaths               6820
deaths_add            743
recovered            8326
recovered_add         894
---

In [2]:
# demo for infection/region
# input region, start_date, end_date, get data

# the params
payload = {
    'region': region,
    'start_date': '2020-03-24',
    'end_date': '2020-03-31'
}

# call requets to load 
r = requests.get(url+'infection/region', params=payload, headers=headers)

data = r.json()

# use pandas to get the data
df = pd.DataFrame.from_dict(data['data']['region'][region])
print(df)
print('---')


               2020-03-24  2020-03-25  2020-03-26  2020-03-27  2020-03-28  \
confirmed_add        5249        5210        6203        5909        5974   
deaths_add            743         683         712         919         889   
recovered_add         894        1036         999         589        1434   
confirmed           69176       74386       80589       86498       92472   
deaths               6820        7503        8215        9134       10023   
recovered            8326        9362       10361       10950       12384   

               2020-03-29  2020-03-30  2020-03-31  
confirmed_add        5217        4050        4053  
deaths_add            756         812         837  
recovered_add         646        1590        1109  
confirmed           97689      101739      105792  
deaths              10779       11591       12428  
recovered           13030       14620       15729  
---

In [3]:
# demo for infection/region
# input region, start_date, end_date, get data
# exchange the row and column by Pandas, the row index is date
# 交换数据的行和列

df = df.T
print(df)
print('---')


            confirmed_add  deaths_add  recovered_add  confirmed  deaths  \
2020-03-24           5249         743            894      69176    6820   
2020-03-25           5210         683           1036      74386    7503   
2020-03-26           6203         712            999      80589    8215   
2020-03-27           5909         919            589      86498    9134   
2020-03-28           5974         889           1434      92472   10023   
2020-03-29           5217         756            646      97689   10779   
2020-03-30           4050         812           1590     101739   11591   
2020-03-31           4053         837           1109     105792   12428   

            recovered  
2020-03-24       8326  
2020-03-25       9362  
2020-03-26      10361  
2020-03-27      10950  
2020-03-28      12384  
2020-03-29      13030  
2020-03-30      14620  
2020-03-31      15729  
---

In [4]:
# demo for infection/region
# input region, start_date, end_date, get data
# exchange the row and column by Pandas, the row index is date
# add calucate column, mortailty rate

df['mortality rate'] = df.apply(lambda x: x['deaths'] / x['confirmed'], axis=1)
print(df)
print('---')


            confirmed_add  deaths_add  recovered_add  confirmed  deaths  \
2020-03-24           5249         743            894      69176    6820   
2020-03-25           5210         683           1036      74386    7503   
2020-03-26           6203         712            999      80589    8215   
2020-03-27           5909         919            589      86498    9134   
2020-03-28           5974         889           1434      92472   10023   
2020-03-29           5217         756            646      97689   10779   
2020-03-30           4050         812           1590     101739   11591   
2020-03-31           4053         837           1109     105792   12428   

            recovered  mortality rate  
2020-03-24       8326        0.098589  
2020-03-25       9362        0.100866  
2020-03-26      10361        0.101937  
2020-03-27      10950        0.105598  
2020-03-28      12384        0.108390  
2020-03-29      13030        0.110340  
2020-03-30      14620        0.113929  
2020-03-31      15729        0.117476  
---

In [5]:
# draw the line chart, for column confirmed and deaths

import requests
import pandas as pd
import matplotlib.pyplot as plt

# API url
url = 'https://covid-19.adapay.tech/api/v1/'
# token, can call register function get the API token
token = '497115d0c2ff9586bf0fe03088cfdbe2'

# region or country
region='Italy'

# headers, need the API token
headers = {
    'token': token
}

# the params
payload = {
    'region': region,
    'start_date':'2020-03-24',
    'end_date':'2020-03-31'
}

# call requets to load 
r = requests.get(url+'infection/region', params=payload, headers=headers)

data = r.json()

# use pandas to get the data
df = pd.DataFrame.from_dict(data['data']['region'][region])

df = df.T

plt.figure()
df[['confirmed','deaths']].plot(kind='line')


Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d542df3fc8>
<Figure size 432x288 with 0 Axes>

In [6]:
# draw the line chart, for column confirmed and deaths

plt.figure();
df[['confirmed_add','deaths_add','recovered_add']].plot(kind='bar')


Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d542edd408>
<Figure size 432x288 with 0 Axes>

In [7]:
# demo for infection/region/detail 

import requests
import pandas as pd
import matplotlib.pyplot as plt

# API url
url = 'https://covid-19.adapay.tech/api/v1/'
# token, can call register function get the API token
token = '497115d0c2ff9586bf0fe03088cfdbe2'

# region or country
region='US'

# headers, need the API token
headers = {
    'token': token
}

# the params
payload = {
    'region': region,
    'start_date':'2020-03-24',
    'end_date':'2020-03-31'
}
r = requests.get(url+'infection/region/detail', params=payload, headers=headers)

data = r.json()
df = pd.DataFrame.from_dict(data['data'])
df.head()


Out[7]:
area
Alabama {'2020-03-24': {'confirmed_add': 46, 'deaths_a...
Alaska {'2020-03-24': {'confirmed_add': 4, 'deaths_ad...
American Samoa {'2020-03-24': {'confirmed_add': 0, 'deaths_ad...
Arizona {'2020-03-24': {'confirmed_add': 91, 'deaths_a...
Arkansas {'2020-03-24': {'confirmed_add': 27, 'deaths_a...

In [8]:
df = pd.DataFrame.from_dict(data['data']['area'])
df.head()


Out[8]:
Alabama Alaska American Samoa Arizona Arkansas California Colorado Connecticut Delaware District of Columbia ... Texas Utah Vermont Virgin Islands Virginia Washington West Virginia Wisconsin Wuhan Evacuee Wyoming
2020-03-24 {'confirmed_add': 46, 'deaths_add': 0, 'recove... {'confirmed_add': 4, 'deaths_add': 0, 'recover... {'confirmed_add': 0, 'deaths_add': 0, 'recover... {'confirmed_add': 91, 'deaths_add': 3, 'recove... {'confirmed_add': 27, 'deaths_add': 2, 'recove... {'confirmed_add': 430, 'deaths_add': 11, 'reco... {'confirmed_add': 19, 'deaths_add': 1, 'recove... {'confirmed_add': 203, 'deaths_add': 2, 'recov... {'confirmed_add': 36, 'deaths_add': 0, 'recove... {'confirmed_add': 21, 'deaths_add': 0, 'recove... ... {'confirmed_add': 197, 'deaths_add': 3, 'recov... {'confirmed_add': 41, 'deaths_add': 0, 'recove... {'confirmed_add': 20, 'deaths_add': 2, 'recove... {'confirmed_add': 10, 'deaths_add': 0, 'recove... {'confirmed_add': 39, 'deaths_add': 3, 'recove... {'confirmed_add': 107, 'deaths_add': 7, 'recov... {'confirmed_add': 6, 'deaths_add': 0, 'recover... {'confirmed_add': 56, 'deaths_add': 0, 'recove... {'confirmed_add': 0, 'deaths_add': 0, 'recover... {'confirmed_add': 3, 'deaths_add': 0, 'recover...
2020-03-25 {'confirmed_add': 139, 'deaths_add': 1, 'recov... {'confirmed_add': 7, 'deaths_add': 1, 'recover... {'confirmed_add': 0, 'deaths_add': 0, 'recover... {'confirmed_add': 75, 'deaths_add': 1, 'recove... {'confirmed_add': 61, 'deaths_add': 0, 'recove... {'confirmed_add': 460, 'deaths_add': 15, 'reco... {'confirmed_add': 298, 'deaths_add': 8, 'recov... {'confirmed_add': 257, 'deaths_add': 7, 'recov... {'confirmed_add': 15, 'deaths_add': 0, 'recove... {'confirmed_add': 46, 'deaths_add': 0, 'recove... ... {'confirmed_add': 274, 'deaths_add': 3, 'recov... {'confirmed_add': 42, 'deaths_add': 0, 'recove... {'confirmed_add': 30, 'deaths_add': 1, 'recove... {'confirmed_add': 0, 'deaths_add': 0, 'recover... {'confirmed_add': 103, 'deaths_add': 0, 'recov... {'confirmed_add': 263, 'deaths_add': 17, 'reco... {'confirmed_add': 17, 'deaths_add': 0, 'recove... {'confirmed_add': 140, 'deaths_add': 2, 'recov... NaN {'confirmed_add': 15, 'deaths_add': 0, 'recove...
2020-03-26 {'confirmed_add': 136, 'deaths_add': 0, 'recov... {'confirmed_add': 15, 'deaths_add': 0, 'recove... {'confirmed_add': 0, 'deaths_add': 0, 'recover... {'confirmed_add': 107, 'deaths_add': 2, 'recov... {'confirmed_add': 55, 'deaths_add': 0, 'recove... {'confirmed_add': 901, 'deaths_add': 16, 'reco... {'confirmed_add': 409, 'deaths_add': 3, 'recov... {'confirmed_add': 137, 'deaths_add': 2, 'recov... {'confirmed_add': 11, 'deaths_add': 1, 'recove... {'confirmed_add': 44, 'deaths_add': 1, 'recove... ... {'confirmed_add': 334, 'deaths_add': 6, 'recov... {'confirmed_add': 56, 'deaths_add': 0, 'recove... {'confirmed_add': 33, 'deaths_add': 1, 'recove... {'confirmed_add': 0, 'deaths_add': 0, 'recover... {'confirmed_add': 70, 'deaths_add': 1, 'recove... {'confirmed_add': 616, 'deaths_add': 17, 'reco... {'confirmed_add': 13, 'deaths_add': 0, 'recove... {'confirmed_add': 107, 'deaths_add': 3, 'recov... NaN {'confirmed_add': 9, 'deaths_add': 0, 'recover...
2020-03-27 {'confirmed_add': 70, 'deaths_add': 3, 'recove... {'confirmed_add': 2, 'deaths_add': 0, 'recover... {'confirmed_add': 0, 'deaths_add': 0, 'recover... {'confirmed_add': 157, 'deaths_add': 5, 'recov... {'confirmed_add': 46, 'deaths_add': 1, 'recove... {'confirmed_add': 758, 'deaths_add': 13, 'reco... {'confirmed_add': 3, 'deaths_add': 8, 'recover... {'confirmed_add': 279, 'deaths_add': 6, 'recov... {'confirmed_add': 33, 'deaths_add': 1, 'recove... {'confirmed_add': 40, 'deaths_add': 0, 'recove... ... {'confirmed_add': 374, 'deaths_add': 5, 'recov... {'confirmed_add': 76, 'deaths_add': 0, 'recove... {'confirmed_add': 26, 'deaths_add': 1, 'recove... {'confirmed_add': 2, 'deaths_add': 0, 'recover... {'confirmed_add': 141, 'deaths_add': 0, 'recov... {'confirmed_add': 270, 'deaths_add': 7, 'recov... {'confirmed_add': 24, 'deaths_add': 0, 'recove... {'confirmed_add': 198, 'deaths_add': 4, 'recov... NaN {'confirmed_add': 17, 'deaths_add': 0, 'recove...
2020-03-28 {'confirmed_add': 107, 'deaths_add': 0, 'recov... {'confirmed_add': 27, 'deaths_add': 1, 'recove... {'confirmed_add': 0, 'deaths_add': 0, 'recover... {'confirmed_add': 108, 'deaths_add': 2, 'recov... {'confirmed_add': 28, 'deaths_add': 2, 'recove... {'confirmed_add': 438, 'deaths_add': 16, 'reco... {'confirmed_add': 307, 'deaths_add': 4, 'recov... {'confirmed_add': 233, 'deaths_add': 6, 'recov... {'confirmed_add': 51, 'deaths_add': 3, 'recove... {'confirmed_add': 33, 'deaths_add': 1, 'recove... ... {'confirmed_add': 518, 'deaths_add': 4, 'recov... {'confirmed_add': 130, 'deaths_add': 0, 'recov... {'confirmed_add': 27, 'deaths_add': 2, 'recove... {'confirmed_add': 3, 'deaths_add': 0, 'recover... {'confirmed_add': 133, 'deaths_add': 3, 'recov... {'confirmed_add': 553, 'deaths_add': 31, 'reco... {'confirmed_add': 20, 'deaths_add': 0, 'recove... {'confirmed_add': 129, 'deaths_add': 3, 'recov... NaN {'confirmed_add': 12, 'deaths_add': 0, 'recove...

5 rows × 58 columns


In [9]:
df = pd.DataFrame.from_dict(data['data']['area']['New York'])
df


Out[9]:
2020-03-24 2020-03-25 2020-03-26 2020-03-27 2020-03-28 2020-03-29 2020-03-30 2020-03-31
confirmed_add 4797 5160 7036 6999 7534 7238 7015 9170
deaths_add 52 75 100 142 201 237 253 332
recovered_add 0 0 0 0 0 0 0 0
confirmed 25681 30841 37877 44876 52410 59648 66663 75833
deaths 210 285 385 527 728 965 1218 1550
recovered 0 0 0 0 0 0 0 0

In [10]:
df = df.T

plt.figure();
df[['confirmed_add','deaths_add','recovered_add']].plot(kind='bar')


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d54301f4c8>
<Figure size 432x288 with 0 Axes>

In [11]:
df = pd.DataFrame.from_dict(data['data']['area']['California'])
df


Out[11]:
2020-03-24 2020-03-25 2020-03-26 2020-03-27 2020-03-28 2020-03-29 2020-03-30 2020-03-31
confirmed_add 430 460 901 758 438 757 1286 1072
deaths_add 11 15 16 13 16 14 22 27
recovered_add 0 0 0 0 0 0 0 0
confirmed 2538 2998 3899 4657 5095 5852 7138 8210
deaths 50 65 81 94 110 124 146 173
recovered 0 0 0 0 0 0 0 0

In [20]:
# demo for infection/global

import requests
import pandas as pd
import matplotlib.pyplot as plt

# API url
url = 'https://covid-19.adapay.tech/api/v1/'
# token, can call register function get the API token
token = '497115d0c2ff9586bf0fe03088cfdbe2'

# headers, need the API token
headers = {
    'token': token
}

r = requests.get(url+'infection/global',  headers=headers)

data = r.json()

df = pd.DataFrame.from_dict(data['data']['global']['region'])
df = df.T
df


Out[20]:
confirmed_add deaths_add recovered_add
Afghanistan 664 21 32
Albania 465 23 232
Algeria 1982 313 601
Andorra 645 29 128
Angola 18 2 4
... ... ... ...
Vietnam 255 0 146
Western Sahara 2 0 0
Yemen 0 0 0
Zambia 43 2 30
Zimbabwe 16 3 0

189 rows × 3 columns


In [24]:
print(df.shape)
print(df.columns)
print(df.info())


(189, 3)
Index(['confirmed_add', 'deaths_add', 'recovered_add'], dtype='object')
<class 'pandas.core.frame.DataFrame'>
Index: 189 entries, Afghanistan to Zimbabwe
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   confirmed_add  189 non-null    int64
 1   deaths_add     189 non-null    int64
 2   recovered_add  189 non-null    int64
dtypes: int64(3)
memory usage: 5.9+ KB
None

In [ ]:
df = pd.DataFrame.from_dict(data)

In [ ]:
""" 
@author:Bingo.he 
@file: get_target_value.py 
@time: 2017/12/22 
"""
def get_target_value(key, dic, tmp_list):
    """
    :param key: 目标key值
    :param dic: JSON数据
    :param tmp_list: 用于存储获取的数据
    :return: list
    """
    if not isinstance(dic, dict) or not isinstance(tmp_list, list):  # 对传入数据进行格式校验
        return 'argv[1] not an dict or argv[-1] not an list '

    if key in dic.keys():
        tmp_list.append(dic[key])  # 传入数据存在则存入tmp_list

    for value in dic.values():  # 传入数据不符合则对其value值进行遍历
        if isinstance(value, dict):
            get_target_value(key, value, tmp_list)  # 传入数据的value值是字典,则直接调用自身
        elif isinstance(value, (list, tuple)):
            _get_value(key, value, tmp_list)  # 传入数据的value值是列表或者元组,则调用_get_value


    return tmp_list


def _get_value(key, val, tmp_list):
    for val_ in val:
        if isinstance(val_, dict):  
            get_target_value(key, val_, tmp_list)  # 传入数据的value值是字典,则调用get_target_value
        elif isinstance(val_, (list, tuple)):
            _get_value(key, val_, tmp_list)   # 传入数据的value值是列表或者元组,则调用自身
    
list0=[]
print(get_target_value('New York',data,list0))

In [ ]:
url = 'https://covid-19.adapay.tech/api/v1/'

r = requests.post(url+'authentication/register', data = {'email':'wingfish@gmail.com'})

print(r)

In [ ]: