First Steps with Pandas


In [38]:
import pandas as pd

Importing Data


In [39]:
df = pd.read_csv('scraped_and_cleand_six.csv')

Looking at summary of the data


In [40]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1169 entries, 0 to 1168
Data columns (total 7 columns):
Unnamed: 0     1169 non-null int64
Company        1169 non-null object
Date           1169 non-null object
Price          1169 non-null float64
Share Total    1169 non-null int64
Type           1169 non-null object
Price_m        1169 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 64.0+ KB

In [11]:
df['Share Total'].apply(comma)


Out[11]:
0         14323
1         12032
2           500
3           500
4            15
5             6
6          2000
7       2190000
8        160000
9          3118
10        82000
11          787
12        13543
13         3290
14         5000
15       950000
16        40000
17          400
18      1460000
19          743
20           13
21         4000
22         2000
23         3418
24         1200
25         5000
26          100
27          868
28       622016
29          190
         ...   
1149          4
1150     150000
1151        316
1152        134
1153        131
1154        131
1155       5410
1156       9830
1157        513
1158          9
1159        105
1160         41
1161        153
1162       5000
1163       1214
1164        267
1165          4
1166         30
1167       1000
1168        292
1169          5
1170         57
1171      23703
1172         11
1173         57
1174       1800
1175         57
1176      30000
1177         50
1178         57
Name: Share Total, Length: 1179, dtype: object

In [12]:
df['Share Total'] = df['Share Total'].apply(comma)

In [13]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1179 entries, 0 to 1178
Data columns (total 5 columns):
Company        1179 non-null object
Date           1179 non-null object
Price          1169 non-null object
Share Total    1179 non-null object
Type           1179 non-null object
dtypes: object(5)
memory usage: 46.1+ KB

In [14]:
df['Share Total'] = df['Share Total'].astype(int)

In [15]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1179 entries, 0 to 1178
Data columns (total 5 columns):
Company        1179 non-null object
Date           1179 non-null object
Price          1169 non-null object
Share Total    1179 non-null int64
Type           1179 non-null object
dtypes: int64(1), object(4)
memory usage: 46.1+ KB

Lets sort the values


In [16]:
df['Share Total'].sort_values(ascending=False).head(20)


Out[16]:
1017    27205398
364      4859499
610      3496500
538      3000000
7        2190000
1082     2190000
1083     2090000
120      1800000
885      1751848
312      1531095
1085     1460000
18       1460000
1084     1340000
545      1340000
317      1290398
784      1076516
256      1000000
306      1000000
15        950000
324       940000
Name: Share Total, dtype: int64

In [17]:
df[df['Share Total']==27205398]


Out[17]:
Company Date Price Share Total Type
1017 KTM Industries AG 23.02.2017 133'306'450.20 27205398 Sale

Or, if we want to see the whole sorted table


In [18]:
df.sort_values(by='Company', ascending=True).head()


Out[18]:
Company Date Price Share Total Type
0 ABB Ltd 30.10.2017 362'229.00 14323 Purchase
1 ABB Ltd 30.10.2017 304'289.00 12032 Purchase
2 ABB Ltd 30.10.2017 10'060.00 500 Purchase
3 ABB Ltd 30.10.2017 10'060.00 500 Purchase
931 ABB Ltd 15.03.2017 15'549.00 679 Sale

Jetzt zu den Preisen. Wir wenden wieder unsere Function an, aber es gibt einen Fehler. Warum?


In [19]:
df['Price'].apply(comma)


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-19-f63b360c6463> in <module>()
----> 1 df['Price'].apply(comma)

~/.virtualenvs/master/lib/python3.5/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
   2353             else:
   2354                 values = self.asobject
-> 2355                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   2356 
   2357         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer (pandas/_libs/lib.c:66645)()

<ipython-input-10-96f46206cf69> in comma(elem)
      1 def comma(elem):
----> 2     elem = elem.replace("'", '')
      3     return elem

AttributeError: 'float' object has no attribute 'replace'

In [20]:
def comma(elem):
    try:
        elem = elem.replace("'", '').split('.')[0]
        return int(elem)
    except:
        return elem

In [21]:
test = "26767.89"

In [22]:
int(test.split(".")[0])


Out[22]:
26767

In [23]:
df['Price'] = df['Price'].apply(comma)

Another way of dealing with NaN Values -> Drop them


In [24]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1179 entries, 0 to 1178
Data columns (total 5 columns):
Company        1179 non-null object
Date           1179 non-null object
Price          1169 non-null float64
Share Total    1179 non-null int64
Type           1179 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 46.1+ KB

In [25]:
df = df.dropna()

In [26]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1169 entries, 0 to 1178
Data columns (total 5 columns):
Company        1169 non-null object
Date           1169 non-null object
Price          1169 non-null float64
Share Total    1169 non-null int64
Type           1169 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 54.8+ KB

In [27]:
df['Price'] = df['Price'].apply(comma)

Price in Millions


In [28]:
df['Price_m'] = round(df['Price'] / 1000000, 1)

In [29]:
df.sort_values(by='Price_m', ascending=False).head()


Out[29]:
Company Date Price Share Total Type Price_m
1017 KTM Industries AG 23.02.2017 133306450.0 27205398 Sale 133.3
885 Bank Coop AG 22.03.2017 77992272.0 1751848 Purchase 78.0
364 Idorsia Ltd 19.06.2017 69765826.0 4859499 Purchase 69.8
68 Feintool International Holding AG 29.09.2017 64732500.0 616500 Sale 64.7
312 Idorsia Ltd 07.07.2017 30621900.0 1531095 Purchase 30.6

Who sold the most?

Lets just consider the sales


In [30]:
df['Type'].value_counts()


Out[30]:
Sale                  694
Purchase              473
Granting / Writing      2
Name: Type, dtype: int64

In [31]:
dfs = df[df['Type']=='Sale']

In [32]:
dfs.groupby('Company')['Price_m'].sum().sort_values(ascending=False).head(10)


Out[32]:
Company
KTM Industries AG                        139.8
Temenos Group AG                          67.8
Feintool International Holding AG         65.4
Chocoladefabriken Lindt & Sprüngli AG     37.9
Molecular Partners Ltd                    28.0
Cosmo Pharmaceuticals N.V.                26.6
Partners Group Holding AG                 16.4
Vontobel Holding AG                       15.9
ams AG                                    15.4
Compagnie Financière Richemont SA         15.0
Name: Price_m, dtype: float64

And the buys?


In [33]:
dfp = df[df['Type']=='Purchase']

In [34]:
dfp.groupby('Company')['Price_m'].sum().sort_values(ascending=False).head(10)


Out[34]:
Company
Idorsia Ltd                            147.0
Bank Coop AG                            78.0
Ypsomed Holding AG                      21.2
Compagnie Financière Tradition S.A.     19.8
Temenos Group AG                        14.9
Logitech International S.A.             10.8
ams AG                                  10.0
Arbonia AG                               9.6
Edmond de Rothschild (Suisse) S.A.       7.8
Nestlé AG                                7.3
Name: Price_m, dtype: float64

In [36]:
df.to_csv('scraped_and_cleand_six.csv')

Now lets look at the dates, this can get tricky


In [221]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1169 entries, 0 to 1178
Data columns (total 6 columns):
Company        1169 non-null object
Date           1169 non-null object
Price          1169 non-null float64
Share Total    1169 non-null int64
Type           1169 non-null object
Price_m        1169 non-null float64
dtypes: float64(2), int64(1), object(3)
memory usage: 63.9+ KB

In [222]:
df['Date'].head()


Out[222]:
0    30.10.2017
1    30.10.2017
2    30.10.2017
3    30.10.2017
4    27.10.2017
Name: Date, dtype: object

In [223]:
from datetime import datetime

In [224]:
pd.to_datetime(df['Date'], format='%d.%m.%Y')


Out[224]:
0      2017-10-30
1      2017-10-30
2      2017-10-30
3      2017-10-30
4      2017-10-27
5      2017-10-27
6      2017-10-27
7      2017-10-27
8      2017-10-27
9      2017-10-27
10     2017-10-26
11     2017-10-26
12     2017-10-26
13     2017-10-25
14     2017-10-25
15     2017-10-25
16     2017-10-25
17     2017-10-25
18     2017-10-24
19     2017-10-24
20     2017-10-23
21     2017-10-20
22     2017-10-20
23     2017-10-18
24     2017-10-17
25     2017-10-17
26     2017-10-16
27     2017-10-16
28     2017-10-16
29     2017-10-13
          ...    
1149   2017-01-12
1150   2017-01-12
1151   2017-01-12
1152   2017-01-12
1153   2017-01-12
1154   2017-01-12
1155   2017-01-12
1156   2017-01-11
1157   2017-01-11
1158   2017-01-11
1159   2017-01-11
1160   2017-01-10
1161   2017-01-10
1162   2017-01-10
1163   2017-01-10
1164   2017-01-09
1165   2017-01-09
1166   2017-01-09
1167   2017-01-09
1168   2017-01-06
1169   2017-01-06
1170   2017-01-06
1171   2017-01-05
1172   2017-01-05
1173   2017-01-05
1174   2017-01-05
1175   2017-01-04
1176   2017-01-03
1177   2017-01-03
1178   2017-01-03
Name: Date, Length: 1169, dtype: datetime64[ns]

In [225]:
df['Date'] = pd.to_datetime(df['Date'], format='%d.%m.%Y')

In [226]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1169 entries, 0 to 1178
Data columns (total 6 columns):
Company        1169 non-null object
Date           1169 non-null datetime64[ns]
Price          1169 non-null float64
Share Total    1169 non-null int64
Type           1169 non-null object
Price_m        1169 non-null float64
dtypes: datetime64[ns](1), float64(2), int64(1), object(2)
memory usage: 103.9+ KB

Let's plot the transaction counts

First we need to make the date the index


In [230]:
df.index = df['Date']

In [232]:
df.resample('W')['Date'].count().head()


Out[232]:
Date
2017-01-08    11
2017-01-15    25
2017-01-22    21
2017-01-29    18
2017-02-05    14
Freq: W-SUN, Name: Date, dtype: int64

Lets plot that


In [233]:
import matplotlib.pyplot as plt
import matplotlib
plt.style.use('ggplot')
%matplotlib inline

In [234]:
df.resample('W')['Date'].count().plot()


Out[234]:
<matplotlib.axes._subplots.AxesSubplot at 0x10c8eef60>

In [235]:
df.resample('W')['Price'].sum().plot()


Out[235]:
<matplotlib.axes._subplots.AxesSubplot at 0x10ca30cf8>

Shall we print this out?


In [237]:
df.resample('W')['Price'].sum().plot()
plt.savefig('hello.pdf')


But what about different shapes?


In [241]:
df.resample('M')['Price'].sum().plot(kind='bar')


Out[241]:
<matplotlib.axes._subplots.AxesSubplot at 0x10cc51390>

In [249]:
df.resample('M')['Price'].sum().plot(kind='barh')


Out[249]:
<matplotlib.axes._subplots.AxesSubplot at 0x10db4b908>

In [252]:
df.resample('M')['Price'].sum().plot(kind='pie')


Out[252]:
<matplotlib.axes._subplots.AxesSubplot at 0x10d905128>

In [255]:
df.resample('M')['Price'].sum().plot(kind='pie', radius=0.5, shadow=True)


Out[255]:
<matplotlib.axes._subplots.AxesSubplot at 0x10dd9fa20>

In [256]:
df.resample('Q')['Price'].sum().plot(kind='pie', radius=0.5, shadow=True)


Out[256]:
<matplotlib.axes._subplots.AxesSubplot at 0x10df25898>

In [268]:
labels = 'Q1', 'Q2', 'Q3', 'Q4'
colors = ['yellowgreen', 'grey', 'grey', 'grey']
explode = (0.05, 0.05, 0.05, 0.05)
plt.axis('equal')
df.resample('Q')['Price'].sum().plot(kind='pie', radius=0.5, autopct='%1.1f%%', shadow=False, labels=labels,colors=colors, explode=explode)


Out[268]:
<matplotlib.axes._subplots.AxesSubplot at 0x10e7b8da0>

In [ ]: