Mutual Fund Data Analysis

Peter (mr.peter.lee@hotmail.com)

Last Update: 31 Dec 2015

Version: 1.0

The file (Analysis.ipynb) is written protected to prevent changes. To make changes or execute code blocks, please make a copy of the file by clicking the mouse button twice:

  • Click on "File"
  • Click on "Make a Copy"

FAQs

How to inspect dataset quickly?

  • Method 1: Use clipboard (suitable for smaller datasets & subject to internet speed).
    • find the name of the DataFrame, e.g., data_main
    • run the command in any cell block: data_main.to_clipboard()
    • launch Excel and paste in the data
  • Method 2: Export to a file and use WinSCP to retrieve the file from the remote server (suitable for large files).
    • find the name of the DataFrame, e.g., data_main
    • run the command in any cell block: data_main.to_excel("data/output/temp.xlsx")
    • launch WinSCP to locate the file and download to local computer

To-do list

[x] completed [ ] to-do

  1. To get an overview of the date first:

    • [x] One Table for all fields of the data consisting of Numbers, Means, Medians, Stddev, Quantiles, Skewness, Kurtosis, …. Date of first observation and last,….
    • [x] This table separately for all single funds (over the total period)
    • [x] This table separately for all single years (over all funds)
    • [x] A list of all values for all definitions (like buy 10 and sell 20) found in the database together with the number of each of them. Do we know the meaning of all of them?
    • [x] The table again separately for all different definitions.
    • [x] A plot of aggregated daily buys and sells (and perhaps others) over the time.
  2. Daily holdings of all funds

    • [x] Theoretically this data will allow you to produce all holdings and stock and funds values for all funds on a daily base.
    • [ ] If this works you could check your holdings against the holdings we got from Bloomberg or Morningstar for on specific point in time. Then you could produce similar tabs as under 1. I think if you are able to get this data, you have done a great job – and typically it takes longer than expected…..
    • [Peter] Checking for problems in the computed holdings is time consuming. I will move on to other work for now and come back to revise the algorithm in a few days.

In [1]:
import pandas as pd
import numpy as np
import datetime as dt
from pandas import Series, DataFrame, Panel, datetime
from datetime import timedelta
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
%matplotlib inline
import io
import os
import random
import statsmodels.stats.api as sms
from warnings import warn
from pylab import rcParams
rcParams['figure.figsize'] = 10, 5  # Change default plot size
from ipywidgets import FloatProgress  # Progress bar - f = FloatProgress(min=0, max=1); display(f); then in a loop: f.value = percentage
from IPython.display import display  # Progress bar

# ****************** Program Settings ******************

Data_Folder = "data/new/"
Output_Folder = "data/output/"
Datafile_Main = "fin_mf_trades_1995_2014_fid.dta"  # The main dataset - mutual fund holdings
Datafile_stocks = "compustat_finland_1990_2015_all.dta"  # The dataset - stock prices / dividends

# Calculate Daily Holdings
Balanced_Panel = True  # Default: True
                       # True if add extra entries to the holdings data so that there is at least one observation per day;
                       # If there is no transaction in a day --> 1 entry in the day
                       # If there is n transactions --> n entries (n>0)
Dividends_Reinvested = True  # True if dividends are assumed to be fully reinvested; False otherwise.
                             # Note that the Barber's method excludes dividends
Adjust_Price_Before_1999 = True  # True if to divide 'price' before 1999 by 5.94573 
First_Day = ""             # Default: ""
                           # Set to "" if you would like the first day in the holdings data to be the date of the first transaction of a particular stock
                           # Or you can set it to a date such as "1994-Jan-01" - The holdings record for all isin per stock will begin with that day (filled with zeros if no-transactions have yet happened)
                           # Note the first obs in the test sample is 1994-Dec-19 - this day must to be earlier than the first obs to make sense
Vars_To_Keep = ['fund_id', 'isin', 'seq', 'acc', 'sect', 'owntype', 'legtype', 'ref_code', 
                'buy_sell', 'volume', 'price', 'date']
                           # To preserve memory usage, select the key variables to keep in the data.
                           # Note that 'date' is actually 'trade_date'
        
# ******************************************************

Variable Definitions - data_main

"data_main" is the mutual fund holdings dataset.

  • date: TR_DATE from the original dataset.
  • fund_id: identifier for funds as in Bloomberg
  • acc: identifier for funds in the Euroclear Data
  • sect: category of what types of invstors
  • buy_sell: 10 for buys; 20 for sells;
  • class: same as above (mutual funds are FI)
    • HH: Household
    • FI: Financial institution (domestic)
    • FR: Foreign with own account at euroclear
    • FI: Foreign nominee account where most foreign holdings are registered
    • NF: Non-financial
    • GG: General government
    • NP: Non-for-profit
  • leg_type: Legal type of the account holder
    • 11 NATURAL PERSON
    • 12 ESTATE OF A DECEASED PERSON
    • 13 TAXATION CONSORTIUM
    • 14 PARTNERSHIP
    • 15 BANKRUPT'S ESTATE
    • 21 LIMITED PARTNERSHIP
    • 22 LAIVAISÄNNISTÖYHTIÖ (EI OSAKEYHTIÖ)
    • 30 PUBLIC LIMITED COMPANY
    • 31 LIMITED COMPANY
    • 32 MUTUAL INSURANCE COMPANY
    • 33 SAVINGS BANK
    • 34 PENSION TRUST OR FUND
    • 35 HOUSING CORPORATION
    • 41 COOPERATIVE
    • 51 FOUNDATION, FUND
    • 52 NON-PROFIT ASSOCIATION
    • 53 MUTUAL INDEMNITY INSURANCE ASSOCIATION
    • 54 ECONOMIC ASSOCIATION
    • 61 PUBLIC AUTHORITY
    • 62 PUBLIC UTILITY
    • 63 PUBLIC CORPORATION
    • 71 STATE CHURCH
    • 72 OTHER RELIGIOUS COMMUNITY
    • 90 OTHER JURIDICAL FORM
  • ref_code: Reference code of the transaction
    • 01 MARKETPLACE TRADE
    • 02 OTHER TRADE
    • 03 TRANSFER TO BOOK-ENTRIES
    • 04 TRANSFER TO OWN ACCOUNT
    • 05 CONVERSION
    • 06 BEQUEST
    • 07 INHERITANCE
    • 08 WILL
    • 09 DIVISION OF HTE PROPERTY OF THE SPOU
    • 10 FORCED SALE
    • 11 TRANSFER INT. CERTIFICATE OF OWN
    • 12 TRANSFER OF INTERNAT. CERTIFICATE
    • 13 TRANSFER TO NOM.REG B-E ACCOUNT
    • 14 TRANSFER FROM NOM.REG B-E ACCOUNT
    • 15 TRADE-BY-TRADE SETTLEMENT
    • 16 DIRECT SELLING
    • 17 TRANSFER OF COLLATERAL
    • 18 RETURN OF COLLATERAL
    • 19 CANCELLATION OF RELATIONSHIP BASED O
    • 20 ISSUE ACCOUNT TRANSACTION
    • 50 RIGHTS ISSUE
    • 51 BONUS ISSUE
    • 52 SUBSCRIPTION
    • 53 FORFEITURE
    • 54 REDEMPTION
    • 55 SPLIT
    • 56 MERGER
    • 57 COMBINATION
    • 58 CANCELLATION OF A SECURITY
    • 59 OTHER SUBSCRIPTION
    • 60 SUBSCRIPTION RIGHT
    • 61 TRANSFER OF RECORD TIME
    • 62 SECURITY TYPE CONVERSION
    • 63 CANCELLING OF SECURITY TYPE LOAN
    • 64 REVERSE SPLIT
    • 65 CANCELLATION OF SUBSCRIPTION
    • 71 CONVERSION OF CONVERTIBLE BOND
    • 72 SUBSCRIPTION WITH WARRANT
    • 73 EXECUTION OF WARRANT
    • 75 DEMERGER
    • 76 STOCK LOAN
    • 77 RETURN OF STOCK LOAN
    • 78 RERUTN OF WARRANTS
    • 96 ACCOUNT TRANSFER
    • 97 ARCHIVING
    • 98 CORRECTION
    • 99 OTHER
  • volume: transaction volume
  • price: transaction price
  • prccd: market closing price (sourced from COMPUSTAT)
  • ajexdi: adjustment factor for stock prices (sourced from COMPUSTAT)
  • cshoc: Number of shares on issue (sourced from COMPUSTAT)
  • trans_date: Date of settlement
  • trade_date: Date of trade

Import Data


In [3]:
data_main = pd.read_stata(Data_Folder + Datafile_Main)
data_main.columns = [var.lower() for var in data_main.columns]

# There are several datetime variables -> remove the unused ones
data_main = data_main.drop(['date', 'tr_date', 'trans_date'], axis=1)

# Convert trade_date to datetime
data_main['date'] = pd.to_datetime(data_main.trade_date)

date_main = data_main.sort_index(axis=1)


# Generate an index for multiple transactions within the same day to preserve their order
def add_seq(grp):
    """Add the variable 'seq' to indicate the order of transactions within a day"""
    grp['seq'] = np.arange(1, len(grp) + 1)
    return grp

data_main = data_main.groupby(by=['fund_id', 'isin', 'date'], as_index=False).apply(add_seq)

date_main = data_main.sort_values(by=['fund_id', 'isin', 'date', 'seq', 'buy_sell'])

# data_main = data_main.ix[0:10000]

data_main.head(10)


Out[3]:
fund_id isin acc sect owntype legtype ref_code buy_sell volume price ... country language prccd ajexdi cshoc trade_date class nom_ind date seq
0 FI0008803119 FI0009000251 1128458 260 1 31 1 10 1000 228 ... 1 227.00 3.4137 6124980 1994-12-27 FI 0 1994-12-27 1
1 FI0008803119 FI0009000582 1128458 260 1 31 1 10 15000 131 ... 1 129.00 1.0000 82000000 1994-12-29 FI 0 1994-12-29 1
2 FI0008803119 FI0009000707 1128458 260 1 31 1 20 600 702 ... 1 700.00 8.0000 37596439 1994-12-29 FI 0 1994-12-29 1
3 FI0008803119 FI0009000707 1128458 260 1 31 1 20 200 702 ... 1 700.00 8.0000 37596439 1994-12-29 FI 0 1994-12-29 2
4 FI0008803119 FI0009000012 1128458 260 1 31 0 10 196000 0 ... 1 5.53 1.0000 726845084 1995-01-01 FI 0 1995-01-01 1
5 FI0008803119 FI0009000053 1128458 260 1 31 0 10 294000 0 ... 1 12.20 1.0000 416794279 1995-01-01 FI 0 1995-01-01 1
6 FI0008803119 FI0009000061 1128458 260 1 31 0 10 150000 0 ... 1 11.80 1.0000 67200000 1995-01-01 FI 0 1995-01-01 1
7 FI0008803119 FI0009000145 1128458 260 1 31 0 10 67000 0 ... 1 51.60 3.0000 20070000 1995-01-01 FI 0 1995-01-01 1
8 FI0008803119 FI0009000202 1128458 260 1 31 0 10 90000 0 ... 1 55.00 1.0000 90213400 1995-01-01 FI 0 1995-01-01 1
9 FI0008803119 FI0009000251 1128458 260 1 31 0 10 12500 0 ... 1 226.00 3.4137 6124980 1995-01-01 FI 0 1995-01-01 1

10 rows × 24 columns

Descriptive Statistics

Numerical Variables


In [4]:
def sum_stats(col):
    """Produce a table of summary statistics for a numerical variable"""
    count = col.count()
    mean = col.mean()
    median = col.median()
    stddev = col.std()
    skewness = col.skew()
    kurtosis = col.kurt()
    quantile1 = col.quantile(0.25)
    quantile2 = col.quantile(0.5)
    quantile3 = col.quantile(0.75)
    first_obs_index = col.first_valid_index()
    last_obs_index = col.last_valid_index()
    
    return pd.Series([count, mean, median, stddev, skewness, kurtosis, quantile1, quantile2, quantile3, first_obs_index, last_obs_index],
                         index=['count', 'mean', 'median', 'stddev', 'skewness', 'kurtosis', 'quantile1', 'quantile2', 'quantile3', 'first_obs_index', 'last_obs_index']
                        )
    
print("Descriptive Statistics for Numerical Variables:")

stats = data_main[['volume', 'price', 'prccd', 'ajexdi', 'cshoc', 'date']].sort_values(by=['date']).drop('date', axis=1).apply(sum_stats).T

# Fix up the dates
stats['first_obs'] = data_main.date[stats.T.ix['first_obs_index']].values 
stats['last_obs'] = data_main.date[stats.T.ix['last_obs_index']].values 
stats


Descriptive Statistics for Numerical Variables:
Out[4]:
count mean median stddev skewness kurtosis quantile1 quantile2 quantile3 first_obs_index last_obs_index first_obs last_obs
volume 290141 1.724857e+04 2.000000e+03 8.851346e+04 46.630421 4273.024135 500.00 2.000000e+03 1.000000e+04 127041 185236 1994-12-19 2014-12-26
price 290141 4.633595e+01 1.713000e+01 8.849064e+01 4.661467 30.979461 10.18 1.713000e+01 3.530000e+01 127041 185236 1994-12-19 2014-12-26
prccd 287501 4.708490e+01 1.737000e+01 8.993040e+01 4.693060 31.570475 10.40 1.737000e+01 3.600000e+01 127041 185236 1994-12-19 2014-12-26
ajexdi 287501 2.164172e+00 1.000000e+00 3.867670e+00 7.520359 87.975780 1.00 1.000000e+00 1.237900e+00 127041 185236 1994-12-19 2014-12-26
cshoc 287465 4.049428e+08 1.277820e+08 8.712507e+08 3.873649 14.645906 60844580.00 1.277820e+08 2.688760e+08 127041 185236 1994-12-19 2014-12-26

Categorical Variables


In [5]:
def_class = {
    'HH': 'Household',
    'FI': 'Financial institution (domestic)',
    'FR': 'Foreign with own account at euroclear',
    'FI': 'Foreign nominee account where most foreign holdings are registered',
    'NF': 'Non-financial',
    'GG': 'General government',
    'NP': 'Non-for-profit',
    'NA': "Not Available"
} 

def_legtype = {
    11 : 'NATURAL PERSON',
    12 : 'ESTATE OF A DECEASED PERSON',
    13 : 'TAXATION CONSORTIUM',
    14 : 'PARTNERSHIP',
    15 : "BANKRUPT'S ESTATE",
    21 : 'LIMITED PARTNERSHIP',
    22 : 'LAIVAISÄNNISTÖYHTIÖ (EI OSAKEYHTIÖ)',
    30 : 'PUBLIC LIMITED COMPANY',
    31 : 'LIMITED COMPANY',
    32 : 'MUTUAL INSURANCE COMPANY',
    33 : 'SAVINGS BANK',
    34 : 'PENSION TRUST OR FUND',
    35 : 'HOUSING CORPORATION',
    41 : 'COOPERATIVE',
    51 : 'FOUNDATION, FUND',
    52 : 'NONPROFIT ASSOCIATION',
    53 : 'MUTUAL INDEMNITY INSURANCE ASSOCIATION',
    54 : 'ECONOMIC ASSOCIATION',
    61 : 'PUBLIC AUTHORITY',
    62 : 'PUBLIC UTILITY',
    63 : 'PUBLIC CORPORATION',
    71 : 'STATE CHURCH',
    72 : 'OTHER RELIGIOUS COMMUNITY',
    90 : 'OTHER JURIDICAL FORM',
    'NA': "Not Available"
}

def_ref_code = {
    1 : 'MARKETPLACE TRADE',
    2 : 'OTHER TRADE',
    3 : 'TRANSFER TO BOOKENTRIES',
    4 : 'TRANSFER TO OWN ACCOUNT',
    5 : 'CONVERSION',
    6 : 'BEQUEST',
    7 : 'INHERITANCE',
    8 : 'WILL',
    9 : 'DIVISION OF HTE PROPERTY OF THE SPOU',
    10 : 'FORCED SALE',
    11 : 'TRANSFER INT. CERTIFICATE OF OWN',
    12 : 'TRANSFER OF INTERNAT. CERTIFICATE',
    13 : 'TRANSFER TO NOM.REG BE ACCOUNT',
    14 : 'TRANSFER FROM NOM.REG BE ACCOUNT',
    15 : 'TRADEBYTRADE SETTLEMENT',
    16 : 'DIRECT SELLING',
    17 : 'TRANSFER OF COLLATERAL',
    18 : 'RETURN OF COLLATERAL',
    19 : 'CANCELLATION OF RELATIONSHIP BASED O',
    20 : 'ISSUE ACCOUNT TRANSACTION',
    50 : 'RIGHTS ISSUE',
    51 : 'BONUS ISSUE',
    52 : 'SUBSCRIPTION',
    53 : 'FORFEITURE',
    54 : 'REDEMPTION',
    55 : 'SPLIT',
    56 : 'MERGER',
    57 : 'COMBINATION',
    58 : 'CANCELLATION OF A SECURITY',
    59 : 'OTHER SUBSCRIPTION',
    60 : 'SUBSCRIPTION RIGHT',
    61 : 'TRANSFER OF RECORD TIME',
    62 : 'SECURITY TYPE CONVERSION',
    63 : 'CANCELLING OF SECURITY TYPE LOAN',
    64 : 'REVERSE SPLIT',
    65 : 'CANCELLATION OF SUBSCRIPTION',
    71 : 'CONVERSION OF CONVERTIBLE BOND',
    72 : 'SUBSCRIPTION WITH WARRANT',
    73 : 'EXECUTION OF WARRANT',
    75 : 'DEMERGER',
    76 : 'STOCK LOAN',
    77 : 'RETURN OF STOCK LOAN',
    78 : 'RERUTN OF WARRANTS',
    96 : 'ACCOUNT TRANSFER',
    97 : 'ARCHIVING',
    98 : 'CORRECTION',
    99 : 'OTHER',
    0  : 'Not Available',
    'NA': "Not Available"
}

In [6]:
def tab(col, desc=None):
    """Create a simple tabluation table for a categorical variable."""
    obs = len(col)
    
    # convert .value_counts() to a DataFrame
    df = pd.DataFrame(col.value_counts()).reset_index()
    df.columns = ['code', 'obs']
    
    if desc is not None:
        df['desc'] = df.code.apply(lambda x: desc[x])
    else:
        df['desc'] = np.nan
        
    df.sort_index(axis=1, inplace=True)
    total_freq = np.sum(df.obs)
    df['percent'] = df['obs'] / total_freq * 100
    
    # report missing observations
    df.loc[len(df)] = [np.nan, 'non-missing obs; <-- divided by total obs;', col.count(), col.count()/obs*100]
    
    return df

class


In [7]:
var = 'class'
dct = def_class
print("Summary statistics of variable: {}".format(var))
tab(data_main[var], dct)


Summary statistics of variable: class
Out[7]:
code desc obs percent
0 FI Foreign nominee account where most foreign hol... 270883 93.362538
1 NF Non-financial 13026 4.489541
2 NA Not Available 6232 2.147921
3 NaN non-missing obs; <-- divided by total obs; 290141 100.000000

legtype


In [8]:
var = 'legtype'
dct = def_legtype
print("Summary statistics of variable: {}".format(var))

tab(data_main[var], dct)


Summary statistics of variable: legtype
Out[8]:
code desc obs percent
0 51 FOUNDATION, FUND 155321 53.532937
1 90 OTHER JURIDICAL FORM 100277 34.561472
2 31 LIMITED COMPANY 18410 6.345191
3 30 PUBLIC LIMITED COMPANY 8609 2.967178
4 54 ECONOMIC ASSOCIATION 7524 2.593222
5 NaN non-missing obs; <-- divided by total obs; 290141 100.000000

ref_code


In [9]:
var = 'ref_code'
dct = def_ref_code
print("Summary statistics of variable: {}".format(var))

tab(data_main[var], dct)


Summary statistics of variable: ref_code
Out[9]:
code desc obs percent
0 1 MARKETPLACE TRADE 225797 77.823196
1 2 OTHER TRADE 57017 19.651480
2 4 TRANSFER TO OWN ACCOUNT 2732 0.941611
3 76 STOCK LOAN 1004 0.346039
4 77 RETURN OF STOCK LOAN 977 0.336733
5 96 ACCOUNT TRANSFER 454 0.156476
6 99 OTHER 416 0.143379
7 55 SPLIT 390 0.134417
8 98 CORRECTION 287 0.098917
9 57 COMBINATION 247 0.085131
10 0 Not Available 195 0.067209
11 51 BONUS ISSUE 159 0.054801
12 56 MERGER 156 0.053767
13 50 RIGHTS ISSUE 98 0.033777
14 52 SUBSCRIPTION 97 0.033432
15 60 SUBSCRIPTION RIGHT 27 0.009306
16 62 SECURITY TYPE CONVERSION 24 0.008272
17 75 DEMERGER 24 0.008272
18 58 CANCELLATION OF A SECURITY 17 0.005859
19 15 TRADEBYTRADE SETTLEMENT 9 0.003102
20 3 TRANSFER TO BOOKENTRIES 7 0.002413
21 64 REVERSE SPLIT 3 0.001034
22 16 DIRECT SELLING 2 0.000689
23 7 INHERITANCE 2 0.000689
24 NaN non-missing obs; <-- divided by total obs; 290141 100.000000

buy_sell


In [10]:
def_buy_sell = {10: "buy", 20: "sell",
                11: "what is this?", 
                21: "Not sure what is this?"}

var = 'buy_sell'
dct = def_buy_sell
print("Summary statistics of variable: {}".format(var))

warn("Code 11, Code 21's definitions are unclear.")

tab(data_main[var], dct)


Summary statistics of variable: buy_sell
/opt/anaconda/lib/python3.4/site-packages/ipykernel/__main__.py:9: UserWarning: Code 11, Code 21's definitions are unclear.
Out[10]:
code desc obs percent
0 10 buy 150573 51.896492
1 20 sell 139557 48.099717
2 11 what is this? 7 0.002413
3 21 Not sure what is this? 4 0.001379
4 NaN non-missing obs; <-- divided by total obs; 290141 100.000000

owntype


In [11]:
var = 'owntype'
dct = None
print("Summary statistics of variable: {}".format(var))

tab(data_main[var], dct)


Summary statistics of variable: owntype
Out[11]:
code desc obs percent
0 1 NaN 290141 100
1 NaN non-missing obs; <-- divided by total obs; 290141 100

sect


In [12]:
var = 'sect'
dct = None
print("Summary statistics of variable: {}".format(var))

tab(data_main[var], dct)


Summary statistics of variable: sect
Out[12]:
code desc obs percent
0 123000 NaN 135802 46.805519
1 260 NaN 127215 43.845923
2 111211 NaN 8063 2.778994
3 122200 NaN 7866 2.711096
4 124100 NaN 5149 1.774654
5 121 NaN 4963 1.710548
6 111020 NaN 1083 0.373267
7 NaN non-missing obs; <-- divided by total obs; 290141 100.000000

currency


In [13]:
var = 'currency'
dct = None
print("Summary statistics of variable: {}".format(var))

print('20% of transactions are based in currency FIM.')

tab(data_main[var], dct)


Summary statistics of variable: currency
20% of transactions are based in currency FIM.
Out[13]:
code desc obs percent
0 EUR NaN 231290 79.716414
1 FIM NaN 58851 20.283586
2 NaN non-missing obs; <-- divided by total obs; 290141 100.000000

byear


In [14]:
var = 'byear'
dct = None
print("Summary statistics of variable: {}".format(var))

tab(data_main[var], dct)


Summary statistics of variable: byear
Out[14]:
code desc obs percent
0 NaN non-missing obs; <-- divided by total obs; 0 0

gender


In [15]:
var = 'gender'
dct = None
print("Summary statistics of variable: {}".format(var))

tab(data_main[var], dct)


Summary statistics of variable: gender
Out[15]:
code desc obs percent
0 1 NaN 102049 77.205738
1 2 NaN 30129 22.794262
2 NaN non-missing obs; <-- divided by total obs; 132178 45.556471

pcode


In [16]:
var = 'pcode'
dct = None
print("Summary statistics of variable: {}".format(var))

tab(data_main[var], dct)


Summary statistics of variable: pcode
Out[16]:
code desc obs percent
0 101 NaN 67780 23.361055
1 100 NaN 60932 21.000824
2 510 NaN 56073 19.326121
3 20 NaN 46471 16.016695
4 131 NaN 25457 8.774010
5 75 NaN 21277 7.333331
6 2650 NaN 6036 2.080368
7 180 NaN 3110 1.071893
8 120 NaN 3005 1.035703
9 NaN non-missing obs; <-- divided by total obs; 290141 100.000000

country


In [17]:
var = 'country'
dct = None
print("Summary statistics of variable: {}".format(var))

tab(data_main[var], dct)


Summary statistics of variable: country
Out[17]:
code desc obs percent
0 NaN 247761 85.393309
1 FI NaN 42380 14.606691
2 NaN non-missing obs; <-- divided by total obs; 290141 100.000000

language


In [18]:
var = 'language'
dct = None
print("Summary statistics of variable: {}".format(var))

tab(data_main[var], dct)


Summary statistics of variable: language
Out[18]:
code desc obs percent
0 1 NaN 237363 81.809534
1 2 NaN 52778 18.190466
2 NaN non-missing obs; <-- divided by total obs; 290141 100.000000

nom_ind


In [19]:
var = 'nom_ind'
dct = None
print("Summary statistics of variable: {}".format(var))

tab(data_main[var], dct)


Summary statistics of variable: nom_ind
Out[19]:
code desc obs percent
0 0 NaN 290141 100
1 NaN non-missing obs; <-- divided by total obs; 290141 100

Descriptive Stats by Funds


In [20]:
grouped = data_main.groupby('fund_id', as_index=False, axis=0)

list_of_numerical_vars = ['volume', 'price', 'prccd', 'ajexdi', 'cshoc', 'date']
list_of_categorical_vars = ['class', 'legtype', 'ref_code', 'buy_sell', 'sect', 'gender', 'pcode']
dict_of_categorical_vars = {'class': def_class, 
                            'legtype': def_legtype, 
                            'ref_code': def_ref_code, 
                            'buy_sell': def_buy_sell, 
                            'sect': None, 
                            'gender': None, 
                            'pcode': None}

In [21]:
for name, group in grouped:
    print("**********  Fund: {}  **********".format(name))
    print("Numerical variables:")
    print(group[list_of_numerical_vars].describe().T)
    print("\n")
    
    print("Categorical variables:")
    for var in list_of_categorical_vars:
        print("- Variable: {}".format(var))
        print(tab(group[var], dict_of_categorical_vars[var]).to_string(index=False))
        print("\n")
    print("\n")
    print("\n")


**********  Fund: FI0008800016  **********
Numerical variables:
        count          mean           std          min            25%  \
volume  18926  2.247476e+04  1.204354e+05       1.0000       543.0000   
price   18926  4.297073e+01  7.806916e+01       0.0000        10.5100   
prccd   18682  4.400443e+01  8.020642e+01       0.1200        10.8525   
ajexdi  18682  1.920676e+00  3.416968e+00       0.1144         1.0000   
cshoc   18680  4.214782e+08  8.997133e+08  477836.0000  53150000.0000   

                 50%           75%         max  
volume  2.700000e+03  1.400000e+04     7662900  
price   1.937500e+01  3.300000e+01         990  
prccd   1.998000e+01  3.310000e+01         990  
ajexdi  1.000000e+00  1.231800e+00          64  
cshoc   1.246307e+08  2.670158e+08  4796292460  


Categorical variables:
- Variable: class
code                                               desc    obs     percent
  FI  Foreign nominee account where most foreign hol...  17471   92.312163
  NA                                      Not Available   1455    7.687837
 NaN         non-missing obs; <-- divided by total obs;  18926  100.000000


- Variable: legtype
 code                                        desc    obs  percent
   51                            FOUNDATION, FUND  18926      100
  NaN  non-missing obs; <-- divided by total obs;  18926      100


- Variable: ref_code
 code                                        desc    obs     percent
    1                           MARKETPLACE TRADE  14877   78.606150
    2                                 OTHER TRADE   3283   17.346507
    4                     TRANSFER TO OWN ACCOUNT    503    2.657720
   99                                       OTHER    143    0.755574
   55                                       SPLIT     34    0.179647
    0                               Not Available     18    0.095107
   57                                 COMBINATION     16    0.084540
   56                                      MERGER     12    0.063405
   51                                 BONUS ISSUE     12    0.063405
   52                                SUBSCRIPTION      9    0.047554
   98                                  CORRECTION      6    0.031702
   50                                RIGHTS ISSUE      6    0.031702
   62                    SECURITY TYPE CONVERSION      3    0.015851
   75                                    DEMERGER      2    0.010567
   60                          SUBSCRIPTION RIGHT      1    0.005284
    3                     TRANSFER TO BOOKENTRIES      1    0.005284
  NaN  non-missing obs; <-- divided by total obs;  18926  100.000000


- Variable: buy_sell
 code                                        desc    obs     percent
   10                                         buy  10127   53.508401
   20                                        sell   8794   46.465180
   11                               what is this?      3    0.015851
   21                      Not sure what is this?      2    0.010567
  NaN  non-missing obs; <-- divided by total obs;  18926  100.000000


- Variable: sect
   code                                        desc    obs     percent
    260                                         NaN   9221   48.721336
 123000                                         NaN   8250   43.590827
 124100                                         NaN   1455    7.687837
    NaN  non-missing obs; <-- divided by total obs;  18926  100.000000


- Variable: gender
 code                                        desc   obs     percent
    1                                         NaN  9221  100.000000
  NaN  non-missing obs; <-- divided by total obs;  9221   48.721336


- Variable: pcode
 code                                        desc    obs  percent
   20                                         NaN  18926      100
  NaN  non-missing obs; <-- divided by total obs;  18926      100






**********  Fund: FI0008801048  **********
Numerical variables:
        count          mean           std           min          25%  \
volume  22564  1.769483e+04  8.403565e+04        1.0000      1000.00   
price   22564  3.009440e+01  6.909619e+01        0.0000         6.04   
prccd   21933  3.113174e+01  7.107623e+01        0.1100         6.24   
ajexdi  21933  1.829756e+00  2.836307e+00        0.0526         1.00   
cshoc   21931  3.094699e+08  6.305392e+08  1070000.0000  42550715.00   

                 50%           75%         max  
volume  5.000000e+03  1.616050e+04     4850000  
price   1.140500e+01  2.200000e+01        1210  
prccd   1.175000e+01  2.290000e+01        1210  
ajexdi  1.000000e+00  1.101700e+00          64  
cshoc   1.254277e+08  2.564037e+08  3800942739  


Categorical variables:
- Variable: class
code                                               desc    obs  percent
  FI  Foreign nominee account where most foreign hol...  22564      100
 NaN         non-missing obs; <-- divided by total obs;  22564      100


- Variable: legtype
 code                                        desc    obs  percent
   51                            FOUNDATION, FUND  22564      100
  NaN  non-missing obs; <-- divided by total obs;  22564      100


- Variable: ref_code
 code                                        desc    obs     percent
    1                           MARKETPLACE TRADE  11276   49.973409
    2                                 OTHER TRADE  10974   48.634994
   99                                       OTHER    118    0.522957
    4                     TRANSFER TO OWN ACCOUNT     49    0.217160
   98                                  CORRECTION     45    0.199433
   57                                 COMBINATION     21    0.093069
   55                                       SPLIT     18    0.079773
    0                               Not Available     17    0.075341
   56                                      MERGER     12    0.053182
   50                                RIGHTS ISSUE      9    0.039887
   52                                SUBSCRIPTION      8    0.035455
   51                                 BONUS ISSUE      7    0.031023
   60                          SUBSCRIPTION RIGHT      4    0.017727
   58                  CANCELLATION OF A SECURITY      3    0.013296
   75                                    DEMERGER      2    0.008864
   16                              DIRECT SELLING      1    0.004432
  NaN  non-missing obs; <-- divided by total obs;  22564  100.000000


- Variable: buy_sell
 code                                        desc    obs     percent
   10                                         buy  12290   54.467293
   20                                        sell  10273   45.528275
   11                               what is this?      1    0.004432
  NaN  non-missing obs; <-- divided by total obs;  22564  100.000000


- Variable: sect
   code                                        desc    obs     percent
    260                                         NaN  10838   48.032264
 122200                                         NaN   7866   34.860840
 123000                                         NaN   3860   17.106896
    NaN  non-missing obs; <-- divided by total obs;  22564  100.000000


- Variable: gender
 code                                        desc    obs     percent
    1                                         NaN  10838  100.000000
  NaN  non-missing obs; <-- divided by total obs;  10838   48.032264


- Variable: pcode
 code                                        desc    obs  percent
  101                                         NaN  22564      100
  NaN  non-missing obs; <-- divided by total obs;  22564      100






**********  Fund: FI0008801071  **********
Numerical variables:
        count          mean           std        min          25%         50%  \
volume  18534  8.174622e+03  4.299689e+04       1.00       383.00      1000.0   
price   18534  4.514100e+01  9.079845e+01       0.00         8.90        15.0   
prccd   18510  4.587253e+01  9.239912e+01       0.17         8.98        15.2   
ajexdi  18510  2.155561e+00  3.901301e+00       0.10         1.00         1.0   
cshoc   18507  2.929307e+08  7.246102e+08  415337.00  36340550.00  88163370.0   

                 75%         max  
volume  3.815750e+03     1735245  
price   3.350000e+01        1630  
prccd   3.400000e+01        1630  
ajexdi  1.274500e+00          64  
cshoc   2.193326e+08  4796292460  


Categorical variables:
- Variable: class
code                                               desc    obs  percent
  FI  Foreign nominee account where most foreign hol...  18534      100
 NaN         non-missing obs; <-- divided by total obs;  18534      100


- Variable: legtype
 code                                        desc    obs  percent
   51                            FOUNDATION, FUND  18534      100
  NaN  non-missing obs; <-- divided by total obs;  18534      100


- Variable: ref_code
 code                                        desc    obs     percent
    1                           MARKETPLACE TRADE  17201   92.807813
    2                                 OTHER TRADE   1058    5.708428
    4                     TRANSFER TO OWN ACCOUNT     91    0.490990
   55                                       SPLIT     40    0.215820
   57                                 COMBINATION     26    0.140283
   51                                 BONUS ISSUE     25    0.134887
    0                               Not Available     24    0.129492
   62                    SECURITY TYPE CONVERSION     12    0.064746
   99                                       OTHER     11    0.059350
   52                                SUBSCRIPTION     10    0.053955
   56                                      MERGER     10    0.053955
   98                                  CORRECTION      9    0.048559
   50                                RIGHTS ISSUE      8    0.043164
   60                          SUBSCRIPTION RIGHT      2    0.010791
    3                     TRANSFER TO BOOKENTRIES      2    0.010791
   58                  CANCELLATION OF A SECURITY      2    0.010791
   75                                    DEMERGER      2    0.010791
   16                              DIRECT SELLING      1    0.005395
  NaN  non-missing obs; <-- divided by total obs;  18534  100.000000


- Variable: buy_sell
 code                                        desc    obs     percent
   10                                         buy  10918   58.907953
   20                                        sell   7616   41.092047
  NaN  non-missing obs; <-- divided by total obs;  18534  100.000000


- Variable: sect
   code                                        desc    obs     percent
    260                                         NaN  10228   55.185065
 123000                                         NaN   8306   44.814935
    NaN  non-missing obs; <-- divided by total obs;  18534  100.000000


- Variable: gender
 code                                        desc    obs     percent
    1                                         NaN  10228  100.000000
  NaN  non-missing obs; <-- divided by total obs;  10228   55.185065


- Variable: pcode
 code                                        desc    obs  percent
  100                                         NaN  18534      100
  NaN  non-missing obs; <-- divided by total obs;  18534      100






**********  Fund: FI0008801485  **********
Numerical variables:
        count          mean           std           min          25%  \
volume  27545  7.624694e+03  5.514410e+04        1.0000       400.00   
price   27545  3.251798e+01  5.901595e+01        0.0000        11.15   
prccd   27272  3.313295e+01  6.073064e+01        0.2500        11.41   
ajexdi  27272  1.637745e+00  2.828950e+00        0.1144         1.00   
cshoc   27270  3.391632e+08  7.951268e+08  1080000.0000  48500125.00   

                 50%           75%         max  
volume  1.500000e+03  5.000000e+03     4637500  
price   1.730000e+01  2.964000e+01         990  
prccd   1.745000e+01  2.990000e+01         990  
ajexdi  1.000000e+00  1.000000e+00          64  
cshoc   1.073587e+08  1.875400e+08  4796292460  **OUTPUT MUTED**

Descriptive Stats by Year


In [22]:
# Resampling from daily to annually
# How to down-sample from daily to annually? Some variables are computed as means; some are aggregated.
vars_compute_average = ['price', 'prccd', 'ajexdi', 'cshoc']
vars_compute_sum = ['buy_value', 'sell_value', 'buy_volume', 'sell_volume', 'buy', 'sell']

In [23]:
ts_main = data_main
# Generate variables
ts_main['buy'] = (ts_main.buy_sell == 10) | (ts_main.buy_sell == 11)
ts_main['sell'] = (ts_main.buy_sell == 20) | (ts_main.buy_sell == 21)

ts_main['buy_volume'] = np.nan
ts_main.ix[ts_main['buy']==True, 'buy_volume'] = ts_main['volume']
ts_main['sell_volume'] = np.nan
ts_main.ix[ts_main['sell']==True, 'sell_volume'] = ts_main['volume']

ts_main['buy_value'] = np.nan
ts_main.ix[ts_main['buy']==True, 'buy_value'] = ts_main['volume'] * ts_main['price']
ts_main['sell_value'] = np.nan
ts_main.ix[ts_main['sell']==True, 'sell_value'] = ts_main['volume'] * ts_main['price']

ts_main.index = ts_main.date

In [24]:
ts_temp_mean = ts_main[vars_compute_average].resample('A', how='mean')
ts_temp_sum = ts_main[vars_compute_sum].resample('A', how='sum')
ts_annual = pd.concat([ts_temp_mean, ts_temp_sum], axis=1)
ts_annual.columns = ['avg_price', 'avg_prccd', 'avg_ajexdi', 'avg_cshoc', 'sum_buy_value', 'sum_sell_value', 'sum_buy_vol', 'sum_sell_vol', 'sum_buys', 'sum_sells']
print('Variable definitions:')
print('sum_buys: number of buys during the year')
print('sum_buy_vol: total buy volume during the year')
print('sum_buy_value (in 000,000s): total turonver of purchase value during the year')
ts_annual['sum_buy_value'] = ts_annual['sum_buy_value']/1000000 
ts_annual['sum_sell_value'] = ts_annual['sum_sell_value']/1000000 
ts_annual


Variable definitions:
sum_buys: number of buys during the year
sum_buy_vol: total buy volume during the year
sum_buy_value (in 000,000s): total turonver of purchase value during the year
Out[24]:
avg_price avg_prccd avg_ajexdi avg_cshoc sum_buy_value sum_sell_value sum_buy_vol sum_sell_vol sum_buys sum_sells
date
1994-12-31 153.809259 150.693889 6.528385 1.104380e+08 35.499000 26.813600 567400 589800 78 84
1995-12-31 138.213604 141.753156 6.147319 1.234849e+08 3683.562927 3833.692808 69590197 59328435 6512 7352
1996-12-31 116.350465 116.608281 5.765767 1.188861e+08 3784.796712 4072.432073 57469672 58546592 7438 8747
1997-12-31 161.441175 161.995295 4.636617 1.342176e+08 4616.198175 4483.420826 59152263 55704503 6818 6925
1998-12-31 208.063459 210.618799 4.112164 1.912685e+08 5439.818231 5210.158386 52115245 48363473 7748 7265
1999-12-31 34.303839 34.479339 2.675440 3.979546e+08 897.252070 899.840589 49069080 46026064 7576 7739
2000-12-31 33.533031 33.803717 1.948758 5.694018e+08 1437.123104 1316.744530 73429956 62885140 6949 6682
2001-12-31 13.694342 13.775393 1.569920 5.771282e+08 734.365436 654.063640 80264119 59810924 5268 4839
2002-12-31 13.404859 13.470798 1.533537 8.072455e+08 629.125871 590.531031 63529511 65233958 5174 4418
2003-12-31 11.696425 11.695220 1.482651 8.290779e+08 588.850666 442.608437 71239436 55856913 6605 4442
2004-12-31 12.495043 13.001033 1.570942 5.861513e+08 722.641738 601.799095 70193837 133266157 5059 2740
2005-12-31 17.840684 18.156664 1.550864 3.987343e+08 733.881262 739.963005 52969789 56608478 2330 1996
2006-12-31 18.392183 19.190453 1.303226 3.907589e+08 228.230178 229.494766 14859885 16878252 478 493
2007-12-31 23.090562 23.103537 1.246037 3.367412e+08 2350.847580 2367.310627 135339817 129138045 21024 15950
2008-12-31 17.967013 17.942003 1.175673 4.231516e+08 2557.008426 2484.484775 190018581 172309766 20855 18774
2009-12-31 12.137364 12.161051 1.203043 4.793680e+08 2346.552330 2049.202083 277441903 227760326 15264 13127
2010-12-31 14.542960 14.580589 1.136956 3.698097e+08 2219.364566 2092.553943 201083780 183930126 6047 5846
2011-12-31 15.275010 15.217217 1.107915 4.156995e+08 2086.850055 2025.551047 204763850 183538495 5758 5890
2012-12-31 13.666824 14.376986 1.124224 4.621582e+08 1993.079458 1904.954811 336439421 283760594 5470 6552
2013-12-31 16.170372 16.891658 1.047625 4.710433e+08 3043.694423 3103.375720 382381613 398314857 5600 6958
2014-12-31 14.727076 15.629244 1.013533 4.705342e+08 1181.346339 1396.723934 131832665 132913061 2529 2742

Plots

Time Series Plots - Number of Buys / Sells per Fund


In [25]:
ts_buys = data_main[['fund_id', 'buy', 'sell', 'buy_value', 'sell_value', 'date']].copy()
ts_buys.index = ts_buys.date
ts_buys.drop('date', axis=1, inplace=True)

In [26]:
# Group by fund_id
ts_buys_grouped = ts_buys.groupby('fund_id', as_index=False, axis=0)
i = 1
for name, group in ts_buys_grouped:
    # Resample from daily
    ts_temp = group.drop('fund_id', axis=1).resample('M', how='sum')
    
    # Plot
    plt.figure(i)
    plt.plot(ts_temp.index, ts_temp.buy, 'b', label='number of buys')
    plt.plot(ts_temp.index, ts_temp.sell, 'r', label='number of sells')
    plt.legend(loc='upper left')
    plt.xlabel('Time')
    plt.ylabel('Freq')
    plt.grid(True)
    plt.title('Number of Buys/Sells Per Month for Fund {}'.format(name))
    i += 1
    
plt.show()


Time Series Plots - Turnover of Buys/Sells per Fund


In [27]:
# Group by fund_id
i = 1
for name, group in ts_buys_grouped:
    # Resample from daily
    ts_temp = group.drop('fund_id', axis=1).resample('M', how='sum')
    
    # Plot
    plt.figure(i)
    plt.plot(ts_temp.index, ts_temp.buy_value, 'b', label='total value of buys')
    plt.plot(ts_temp.index, ts_temp.sell_value, 'r', label='total value of sells')
    plt.legend(loc='upper left')
    plt.xlabel('Time')
    plt.ylabel('Value in Reported Currency')
    plt.grid(True)
    plt.title('Value of Buys/Sells Per Month for Fund {}'.format(name))
    i += 1
    
plt.show()


Calculating Daily Holdings using FIFO

This method makes use of First-In-First-Out principle to calculate stock returns. Cash dividends can be assumed to be fully reinvested (see the switches at the beginning of the program).


In [28]:
# load stock data
data_stocks = pd.read_stata(Data_Folder + Datafile_stocks)

# Format date
data_stocks['date'] = pd.to_datetime(data_stocks['date'], format="%Y%m%d")

# Keep only useful variables
data_stocks = data_stocks[['isin', 'date', 'ajexdi', 'div', 'prccd']]

data_stocks.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 741690 entries, 0 to 741689
Data columns (total 5 columns):
isin      741690 non-null object
date      741690 non-null datetime64[ns]
ajexdi    741288 non-null float64
div       2875 non-null float64
prccd     741288 non-null float64
dtypes: datetime64[ns](1), float64(3), object(1)
memory usage: 34.0+ MB

Dealing with missing data in the stock price data.


In [29]:
# Identify the number of missing isin
print("Number of missing isin entries in the stocks price data: {}".format(len(data_stocks[data_stocks['isin']==""])))
print("These missing entries are dropped.")
data_stocks.drop(data_stocks.index[data_stocks['isin']==""], inplace=True)

print("\n")
# Checking zero dividend payments
print("Number of zero-dividends paymetns: {}".format(np.sum(data_stocks['div']==0)))
print("zero-dividend payments are replaced as missing.")
data_stocks.ix[data_stocks['div']==0, 'div'] = np.nan

print("\n")
# Check missing prccd
print("Number of missing prccd entries: {}".format(np.sum(np.isnan(data_stocks['prccd']))))
print("They are dropped")
data_stocks.drop(data_stocks.index[np.isnan(data_stocks['prccd'])], inplace=True)

print("\n")
# Check missing ajexdi
print("Number of missing ajexdi entries: {}".format(np.sum(np.isnan(data_stocks['ajexdi']))))
print("They are dropped")
data_stocks.drop(data_stocks.index[np.isnan(data_stocks['ajexdi'])], inplace=True)


Number of missing isin entries in the stocks price data: 46583
These missing entries are dropped.


Number of zero-dividends paymetns: 0
zero-dividend payments are replaced as missing.


Number of missing prccd entries: 356
They are dropped


Number of missing ajexdi entries: 0
They are dropped

Compute the dividend index (divIndex)


In [30]:
# display the floating progress bar
f = FloatProgress(min=0, max=1)
display(f)

data_stocks['divIndex'] = 0
data_stocks.sort_values(by=['isin', 'date'], inplace=True)
gp_temp = data_stocks.groupby(by='isin', as_index=False)

# Identify the first entry per isin
df_temp = gp_temp.first()[['isin', 'date']]  
df_temp['first'] = 1

# Merge back
data_stocks = data_stocks.merge(df_temp, how="outer", on=['isin', 'date'])
data_stocks.ix[data_stocks['first']==1, 'divIndex'] = 1
data_stocks.drop('first', axis=1, inplace=True)

# Compute the rest of the divIndex
total_obs = len(data_stocks)
gp_temp = data_stocks.groupby(by='isin', as_index=False)
for name, group in gp_temp:
    # Reset all variables when moving to a new stock
    cumIndex = 1
    previousPrice = group.ix[group.index[0], 'prccd']  # get the first row
    
    for index, row in group.iterrows():
        # print("now in stock {} row {}".format(name, index))
        # print("cumIndex is {}".format(cumIndex))
        # check if there is a dividend payment
        if (not np.isnan(row['div'])) & (row['div']!=0):
            cumIndex = cumIndex*( row['div'] / previousPrice + 1)
            previousPrice = row['prccd']
        else:
            # There is no dividend payment
            previousPrice = row['prccd']
            
        data_stocks.ix[index, 'divIndex'] = cumIndex
        f.value = index/total_obs

data_stocks.head()


Out[30]:
isin date ajexdi div prccd divIndex
0 ARDEUT110533 2012-10-16 1 NaN 18.30 1
1 ARDEUT110533 2012-11-13 1 NaN 18.30 1
2 ARDEUT110533 2012-11-14 1 NaN 19.45 1
3 ARDEUT110533 2012-11-15 1 NaN 19.45 1
4 ARDEUT110533 2012-11-16 1 NaN 19.45 1

Adjust adjexdi to include dividend payments

The original ajexdi is assumed to exclude dividends. The ajexdi is adjusted so that it accounts for the implications of dividends.


In [31]:
def add_modAjexdi(grp):
    """Calculate mod_ajexdi per group."""
    grp['modAjexdi'] = np.nan
    
    for index, row in grp.iterrows():
        if index==grp.index[0]:
            grp.ix[index, 'modAjexdi'] = grp.ix[index, 'ajexdi']
        else:
            grp.ix[index, 'modAjexdi'] = grp.ix[index - 1, 'modAjexdi'] \
                                         / grp.ix[index -1, 'ajexdi'] \
                                         * grp.ix[index, 'ajexdi'] \
                                         / grp.ix[index, 'firstDiffDivIndex']
    return grp

if  Dividends_Reinvested:
    # Adjust ajexdi to account for dividends
    first_diff = lambda x: x / x.shift(1)
    data_stocks['firstDiffDivIndex'] = data_stocks.groupby(by='isin', as_index=False)['divIndex'].transform(first_diff)
    data_stocks = data_stocks.groupby(by='isin', as_index=False).apply(add_modAjexdi)
 
data_stocks.sort_values(by=['isin', 'date'], inplace=True)
data_stocks = data_stocks[['isin', 'date', 'divIndex', 'ajexdi', 'prccd']]
data_stocks.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 694751 entries, 0 to 694750
Data columns (total 5 columns):
isin        694751 non-null object
date        694751 non-null datetime64[ns]
divIndex    694751 non-null float64
ajexdi      694751 non-null float64
prccd       694751 non-null float64
dtypes: datetime64[ns](1), float64(3), object(1)
memory usage: 31.8+ MB

Expand unbalanced panel to balanced panel

The current dataset "data_main" is unbalanced where entries are observed at the time of a transaction. The dataset is expanded to create a balanced panel dataset so that there is at least one entry per day.


In [32]:
data_main = data_main[Vars_To_Keep]
print("Check if there is any missing values in the data before expanding it to Panel:")
print(data_main.isnull().sum())
if np.sum(data_main.isnull().sum())>0:
    warn("Warning: There are some missing values --> check!")


Check if there is any missing values in the data before expanding it to Panel:
fund_id     0
isin        0
seq         0
acc         0
sect        0
owntype     0
legtype     0
ref_code    0
buy_sell    0
volume      0
price       0
date        0
dtype: int64

Create an index for multiple transactions within the same day

So that their order can be preserved in the dataset.


In [34]:
# Expand the data
def expand_panel(temp):
    """Convert the dataset to balanced panel."""
    
    # Keep the first row of everyday
    temp_first_rows = temp.groupby(by=['fund_id', 'isin', 'date'], as_index=False).first()

    gp_isin = temp_first_rows.ix[temp_first_rows.index[0], 'isin']
    gp_fund_id = temp_first_rows.ix[temp_first_rows.index[0], 'fund_id']
    
    # Find the remaining rows of everyday
    temp_first_rows['flag'] = 1

    temp_remaining_rows = temp.merge(temp_first_rows, on=list(temp.columns), how='outer')
    temp_remaining_rows.drop(temp_remaining_rows.index[temp_remaining_rows['flag']==1], inplace=True)
    temp_first_rows.drop('flag', axis=1, inplace=True)
    temp_remaining_rows.drop('flag', axis=1, inplace=True)

    # Create a time series
    temp_first_rows.index = temp_first_rows['date']
    temp_first_rows.drop('date', axis=1, inplace=True)

    # Create an entry for the first day
    if First_Day != "":
        temp_first_rows.loc[dt.datetime.strptime(First_Day, '%Y-%b-%d')] = np.nan
        temp_first_rows.sort_index(inplace=True)

    # Up-sample
    ts_daily = temp_first_rows.resample('D')

    lst_fill = ['acc', 'sect', 'owntype', 'legtype']
    for x in lst_fill:
        ts_daily[x] = ts_daily[x].fillna(method='ffill')
        ts_daily[x] = ts_daily[x].fillna(method='bfill')
    lst_fill = ['volume']
    for x in lst_fill:
        ts_daily[x] = ts_daily[x].fillna(value=0)
        ts_daily[x] = ts_daily[x].fillna(value=0)
    
    ts_daily.reset_index(inplace=True)
    ts_daily.rename(columns={'index':'date'}, inplace=True)
    ts_daily['fund_id'] = gp_fund_id
    ts_daily['isin'] = gp_isin
    
    # Merge the remaining rows back to the data
    # Sort the DataFrame to preserve the order of transactions within the same day
    return ts_daily.merge(temp_remaining_rows, on=list(temp.columns), how='outer').sort_values(by=['date', 'seq']).reset_index()

data_main = data_main.groupby(by=['fund_id', 'isin'], as_index=False).apply(expand_panel)
data_main.reset_index(drop=True, inplace=True)
data_main = data_main[['fund_id', 'isin', 'date', 'seq', 'acc', 'sect', 'owntype', 'legtype', 'ref_code', 'buy_sell', 'volume', 'price']]
data_main.sort_values(by=['fund_id', 'isin', 'date', 'seq'], inplace=True)
data_main.head()


Out[34]:
fund_id isin date seq acc sect owntype legtype ref_code buy_sell volume price
0 FI0008800016 FI0009000012 1994-12-28 1 1130139 260 1 51 1 20 50000 5.75
1 FI0008800016 FI0009000012 1994-12-29 1 1130139 260 1 51 1 20 50000 5.70
2 FI0008800016 FI0009000012 1994-12-29 2 1130139 260 1 51 1 20 140000 5.70
3 FI0008800016 FI0009000012 1994-12-29 3 1130139 260 1 51 1 20 20000 5.70
4 FI0008800016 FI0009000012 1994-12-29 4 1130139 260 1 51 1 20 30000 5.70

Calculate the initial holdings for each fund

Merge with data_stocks and impute missing prices.

Missing prices/ajexdi/divIndex is the result of merging non-trading days with the Stock Prices data (e.g., weekends and public holidays), or just missing observations for some reason. I am using "forward fill" to impute the missing values.


In [35]:
# Merge 
data_main = data_main[['fund_id', 'isin', 'date', 'seq', 'acc', 'sect', 'owntype', 'legtype', 'ref_code', 'buy_sell', 'volume', 'price']]

data_main = data_main.merge(data_stocks, how='left', on=['isin', 'date'])

In [36]:
# Inspect missing data
print(data_main.isnull().sum())

# Use forward fill to impute ajexdi; prccd, divIndex; 
gp_temp = data_main.groupby(by=['fund_id', 'isin'], as_index=False)

def fill_na(grp):
    lst_vars = ['ajexdi', 'prccd', 'divIndex']
    for x in lst_vars:
        grp[x] = grp[x].fillna(method='ffill')
        # I don't really like the missing values at the beginning of the dataset
        # just let me get rid of those
        grp[x] = grp[x].fillna(method='bfill')
    return grp

data_main = gp_temp.apply(fill_na)


fund_id           0
isin              0
date              0
seq         2983627
acc               0
sect              0
owntype           0
legtype           0
ref_code    2983627
buy_sell    2983627
volume            0
price       2983627
divIndex    1035253
ajexdi      1035253
prccd       1035253
dtype: int64

Adjust stock purchase/sell prices before 1999


In [37]:
if Adjust_Price_Before_1999:
    data_main.ix[data_main['date']<"1999-Jan-01", 'price'] = data_main['price']/5.94573

In [38]:
# Generate variables
data_main['buy'] = (data_main.buy_sell == 10) | (data_main.buy_sell == 11)
data_main['sell'] = (data_main.buy_sell == 20) | (data_main.buy_sell == 21)

# Create Adjusted Volume
data_main['adjSellVolume'] = 0
data_main.ix[data_main['sell']==True, 'adjSellVolume'] = data_main['volume']

data_main['adjBuyVolume'] = 0
data_main.ix[data_main['buy']==True, 'adjBuyVolume'] = data_main['volume']

data_main['cumulativeVolume'] = 0

In [39]:
# check if there is any missing values in ajexdi and divIndex
lst_vars = ['ajexdi', 'divIndex', 'price', 'buy_sell', 'volume']
print("Number of obs in data: {}".format(len(data_main)))

for x in lst_vars:
    print("missing values in {}: {}".format(x, np.sum(np.isnan(data_main[x]))))

print("Need to get rid of missing ajexdi or divIndex (mostly the outcome of expanding the data)")
data_main = data_main.ix[data_main['ajexdi'].notnull()].ix[data_main['divIndex'].notnull()]


Number of obs in data: 3273768
missing values in ajexdi: 51414
missing values in divIndex: 51414
missing values in price: 2983627
missing values in buy_sell: 2983627
missing values in volume: 0
Need to get rid of missing ajexdi or divIndex (mostly the outcome of expanding the data)

Check over-sales and compute holdings

Some funds may "over-sale" their stocks and result in negative balance. This section prohibits negative balances by identifying all possible "over-sales" and lower the sell-volume to the level that the holdings of zero shares occur at the point of sale. Subsequent sales are then void.

The number of stocks in fund holdings for each period is stored in the variable cumulativeVolume.


In [40]:
# Progress bar
f = FloatProgress(min=0, max=1)
display(f)
total_obs = len(data_main)

# Check for over-sales
data_main = data_main.sort_values(by=['fund_id', 'isin', 'date', 'seq']).reset_index(drop=True)

gp_temp = data_main.groupby(by=['fund_id', 'isin'], as_index=False)
for name, group in gp_temp:
    # Reset variables when moving to a new stock
    first_row = group.index[0]
    # If the first entry is a sell before buy --> the beginning balance is negative.
    cumVolume = 0
    previousAjexdi = group.ix[first_row, 'ajexdi'] 
    
    for index, row in group.iterrows():
        # check if there is a buy
        if (row['buy']==True) & (row['adjBuyVolume']!=0):
            cumVolume = cumVolume / row['ajexdi'] * previousAjexdi + row['adjBuyVolume'] 
            previousAjexdi = row['ajexdi']
        
        # check if there is a sell
        if (row['sell']==True) & (row['adjSellVolume']!=0):
            cumVolume = cumVolume / row['ajexdi'] * previousAjexdi
            previousAjexdi = row['ajexdi']
            
            # check if sell volume exceeds cumulative volume
            if (row['adjSellVolume'] > cumVolume):
                data_main.ix[index, 'adjSellVolume'] = cumVolume
                cumVolume = 0
            else:
                cumVolume = cumVolume - row['adjSellVolume']
                
        f.value = index/total_obs        
        data_main.ix[index, 'cumulativeVolume'] = cumVolume

Calcualte profits from sales


In [41]:
# Progress bar
f = FloatProgress(min=0, max=1)
display(f)
total_obs = len(data_main)

data_main['profit'] = np.nan

data_main['daysSincePurchase'] = np.nan
data_main['profit'] = np.nan

# Note that profit is from sale only (not buy)
# Calcualte profit from each sale
data_main = data_main.sort_values(by=['fund_id', 'isin', 'date', 'seq']).reset_index(drop=True)

gp_temp = data_main.groupby(by=['fund_id', 'isin'], as_index=False)
for name, group in gp_temp:
    # Move to a new stock and reset variables
    buyPrice = []
    buyVolume = []
    buyDate = []
    buyAjexdi = []
    
    salePrice = 0
    saleVol = 0
    saleRevenue = 0
    profit = 0
    daysSinceBuy = -1
    saleDate = -1
    ithPurchase = 1
    
    for index, row in group.iterrows():
        # read in buyPrice and buyVolume
        if (row['buy']==True) & (row['adjBuyVolume']>0):
            buyPrice.append(row['price'])
            buyVolume.append(row['adjBuyVolume'])
            buyAjexdi.append(row['ajexdi'])
            buyDate.append(index)
            ithPurchase += 1
            
        # update sellPrice and sellVolume
        if (row['sell']==True) & (row['adjSellVolume']>0):
            salePrice = row['price']
            saleVol = row['adjSellVolume']
            saleRevenue = salePrice * saleVol
            saleDate = index
            saleAjexdi = row['ajexdi']
            
        # determine purchase price -- total purchase cost is purchaseCost
        qtyTaken = 0  # number of shares to take from the element in list
        purchaseCost = 0
        while saleVol > 1:
            # loop from the first element of buyVolume
            for i, vol in enumerate(buyVolume):
                if vol > 0:
                    # the ith buy's volume >0 
                    daysSinceBuy = saleDate - buyDate[i]
                    data_main.ix[index, 'daysSincePurchase'] = daysSinceBuy
                    convertedBuyVolume = vol * buyAjexdi[i] / saleAjexdi
                    
                    if saleVol > convertedBuyVolume:
                        saleVol = saleVol - convertedBuyVolume
                        purchaseCost += vol * buyPrice[i]
                        buyVolume[i] = 0
                    else:
                        purchaseCost += vol * buyPrice[i] / convertedBuyVolume * saleVol
                        buyVolume[i] = vol - saleVol * saleAjexdi / buyAjexdi[i]
                        saleVol = 0
        data_main.ix[index, 'profit'] = saleRevenue - purchaseCost
        f.value = index/total_obs


                    
data_main.ix[np.isnan(data_main['profit']), 'profit'] = 0            
data_main.ix[data_main['adjSellVolume']==0, 'profit'] = 0

Output Files


In [42]:
# Output main dataset
data_main.to_excel(Output_Folder + 'data_main.xlsx')

# Holdings data 
gp_temp = data_main[['fund_id', 'acc', 'isin', 'date', 'cumulativeVolume']].groupby(by=['fund_id', 'isin', 'date'])
data_holdings = gp_temp.last().reset_index()
data_holdings.rename(columns={'level_0': 'fund_id', 'level_1':'isin', 'level_2':'date'}, inplace=True)
data_holdings.to_excel(Output_Folder + 'data_holdings.xlsx')